sufficient statistic

1. definition

1.1. bayesian version

Let \(f(x\mid \theta)\) be a family of distributions parameterized by \(\theta\). And say that \(X=\{x_1,...,x_n\}\) is a set of samples from a given \(f(x\mid \theta)\). Then, \(T(X)\) is a sufficient statistic with respect to \(\theta\) if no other statistic can be computed that would allow us to better estimate \(\theta\). In other words, as far as getting an estimat of \(\theta\) is concerned, we can throw away \(X\) and keep \(T(X)\) only.

Formally, \(T(X)\) is a sufficient statistic if the conditional probability distribution given \(T(X)\), \(P_{X\mid T(X)}(x \mid T(X))\) does not depend on \(\theta\). Then, this means that if we ever try to do maximum likelihood estimation of \(\theta\), once we know \(T(X)\), we don't have to know anything else about the sample – it won't change the conditional probability distribution.

1.2. non-bayesian

In the non-Bayesian case, we don't talk about \(P(x\mid \theta)\), instead we talk about \(P(x ; \theta)\), where there is no prior distribution on \(\theta\); \(\theta\) is not a random variable.

It turns out that the sufficient statistic is all that is needed to make a Maximum Likelihood Estimation decision. Why? Because \(\arg\max_{\theta} p(x ; \theta) = \arg\max_{\theta} p(x \mid T(x))p(T(x) ; \theta)\). This is pretty much what we were talking about in the above Bayesian case.

2. Fisher-Neyman factorization theorem

The function \(T(x)\) is a sufficient statistic if and only if the probability density function \(f_{X}(x)\) can be factorized as: \[ f_{X}(x) = h(x)g(T(x), \theta) \] Again, we see that if we are making a likelihood inference about what \(\theta\) could be, then we only need to pay attention to how \(g(T(x), \theta)\) varies with \(\theta\).

3. minimal sufficient statistic

There's a notion of dimensionality reduction at work here. Instead of needing to store the whole data, we just need to store the sufficient statistic for the data. So a vector of observations can potentially be reduced to a single scalar value.

Then, we will want a notion of which sufficient statistic has the smallest "size". A minimal sufficient statistic \(t^*\) is such that for all sufficient statistics \(t\), there exists a function \(g\) such that \(g(t) = t^*\). We can think of \(g\) as a "grouping" function that merges together regions in the space of observations \(x\), where observations have been partitioned into groups based on \(t(x)\).