# Maximum Likelihood Estimation

Given a dataset \(\mathbf{y} = \{y_1,y_2,...\}\) a parametric family of distributions indexed by \(\theta\), maximum likelihood Estimation is the strategy of selecting \(\theta\) which maximizes \(\mathcal{L}(\mathbf{y};\theta)\)

When \(y_i\) are independent and identically distributed, \[\mathcal{L}(y; \theta) = \prod_i l(y_i; \theta)\]

## 1. Example

The family of gaussians is parameterized by \(\mu\) and \(\sigma\). If our data is i.i.d then, the goal of MLE is to find \(\mu\) and \(\sigma\) that maximizes \[\mathcal{L}(y; \mu, \sigma) = \prod \frac{1}{\sigma\sqrt{2\pi}}\exp\left(\frac{(y_i - \mu)^2}{2\sigma^2}\right)\]

## 2. Relation to KL-divergence and cross entropy

Say that the true distribution is \(P_{\theta_0}\). Finding the \(\hat\theta\) such that \(Q_{\hat\theta}\) has maximum likelihood is equivalent to finding \(\hat\theta\) that minimizes the KL divergence \(D_{KL}(P || Q)\) (proof). Often in machine learning, the true distribution is fixed. Recall that the cross entropy is \(H(P) + D_{KL}(P||Q)\). If \(P\) is fixed, then minimizing the KL divergence is equivalent to minimizing the cross entropy.

## 3. What do we mean when we say "likelihood"?

It is the likelihood of the *parameters* given the observed data *y*.

## 4. Relationship to Maximum A Posteriori

If we know the priors on \(\theta\), we can weight our strategy by what we *a priori* expect. So, we want to find \(\theta\) that maximizes:
\[
p(\theta \mid y) = \frac{p(y\mid\theta)p(\theta)}{p(y)}
\]
The denominator \(p(y)\) is constant across \(\theta\), so for our purposes of finding the best \(\theta\), we can ignore it. If we make the assumption that our prior \(p(\theta)\) is uniform across \(\theta\), then we can ignore that too, and we recover our MLE strategy.