Estimating Point Estimators

Overview of Estimation Methods

Two main methods for finding point estimators:
- Method of Moments (MoM)
- Maximum Likelihood Estimation (MLE)
These methods are generalizable to any parametric model.
Focus on random samples from one- or two-parameter distributions with closed-form solutions.

Equate sample moments with theoretical moments.
Involves solving a system of equations where the number of equations equals the number of parameters.
Theoretical moments (population moments) are denoted as 𝜇k and sample moments as mk.
E[X^k] = \begin{cases} \sum{x} x^k f(x) & discrete \ \int{-\infty}^{\infty} x^k f(x) dx & continuous \end{cases}
Mk = E[X^k] = \frac{1}{n} \sum{i=1}^{n} X_i^k

Informally
1. Equate the first sample moment about the origin to the first theoretical moment.
2. Equate the second sample moment about the origin to the second theoretical moment.
3. Repeat for all parameters.
4. Solve the system of equations for the parameters.
Formally
- Given X1, X2, …, Xn \sim f(x; \theta), where \theta = (\theta1, …, \thetat) has dimension t; the t moment estimators of \theta are the unique solutions \hat{\theta1}, …, \hat{\theta_t} satisfying the system of equations:
  - M_1 = E[X^1]
  - M_2 = E[X^2]
  - \vdots
  - M_t = E[X^t]
  - m1 = \mu1
  - m2 = \mu2
  - \vdots
  - mt = \mut

Poisson:
- f(x, \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}
- E[X] = \lambda
- \hat{\lambda}_{MM} = \bar{X}
Bernoulli:
- f(x, p) = p^x (1-p)^{1-x}
- E[X] = p
- \hat{p}_{MM} = \bar{X}
Uniform (0, θ):
- f(x, \theta) = \frac{1}{\theta - 0}
- E[X] = \frac{1}{2}(0 + \theta)
- \hat{\theta}_{MM} = 2\bar{X}
Normal (\mu, \sigma^2):
- E[X] = \mu
- E[X^2] = \mu^2 + \sigma^2
- \hat{\mu}_{MM} = \bar{X}
- \hat{\sigma}^2{MM} = \frac{1}{n} \sum{i=1}^{n} (X_i - \bar{X})^2
Binomial:
- f(x; n, p) = \binom{n}{x} p^x (1-p)^{n-x}
- E[X] = np
- E[X^2] = n(n-1)p^2 + np
- \hat{n}{MM} = \frac{\bar{X}^2}{\bar{X} - (\frac{1}{n} \sum{i=1}^{n} X_i^2 - \bar{X}^2)}
- \hat{p}{MM} = 1 - \frac{(\frac{1}{n} \sum{i=1}^{n} X_i^2 - \bar{X}^2)}{\bar{X}}

If Y1, …, Yn are i.i.d. and E[|Y^k|] < \infty, then mk \rightarrow \muk.
MoM estimators are often consistent.
- Consistent Estimators Examples
  - Uniform(0, θ): \hat{\theta} = 2\bar{Y} \rightarrow \theta
  - Poisson(λ): \hat{\lambda} = \bar{Y} \rightarrow \lambda
  - Normal(\mu, \sigma^2): \hat{\mu} = \bar{Y} \rightarrow \mu and \hat{\sigma}^2 = \frac{1}{n} \sum (Y_i - \bar{Y})^2 \rightarrow \sigma^2
  - Bernoulli(p): \hat{p} = \bar{Y} \rightarrow p
If a moment estimator \hat{\theta} is a continuous function g(m1, …, mt) and g(\mu1, …, \mut) = \theta, then \hat{\theta} is a consistent estimator.

Find the maximum of the likelihood function to get the most likely parameter values.
The value(s) of \theta that maximize \mathcal{L}(\theta; y) = f(y; \theta).
The MLE of θ is \hat{\theta} = \operatorname{argmax}_{\theta \in \Theta} \mathcal{L}(\theta; y)

For a twice-differentiable function g(x), any optimal point x^* satisfies:
- \frac{d}{dx} g(x) |_{x=x^*} = 0
- x^ is a local maximum if: \frac{d^2}{dx^2} g(x) |_{x=x^} < 0
- x^ is a local minimum if: \frac{d^2}{dx^2} g(x) |_{x=x^} > 0
If g(x) > 0 for all x on an interval (a, b), then g(x) achieves a local minimum (or maximum) at x^ if and only if log(g(x)) achieves a local minimum (or maximum) at x^.

Bernoulli:
- f(Yi; p) = p^{Yi} (1-p)^{1-Y_i}
- \mathcal{L}(p; Y1, …, Yn) = \prod{i=1}^{n} p^{Yi} (1-p)^{1-Yi} = p^{\sum{i=1}^{n} Yi} (1-p)^{n - \sum{i=1}^{n} Y_i}
- \ell(p; Y1, …, Yn) = \sum{i=1}^{n} Yi \log(p) + (n - \sum{i=1}^{n} Yi) \log(1-p)
- \frac{\partial}{\partial p} \ell(p; Y1, …, Yn) = \frac{\sum{i=1}^{n} Yi}{p} - \frac{n - \sum{i=1}^{n} Yi}{1-p} = 0
- \hat{p}{ML} = \frac{\sum{i=1}^{n} Y_i}{n} = \bar{X}
Gaussian:
- f(Yi; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{\frac{-(Yi - \mu)^2}{2 \sigma^2}}
- \mathcal{L}(\mu, \sigma^2; Y1, …, Yn) = \prod{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{\frac{-(Yi - \mu)^2}{2 \sigma^2}} = \sigma^{-n} (2 \pi)^{-\frac{n}{2}} \exp{-\frac{\sum{i=1}^{n} (Yi - \mu)^2}{2 \sigma^2}}
- \ell(\mu, \sigma^2; Y1, …, Yn) = -\frac{n}{2} \log(\sigma^2) - \frac{n}{2} \log(2 \pi) - \frac{\sum{i=1}^{n} (Yi - \mu)^2}{2 \sigma^2}
- \frac{\partial}{\partial \mu} \ell(\mu, \sigma^2; Y1, …, Yn) = \frac{\sum{i=1}^{n} (Yi - \mu)}{\sigma^2} = 0
- \hat{\mu}{ML} = \frac{\sum{i=1}^{n} Y_i}{n} = \bar{Y}
- \frac{\partial}{\partial \sigma^2} \ell(\mu, \sigma^2; Y1, …, Yn) = -\frac{n}{2 \sigma^2} + \frac{\sum{i=1}^{n} (Yi - \mu)^2}{2 (\sigma^2)^2} = 0
- \hat{\sigma}^2{ML} = \frac{\sum{i=1}^{n} (Y_i - \hat{\mu})^2}{n}

MLEs have many favorable limiting properties:
- Consistency
- Efficiency
- Invariance (or “functional equivariance”)
- Asymptotic normality (under regularity conditions)

If \hat{\theta} is the MLE for \theta, and if g(\theta) is any transformation of \theta, then the MLE for \alpha = g(\theta) is \hat{\alpha} = g(\hat{\theta}).
Another way to put this: If \hat{\theta} is MLE of \theta, and t is any function with a twice-differentiable inverse on \Theta, then t(\hat{\theta}) is the MLE of t(\theta).

Under several regularity conditions:
- \sqrt{n}(\hat{\theta} - \theta) \rightarrow N(0, I(\theta)^{-1}), where I(\theta) is information.
- I(\theta) = E[-\frac{d^2}{d \theta^2} \ell(\theta; Y)]

Let Y1, Y2, … \sim f(y; \theta^*), and assume:
- (A0) f is "identifiable": if \theta \neq \theta', then for at least one y in the support set f(y; \theta) \neq f(y; \theta')
- (A1) f “has common support”: \lbrace y: f(y; \theta) > 0 \rbrace = \lbrace y; f(y; \theta') > 0 \rbrace for any \theta, \theta'
- (A2) the parameter space \Theta contains an open set \omega and \theta^* is an interior point of \omega
- (A3) f is differentiable on \omega
Under (A0) – (A3), the likelihood equation
- \frac{d}{d \theta} \ell(\theta; \mathcal{Y}) = 0
- Has a root (or solution) \hat{\theta}n such that \hat{\theta}n \rightarrow \theta^*
Corollary: If \hat{\theta}n is unique and \Theta is an open interval, then with probability tending to 1, \hat{\theta}n is the MLE.

Let Y1, Y2, … \sim f(y; \theta^*), and now assume (A0) – (A1) hold along with:
- (A2’) the parameter space \Theta is an open interval;
- (A3) the Fisher information is positive and finite; and a more complicated rule related to f being thrice differentiable: Any consistent root of the likelihood satisfies
- \sqrt{n} (\hat{\theta}_{MLE} - \theta) \rightarrow N(0, \frac{1}{I(\theta)})
- \hat{\theta}_{MLE} \rightarrow \theta

Method of moments derives point estimators by equating sample moments and population moments.
Maximum likelihood estimation provides a framework for deriving point estimators by optimizing the likelihood of data.
MoM estimates are usually consistent.
MLEs are invariant for transformations of parameters and asymptotically normal.