Notes on Point Estimation and Inference (Unbiasedness, UMVUE, BLUE, MLE, and Bounds)

Unbiased Estimation

Goal of point estimation: find a statistic T(X) that provides a precise estimate of an unknown parameter θ.
Unbiased estimator: A statistic T(X) is unbiased for a function Y(θ) if
\mathbb{E}_{\theta}[T(X)]=Y(\theta)\quad\text{for all }\theta
Bias of an estimator T for θ is
\text{bias}(T,\theta)=\mathbb{E}_{\theta}[T(X)]-\theta
If a statistic is not unbiased, it is called biased.
Notation: Let {fθ} be a family of pdf/pmf parameterized by θ, X=(X1,…,Xn) is a random sample from fθ.
Example: Existence of unbiased estimators for powers of p
- For a Bernoulli(p) observation X~Bernoulli(p), trying to unbiasedly estimate p² with a single observation is impossible: if T(X) is a function of X taking values in {T(0), T(1)}, then E[T(X)]=T(0)P(X=0)+T(1)P(X=1) cannot equal p² for all p. In particular, a linear form a+bX cannot satisfy E[T(X)]=p² for all p.
- With more than one observation, unbiased estimators can exist; e.g., with two independent Bernoulli(p) draws X1,X2, the product X1X2 has E[X1X2]=p², hence unbiased for p². More generally, with n samples, unbiased estimators for functions of θ may be constructed via moments of the joint distribution.
Unbiased estimators may not exist for some targets and sample sizes; also, unbiasedness alone does not guarantee good performance (variance matters).

LMVUE and UMVUE

Local minimum variance unbiased estimator (LMVUE): among all unbiased estimators of θ, T_X achieving minimum variance in a neighborhood of θ.
Uniformly minimum variance unbiased estimator (UMVUE): among all unbiased estimators of θ, T achieving the minimum variance for all θ in the parameter space.
Def: Let 𝔗 be a class of all unbiased estimators of θ. TUV = argminT∈𝔗 Var_θ(T) (for all θ) if it exists.
Rao–Blackwell Theorem (informal): If S is any statistic and h is an unbiased estimator of θ, then E[h(X)|S] is unbiased for θ and Varθ(E[h(X)|S]) ≤ Varθ(h(X)).
- Equality Varθ(E[h(X)|S]) = Varθ(h(X)) holds iff h(X) is a function of S (i.e., h(X)=g(S)).
Lehmann–Scheffé Theorem: If T is a complete sufficient statistic for θ and Y is any unbiased estimator of a function g(θ), then E_θ[Y|T] is the UMVUE of g(θ). In particular, the UMVUE is unique if it exists.
Practical takeaway:
- To obtain UMVUE, often compute E[unbiased estimator | a complete sufficient statistic].
- If there are several unbiased estimators, the Rao–Blackwellization with a CSS yields the UMVUE when the CSS is complete.

Completeness, Sufficiency, and Symmetry

Sufficiency: T(X) is sufficient for θ if the conditional distribution of X given T(X) does not depend on θ.
Completeness: A statistic T is complete for the family {fθ} if for any function g, Eθ[g(T)]=0 for all θ implies P_θ(g(T)=0)=1 for all θ; informally, there are no nontrivial unbiased estimators of zero depending on T.
Symmetry: If X1,…,X_n are i.i.d., and a statistic is a symmetric function of the observations, it often preserves exchangeability; however, symmetry alone does not ensure UMVUE status.
Theorem (unique UMVUE when exists): If there exists a complete sufficient statistic T for θ and an unbiased estimator of φ(θ), then the conditional expectation E[U|T] yields the UMVUE for φ(θ) and is unique.

Rao–Blackwellization and Concrete Consequences

Given any unbiased estimator h(X) of θ, the Rao–Blackwell improvement is h*(X)=E[h(X)|S], where S is any statistic (often a sufficient statistic). Then
- h*(X) is unbiased for θ,
- Varθ(h*(X)) ≤ Varθ(h(X)).
This is the core mechanism to obtain lower-variance estimators and is the practical route to UMVUEs when a complete sufficient statistic exists.

BLUE: Best Linear Unbiased Estimator

Setup: X1,…,X_n independent with common mean μ and variance σ².
Consider a linear unbiased estimator: T(X)=\sum{i=1}^n ai Xi with ∑ ai = 1 so that E[T(X)]=μ.
Variance: Var(T)=σ²∑ a_i².
BLUE principle (Gauss–Markov): Among all unbiased linear estimators, the one with minimum Var(T) is the BLUE. For i.i.d. observations with equal variances, the BLUE is the sample mean
\hat{μ}{BLUE}=\bar{X}=\frac{1}{n}\sum{i=1}^n X_i.

Fisher Information and Cramér–Rao Bounds

Likelihood setup:
- For a parametric family f(x; θ) with observation X, the likelihood is
  L(θ; x)=f(x; θ),\quad \ell(θ; x)=\log L(θ; x).
Score and Fisher information:
- Score: U(θ)=\frac{∂}{∂θ}\ell(θ; X)=\frac{∂}{∂θ}\log f(X; θ).
- Fisher information: I(θ)=\mathbb{E}{θ}[U(θ)^2]=\operatorname{Var}{θ}(U(θ)).
- For n i.i.d. observations, In(θ)=n I1(θ).
Regularity: E_θ[U(θ)]=0 and I(θ) finite under regularity conditions.
Cramér–Rao lower bound (CRLB): For any unbiased estimator T(X) of a scalar θ,
\operatorname{Var}{θ}(T(X)) \ge \frac{(\frac{d}{dθ}\mathbb{E}{θ}[T(X)])^2}{In(θ)}. In the unbiased case Eθ[T(X)]=θ, so
\operatorname{Var}{θ}(T(X)) \ge \frac{1}{In(θ)}.
Multivariate extension: If θ=(θ1,…,θk) is k-dimensional and T(X) is a vector of unbiased estimators, then
\operatorname{Cov}{θ}(T(X)) \ge In(θ)^{-1},
where In(θ) is the Fisher information matrix with entries [In(θ)]{ij}=\mathbb{E}θ\left[\frac{∂}{∂θi}\log f(X; θ)\frac{∂}{∂θj}\log f(X; θ)\right].
Notes:
- The CRLB provides a lower bound on the variance; an estimator achieving equality is called efficient.
- The bound applies to unbiased estimators; for biased estimators, generalized forms exist (Bhattacharyya, CRK bounds).
- The Fisher information in a sample equals the sum (for i.i.d.) of the information from each observation.

Bhattacharyya and Chapman–Robb–Kiefer Bounds (CRK/Bhattacharyya-type bounds)

Bhattacharyya bound (generalized information bound):
- Let Y(θ)=Eθ[T(X)]. If Y is differentiable in θ and I(θ) exists finite, then a bound of the form \operatorname{Var}{θ}(T(X)) \ge \frac{(dY/dθ)^2}{I(θ)}
  holds. This generalizes the CRLB to estimable functions Y(θ).
- In multiple dimensions, the bound becomes
  \operatorname{Cov}{θ}(T(X)) \ge \nablaθ Y(θ)\, I(θ)^{-1}\, \nabla_θ Y(θ)^{\top}.
Chapman–Robb–Kiefer (CRK) inequality:
- Provides a bound on the variance of unbiased estimators using a class of pdfs and a supremum over a family of score-like functions. It refines the CRLB in some regular families and reduces to CRLB under particular conditions. The key idea is to bound Var(T) from below by quantities derived from derivatives of the log-likelihood with respect to θ, optimized over admissible perturbations. (Details are more technical and involve complete classes of distributions and score-type functions.)

Method of Moments Estimation (MME)

Idea: Use a finite set of population moments as targets to estimate parameters.
Suppose θ ∈ R^k and the first k moments μr(θ)=Eθ[X^r] for r=1,…,k are available as functions of θ.
If we can equate the sample moments to the population moments:
\hat{m}r=\frac{1}{n}\sum{i=1}^n Xi^r, \quad r=1,…,k, and solve the system μr(\theta)=\hat{m}_r for θ, we obtain the method-of-moments estimator T(X)=\hat{θ}.
Examples:
- For a normal distribution with unknown μ and σ², the first two population moments are μ and μ²+σ². The sample moments give
  \hat{μ}=\hat{m}1 = \frac{1}{n}\sum{i=1}^n Xi, \hat{m}2=\frac{1}{n}\sum{i=1}^n Xi^2
  and solving μ=\hat{m}1, μ²+σ²=\hat{m}2 yields
  \hat{μ}=\bar{X}, \quad \hat{σ}^2=\hat{m}2-\hat{m}1^2.
Remarks:
- If there are more moments than parameters, choose a subset or minimize a loss function; the method may be biased and inconsistent in small samples but is often simple and interpretable. If the sample moments converge to the true moments, the MME is consistent under regularity.
- The method can yield multiple solutions when the moment equations are nonlinear; selection of a sensible (lower-order) moment system is common.
Worked illustration (2-parameter example): If a pair (a,b) governs a distribution and the first two population moments are related to a and b via a system, equating sample first and second moments to those expressions and solving yields MME for a and b (the algebra depends on the chosen moment equations). The method of moments is often used for small-sample intuition and quick estimates.

Maximum Likelihood Estimation (MLE)

Likelihood principle: For iid samples X1,…,Xn with joint pdf/pmf f(x; θ), the likelihood is L(θ;x)=\prod{i=1}^n f(Xi; θ),\quad \ell(θ;x)=\log L(θ;x)=\sum{i=1}^n \log f(X_i; θ).
Definition: The MLE θ̂ maximizes L(θ; x) (equivalently, maximizes ℓ(θ; x)).
Example: Normal with unknown μ and σ² (both unknown, μ real, σ²>0).
- Log-likelihood (for observations x1,…,xn): \ell(μ, σ^2)= -\frac{n}{2}\log(2\pi) -\frac{n}{2}\log σ^2 - \frac{1}{2σ^2}\sum{i=1}^n (x_i-μ)^2.
- Score equations:
  \frac{∂ℓ}{∂μ}=\frac{1}{σ^2}\sum{i=1}^n (xi-μ)=0 \Rightarrow \hat{μ}=\bar{x}.
  \frac{∂ℓ}{∂σ^2}= -\frac{n}{2σ^2}+\frac{1}{2(σ^2)^2}\sum{i=1}^n (xi-μ)^2 =0 \Rightarrow \hat{σ}^2=\frac{1}{n}\sum{i=1}^n (xi-\hat{μ})^2.
Properties of MLE:
- Existence: MLE may not exist in some models or can be non-unique.
- Bias: MLE need not be unbiased.
- Efficiency: MLEs are asymptotically efficient under regularity (achieve CRLB asymptotically).
- Invariance: If θ̂ is an MLE for θ and g is a one-to-one transformation, then g(θ̂) is the MLE for g(θ).
- MLEs can be obtained by solving likelihood equations ∂ℓ/∂θ=0; when closed-form solutions are not available, iterative methods are used.
Fisher Scoring (iterative MLE method):
- Define score s(θ)=∂ℓ/∂θ and the expected information I(θ)=E[s(θ)^2].
- Iterative update (Fisher scoring):
  \hat{θ}^{(t+1)}=\hat{θ}^{(t)}+I(\hat{θ}^{(t)})^{-1}s(\hat{θ}^{(t)}).
- In practice, observed information I_obs(θ) = -∂^2ℓ/∂θ^2 at θ̂ is often used for updates.

Practical MLE Examples and Invariance

Example: Likelihood for N categories with unknown N and category probabilities P(X=k) depends on the model (illustrative). The MLE of N is often the maximum observed category index under certain sampling schemes.
Invariance principle: For a one-to-one mapping h, if θ̂ is an MLE of θ, then h(θ̂) is an MLE of h(θ).

Minimum Chi-Square Method (Pearson’s Chi-Square Method)

Setup: Observed frequencies Fi in K mutually exclusive bins Ai with probabilities pi(θ) under θ and expected counts Ei(θ)=n p_i(θ).
Test statistic (Pearson's chi-square):
\chi^2(θ)=\sum{i=1}^K \frac{(Fi - Ei(θ))^2}{Ei(θ)}.
Estimation by minimum chi-square: find θ that minimizes χ²(θ) (equivalently solve ∂χ²/∂θ=0). This yields an estimator for θ within the specified model.
Notes:
- The method is a quasi-parameter estimation technique based on frequency matching, not a full likelihood approach.
- It is widely used in goodness-of-fit tests and when the likelihood is difficult to handle directly.

Worked Examples and Special Topics (From Transcript)

UMVUE examples for binomial-type problems:
- Let S = ∑{i=1}^n Xi be the total count of successes in n Bernoulli trials with parameter p.
- If θ=p, then S is sufficient and complete for p in the binomial family. The UMVUE of p is the conditional expectation of any unbiased estimator given S, and in this case the natural UMVUE is
  \hat{p}_{UMVUE}=\frac{S}{n}.
- UMVUE for p² with binomial(n,p) is (S(S-1))/(n(n-1)). This leverages the fact that E[S(S-1)]=n(n-1)p^2.
UMVUE existence and uniqueness (Lehmann–Scheffé):
- If a complete statistic T exists and Y is unbiased for g(θ), then E[Y|T] is the UMVUE for g(θ) and is unique.
Symmetry and completeness: For i.i.d. samples, symmetry of a statistic is natural but does not by itself guarantee UMVUE status; the UMVUE is typically a function of the complete sufficient statistic.
Fisher information and multi-parameter cases: If θ ∈ R^k, then the Fisher information is a k×k matrix I(θ) with entries [I(θ)]{ij}=\mathbb{E}θ\left[\frac{∂}{∂θi}\log f(X; θ)\frac{∂}{∂θj}\log f(X; θ)\right].
- For independent observations, I_n(θ)=n I(θ).
Notions of optimality and limitations:
- MLE properties: existence, consistency, asymptotic normality, bias, variance.
- Invariance: MLEs are preserved under one-to-one transformations of the parameter.
- Boundaries: CRLB and other bounds provide benchmarks for estimator performance; MLEs may not attain CRLB in finite samples.

Summary of Key Takeaways

Unbiasedness is a desirable property but does not guarantee low variance; unbiased estimators may not exist for some targets.
The Rao–Blackwell theorem and Lehmann–Scheffé theorem provide systematic ways to obtain the UMVUE via conditioning on complete sufficient statistics.
BLUE is the best linear unbiased estimator under the Gauss–Markov theorem; in i.i.d. cases with equal variances, the sample mean is BLUE for the population mean.
Fisher information quantifies the amount of information in the data about θ; CRLB gives a lower bound on the variance of any unbiased estimator and is tight for efficient estimators.
Inference tools include MLE (with invariance properties and iterative solution methods like Fisher scoring), the method of moments, and the minimum chi-square approach.
Bounds beyond CRLB (Bhattacharyya, CRK) refine variance bounds for unbiased estimators and estimators of functionals, often using derivatives of E_θ[T(X)].
Many estimators (notably MLEs) are not unbiased, but they may be consistent and asymptotically efficient; bias corrections and alternative estimators may be explored when unbiasedness is required for a specific purpose.

Important Formulas to Remember

Unbiasedness: \mathbb{E}_{\theta}[T(X)]=\theta.
Bias: \text{bias}(T,\theta)=\mathbb{E}_{\theta}[T(X)]-\theta.
Score: U(θ)=\frac{∂}{∂θ}\log f(X; θ).
Fisher information: I(θ)=\mathbb{E}_{θ}[U(θ)^2] (scalar θ), or the information matrix for multi-parameter cases.
CRLB (scalar θ): \operatorname{Var}{θ}(T(X)) \ge \frac{\big(\frac{d}{dθ}\mathbb{E}{θ}[T(X)]\big)^2}{In(θ)}. For unbiased T, with Eθ[T(X)]=θ, this reduces to \operatorname{Var}{θ}(T(X)) \ge \frac{1}{In(θ)}.
Multivariate CRLB: \operatorname{Cov}{θ}(T(X)) \ge In(θ)^{-1}.
Likelihood and MLE: maximize L(θ;x)=\prod{i=1}^n f(Xi; θ),\quad \ell(θ;x)=\log L(θ;x).
MLE update via Fisher scoring: \hat{θ}^{(t+1)}=\hat{θ}^{(t)}+I(\hat{θ}^{(t)})^{-1} s(\hat{θ}^{(t)}),\quad s(θ)=\frac{∂}{∂θ}\ell(θ;x).
BLUE for μ with i.i.d. Xi: \hat{μ}{BLUE}=\bar{X}=\frac{1}{n}\sum{i=1}^n Xi.
Method of moments (example for normal):
\hat{μ}=\bar{X},\quad \hat{σ}^2=\frac{1}{n}\sum{i=1}^n (Xi-\bar{X})^2.
Pearson’s chi-square statistic: \chi^2(θ)=\sum{i=1}^K \frac{(Fi - Ei(θ))^2}{Ei(θ)},\quad Ei(θ)=n pi(θ).
UMVUE for p with Binomial(n,p) via S=∑Xi: \hat{p}{UMVUE}=\frac{S}{n},\quad \widehat{p^2}_{UMVUE}=\frac{S(S-1)}{n(n-1)}.