The mean is the arithmetic average of a dataset, calculated by summing all data points and dividing by the total number of observations. It gives a measure of central tendency and works well when data is symmetrically distributed. However, the mean is sensitive to outliers; a few extreme values can skew it significantly. In many ML models like linear regression, minimizing mean squared error essentially tries to predict the mean of the response variable.
The median represents the middle value of a sorted dataset, or the average of the two middle values if the dataset has an even number of observations. Unlike the mean, the median is robust to outliers and skewed distributions, making it a better measure of central tendency when data contains extreme values. In business settings (e.g., income data), median is often reported instead of mean because it's less influenced by very high or low extremes.
The mode is the value that appears most frequently in a dataset. It is especially useful for categorical data where mean and median may not make sense. In unimodal distributions, mode, median, and mean can be close, but in multimodal distributions, multiple modes may exist. Mode is often used in recommendation systems and voting-based classifiers like Random Forests.
Variance measures the average squared deviation from the mean, quantifying the spread or dispersion in the dataset. High variance indicates that data points are spread out widely around the mean, while low variance means they're clustered closely. Variance plays a crucial role in understanding model generalization — high model variance may lead to overfitting.
Standard deviation is simply the square root of variance, which puts it back in the same unit as the original data. It's a widely used measure of spread, often reported along with mean. In normal distributions, ~68% of values lie within ±1 standard deviation from the mean. Many ML algorithms assume normally distributed data with constant standard deviation.
Distributions describe how data is spread.
Normal Distribution is symmetric and bell-shaped; many natural phenomena follow it.
Binomial Distribution is for binary outcomes across fixed trials (e.g., coin toss).
Poisson Distribution models count data for rare events (e.g., calls arriving at a call center).
Exponential Distribution models time between Poisson events.
Beta Distribution models probabilities (between 0 and 1), widely used in Bayesian inference.
Dirichlet Distribution generalizes Beta for multiple categories; used in NLP and topic modeling (e.g., LDA).
The CLT states that as sample size grows, the sampling distribution of the sample mean approaches a normal distribution, regardless of the population's original distribution. This property allows us to apply normal-based inference (e.g., confidence intervals, hypothesis tests) even when the underlying data is not normally distributed — as long as sample size is large.
LLN states that as the number of trials or observations increases, the sample average will converge to the population mean. This principle underlies many ML algorithms because it assures that, with enough data, our estimates of model parameters (e.g., coefficients, means, probabilities) will become more accurate.
In Frequentist statistics, parameters are fixed and data is random — we make inferences using hypothesis tests, confidence intervals, and p-values.
In Bayesian statistics, parameters are treated as random variables with prior distributions. Observed data updates these priors to produce a posterior distribution. Bayesian approaches naturally incorporate prior knowledge and give probability distributions over parameters, rather than point estimates.
MLE is a method of estimating model parameters by finding values that maximize the likelihood function — i.e., the probability of observing the given data under those parameter values. For many models like logistic regression and Gaussian Mixture Models, MLE provides a computationally efficient way to estimate parameters directly from data.
A confidence interval gives a range of plausible values for an unknown population parameter based on sample data. A 95% CI means that if we repeat the sampling process many times, approximately 95% of those intervals would contain the true parameter value. It does not mean there's a 95% chance the parameter lies in that specific interval.
Hypothesis testing evaluates competing claims (null vs alternative hypotheses) using observed data.
t-test compares means between two groups.
Chi-square test assesses independence between categorical variables.
The p-value indicates how extreme the observed result is if the null hypothesis were true.
Type I Error (α): Rejecting a true null hypothesis (false positive).
Type II Error (β): Failing to reject a false null hypothesis (false negative).
Correlation measures the degree to which two variables move together, but does not imply one causes the other. Causation means one variable directly affects another. ML models often capture correlations, but establishing causality usually requires controlled experiments (A/B tests) or causal inference methods.
Covariance measures joint variability between two variables but is scale-dependent. Correlation standardizes covariance, producing values between -1 and 1, making it easier to interpret strength and direction of linear relationships. In ML, correlation is often checked during feature selection to reduce multicollinearity.
Bootstrapping: Resampling with replacement to estimate the sampling distribution of a statistic; useful when theoretical distributions are unknown.
Stratified Sampling: Divides population into subgroups (strata) and samples proportionally; improves estimate accuracy when subgroups differ significantly.Simple Random Sampling: Every member of the population has an equal chance of being selected; effective for large populations with homogenous characteristics.
Systematic Sampling: Selecting every nth member from a list of the population; introduces a degree of randomness while simplifying the sampling process.
Cluster Sampling: Involves dividing the population into clusters (often geographically) and then randomly selecting entire clusters to sample; useful for cost-effective and geographically diverse surveys.