1/98
Vocabulary flashcards covering the major concepts, definitions, and tools introduced in the lecture notes on sampling distributions, hypothesis testing, and statistical learning.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Sampling Distribution
The probability distribution of a sample statistic calculated from all possible samples of a fixed size drawn from a population.
Random Experiment
A process that generates one outcome from several possible outcomes, where the specific result cannot be known in advance.
Deterministic Component
Part of a phenomenon that yields the same outcome every time given the same conditions; no randomness involved.
Purely Random Component
Part of a phenomenon that can lead to different outcomes despite identical conditions, due to inherent randomness.
Random Variable
A function that assigns numerical values to the outcomes of a random experiment.
Statistical Inference
Using sample data to draw conclusions or make guesses about the underlying population or data‐generating process.
Simple Random Sampling
Selecting n observations from a population so that every possible sample of size n has an equal chance of being chosen.
Systematic Sampling
Selecting observations according to a fixed rule, e.g., every kth item after a random start.
Stratified Sampling
Randomly sampling within predefined subgroups (strata) in proportions that mirror their frequencies in the population.
Cluster Sampling
Randomly selecting entire groups (clusters) from the population and sampling all or some units within them.
Convenience Sampling
Non-random sampling that selects observations based on ease of access, risking bias.
Judgment Sampling
Non-random sampling where the researcher selects units deemed ‘representative’, introducing subjective bias.
Focus Group Sampling
Collecting data from a targeted discussion group, often recruited through non-random means like social media.
Sampling Bias
Systematic error that occurs when the sample does not accurately represent the intended population.
Random Sample (Three Criteria)
A sample where (1) every population member has equal selection probability, (2) selections are independent, (3) all possible samples of the size are equally likely.
Parameter
A numerical descriptive measure of an entire population, typically unknown.
Sample Statistic
A numerical descriptive measure computed from a sample.
Sampling Error
The difference between a sample statistic and its population parameter that arises purely by chance.
Central Limit Theorem (CLT)
States that, for large n, the sampling distribution of the sample mean (or proportion) is approximately normal with mean µ and variance σ²/n, regardless of the population’s distribution.
Standard Error
The standard deviation of a sampling distribution, quantifying the typical distance between a sample statistic and the population parameter.
Sample Proportion (p-hat)
Number of ‘successes’ in a sample divided by the sample size; an estimator of the population proportion p.
Sampling Distribution of the Sample Proportion
Approximate normal distribution of p-hat with mean p and variance p(1-p)/n when np(1-p) > 5.
Bernoulli Random Variable
A binary variable taking value 1 with probability p (success) and 0 with probability 1-p (failure).
Continuity Correction
Adjustment applied when using a continuous normal distribution to approximate a discrete binomial distribution, improving accuracy for proportions.
Sample Variance (S²)
Average of squared deviations from the sample mean, adjusted by dividing by n-1 to remain unbiased for σ².
Adjusted Sample Variance
Same as sample variance; uses n-1 in the denominator to correct bias.
Degrees of Freedom
Number of independent values that can vary in computing a statistic; for variance it is n-1.
Chi-Square Distribution
Distribution of a sum of squared standard normal variables; used for inference about variances.
De Moivre’s Equation
Var( X̄ ) = σ² / n for independent, identically distributed variables; basis for the Law of Large Numbers.
Law of Small Numbers
Cognitive bias where people expect small samples to closely resemble population proportions, underestimating true variability.
Exact Sampling Distribution
Distribution derived analytically or by enumerating all possible samples, feasible for small populations or normal populations with known parameters.
CLT Approximation
Using the Central Limit Theorem to model a sampling distribution as normal when exact derivation is impractical.
Simulation
Computational method that repeatedly draws random samples to approximate a sampling distribution empirically.
Empirical Probability
Probability estimated by the relative frequency of an event in simulated or observed data.
Hypothesis
A statement about a population parameter that is tested using sample data.
Null Hypothesis (H₀)
Default claim assumed true unless sample evidence is sufficiently strong to reject it.
Alternative Hypothesis (Hₐ)
Contrary claim to H₀ that a researcher seeks to support with evidence.
Test Statistic
Sample-based quantity calculated to decide between H₀ and Hₐ.
Level of Significance (α)
Threshold probability for rejecting H₀; equals risk of a Type I error.
Critical Value
Boundary of the rejection region determined by α; test statistics beyond it lead to rejection of H₀.
Critical (Rejection) Region
Set of extreme test statistic values that trigger rejection of H₀.
Type I Error
Incorrectly rejecting a true null hypothesis; probability equals α.
Type II Error
Failing to reject a false null hypothesis; probability denoted β.
One-Tailed Test
Hypothesis test where Hₐ specifies a direction (greater than or less than).
Two-Tailed Test
Hypothesis test where Hₐ only states a difference (not direction), using both tails of the distribution.
Permutation Test
Non-parametric hypothesis test that assesses significance by evaluating all (or many) reallocations of observed data labels.
Sampling Distribution Under Permutation
Distribution of a statistic generated by all possible re-labelings consistent with H₀, providing exact or empirical p-values.
Exogeneity
Condition where explanatory variables are uncorrelated with the error term in a regression model.
Endogeneity
Violation of exogeneity; explanatory variables correlate with errors, biasing OLS estimates.
Homoskedasticity
Assumption that error terms have constant variance across all levels of the independent variables.
Heteroskedasticity
Condition where error variance changes with the level of an explanatory variable, violating an OLS assumption.
Ordinary Least Squares (OLS)
Estimation method that chooses regression coefficients minimizing the sum of squared residuals.
Least Squares Line
Fitted regression line obtained via OLS that minimizes squared deviations between observed and predicted values.
Residual
Difference between an observed value and its corresponding predicted value from a model.
Total Sum of Squares (SST)
Sum of squared deviations of observed y-values from their mean; measures total variability.
Explained Sum of Squares (SSE)
Sum of squared deviations of predicted values from the mean of y; variability explained by the model.
Residual Sum of Squares (SSR)
Sum of squared residuals; variability not explained by the model.
Coefficient of Determination (R²)
Proportion of total variability in the dependent variable explained by the regression model (SSE/SST).
Adjusted R²
R² corrected for the number of predictors, preventing artificial inflation when irrelevant variables are added.
Standard Error of Regression
Square root of SSR divided by (n – k); average distance of observations from the regression line.
Confidence Interval (Regression)
Range around a parameter estimate within which the true parameter is expected to lie with specified probability.
Prediction Interval
Interval within which a future individual response is expected to fall with a given probability.
Gauss–Markov Theorem
States that, under classical assumptions, OLS provides the Best Linear Unbiased Estimators (BLUE) for regression coefficients.
Multiple Linear Regression
Regression model with one dependent variable and two or more independent variables, estimated via OLS.
Multicollinearity
Strong linear relationships among independent variables that inflate variances of coefficient estimates.
Dummy Variable
Binary indicator (0/1) representing categories of a qualitative predictor in regression models.
Dummy Variable Trap
Perfect multicollinearity caused by including a full set of dummy variables for all categories; solved by omitting one reference category.
Logistic Regression
Model that relates predictors to the log-odds of a binary outcome, ensuring predicted probabilities lie between 0 and 1.
Logistic Function
S-shaped curve, 1 / (1 + e^{–z}), mapping real numbers to the (0,1) interval for probability estimation.
Odds
Ratio of probability of success to probability of failure; logistic regression models logarithm of odds (logit).
Classification Error Rate
Proportion of observations misclassified by a model on a given data set.
Confusion Matrix
Table displaying counts of true vs predicted classes, summarizing classification performance.
Curse of Dimensionality
Phenomenon where data become sparse in high dimensions, hindering methods like local averaging or K-NN.
Bias-Variance Trade-Off
Balance between model complexity (variance) and accuracy of approximation (bias) that determines generalization error.
Mean Squared Error (MSE)
Expected squared difference between predicted and actual values; equals bias² plus variance plus irreducible error.
Bayes Classifier
Hypothetical classifier that assigns each observation to the class with the highest true conditional probability, achieving the lowest possible error rate.
Supervised Learning
Learning task where a model is trained on labeled data to predict an output variable.
Unsupervised Learning
Learning task aimed at discovering patterns or structure in data without labeled responses.
Regression Tree
Decision tree that predicts a continuous response by partitioning predictor space and using region means.
Classification Tree
Decision tree that assigns class labels by partitioning predictor space to maximize class purity within regions.
Recursive Binary Splitting
Greedy algorithm that builds trees by repeatedly splitting regions into two parts to improve a chosen criterion.
Pruning
Process of cutting back a large tree to a subtree that balances goodness of fit and model complexity, often using cross-validation.
Terminal Node (Leaf)
Final region in a decision tree where a single prediction is made for all observations falling there.
Internal Node
Decision point in a tree where the data set is split based on a predictor and threshold.
Residual Sum of Squares in Trees
Criterion minimized when splitting nodes in regression trees; sum of squared deviations within each region.
Cross-Validation (for Trees)
Technique that estimates prediction error to choose tuning parameters like the pruning penalty α.
K-Nearest Neighbours (K-NN)
Non-parametric method that classifies or regresses by averaging the outcomes of the K closest observations in feature space.
Exact Test
Statistical test using the true sampling distribution without approximation, often via enumeration or permutation.
Standard Normal Distribution
Normal distribution with mean 0 and variance 1, used for z-scores.
Standard Error of the Mean
σ / √n; the standard deviation of the sampling distribution of the sample mean.
Law of Large Numbers
Theorem stating that sample averages converge to the population mean as sample size increases.
Empirical Cumulative Distribution Function (eCDF)
Step function giving the proportion of sample values less than or equal to each point; used for simulation-based inference.
Permutation Distribution
Distribution of a statistic over all possible reassignments of labels consistent with H₀, forming the basis of exact non-parametric tests.
p-Value
Probability, under H₀, of observing a result as extreme as or more extreme than the sample outcome.
Standard Error of a Coefficient
Estimated standard deviation of an OLS coefficient; square root of its estimated variance.
F-Test of Global Significance
Hypothesis test that evaluates whether at least one predictor in a multiple regression explains variation in the response.
Partial Effect (Regression)
Change in the expected response due to a one-unit change in one predictor, holding others constant.
Heteroskedasticity-Robust SE
Standard error estimate that remains valid when error variance is not constant across observations.
Tree-Based Ensemble
Method combining many decision trees (e.g., random forest, boosting) to improve prediction accuracy.