Sampling, Inference, and Statistical Learning – Core Vocabulary

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/98

flashcard set

Earn XP

Description and Tags

Vocabulary flashcards covering the major concepts, definitions, and tools introduced in the lecture notes on sampling distributions, hypothesis testing, and statistical learning.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

99 Terms

1
New cards

Sampling Distribution

The probability distribution of a sample statistic calculated from all possible samples of a fixed size drawn from a population.

2
New cards

Random Experiment

A process that generates one outcome from several possible outcomes, where the specific result cannot be known in advance.

3
New cards

Deterministic Component

Part of a phenomenon that yields the same outcome every time given the same conditions; no randomness involved.

4
New cards

Purely Random Component

Part of a phenomenon that can lead to different outcomes despite identical conditions, due to inherent randomness.

5
New cards

Random Variable

A function that assigns numerical values to the outcomes of a random experiment.

6
New cards

Statistical Inference

Using sample data to draw conclusions or make guesses about the underlying population or data‐generating process.

7
New cards

Simple Random Sampling

Selecting n observations from a population so that every possible sample of size n has an equal chance of being chosen.

8
New cards

Systematic Sampling

Selecting observations according to a fixed rule, e.g., every kth item after a random start.

9
New cards

Stratified Sampling

Randomly sampling within predefined subgroups (strata) in proportions that mirror their frequencies in the population.

10
New cards

Cluster Sampling

Randomly selecting entire groups (clusters) from the population and sampling all or some units within them.

11
New cards

Convenience Sampling

Non-random sampling that selects observations based on ease of access, risking bias.

12
New cards

Judgment Sampling

Non-random sampling where the researcher selects units deemed ‘representative’, introducing subjective bias.

13
New cards

Focus Group Sampling

Collecting data from a targeted discussion group, often recruited through non-random means like social media.

14
New cards

Sampling Bias

Systematic error that occurs when the sample does not accurately represent the intended population.

15
New cards

Random Sample (Three Criteria)

A sample where (1) every population member has equal selection probability, (2) selections are independent, (3) all possible samples of the size are equally likely.

16
New cards

Parameter

A numerical descriptive measure of an entire population, typically unknown.

17
New cards

Sample Statistic

A numerical descriptive measure computed from a sample.

18
New cards

Sampling Error

The difference between a sample statistic and its population parameter that arises purely by chance.

19
New cards

Central Limit Theorem (CLT)

States that, for large n, the sampling distribution of the sample mean (or proportion) is approximately normal with mean µ and variance σ²/n, regardless of the population’s distribution.

20
New cards

Standard Error

The standard deviation of a sampling distribution, quantifying the typical distance between a sample statistic and the population parameter.

21
New cards

Sample Proportion (p-hat)

Number of ‘successes’ in a sample divided by the sample size; an estimator of the population proportion p.

22
New cards

Sampling Distribution of the Sample Proportion

Approximate normal distribution of p-hat with mean p and variance p(1-p)/n when np(1-p) > 5.

23
New cards

Bernoulli Random Variable

A binary variable taking value 1 with probability p (success) and 0 with probability 1-p (failure).

24
New cards

Continuity Correction

Adjustment applied when using a continuous normal distribution to approximate a discrete binomial distribution, improving accuracy for proportions.

25
New cards

Sample Variance (S²)

Average of squared deviations from the sample mean, adjusted by dividing by n-1 to remain unbiased for σ².

26
New cards

Adjusted Sample Variance

Same as sample variance; uses n-1 in the denominator to correct bias.

27
New cards

Degrees of Freedom

Number of independent values that can vary in computing a statistic; for variance it is n-1.

28
New cards

Chi-Square Distribution

Distribution of a sum of squared standard normal variables; used for inference about variances.

29
New cards

De Moivre’s Equation

Var( X̄ ) = σ² / n for independent, identically distributed variables; basis for the Law of Large Numbers.

30
New cards

Law of Small Numbers

Cognitive bias where people expect small samples to closely resemble population proportions, underestimating true variability.

31
New cards

Exact Sampling Distribution

Distribution derived analytically or by enumerating all possible samples, feasible for small populations or normal populations with known parameters.

32
New cards

CLT Approximation

Using the Central Limit Theorem to model a sampling distribution as normal when exact derivation is impractical.

33
New cards

Simulation

Computational method that repeatedly draws random samples to approximate a sampling distribution empirically.

34
New cards

Empirical Probability

Probability estimated by the relative frequency of an event in simulated or observed data.

35
New cards

Hypothesis

A statement about a population parameter that is tested using sample data.

36
New cards

Null Hypothesis (H₀)

Default claim assumed true unless sample evidence is sufficiently strong to reject it.

37
New cards

Alternative Hypothesis (Hₐ)

Contrary claim to H₀ that a researcher seeks to support with evidence.

38
New cards

Test Statistic

Sample-based quantity calculated to decide between H₀ and Hₐ.

39
New cards

Level of Significance (α)

Threshold probability for rejecting H₀; equals risk of a Type I error.

40
New cards

Critical Value

Boundary of the rejection region determined by α; test statistics beyond it lead to rejection of H₀.

41
New cards

Critical (Rejection) Region

Set of extreme test statistic values that trigger rejection of H₀.

42
New cards

Type I Error

Incorrectly rejecting a true null hypothesis; probability equals α.

43
New cards

Type II Error

Failing to reject a false null hypothesis; probability denoted β.

44
New cards

One-Tailed Test

Hypothesis test where Hₐ specifies a direction (greater than or less than).

45
New cards

Two-Tailed Test

Hypothesis test where Hₐ only states a difference (not direction), using both tails of the distribution.

46
New cards

Permutation Test

Non-parametric hypothesis test that assesses significance by evaluating all (or many) reallocations of observed data labels.

47
New cards

Sampling Distribution Under Permutation

Distribution of a statistic generated by all possible re-labelings consistent with H₀, providing exact or empirical p-values.

48
New cards

Exogeneity

Condition where explanatory variables are uncorrelated with the error term in a regression model.

49
New cards

Endogeneity

Violation of exogeneity; explanatory variables correlate with errors, biasing OLS estimates.

50
New cards

Homoskedasticity

Assumption that error terms have constant variance across all levels of the independent variables.

51
New cards

Heteroskedasticity

Condition where error variance changes with the level of an explanatory variable, violating an OLS assumption.

52
New cards

Ordinary Least Squares (OLS)

Estimation method that chooses regression coefficients minimizing the sum of squared residuals.

53
New cards

Least Squares Line

Fitted regression line obtained via OLS that minimizes squared deviations between observed and predicted values.

54
New cards

Residual

Difference between an observed value and its corresponding predicted value from a model.

55
New cards

Total Sum of Squares (SST)

Sum of squared deviations of observed y-values from their mean; measures total variability.

56
New cards

Explained Sum of Squares (SSE)

Sum of squared deviations of predicted values from the mean of y; variability explained by the model.

57
New cards

Residual Sum of Squares (SSR)

Sum of squared residuals; variability not explained by the model.

58
New cards

Coefficient of Determination (R²)

Proportion of total variability in the dependent variable explained by the regression model (SSE/SST).

59
New cards

Adjusted R²

R² corrected for the number of predictors, preventing artificial inflation when irrelevant variables are added.

60
New cards

Standard Error of Regression

Square root of SSR divided by (n – k); average distance of observations from the regression line.

61
New cards

Confidence Interval (Regression)

Range around a parameter estimate within which the true parameter is expected to lie with specified probability.

62
New cards

Prediction Interval

Interval within which a future individual response is expected to fall with a given probability.

63
New cards

Gauss–Markov Theorem

States that, under classical assumptions, OLS provides the Best Linear Unbiased Estimators (BLUE) for regression coefficients.

64
New cards

Multiple Linear Regression

Regression model with one dependent variable and two or more independent variables, estimated via OLS.

65
New cards

Multicollinearity

Strong linear relationships among independent variables that inflate variances of coefficient estimates.

66
New cards

Dummy Variable

Binary indicator (0/1) representing categories of a qualitative predictor in regression models.

67
New cards

Dummy Variable Trap

Perfect multicollinearity caused by including a full set of dummy variables for all categories; solved by omitting one reference category.

68
New cards

Logistic Regression

Model that relates predictors to the log-odds of a binary outcome, ensuring predicted probabilities lie between 0 and 1.

69
New cards

Logistic Function

S-shaped curve, 1 / (1 + e^{–z}), mapping real numbers to the (0,1) interval for probability estimation.

70
New cards

Odds

Ratio of probability of success to probability of failure; logistic regression models logarithm of odds (logit).

71
New cards

Classification Error Rate

Proportion of observations misclassified by a model on a given data set.

72
New cards

Confusion Matrix

Table displaying counts of true vs predicted classes, summarizing classification performance.

73
New cards

Curse of Dimensionality

Phenomenon where data become sparse in high dimensions, hindering methods like local averaging or K-NN.

74
New cards

Bias-Variance Trade-Off

Balance between model complexity (variance) and accuracy of approximation (bias) that determines generalization error.

75
New cards

Mean Squared Error (MSE)

Expected squared difference between predicted and actual values; equals bias² plus variance plus irreducible error.

76
New cards

Bayes Classifier

Hypothetical classifier that assigns each observation to the class with the highest true conditional probability, achieving the lowest possible error rate.

77
New cards

Supervised Learning

Learning task where a model is trained on labeled data to predict an output variable.

78
New cards

Unsupervised Learning

Learning task aimed at discovering patterns or structure in data without labeled responses.

79
New cards

Regression Tree

Decision tree that predicts a continuous response by partitioning predictor space and using region means.

80
New cards

Classification Tree

Decision tree that assigns class labels by partitioning predictor space to maximize class purity within regions.

81
New cards

Recursive Binary Splitting

Greedy algorithm that builds trees by repeatedly splitting regions into two parts to improve a chosen criterion.

82
New cards

Pruning

Process of cutting back a large tree to a subtree that balances goodness of fit and model complexity, often using cross-validation.

83
New cards

Terminal Node (Leaf)

Final region in a decision tree where a single prediction is made for all observations falling there.

84
New cards

Internal Node

Decision point in a tree where the data set is split based on a predictor and threshold.

85
New cards

Residual Sum of Squares in Trees

Criterion minimized when splitting nodes in regression trees; sum of squared deviations within each region.

86
New cards

Cross-Validation (for Trees)

Technique that estimates prediction error to choose tuning parameters like the pruning penalty α.

87
New cards

K-Nearest Neighbours (K-NN)

Non-parametric method that classifies or regresses by averaging the outcomes of the K closest observations in feature space.

88
New cards

Exact Test

Statistical test using the true sampling distribution without approximation, often via enumeration or permutation.

89
New cards

Standard Normal Distribution

Normal distribution with mean 0 and variance 1, used for z-scores.

90
New cards

Standard Error of the Mean

σ / √n; the standard deviation of the sampling distribution of the sample mean.

91
New cards

Law of Large Numbers

Theorem stating that sample averages converge to the population mean as sample size increases.

92
New cards

Empirical Cumulative Distribution Function (eCDF)

Step function giving the proportion of sample values less than or equal to each point; used for simulation-based inference.

93
New cards

Permutation Distribution

Distribution of a statistic over all possible reassignments of labels consistent with H₀, forming the basis of exact non-parametric tests.

94
New cards

p-Value

Probability, under H₀, of observing a result as extreme as or more extreme than the sample outcome.

95
New cards

Standard Error of a Coefficient

Estimated standard deviation of an OLS coefficient; square root of its estimated variance.

96
New cards

F-Test of Global Significance

Hypothesis test that evaluates whether at least one predictor in a multiple regression explains variation in the response.

97
New cards

Partial Effect (Regression)

Change in the expected response due to a one-unit change in one predictor, holding others constant.

98
New cards

Heteroskedasticity-Robust SE

Standard error estimate that remains valid when error variance is not constant across observations.

99
New cards

Tree-Based Ensemble

Method combining many decision trees (e.g., random forest, boosting) to improve prediction accuracy.