1/51
Vocabulary flashcards covering core concepts from descriptive statistics, distributions, and inferential statistics as discussed in the video notes.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Central Limit Theorem (CLT)
States that the sum (or average) of a large number of independent, identically distributed random variables tends toward a normal distribution regardless of the original distributions.
Normal distribution (Gaussian)
A symmetric bell-shaped distribution that many statistical methods assume; characterized by its mean and standard deviation.
Type I Error
Rejecting a true null hypothesis (false positive); the probability of this error is denoted by alpha.
Type II Error
Failing to reject a false null hypothesis (false negative); the probability of this error is denoted by beta.
Significance level (alpha)
The probability threshold for rejecting the null hypothesis when it is true (commonly 0.05).
Power of a test
The probability of correctly rejecting a false null hypothesis; equal to 1 minus beta.
R-squared (coefficient of determination)
Measures the proportion of variance in the dependent variable explained by the model; ranges from 0 to 1, with 1 indicating a perfect fit.
Correlation
A measure of the strength and direction of a linear relationship between two variables; does not imply causation.
Causation
A cause-and-effect relationship where changes in one variable bring about changes in another.
Lurking variable (confounding variable)
An outside factor that affects both variables of interest, potentially creating a spurious association.
Parametric tests
Statistical tests that assume a specific population distribution (often normal), e.g., t-tests, ANOVA.
Non-parametric tests
Tests that do not assume a specific population distribution; e.g., Mann-Whitney U, Kruskal-Wallis.
p-value
The probability of observing the data (or more extreme) under the null hypothesis; used to decide on reject/fail to reject.
Cross-validation
A model evaluation method that partitions data into training and validation sets to assess performance.
k-fold cross-validation
A form of cross-validation where data are split into k subsets; train on k-1 and test on the remaining fold, rotating.
Bootstrapping
A resampling technique (sampling with replacement) used to estimate the distribution of a statistic.
Descriptive statistics
Techniques that summarize data (e.g., central tendency and dispersion) without making inferences about a population.
Mean
Arithmetic average of a set of numbers.
Median
The middle value in an ordered data set.
Mode
The most frequently occurring value in a data set.
Population
The entire group of interest in a study.
Sample
A subset drawn from a population used to estimate population characteristics.
Parameter
A numerical characteristic of a population (e.g., population mean).
Statistic
A numerical characteristic computed from a sample (e.g., sample mean).
Handling missing data by deletion
Removing records with missing values (listwise or pairwise deletion).
Imputation
Replacing missing values with estimated values (mean/median/mode or model-based).
Interquartile Range (IQR)
Q3 minus Q1; a robust measure of dispersion not affected by extreme values.
Skewness
A measure of asymmetry in a distribution; negative means left tail longer, positive means right tail longer.
Box plot
A graphical display showing the median, Q1, Q3, and whiskers (min/max) of a dataset.
Variance
Average of the squared deviations from the mean.
Standard deviation
Square root of the variance; measures spread in the same units as the data.
Range
Difference between the maximum and minimum values in a dataset.
Z-score
The number of standard deviations a value is from the mean: z = (X − μ) / σ.
Covariance
A measure of how two variables vary together; not standardized.
Pearson correlation coefficient (r)
A standardized measure of linear relationship between two variables, ranging from -1 to 1.
Kurtosis
A measure of the tailedness or extremity of a distribution's tails.
Simpson's Paradox
A trend observed within subgroups reverses when data are aggregated across groups.
Outliers
Values far from the rest of the data that can distort statistics like the mean.
Log transformation
Applying a logarithm to data to reduce skew and stabilize variance.
Histogram
A bar chart showing the frequency distribution of data divided into bins.
Probability Density Function (PDF)
A function describing the probability distribution of a continuous random variable.
Probability Mass Function (PMF)
A function describing the probabilities of the discrete outcomes of a random variable.
Poisson distribution
A discrete distribution for counting rare events; parameter λ equals the mean rate.
Binomial distribution
Distribution of the number of successes in n independent Bernoulli trials with probability p.
Hypothesis testing
A framework for testing assumptions about a population using sample data.
A/B testing
An experimental design comparing two versions (A and B) to determine which performs better.
Null Hypothesis (H0)
A statement of no effect or no difference to be tested against.
Alternative Hypothesis (HA)
A statement that there is an effect or difference to be detected.
Independent samples t-test
Tests whether the means of two independent groups are different.
Paired t-test
Tests whether the means of paired observations (e.g., before/after) differ.
ANOVA (Analysis of Variance)
A test comparing means across three or more groups to see if at least one differs.
Chi-square test
Tests independence between categorical variables or goodness-of-fit of observed frequencies.