1/83
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Hypothesis Testing
A statistical method to make informed decisions about a population based on sample data.
null hypothesis (H₀)
A statement of no effect or no difference, assumed true until evidence suggests otherwise.
alternative hypothesis (H₁ or Hₐ)
A statement that contradicts the null hypothesis, supported if evidence is strong.
What does statistical significance mean?
The observed result is unlikely under the null hypothesis, often determined by a p-value less than α.
p-value
The probability of observing a test statistic as extreme or more, assuming the null hypothesis is true.
significance level (α)
A threshold for rejecting H₀, typically set at 0.05 or 0.01.
Type I Error
Incorrectly rejecting a true null hypothesis (false positive).
Type II Error
Failing to reject a false null hypothesis (false negative).
random sampling
Every member of the population has an equal chance of selection.
stratified sampling
Divide the population into subgroups and randomly sample from each subgroup.
cluster sampling
Divide the population into groups (clusters), randomly select some clusters, and include all members of selected clusters.
systematic sampling
Select every kth individual from a list, starting at a random point.
convenience sampling
Select individuals who are easiest to reach; may introduce bias.
critical region
The range of values where the null hypothesis is rejected.
critical value
The boundary that separates the critical region from the rest of the distribution.
How do you decide whether to reject H₀ using a test statistic?
Compare the test statistic to the critical value or the p-value to α.
one-tailed test
A test that checks for an effect in only one direction (e.g., μ > μ₀ or μ < μ₀).
two-tailed test
A test that checks for an effect in both directions (μ ≠ μ₀).
When do you use a Z-test?
When sample size is large (n > 30) and the population standard deviation is known.
When do you use a T-test?
When the sample size is small (n < 30) and the population standard deviation is unknown.
paired T-test
A test for comparing means of the same group at two points in time or under two conditions.
What is a chi-square test used for?
To test if two categorical variables are related or come from the same distribution.
What is ANOVA used for?
To compare the means of three or more groups for significant differences.
What are post hoc tests and when are they used?
Used after ANOVA to determine which specific group means differ significantly.
Probability
A measure between 0 and 1 that describes the likelihood of an event occurring.
What is Bayes' Rule used for?
To calculate the probability of a hypothesis based on prior knowledge and new evidence.
Conditional probability
The probability of an event occurring given that another event has already occurred.
Conditional Independence
When the occurrence of one event does not affect the probability of another, given a third event.
Law of Total Probability
A formula that finds the total probability of an event based on all the different ways it can happen
Expected Value
The long-run average or mean value of repetitions of a random variable.
Uniform Distribution
A distribution where all outcomes are equally likely.
normal (Gaussian) distribution
A bell-shaped, symmetric distribution defined by a mean (μ) and standard deviation (σ).
What does the standard deviation (σ) control in a normal distribution?
It controls the spread; a smaller σ means a narrower peak, and a larger σ means a wider spread.
Z-score
A standardized score that tells how many standard deviations a value is from the mean.
Central Limit Theorem (CLT)
The sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the original population distribution.
Three main measures of central tendency
Mean, median, and mode.
When is the geometric mean preferred over the arithmetic mean?
For data involving growth rates, ratios, or percentages.
Weighted Average
An average where each value contributes according to its importance or frequency.
two common measures of variability
Variance and standard deviation.
How do the mean and median behave in skewed data?
The mean is pulled toward outliers; the median is more robust.
What does standard deviation tell us?
How spread out the values are around the mean; higher values indicate greater spread.
Experimental design
The process of planning, conducting, and analyzing experiments to test a hypothesis and ensure reliable, unbiased conclusions.
Main steps of experimental design
Define the problem
Identify variables and population/sample
Formulate a hypothesis
Control for confounding variables
Choose data collection method
Analyze and conclude
Independent variable (IV)
The variable that is manipulated or changed in an experiment to observe its effect.
Dependent variable (DV)
The outcome that is measured in an experiment; it depends on the IV.
Population vs. sample.
Population: Entire group of interest
Sample: Subset of the population used for study
Hypothesis
A testable explanation predicting the relationship between variables (e.g., "If X, then Y").
Optimization criterion
A goal or objective (like maximizing CTR or accuracy) used to evaluate outcomes.
Confounding variable
An external factor that may influence the DV and distort results if not controlled.
How can you control confounding variables?
Hold variables constant
Randomization (RCTs)
Replication
Stratified Randomization
Block Design (Matched Pair)
Randomization
Randomly assigning participants to groups to minimize systematic bias.
Replication
Repeating the experiment to confirm reliability and reduce the effect of anomalies.
Stratified randomization
Grouping subjects by confounders (e.g., prior knowledge), then randomizing within each group.
Block design (matched pair)
Pairing similar individuals and assigning one to control and one to treatment to isolate effects.
Four main methods of data collection
Observational studies
Surveys
Experiments
Simulations
Types of observational studies
Cross-sectional
Retrospective (case-control)
Prospective (cohort)
Placebo effect
Improvement due to belief in treatment, not the treatment itself.
Blinding
A method to reduce bias where participants and/or researchers don’t know group assignments.
Single vs double blinding
Single-blind: Either participants or researchers are blind
Double-blind: Both are blind
When would you use a simulation for data collection?
When real-world testing is too expensive, dangerous, or impractical.
Fundamental rule of data collection
Your data must be representative of the population you want to study.
Data cleaning
The process of removing or correcting inaccurate, incomplete, or irrelevant data from a dataset.
Why is data cleaning important?
It ensures data quality, improves analysis reliability, and prepares data for modeling or decision-making.
What are duplicated records and how are they handled?
Duplicate rows that may be exact or slightly different. Use df.drop_duplicates()
for exact matches; others require manual review.
Evolving labeling schemes
Changes in data categories or labels over time (e.g., "Good" becomes "Very Good"). Handle by remapping or splitting data by time periods.
Outlier
A value significantly different from the rest, typically >2 or <−2 standard deviations from the mean.
How is a z-score used in outlier detection?
It measures how far a data point is from the mean in standard deviations; extreme z-scores suggest outliers.
Should outliers always be removed?
No—only remove if they distort analysis and are not meaningful for the problem at hand.
What are the types of missing data?
MCAR (Missing Completely at Random), MAR (Missing at Random), and MNAR (Missing Not at Random).
MCAR
Missingness unrelated to any observed or unobserved data; purely random.
MAR
Missingness related to other observed variables, not the missing values themselves.
MNAR
Missingness is related to the missing values themselves; most difficult to address.
How do you handle MCAR data?
Use listwise deletion, pairwise deletion, or simple imputations like mean, median, or mode.
How do you handle MAR data?
Use regression, KNN imputation, multiple imputation, or ML models trained on observed variables.
How do you handle MNAR data?
Use sensitivity analysis, pattern mixture models, or domain knowledge to impute or analyze separately.
Listwise deletion
Remove entire rows with any missing values.
pairwise deletion
Use all available data for each analysis; only exclude missing values per variable involved.
When is mean imputation appropriate?
For numeric data that is not skewed and has <5% missingness.
When is mode imputation used?
For categorical data or when a clear most frequent value exists.
hot-deck imputation
Replace missing values using values from similar cases within the dataset.
multiple imputation
Create several plausible versions of the dataset with different imputations, analyze separately, then combine results.
boundary conditions
Data constraints imposed by instruments or systems (e.g., a sensor that can't read below −10°C).
How can you detect incorrect data?
Look for attractors, discontinuities, extreme or impossible values, and data outside valid ranges.
How do instrument errors affect data?
They can introduce incorrect measurements; fix by comparing with normal data and adjusting accordingly.