Statistics Study Guide for Data Science Interviews

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/63

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

64 Terms

1
New cards

Measures of Central Tendency

Mean, Median, Mode

2
New cards

Mean

Average value, sensitive to outliers

3
New cards

Median

Middle value when data is ordered; robust to outliers

4
New cards

Mode

Most frequently occurring value

5
New cards

When to use mean, median, and mode

Mean for normal distributions, median for skewed data, mode for categorical data

6
New cards

Measures of Variability

Variance, Standard Deviation, Range, Interquartile Range (IQR)

7
New cards

Variance

Average of squared deviations from mean

8
New cards

Standard Deviation

Square root of variance; same units as original data

9
New cards

Range

Difference between max and min values

10
New cards

Interquartile Range (IQR)

Range of middle 50% of data; robust to outliers

11
New cards

Skewness

Measure of asymmetry (positive = right tail, negative left tail)

12
New cards

Kurtosis

Measure of tail heaviness compared to normal distribution

13
New cards

Addition Rule

P(A or B) = P(A) + P(B) - P(A and B)

14
New cards

Multiplication Rule

P(A and B) = P(A) * P(B|A)

15
New cards

Conditional Probability

P(A|B) = P(A and B)/P(B)

16
New cards

Bayes’ Theorem

Formula: P(A|B) = P(B|A) * P(A)/P(B)

Key concept: Updates prior probability with new evidence

Common interview trap: Confusing P(A|B) with P(B|A)

17
New cards

Independent Events

P(A|B) = P(A); knowing B doesn’t change probability of A

18
New cards

Mutually Exclusive

P(A and B) = 0; events cannot occur together

19
New cards

Independence vs. Mutual Exclusivity Common Mistake

Assuming mutually exclusive events are independent

20
New cards

Normal Distribution Properties

Bell-shaped, symmetric, defined by mean and standard deviation

21
New cards

68-95-99.7 Rule

~68% within 1 SD, ~95% within 2 SD, ~99.7% within 3 SD

22
New cards

Standard Normal

Mean = 0, SD = 1; used for z-scores

23
New cards

Central Limit Theorem (CLT)

Key insight: Sample means approach normal distribution as sample size increases

Rule of thumb: n >= 30 for CLT to apply

Why it matters: Enables inference even when population isn’t normal

24
New cards

Binomial Distribution

Number of successes in fixed number of trials

25
New cards

Poisson Distribution

Count of rare events in fixed time/space

26
New cards

Exponential Distribution

Time between events in Poisson process

27
New cards

t-distribution

Used when sample size is small and population SD unknown

28
New cards

Hypothesis Testing Core Framework

  1. Null Hypothesis (H0): Status quo assumption

  2. Alternative Hypothesis (H1): What we’re trying to prove

  3. Test Statistic: Standardized measure of evidence against H0

  4. p-value: Probability of observing data this extreme if H0 is true

  5. Decision: Reject H0 if p-value < alpha (significance level)

29
New cards

Type I Error (alpha)

Rejecting true null hypothesis (false positive)

30
New cards

Type II Error (beta)

Failing to reject false null hypothesis (false negative)

31
New cards

Power

1 - beta; probability of correctly rejecting false null hypothesis

32
New cards

Trade-off

Decreasing alpha increases beta, and vice versa

33
New cards

One-sample t-test

Compare sample mean to known value

34
New cards

Two-sample t-test

Compare means of two groups

35
New cards

Paired t-test

Compare before/after measurements

36
New cards

Chi-square test

Test independence between categorical variables

37
New cards

ANOVA (Analysis of Variance)

Compare means across multiple groups

38
New cards

p-hacking

Manipulating analysis to achieve significant p-value

39
New cards

Multiple testing problem

Increased chance of Type I error with multiple tests

40
New cards

Bonferroni correction

Adjust alpha by dividing by number of tests

41
New cards

Confidence Interval Interpretation

CORRECT: “We are 95% confident the interval contains the true parameter.”

INCORRECT: “There’s a 95% chance the parameter is in this interval.”

KEY POINT: The interval is random, NOT the parameter.

42
New cards

Factors Affecting Width (of confidence interval)

  • Sample size: Larger n → narrower interval

  • Confidence level: Higher confidence → wider interval

  • Population variability → wider interval

43
New cards

Pearson Correlation

Linear relationship between continuous variables (-1 to +1)

KEY LIMITATION: Only captures linear relationships

44
New cards

Spearman Correlation

Monotonic relationship; uses ranks

45
New cards

Requirements for establishing causation

  • Temporal precedence

  • Covariation

  • No confounding variables

46
New cards

Common fallacy of causation

Assuming correlation implies causation

47
New cards

Solutions for establishing causation

  • Randomized experiments

  • Instrumental variables

  • Natural experiments

48
New cards

Types of Sampling Bias

  • Selection Bias

  • Survivorship Bias

  • Response Bias

  • Confirmation Bias

49
New cards

Selection Bias

Non-representative sample selection

50
New cards

Survivorship Bias

Only analyzing “survivors” of a process

51
New cards

Response Bias

Systematic differences in who responds

52
New cards

Confirmation Bias

Seeking data that confirms preconceptions

53
New cards

Factors of Sample Size Determination

  • Desired confidence level

  • Margin of error

  • Population variability

54
New cards

Power analysis for sample size determination

Determining sample size needed to detect meaningful effect

55
New cards

Common mistake in sample size determination

Use sample size formulas without considering effect size

56
New cards

A/B Testing Design Principles

  • Randomization

  • Power calculation

  • Duration

  • Primary metric

57
New cards

Randomization

Ensures groups are comparable

58
New cards

Power calculation

Determine sample size before starting

59
New cards

Duration

Balance statistical power with external validity

60
New cards

Primary metric

Define success metric before starting

61
New cards

A/B Testing Common Pitfalls

  • Peeking

  • Novelty effect

  • Simpson’s paradox

62
New cards

Peeking

Checking results before predetermined end

63
New cards

Novelty effect

Initial behavior change due to change itself

64
New cards

Simpson’s paradox

Trend reverses when data is segmented