Statistical Interviewing

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/80

There's no tags or description

Looks like no tags are added yet.

Last updated 5:30 PM on 6/8/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

81 Terms

New cards

What does a p-value actually mean?

It's the probability of observing data at least as extreme as what we saw, assuming the null hypothesis is true. It is not the probability that the null is true, and it's not the probability the result happened by chance.

New cards

A colleague says "p = 0.04, so we're 96% confident our effect is real." How do you respond?

That framing conflates two different things. The p-value is conditional on the null being true — it doesn't tell us the probability the alternative is correct. To get a probability of the hypothesis itself, we'd need a Bayesian framing with a prior.

New cards

When would you choose a one-sided test?

When there's a directional hypothesis with real cost only on one side — for example, testing whether a new drug is better than placebo when worse-than-placebo isn't actionable. In most business contexts I default to two-sided because unexpected directions are themselves informative.

New cards

What is statistical power, in plain language?

It's the probability that our test will detect a real effect if one truly exists at a given size. Low power means we'll miss real effects; it's the underappreciated half of the Type I / Type II tradeoff.

New cards

How do you explain Type I vs Type II error to a stakeholder?

Type I is a false alarm — we conclude there's an effect when there isn't. Type II is a missed signal — we fail to detect a real effect. Which one is more costly depends entirely on the decision context, and that's a business conversation, not a statistics one.

New cards

What's the difference between statistical and practical significance?

Statistical significance tells us the effect is unlikely to be zero. Practical significance asks whether the effect is large enough to act on. With enough data, trivially small effects become statistically significant — so I always pair a p-value with an effect size.

New cards

When does multiple comparison correction matter, and which method?

Whenever I'm running many tests and the cost of a false positive is real — dashboards, exploratory analyses, multi-arm experiments. Bonferroni is conservative and simple; Benjamini-Hochberg controls false discovery rate, which is usually more appropriate when I expect some true positives.

New cards

What is the null hypothesis really doing in a test?

It's a strawman — a specific, falsifiable claim we set up to potentially reject. The framework only lets us reject or fail to reject; we never "accept" the null. Failing to reject means our evidence wasn't strong enough, not that the null is true.

New cards

How do you correctly explain a 95% confidence interval?

If we repeated the sampling process many times, about 95% of the intervals we'd construct would contain the true parameter. It is not a 95% probability that this particular interval contains the truth — that's a Bayesian credible interval, which requires different assumptions.

New cards

A confidence interval for a difference includes zero. What does that tell you?

We can't rule out that the true difference is zero at our confidence level. It doesn't mean the effect is zero, just that our data is consistent with no effect. The width of the interval tells me whether that's because the effect is genuinely small or because we don't have enough data.

New cards

When would you prefer a credible interval over a confidence interval?

When I want to make a direct probability statement about the parameter itself, or when I have meaningful prior information to incorporate. Credible intervals are also generally easier for non-statistical stakeholders to interpret correctly.

New cards

What does it mean if your confidence interval is much wider than expected?

Either I have less data than I need, the underlying variance is high, or I'm modeling something poorly specified. It's a signal to investigate sample size, measurement quality, or model fit before reporting.

New cards

What are the assumptions of OLS regression?

Linearity, independence of errors, homoscedasticity, normality of errors, and no perfect multicollinearity. In practice, the most consequential violations are usually independence and homoscedasticity — normality of residuals matters less than people think with reasonable sample sizes.

New cards

How do you check for heteroscedasticity, and what do you do about it?

I plot residuals against fitted values and look for fanning. I'll also run a Breusch-Pagan test. If it's present, I usually reach for heteroscedasticity-robust standard errors first — they're a low-cost fix that doesn't change my coefficient estimates.

New cards

What's multicollinearity, and when does it actually matter?

It's when predictors are highly correlated with each other. It inflates standard errors and makes individual coefficients unstable, but it doesn't bias predictions. So it matters when I care about interpreting individual coefficients; less so for pure prediction.

New cards

How do you decide whether to include an interaction term?

Substantively, when I have theoretical reason to expect the effect of one variable to depend on another. Empirically, I'll check fit with and without, but I'm wary of fishing for interactions in exploratory analysis without correction.

New cards

When would you use logistic regression vs a linear probability model?

Logistic is the default for binary outcomes — bounded predictions, appropriate likelihood. Linear probability is sometimes preferred in econometrics for interpretability and easier inclusion in larger models, accepting that predictions outside [0,1] are possible.

New cards

What does a Q-Q plot tell you?

It compares the distribution of my residuals to a theoretical distribution, usually normal. A straight line means residuals approximate that distribution; systematic deviation in the tails tells me about skew or heavy tails.

New cards

How do you handle outliers in regression?

First I investigate — are they data errors, legitimate extreme cases, or signal that my model is misspecified? I'll look at Cook's distance and leverage. If they're legitimate, I might use robust regression methods rather than dropping them; dropping data should be justified by the analysis question, not by what makes the model look better.

New cards

When would you use a mixed-effects model?

When my data has clustering or repeated measures — observations within groups aren't independent. Treating them as independent inflates Type I error. Mixed-effects models let me account for between-group variance while still estimating within-group effects.

New cards

How do you choose between Ridge, Lasso, and Elastic Net?

Lasso for sparse solutions — when I expect many coefficients to be truly zero, or want feature selection. Ridge when predictors are correlated and I want to keep them all but shrink. Elastic Net when I want a mix, particularly with grouped correlated features.

New cards

What is regularization actually doing?

It penalizes large coefficients, trading bias for variance. The model fits training data slightly worse but generalizes better. The penalty parameter controls how aggressively we trade off — typically chosen via cross-validation.

New cards

When is R-squared misleading?

When comparing models with different numbers of predictors (use adjusted R²), when the relationship is non-linear, when I care about prediction error rather than variance explained, or when I'm comparing models on different datasets. It also says nothing about whether the model is correctly specified.

New cards

How do you diagnose a poorly fitting model?

I look at residuals — against fitted values, against predictors, over time if relevant. Patterns mean misspecification. I'll also do out-of-sample testing if prediction is the goal. The diagnostic question isn't "does the model fit" but "where does it fail, and does that failure matter for my use case?"

New cards

What's the fundamental problem of causal inference?

We can never observe both potential outcomes for the same unit — we see what happened, not what would have happened under the alternative. Everything in causal inference is about constructing a credible counterfactual.

New cards

What makes randomization so powerful?

It ensures that treatment assignment is independent of all confounders — observed and unobserved. That's why an RCT gives unbiased estimates of average treatment effects without needing to model the assignment mechanism.

New cards

When can you not run an RCT?

When it's unethical, infeasible, too expensive, or when the treatment can't be assigned at the individual level. In those cases I reach for quasi-experimental methods — difference-in-differences, regression discontinuity, instrumental variables, or synthetic controls.

New cards

How does difference-in-differences work?

I compare the change over time in a treated group to the change in a control group. The key assumption is parallel trends — that absent treatment, both groups would have moved in parallel. I check pre-period trends to assess plausibility.

New cards

When would you use regression discontinuity?

When treatment is assigned based on a cutoff in some continuous running variable — like a test score threshold or eligibility age. Units just above and just below the cutoff are comparable, so the discontinuity in outcomes identifies the local treatment effect.

New cards

What's an instrumental variable?

A variable that affects the outcome only through its effect on treatment — relevance (correlated with treatment) and exclusion (no direct effect on outcome). Good instruments are rare and the exclusion restriction is usually unverifiable, which is why IV results need to be argued carefully.

New cards

What's confounding, and how do you address it?

A confounder affects both the treatment and the outcome, creating a spurious association. Solutions: randomize if possible; otherwise control for it in regression, match on it, or use a design like DiD that differences it out.

New cards

What is selection bias?

When the sample we observe isn't representative of the population we want to make claims about, in a way that's correlated with the outcome. Classic example: studying job training effects only on those who completed the program. Solutions involve modeling the selection mechanism or finding a design that breaks the selection.

New cards

When is "controlling for" a variable a bad idea?

When the variable is a collider — affected by both treatment and outcome — controlling for it induces bias. Also when it's a mediator and we care about the total effect, not just the direct effect. The choice of controls should follow a causal diagram, not a regression's R-squared.

New cards

How do you communicate causal claims to a non-technical audience?

I'm specific about what design supports the claim. With an RCT I can say "caused"; with observational data I lean on "associated with" or "consistent with a causal effect, given these assumptions." Overclaiming undermines trust; underclaiming undersells the work.

New cards

What's a synthetic control?

A weighted combination of untreated units constructed to match the pre-treatment trajectory of the treated unit. Useful when there's no single good comparison group — like estimating the effect of a state-level policy.

New cards

What's the difference between ATE, ATT, and LATE?

Average Treatment Effect is the effect averaged over the whole population. ATT is the effect on the treated. LATE is the effect on compliers — the people whose treatment status was actually shifted by the instrument. Which one I want depends on the decision being made.

New cards

How do you decide on sample size for an A/B test?

From the minimum detectable effect I'd care about, the baseline variance, the desired power (usually 0.8), and significance threshold (usually 0.05). I'd rather over-power and run shorter than under-power and run inconclusively.

New cards

What is peeking, and why is it a problem?

Repeatedly checking results during a test and stopping when significance is reached. It inflates Type I error dramatically because each check is a new opportunity to cross the threshold by chance. Solutions: pre-specified analysis plans, sequential testing methods that adjust for repeated looks, or Bayesian methods that don't have this problem in the same way.

New cards

How would you handle a metric that's heavily skewed in an A/B test?

Standard t-tests rely on CLT and work fine at large sample sizes even with skew, but I'd consider a log transformation, a rank-based test, or bootstrap confidence intervals. For business metrics like revenue, I usually look at both means and medians.

New cards

What is the SUTVA assumption?

Stable Unit Treatment Value Assumption — treatment of one unit doesn't affect outcomes for other units. Often violated when network effects or marketplace dynamics are present. Solutions: cluster randomize, use a different unit of analysis, or model the spillover explicitly.

New cards

When would you use a CUPED adjustment?

When I have pre-experiment data on the same units that correlates with the outcome. CUPED uses that data to reduce variance, effectively shrinking the noise and improving power without changing the experimental design.

New cards

How do you choose a primary metric for an experiment?

It should be sensitive to the change, aligned with the business decision, and measurable in the experimental timeframe. I prefer leading indicators with strong causal links to long-term outcomes over the long-term outcomes themselves when those would take too long to observe.

New cards

What is a guardrail metric?

A metric that I'm not trying to move but want to monitor for negative side effects. It catches the case where the primary metric improves at the cost of something I also care about — like increasing engagement while tanking user satisfaction.

New cards

How do you handle multiple metrics in an experiment?

I separate them into primary, secondary, and guardrail. Primary drives the decision; secondaries inform but don't override; guardrails check for harm. I'll correct for multiplicity within each tier rather than testing everything as if it were the primary.

New cards

What is survivorship bias?

Drawing conclusions from a sample that's been filtered by some process I'm not accounting for — like studying successful companies to find success factors, missing all the companies with the same traits that failed. Solutions involve identifying the survival mechanism and either correcting for it or sampling differently.

New cards

What is selection bias in survey data?

When the people who respond differ systematically from those who don't, in ways correlated with the outcome. I think about this as a missing data problem — modeling who responds and weighting accordingly, when possible.

New cards

How do you assess whether your sample is representative?

Compare its distribution on known characteristics to the population — demographics, geography, behavior. Differences don't necessarily invalidate the analysis, but they constrain the population I can generalize to.

New cards

What's the difference between MCAR, MAR, and MNAR?

MCAR means missingness is unrelated to anything. MAR means it's related to observed variables but not unobserved ones. MNAR means it's related to the unobserved value itself. The first two are tractable with standard methods; MNAR requires modeling the missingness mechanism explicitly.

New cards

When is mean imputation a bad idea?

Almost always for analysis — it shrinks variance, attenuates correlations, and gives false precision. For quick exploration it's fine; for inference I prefer multiple imputation or model-based approaches that propagate uncertainty.

New cards

How do you handle class imbalance in classification?

Depends on the problem. Often the imbalance itself isn't the issue — the wrong evaluation metric is. I'll use precision-recall curves and class-weighted losses before reaching for resampling. SMOTE and oversampling can introduce their own issues.

New cards

What is measurement validity?

Whether the thing I'm measuring actually captures the construct I care about. Construct validity, content validity, criterion validity — they're different facets. Particularly relevant in psychometrics and survey work, but underappreciated in business metrics, where proxy measures often drift from what they're supposed to represent.

New cards

What is Simpson's paradox?

When an aggregated trend reverses or disappears within subgroups. It's a sign that an unmeasured grouping variable is confounding the relationship. The right level of aggregation depends on the question — there isn't a universally correct view.

New cards

How do you choose between accuracy, precision, recall, and F1?

Accuracy is fine when classes are balanced and errors are symmetric. Precision matters when false positives are costly; recall when false negatives are costly. F1 balances them. The choice should map to the actual cost structure of the decision.

New cards

When is AUC misleading?

When the use case requires a specific operating point — AUC averages over all thresholds, which may not reflect how the model is actually used. It can also look high on imbalanced data because of the dominant class.

New cards

How do you choose k in k-fold cross-validation?

5 or 10 are conventional. Higher k means lower bias but higher variance and more compute. Leave-one-out is high-variance and rarely necessary. For small datasets I lean higher; for large datasets, a single train/validation split is often sufficient.

New cards

What is the bias-variance tradeoff?

Bias is error from systematic mismatch between model and truth; variance is sensitivity to the specific training sample. Complex models reduce bias at the cost of variance; simple models the reverse. Total error is the sum, and the goal is to minimize that — not either component individually.

New cards

How do you tell if a model is overfitting?

Training error keeps dropping while validation error stops dropping or starts rising. Other signs: very different performance across folds, unstable coefficients across resamples, perfect or near-perfect in-sample fit on noisy data.

New cards

When would you choose interpretability over accuracy?

When the decision context requires explanation — regulatory contexts, high-stakes individual decisions, contexts where stakeholders need to trust and validate the reasoning. Also when interpretability supports debugging and ongoing iteration.

New cards

How do you compare a model to a simple baseline?

Always do. The right baseline depends on the problem: predicting the mean, predicting the majority class, predicting last period's value. If a complex model can't beat a sensible baseline by a meaningful margin, it's not adding value.

New cards

What does calibration mean for a probability model?

When the model predicts 0.7, the event should actually happen about 70% of the time. Calibration is different from discrimination — a model can rank order well but have miscalibrated probabilities. I check with reliability diagrams and fix with Platt scaling or isotonic regression.

New cards

When would you choose a Bayesian approach?

When I have meaningful prior information to incorporate, when I want direct probability statements about parameters, when I'm working with small samples where regularization through priors helps, or when I need to communicate uncertainty in a more intuitive way.

New cards

How do you explain a prior to a skeptical stakeholder?

It's an explicit statement of what we believe before seeing the data — which we're always doing implicitly anyway. The data updates the prior into the posterior. Strong priors require strong justification; weak priors let the data dominate.

New cards

What's the practical difference between a confidence interval and a credible interval?

A credible interval says "there's a 95% probability the parameter is in this range." A confidence interval says "if I repeated this procedure many times, 95% of the intervals would contain the parameter." The credible interval is what people usually want, but it requires a prior.

New cards

When is Bayesian inference impractical?

When priors are hard to justify, when computational cost is prohibitive at scale, when the audience expects frequentist outputs, or when the marginal value over a well-specified frequentist analysis is low. I don't reach for Bayesian by default — only when it earns its complexity.

New cards

What is a hierarchical model good for?

When data has natural grouping and I want to partially pool information across groups — borrowing strength from the population while still allowing group-level variation. Particularly powerful for small-sample groups where individual estimates would be too noisy.

New cards

A stakeholder wants to know "is this difference significant?" What do you do?

First clarify what decision rides on the answer — that determines whether statistical significance is even the right frame. Then give them the practical significance alongside any statistical test, and be explicit about what we can and can't conclude.

New cards

How do you communicate uncertainty to a leadership audience?

I lead with the best estimate and the range, framed in business units, not statistical ones. "We expect lift of X, somewhere between Y and Z." I name the assumptions that would invalidate the estimate, briefly. I avoid statistical jargon unless they ask for it.

New cards

A business partner is sure they see a trend in the data. You don't think it's real. How do you handle it?

I take their pattern recognition seriously — they often see things the data doesn't capture. Then I show them what statistical evidence would look like for the pattern, what we'd need to confirm it, and what alternative explanations exist. The goal is shared understanding, not winning the argument.

New cards

How do you decide what to measure when the goal is fuzzy?

I work backwards from the decision. What action would different measurements support? If no measurement changes the action, we don't need it. If many measurements would, we prioritize by what's tractable, sensitive to the variable of interest, and defensible against gaming.

New cards

How do you choose between rigor and speed?

By the cost of being wrong versus the cost of being late. For reversible decisions, speed wins. For irreversible or high-stakes ones, rigor wins. Most decisions are more reversible than people think, which means the bias should usually be toward shipping the analysis.

New cards

How do you push back on a request you think is methodologically flawed?

I name the specific concern, propose an alternative that addresses the underlying business need, and explain the tradeoff clearly. People usually have a real question — the flawed methodology is just their best guess at how to answer it. Reframing usually works.

New cards

How do you decide when an analysis is "done"?

When additional work wouldn't change the recommendation, or when the cost of further analysis exceeds the cost of acting on what we have. Perfect analyses don't ship; the goal is sufficient confidence for the decision at hand.

New cards

A stakeholder asks for predictions further out than the data supports. What do you do?

I show what the data supports, then show the extrapolation with a much wider uncertainty band and explicit assumptions. I name what would have to be true for the extrapolation to hold. The decision to extrapolate is theirs; my job is to make the cost clear.

New cards

How do you translate a model output into a recommendation?

I separate three things: what the model says, what it doesn't say, and what the implications are. The model produces estimates and uncertainties; the implications require business context the model doesn't have. Conflating these undermines trust over time.

New cards

"Correlation doesn't imply causation" — when do people misuse this?

As a thought-stopper to dismiss any observational evidence. The honest version is that correlation isn't sufficient for causation, but it's often necessary, and with careful design we can move from association to credible causal claims. The phrase is a starting point for the conversation, not an ending one.

New cards

Why is "the data speaks for itself" wrong?

Data is always shaped by collection, measurement, and framing choices. The same data supports different conclusions under different models and questions. What people usually mean is "this conclusion seems obvious to me" — and that intuition is worth examining, not deferring to.

New cards

Why isn't more data always better?

Garbage at scale is still garbage. More data sharpens estimates but doesn't fix bias, doesn't address confounding, and doesn't make a poorly designed measurement valid. With enough data, even trivial differences become statistically significant, which can mislead more than help.

New cards

When is "let's just look at the data" a problem?

When it skips the design question. Exploration without a plan invites multiple testing, motivated reasoning, and overconfidence in patterns that won't replicate. Exploration is valuable, but it should be labeled as exploration and followed by confirmation on new data.

New cards

Why is "we don't have time to think about assumptions" a red flag?

Because the assumptions are still being made — they're just being made implicitly. The time saved up front is paid back with interest when the analysis fails to replicate or leads to a bad decision. Making assumptions explicit is what makes them improvable.

New cards

What's wrong with "the model said so"?

It launders judgment as objectivity. Every model embeds choices — features, framing, loss function, training data. Those choices are human and contestable. Saying "the model said so" obscures the chain of reasoning that should be defended on its merits.

New cards