1/48
50 vocabulary flashcards covering key statistical, programming, and modeling concepts from the data-science interview guide.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Type I Error
Incorrectly rejecting a true null hypothesis; also called a false positive.
Type II Error
Failing to reject a false null hypothesis; also called a false negative.
Null Hypothesis
Default assumption that there is no effect or no difference between groups being compared.
False Positive
A test result that indicates the presence of a condition when it is actually absent.
False Negative
A test result that fails to detect a condition that is actually present.
Hypothesis Testing
Statistical procedure for deciding whether data are consistent with a stated assumption (the null hypothesis).
A/B Test
Controlled experiment comparing two variants (A and B) to determine which performs better.
Linear Regression
Modeling technique that fits a straight-line relationship between one dependent variable and one or more independent variables.
Coefficient (in Regression)
Weight multiplied by an input feature in a regression equation, indicating direction and magnitude of effect.
p-value
Probability of observing data as extreme as the sample, assuming the null hypothesis is true; measures statistical significance.
r-squared (Coefficient of Determination)
Proportion of variance in the dependent variable explained by the independent variables in a regression model (0 to 1).
Heteroskedasticity
Situation in which the variance of errors differs across levels of an independent variable, violating linear-regression assumptions. (it means the spread of data points around a regression line is not consistent; linear-regression assumes constant variance)
Logistic Regression
statistical model that models the log-odds of an event as a linear combination of one or more independent variables and is used for binary classification problems. It estimates the probability that a given input point belongs to a certain category.
Sigmoid Function
S-shaped curve mapping real numbers to the (0,1) interval, used to convert linear outputs into probabilities.
Precision
Ratio of true positives to all predicted positives; measures exactness of a classifier.
Recall (Sensitivity)
Ratio of true positives to all actual positives; measures completeness of a classifier.
ROC Curve
Plot of true positive rate (recall) versus false positive rate across different classification thresholds.
Area Under the Curve (AUC)
Single-number summary of an ROC curve; values closer to 1 indicate better discriminatory ability.
L1 Regularization (Lasso)
Penalty adding the absolute values of coefficients to a loss function, often driving some coefficients to zero for feature selection.
L2 Regularization (Ridge)
Penalty adding the squared values of coefficients to a loss function, shrinking coefficients toward zero without eliminating them.
Elastic Net
Regularization technique that combines L1 and L2 penalties with a mixing parameter to balance sparsity and shrinkage.
Overfitting
Modeling error where a model captures noise in training data, harming its performance on unseen data.
Underfitting
Model too simple to capture underlying patterns, resulting in poor performance on both training and test data.
Class Imbalance
Condition where certain classes occur far more frequently than others in a data set.
Oversampling
Technique that duplicates or synthetically creates minority-class samples to balance class distribution.
Undersampling
Technique that removes samples from the majority class to balance class distribution.
MapReduce
Distributed computing framework that splits a task into parallel ‘map’ operations and combines results in a ‘reduce’ step.
Master Node
Central coordinator in a MapReduce job that assigns tasks and aggregates results from worker nodes.
Worker Node
Individual machine in a distributed system that executes assigned map or reduce tasks.
Mergesort
Divide-and-conquer sorting algorithm that recursively splits a list, sorts sublists, and merges them; O(n log n).
Quicksort
Divide-and-conquer sorting algorithm that partitions a list around a pivot and recursively sorts partitions; average O(n log n).
Tuple (Python)
Immutable ordered collection of elements in Python, defined with parentheses.
List (Python)
Mutable ordered collection of elements in Python, defined with square brackets.
Mutability
Property of an object that allows its contents to be changed after creation.
Inner Join
SQL operation returning rows present in both joined tables based on matching keys.
Left Join
SQL join returning all rows from the left table and matching rows from the right table; non-matches yield NULLs.
Right Join
SQL join returning all rows from the right table and matching rows from the left table; non-matches yield NULLs.
Union (SQL)
SQL operator that appends rows of two tables with identical column structures, removing duplicates unless UNION ALL is used.
Clustermap
Heatmap augmented with hierarchical clustering on rows and/or columns to reveal similarity patterns.
Box Plot
Statistical chart showing median, quartiles, and outliers of a distribution for quick comparison.
Violin Plot
Visualization combining a box plot and a kernel density plot to display data distribution shape and summary statistics.
Naive Bayes Classifier
Probabilistic model applying Bayes’ theorem with feature independence assumptions for classification tasks.
False Positive Rate
Proportion of negative cases incorrectly classified as positive; equal to 1 − specificity.
True Positive Rate
Same as recall; proportion of positive cases correctly classified by a model.
Cross-Validation
Resampling procedure that partitions data into multiple train/test splits to assess model generalization.
Feature Importance
Metric indicating how much each predictor contributes to a model’s predictive performance.
Generative Model
Model that learns the joint probability of inputs and outputs, enabling it to generate synthetic data.
Regularization
Technique of adding a penalty term to a loss function to discourage overly complex models and improve generalization.
Sample Size
Number of observations in a data set; larger sizes reduce variance and can lower both Type I and Type II errors.