Data Science Interview Essentials – Vocabulary Flashcards

0.0(0)

Studied by 1 person

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/48

Earn XP

Description and Tags

50 vocabulary flashcards covering key statistical, programming, and modeling concepts from the data-science interview guide.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

49 Terms

New cards

Type I Error

Incorrectly rejecting a true null hypothesis; also called a false positive.

New cards

Type II Error

Failing to reject a false null hypothesis; also called a false negative.

New cards

Null Hypothesis

Default assumption that there is no effect or no difference between groups being compared.

New cards

False Positive

A test result that indicates the presence of a condition when it is actually absent.

New cards

False Negative

A test result that fails to detect a condition that is actually present.

New cards

Hypothesis Testing

Statistical procedure for deciding whether data are consistent with a stated assumption (the null hypothesis).

New cards

A/B Test

Controlled experiment comparing two variants (A and B) to determine which performs better.

New cards

Linear Regression

Modeling technique that fits a straight-line relationship between one dependent variable and one or more independent variables.

New cards

Coefficient (in Regression)

Weight multiplied by an input feature in a regression equation, indicating direction and magnitude of effect.

New cards

p-value

Probability of observing data as extreme as the sample, assuming the null hypothesis is true; measures statistical significance.

New cards

r-squared (Coefficient of Determination)

Proportion of variance in the dependent variable explained by the independent variables in a regression model (0 to 1).

New cards

Heteroskedasticity

Situation in which the variance of errors differs across levels of an independent variable, violating linear-regression assumptions. (it means the spread of data points around a regression line is not consistent; linear-regression assumes constant variance)

New cards

Logistic Regression

statistical model that models the log-odds of an event as a linear combination of one or more independent variables and is used for binary classification problems. It estimates the probability that a given input point belongs to a certain category.

New cards

Sigmoid Function

S-shaped curve mapping real numbers to the (0,1) interval, used to convert linear outputs into probabilities.

New cards

Precision

Ratio of true positives to all predicted positives; measures exactness of a classifier.

New cards

Recall (Sensitivity)

Ratio of true positives to all actual positives; measures completeness of a classifier.

New cards

ROC Curve

Plot of true positive rate (recall) versus false positive rate across different classification thresholds.

New cards

Area Under the Curve (AUC)

Single-number summary of an ROC curve; values closer to 1 indicate better discriminatory ability.

New cards

L1 Regularization (Lasso)

Penalty adding the absolute values of coefficients to a loss function, often driving some coefficients to zero for feature selection.

New cards

L2 Regularization (Ridge)

Penalty adding the squared values of coefficients to a loss function, shrinking coefficients toward zero without eliminating them.

New cards

Elastic Net

Regularization technique that combines L1 and L2 penalties with a mixing parameter to balance sparsity and shrinkage.

New cards

Overfitting

Modeling error where a model captures noise in training data, harming its performance on unseen data.

New cards

Underfitting

Model too simple to capture underlying patterns, resulting in poor performance on both training and test data.

New cards

Class Imbalance

Condition where certain classes occur far more frequently than others in a data set.

New cards

Oversampling

Technique that duplicates or synthetically creates minority-class samples to balance class distribution.

New cards

Undersampling

Technique that removes samples from the majority class to balance class distribution.

New cards

MapReduce

Distributed computing framework that splits a task into parallel ‘map’ operations and combines results in a ‘reduce’ step.

New cards

Master Node

Central coordinator in a MapReduce job that assigns tasks and aggregates results from worker nodes.

New cards

Worker Node

Individual machine in a distributed system that executes assigned map or reduce tasks.

New cards

Mergesort

Divide-and-conquer sorting algorithm that recursively splits a list, sorts sublists, and merges them; O(n log n).

New cards

Quicksort

Divide-and-conquer sorting algorithm that partitions a list around a pivot and recursively sorts partitions; average O(n log n).

New cards

Tuple (Python)

Immutable ordered collection of elements in Python, defined with parentheses.

New cards

List (Python)

Mutable ordered collection of elements in Python, defined with square brackets.

New cards

Mutability

Property of an object that allows its contents to be changed after creation.

New cards

Inner Join

SQL operation returning rows present in both joined tables based on matching keys.

New cards

Left Join

SQL join returning all rows from the left table and matching rows from the right table; non-matches yield NULLs.

New cards

Right Join

SQL join returning all rows from the right table and matching rows from the left table; non-matches yield NULLs.

New cards

Union (SQL)

SQL operator that appends rows of two tables with identical column structures, removing duplicates unless UNION ALL is used.

New cards

Clustermap

Heatmap augmented with hierarchical clustering on rows and/or columns to reveal similarity patterns.

New cards

Box Plot

Statistical chart showing median, quartiles, and outliers of a distribution for quick comparison.

New cards

Violin Plot

Visualization combining a box plot and a kernel density plot to display data distribution shape and summary statistics.

New cards

Naive Bayes Classifier

Probabilistic model applying Bayes’ theorem with feature independence assumptions for classification tasks.

New cards

False Positive Rate

Proportion of negative cases incorrectly classified as positive; equal to 1 − specificity.

New cards

True Positive Rate

Same as recall; proportion of positive cases correctly classified by a model.

New cards

Cross-Validation

Resampling procedure that partitions data into multiple train/test splits to assess model generalization.

New cards

Feature Importance

Metric indicating how much each predictor contributes to a model’s predictive performance.

New cards

Generative Model

Model that learns the joint probability of inputs and outputs, enabling it to generate synthetic data.

New cards

Regularization

Technique of adding a penalty term to a loss function to discourage overly complex models and improve generalization.

New cards

Sample Size

Number of observations in a data set; larger sizes reduce variance and can lower both Type I and Type II errors.