1/86
i hate this so much i hate this
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Volume
The huge quantity of data being collected and stored.
Variety
The data comes in many different types and from many different sources.
Velocity
The incredible speed at which data is generated, often in real-time.
Veracity
How accurate and truthful the data is. Low-quality data leads to bad analysis.
Variability
The data is constantly changing, which can make it hard to manage.
Value
This is the most important part—what useful information can we get from the data to make decisions.
Data
Raw numbers and text collected from measurements.
Information
What you get after you analyze the data. It's the meaning you extract to make decisions.
Descriptive Analytics
"What happened?" This involves looking at past and current data to understand business performance. For example, looking at last quarter's sales report.
Predictive Analytics
"What will happen?" This uses historical data to find patterns and predict the future. For example, forecasting next month's sales based on past trends.
Prescriptive Analytics
"What should I do?” This is the most advanced type. It not only predicts what will happen but also suggests the best actions to take to achieve a goal, like minimizing costs or maximizing profits.
Reliability
When data are accurate and consistent (low variability).
Validity
When data correctly measure what they are supposed to measure. This means the data are both correct and accurate.
Uncertainty
The imperfect knowledge of what will happen in the future. As the variety and velocity (speed) of data increase, uncertainty also increases.
Risk
The consequences of what happens.
Flexible/Complex Models
Models like deep neural networks or random forests can capture complex, highly non-linear patterns - f(X) - and minimize prediction error. However, they often act as black boxes because transparency is sacrificed. These are typically used for Prediction.
Less Flexible/Simple Models
Models like linear regression are much easier to interpret and communicate. They may not predict as well if f(X) is highly non-linear, but they are easy to interpret. These are typically used for Inference.
Mean Squared Error (MSE)
MSE quantifies the average squared difference between the true outcome values and the predicted values. A lower MSE indicates a better fit.
MSE Sensitivity
MSE is especially sensitive to large prediction errors, as the errors are squared, giving them a disproportionate impact on the metric.
MSE Interpretation
The square root of the MSE indicates the approximate average deviation of predictions from actual values.
Training Data
Contains the data the model will use to build its prediction function. Measuring performance on this set gives the Training Error.
Test Data
Contains the unseen data. Measuring performance (e.g., MSE) on this set gives the Test Error, which is the unbiased evaluation of how the model performs in the real world.
Bias
Measures how far off, on average, the predictions are from the true value.
High Bias means the model is too simple and misses key patterns, resulting in underfitting.
Bias is typically high in simple models (e.g., Linear Regression) and low in complex models.
Variance
Measures how much predictions change if the model were trained on different datasets.
High Variance means the model is too complex and overly sensitive to noise, resulting in overfitting.
Variance is typically high in complex models (e.g., deep decision trees) and low in simple models.
Bias-Variance Trade-Off
Increasing model flexibility (complexity) tends to reduce bias but concurrently increase variance. The best models achieve a balance—low total error—by being neither too simple nor too complex. The goal is not to eliminate all error, but to reduce bias and variance to get as close as possible to the floor set by the irreducible error (ε).
Underfitting
Model is too simple, misses the pattern, and performs poorly on all data.
Overfitting
Model is too flexible, memorizes noise, performs well on training data, but poorly on new data (poor generalization).
Unsupervised learning
What you use when you only have input data but no specific output (Y) to predict. Involves unlabeled data that the algorithm tries to make sense of by extracting features and patterns on its own. This is also known as a clustering problem.
Cluster
A group of similar observations.
Centroid
The “center” of a cluster (average of points in the cluster).
Distance
Usually Euclidean distance (straight-line distance between points).
PCA (Principal Component Analysis)
A technique that reduces data with many variables into fewer dimensions, making it easier to visualize while keeping as much variation as possible.
Supervised learning
The algorithm learns on a labeled dataset.
Input Variables (X)
Variables used to make predictions. These are also known as:
Predictors
Independent variables
Features
Covariates
Output Variable (Y)
The variable we are trying to predict or understand. Also referred to as:
Response
Dependent variable
Target variable
Prediction
Use inputs (X) to predict outputs (Y) for new data. Care less about function (black box), more about the accuracy of outputs.
Ex. “Given your symptoms (X), I predict you have the flu (Y).”
Inference
Understand how inputs (X) are related to outputs (Y). Function matters, because it tells us which variables matter and how they affect Y.
Ex. “Fever is the strongest factor in diagnosing flu, more than cough or headache.”
Regression
(Predicting a Number): This is when you want to predict a continuous number, like a price or a temperature.
Example: Linear Regression Model
Classification
(Predicting a Category): This is when you want to predict which group or category something belongs to, like "yes/no" or "up/down".
Example: Logistic Regression Model
Simple Linear Regression (SLR)
Simple Linear Regression is a foundational technique in supervised learning used to model the relationship between two numerical variables. SLR aims to represent this relationship using a straight line.
Response Variable (Y)
The dependent variable whose value we wish to predict.
Predictor Variable (X)
The independent variable used to predict the response.
Beta Coefficients
B_0 (intercept) and B_1 (slope)
Intercept
The predicted value of the response (Y) when the predictor (X) is equal to zero.
Slope
The amount by which the response (Y) is expected to change for every one-unit increase in the predictor (X). It measures the average change in Y for a one-unit increase in X
Irreducible error (ε)
Captures all the variation in the response variable (Y) that is not explained by the predictor (X). We assume that the average value of this error term is zero.
Least Squares Estimation
To select the slope and intercept that minimize the errors between the actual observed values and the values predicted by the line.
Residuals
These are the vertical distances between each observed data point and the fitted regression line, representing the error for that observation.
Residual Sum of Squares (RSS)
The goal of the least squares method is to minimize the RSS. This total squared error measures how far off the predictions are from the actual values.
Standard Errors (SE)
Shows how uncertain or variable the coefficient estimate is. Smaller standard errors mean more precise estimates.
Confidence Intervals
Gives a range of values where the true population coefficient is 95% likely to fall; if this range does not include zero, it indicates the predictor (X) has a real, statistically significant effect on the response (Y).
Null Hypothesis
The slope is zero (meaning X has no relationship with Y).
T-statistic
Tells you how many standard errors the coefficient is away from zero. Large absolute values (like 24 or 55) → strong evidence the effect is real.
P-Value
The probability of getting this result if the true effect were actually zero. Small p-values (usually < 0.05) → reject the null hypothesis → the variable has a statistically significant effect.
Residual Standard Error (RSE)
The average size of prediction errors — how far the actual data points fall from the regression line. Output is a summary of how the residuals are spread
R-Squared
The percentage of variation in Y explained by X.
Range: 0 → 1, 0 means the model explains none of the variation, 1 means the model explains all the variation perfectly.
Ex. 0.6059 means horsepower explains about 61% of the variation in mpg.
Adjusted R-Squared
How well the model explains the data after adjusting for the number of predictors. If Adjusted R² is close to R², your predictors are actually meaningful. If Adjusted R² is much lower, it means some predictors may not be helping.
F-statistic
Measures how well the regression model explains variation compared to a model with no predictors.
Bigger = better model fit.
SLR Assumptions
For regression results (like confidence intervals and p-values) to be trustworthy, these 4 things need to be mostly true.
Linearity
The relationship between X and Y should look like a straight line — not curved.
Normality of Errors
The leftover differences between actual and predicted values (called errors or residuals) should follow a normal, bell-shaped pattern.
Independence of Errors
Each data point should be separate — one observation’s error shouldn’t affect another’s.
Constant Variance (Homoskedasticity)
The spread of errors should be about the same everywhere along the line.
If not: it’s called heteroskedasticity, and it can make your model’s results less reliable.
Multiple Linear Regression (MLR)
Using two or more factors (X₁, X₂, …, Xp - called "predictors," “inputs,” or "explanatory variables") at once to predict an outcome (Y - the "response variable").
Choosing Predictors
You might not want to use every predictor available. Keeping your model simple (parsimonious) makes it easier to understand and can lead to better predictions on new data.
Forward Selection
Start with no predictors. Add them one by one, always picking the one that improves the model the most, until adding more doesn't help significantly.
Backward Elimination
Start with all predictors. Remove the least useful one (usually the one with the highest p-value) and repeat this process until all remaining predictors are significant.
Stepwise Selection
A mix of backward and forward selection. At each step, the model can add a new useful predictor or drop one that has become non-significant.
Multicollinearity
A major issue in multiple regression. Happens when two or more of your predictor variables are highly correlated with each other.
Ex. trying to predict a person's weight using both their height in inches and their height in centimeters.
Variance Inflation Factor (VIF)
A score that measures how much a predictor is correlated with the others. A common rule of thumb is that a VIF score greater than 5 or 10 indicates a problem.
Qualitative Predictors
(Also known as categorical variables) represent discrete groups or categories rather than continuous numerical quantities.
Nominal
Categories that have no intrinsic ordering or rank.
Ordinal
Categories that have a clear rank ordering to them (5 star ratings).
Dichotomous or Binary
Nominal variables with exactly two categories (Yes/No).
Ordinal Variable Issues
Standard dummy coding ignores their natural order (Loss of Information), while assigning arbitrary numbers (like 1–5) wrongly assumes equal spacing between categories (Incorrect Encoding Risk); both approaches risk losing information or producing misleading results.
Dummy Variables
A special numeric tool used in regression analysis to represent discrete groups or categories (qualitative predictors), ex 1 = yes and 0 = no.
Baseline Category
The reference level is the category left out when creating dummy variables; it becomes the baseline the model compares all other coefficients against.
Example: If a variable “Color” has categories Red, Blue, and Green, and Red is the reference level, then the coefficients for Blue and Green show how their effects differ from Red.
Dummy Variable Trap
This is the condition where including all $k$ dummy variables for a categorical predictor with $k$ levels causes perfect multicollinearity, making it impossible to fit the model; therefore, one category (the baseline) must always be omitted.
Omitted Variable Bias
Happens when you leave out an important variable that affects both your predictor and your outcome, causing the model’s results to be misleading.
Example: If you study how studying time affects grades but forget to include sleep, your results might be wrong because sleep affects both how much someone studies and how well they do.
Dummy Intercept
Represents the estimated average outcome (dependent variable) for the baseline category.
Example: If "January" is the baseline month, the Intercept is the expected units sold in January.
Dummy Coefficient
The difference in the expected outcome between the category associated with that dummy variable and the chosen baseline category.
Baseline Sensitivity
The interpretation of coefficients is entirely relative to the chosen baseline. If the baseline is changed (e.g., from Texas to Kentucky), the Intercept and all other group coefficients will change because they are recalculated based on the new reference point.
Insignificant intercept
Means there isn’t enough evidence to show that the average outcome is different from zero when all predictors are zero.
Additive Assumption
The standard default setting in linear regression, which assumes that the effect of one predictor on the response is independent of the value of the other predictors; for example, the increase in sales from spending on TV is assumed not to depend on the amount spent on radio.
Interaction Effect
Occurs when the effect of one predictor on the response variable depends on the value of another predictor, effectively removing the standard additive assumption from the model.
Types of Interactions
Two dummy variables (e.g., gender × treatment)
One dummy, one numeric (e.g., gender × age)
Two numeric variables (e.g., age × horsepower)
Breusch–Pagan (BP) test
Checks whether the spread of a regression’s errors (residuals) is constant — an assumption called homoskedasticity. If the test’s p-value is small (usually < 0.05), it means the errors’ spread changes with the predictor — a problem called heteroskedasticity.