1/75
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Big Data
Massive amounts of data generated daily from devices, sensors, and online activity.
Data Mining
The process of finding useful patterns or knowledge from large datasets using algorithms.
Knowledge
Actionable insights or information derived from data that help make decisions.
Learning Algorithm
A method used by computers to learn patterns or relationships from data.
Data Mining Pipeline
Input Data â Data Preprocessing â Data Mining â Post Processing â Information (the final useful knowledge you can act on)
Data Subsetting
Using a portion of the full dataset for analysis.
Supervised Learning
Learning from labeled data to predict outcomes. (Data Mining Tasks)
Unsupervised Learning
Finding hidden patterns in unlabeled data. (Data Mining Tasks)
Data Object
A single item or record in a dataset (a row).
Attribute
A property or characteristic of an object (a column).
Distinctness
Whether values can be told apart (=, â ). Attribute Properties
Order
Whether values can be ranked (<, >). Attribute Properties
Addition
Whether differences between values are meaningful (+, â). Attribute Properties
Multiplication
Whether ratios between values are meaningful (Ă, á). Attribute Properties
Nominal
Categories with names only; no order or numbers.
Examples: ZIP code, Color, ID
Ordinal
Ordered categories; ranking matters but differences arenât consistent or measurable, gaps are not consistent.
Examples: Grades, {Good, Better, Best}, Rank
Interval
Differences are meaningful, but no true zero.
Examples: Dates, °C, °F
Ratio
Differences and ratios are meaningful; has a true zero.
Examples: Age, Height, Weight, Money
 Nominal, Ordinal, Interval, Ratio
Distinctness applies to
Ordinal, Interval, Ratio
Order applies to
Interval, Ratio
Addition applies to
Ratio
Multiplication applies to
Mode
Most common value; all attribute types
Median
Ordinal, Interval, Ratio, Middle value (robust to outliers);
Mean (and weighted mean)
Interval, Ratio only, Sum á count
Range
Interval, ratio, max-min
Variance (s²)
Interval, Ratio, Average squared distance from mean
Standard Deviation (s)
Interval, Ratio, Square root of variance
Median Absolute Deviation
Interval, Ratio, Median of absolute differences
Discrete Data
Finite/countable values (integers); examples: ID, counts, zip codes
Continuous Data
Infinite real values (decimals); examples: height, temperature, age
Noise
Random errors in data (e.g. sensor error, distortion)
⥠Fix: visualize, remove noisy attributes, avoid overfitting
Outlier
Value far from others; possible error or anomaly
⥠Fix: detect, remove if irrelevant, use median-based stats
Nominal
Mode, Entropy, Ď²
Ordinal
Median, Mode, Rank tests
Interval
Mean, Std. Dev., Correlation, z-score
Ratio
Mode, median, entropy, std. dev, correlation, z-score, rank tests, Geometric/Harmonic means (Everything!!)
Data Preprocessing
Steps to prepare raw data for analysis: sampling, feature selection, dimensionality reduction, feature creation, discretization, transformation.
Discretization
Converting continuous values into categories (e.g., age â âyoung,â âmiddle,â âoldâ).
Data Bias
Systematic errors caused by unrepresentative samples or flawed data sources.
Sampling
Selecting a subset of data to analyze when the full dataset is too large or costly.
Representative Sample
A sample that accurately reflects the populationâs key properties.
Simple Random Sampling
Every item has equal chance; may miss rare cases.
Stratified Sampling
Sampling from each subgroup to ensure all are represented
Progressive Sampling
Start small, increase sample size until results stabilize.
Sample Size
Must be large enough to capture patterns in the population.
Survivorship Bias
Only analyzing entities that âsurvivedâ over time (e.g., current companies).
Lookahead Bias
Using future or modern knowledge to influence past data analysis.
Feature Selection
Choosing the most useful attributes to improve model performance and reduce dimensionality. Reduces noise, speeds up learning, prevents overfitting, improves accuracy.
Redundant Feature
Duplicates info found in other attributes (e.g., price and sales tax).
Irrelevant Feature
Adds no useful info for prediction (e.g., student ID when predicting GPA).
Curse of Dimensionality
Too many attributes â sparse data â harder to find meaningful patterns.
Dimensionality
Number of features (attributes) in a dataset.
Principal Component Analysis (PCA)
Reduces the number of features while keeping most of the important information. Finds new axes (principal components) that capture most variance in data. compressing many correlated features into a few powerful ones that still describe the data well.
Precision
If you care more about avoiding false alarms (like spam detection) â maximizeâŚOf the items you said were positive (TP+FP), how many of them really were (TP)
Recall
If you care more about not missing cases (like oil spills or cancer detection) â maximizeâŚOf the items that really were positive (TF+FN), how many of them did you actually find (TP)
Underfitting
Model is too simple â makes high errors on both training and testing.
(Like drawing a straight line through a wavy dataset.)
Overfitting
Model is too complex â perfect on training data but fails on test data.
(Like drawing a squiggly line that passes through every training point.)
noise and insufficient data
Two major reasons for overfitting
Holdout method
a model evaluation technique where a dataset is split into separate training and testing sets, the model is trained on the training set, and its performance is evaluated on the unseen testing set (EX: 70/30 or 50/50 or 60/40 chosen randomly)
Repeated Resampling
Repeat the holdout process several times and average results.
Stratified Sampling
Keep class proportions consistent in train/test splits (important for imbalanced data).
Bootstrap
Sampling with replacement to create multiple training sets
Hyperparameters
settings you choose before training (not learned from data), Examples:
Decision tree max depth
Number of clusters in K-Means
Learning rate in neural networks
Polynomial degree in regression
control model complexity â too high = overfitting, too low = underfitting
Eager learners
(like decision trees) build a model first using all training data.
Learning = slow
Predicting new data = fast
Lazy learners
(like KNN) donât build a model until prediction time.
Learning = fast (just store the data)
Predicting = slow (must look through all stored data to find nearest neighbors).
KNN
an instance-based or example-based classifier:
It stores all training examples.
When a new example comes, it compares it to stored cases.
It predicts the class of the new example based on similar past cases.
rote learner
Memorizes data exactly. To classify a new case, it looks for an exact match.
Nearest neighbor
Looks for closest (not identical) examples using a distance metric.
Small k
may overfit (too sensitive to noise).
large k
may underfit (too smooth, ignores local patterns)
Ensemble methods
combine multiple models (classifiers) to make predictions.
The idea: instead of relying on one model, we train several and aggregate their outputs (e.g., by majority vote or averaging).
This usually improves accuracy and reduces overfitting.