1/138
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is machine learning?
Making predictions or decisions from data by finding patterns in data rather than using a closed‑form mathematical formula.
Essence of machine learning
An underlying pattern exists that we cannot describe with a closed form; we use data (preferably lots) to learn it.
ML notation: x, y, X, Y
x = feature vector; y = target value; X = matrix of feature vectors; Y = vector of target values.
Unknown target function f
The ideal (unknown) mapping from X to Y that we aim to approximate with data.
Training samples
Observed historical pairs (x_i, y_i) used to learn a model.
Hypothesis set H
A set of candidate functions {h1, h2, ...} from which the learning algorithm selects a final hypothesis.
Learning algorithm A
The procedure that uses training samples to select a hypothesis g ∈ H.
Final hypothesis g
The model chosen by the learning algorithm to approximate the unknown target function.
Supervised learning
Learning from labeled data; includes classification and regression.
Unsupervised learning
Learning from unlabeled data; includes clustering, association rules, and dimension reduction.
Classification task
Predicting a discrete label or class for an input (e.g., recurrent cancer vs no cancer).
Regression task
Predicting a continuous numerical value or relationship between variables.
Clustering
Grouping similar unlabeled data points into clusters (e.g., cancer subtypes).
Association rules
Finding co‑occurrence or co‑regulation patterns (e.g., coexpressed genes).
Dimension reduction
Reducing the number of features while preserving important structure (e.g., PCA).
Structured data
Data that fits into rows and columns (ideal for many ML algorithms).
Semi‑structured data
Data with partial structure such as JSON, HTML, or XML.
Unstructured data
Data without inherent tabular structure (images, free text, audio); often used in deep learning.
Data wrangling (munging)
Cleaning, correcting, and handling missing or inaccurate data before modeling.
Integration (data)
Combining multiple data sources into a single dataset for analysis.
Reduction (feature)
Consolidating or removing attributes to reduce dimensionality before modeling.
Aggregation (data)
Collapsing data to a common level (e.g., averages or sums) for analysis.
Exploratory Data Analysis (EDA)
Descriptive statistics, visualization, transforms, and tests used to understand data before modeling.
Normalization and standardization
Transformations to put features on comparable scales (e.g., z‑score, min‑max).
Training / Validation / Testing split
Common split: ~70% training, 10% validation, 20% testing; train fits model, validation tunes hyperparameters, test evaluates final performance.
Accuracy (definition)
Number of correct predictions divided by total predictions; depends on data quality and balance.
Gold standard data
High‑quality labeled data used as the reference for training and evaluation.
Accuracy paradox
High overall accuracy can be misleading when classes are imbalanced; a model can be accurate yet fail on the minority class.
Confusion matrix
Table summarizing true positives, false positives, true negatives, and false negatives (useful for classification evaluation).
Precision and recall
Precision = TP/(TP+FP); Recall = TP/(TP+FN); useful when class imbalance matters.
ROC curve and AUC
ROC plots true positive rate vs false positive rate; AUC measures overall separability of classes.
Decision trees
Classification/regression algorithm that splits data by feature thresholds to form a tree of decisions.
Random forest
Ensemble of decision trees that reduces variance by averaging many trees trained on bootstrapped samples.
Gradient boosting machines
Ensemble method that builds models sequentially to correct errors of prior models (e.g., XGBoost, LightGBM).
k‑nearest neighbors (k‑NN)
Predicts label based on the labels of the k closest training examples in feature space.
Logistic regression
Linear model for binary classification that outputs probabilities via the logistic function.
Naive Bayes
Probabilistic classifier assuming conditional independence of features given the class.
Support Vector Machine (SVM)
Finds a hyperplane that maximizes margin between classes; can use kernels for nonlinearity.
Neural networks
Models composed of layers of interconnected units (neurons) that learn hierarchical representations.
SVR (Support Vector Regression)
SVM variant for regression tasks.
K‑means clustering
Partitioning algorithm that assigns points to k clusters by minimizing within‑cluster variance.
Hierarchical clustering
Builds nested clusters by merging or splitting clusters based on distance metrics.
Apriori algorithm
Classic algorithm for mining frequent itemsets and association rules.
FP‑tree (frequent pattern tree)
Data structure for efficient mining of frequent patterns and association rules.
Principal Component Analysis (PCA)
Linear dimension reduction technique that projects data onto orthogonal components capturing maximum variance.
Linear Discriminant Analysis (LDA)
Dimension reduction and classification method that maximizes class separability in projected space.
t‑SNE
Nonlinear dimension reduction technique for visualizing high‑dimensional data in 2D/3D.
Autoencoder
Neural network that learns compressed representations (encoding) and reconstructs inputs (decoding).
Data quality importance
Models can only be as good as the data they are trained on; poor labels or bias degrade performance.
Overfitting
Model fits training data too closely and fails to generalize to new data; regularization and validation help prevent it.
Underfitting
Model is too simple to capture underlying patterns; increases bias and poor performance on train and test.
Cross‑validation
Method to estimate model performance by splitting data into multiple train/validation folds (e.g., k‑fold CV).
Feature engineering
Creating or transforming features to improve model performance (domain knowledge often helps).
Bias and variance tradeoff
Bias: error from wrong assumptions; Variance: error from sensitivity to training data; balance is key.
Regularization
Techniques (L1, L2) that penalize model complexity to reduce overfitting.
Hyperparameter tuning
Selecting model parameters (not learned during training) using validation data or CV (grid/random search).
Ensemble methods
Combine multiple models (bagging, boosting, stacking) to improve predictive performance.
Feature selection
Selecting a subset of relevant features to reduce dimensionality and improve model interpretability.
Imbalanced data strategies
Use resampling, class weighting, or specialized metrics when classes are imbalanced.
6 standard criteria of life
List: Cellular organization; Homeostasis; Metabolism; Response to stimuli; Reproduction; Adaptation and evolution.
Cellular organization
Organisms are composed of one or more highly organized cells.
Homeostasis
Organisms regulate and maintain internal conditions within narrow limits.
Metabolism
Organisms carry out complex chemical reactions to sustain life.
Response to stimuli
Organisms can detect and respond to changes in their environment.
Reproduction
Organisms produce new individuals of the same species (asexual or sexual).
Adaptation and evolution
Populations change over time through natural selection and genetic variation.
Prokaryotic vs Eukaryotic
Prokaryotes lack a nucleus (nucleoid instead); eukaryotes have a membrane‑bound nucleus.
Nucleoid
Region in prokaryotes containing the genome (not membrane‑bound).
Organelles
Membrane‑bound or specialized cellular compartments performing distinct functions.
Non‑organelle structures
Examples: cell membrane, cell wall, cytoplasm, flagella, pilus, ribosome.
Cell membrane
Phospholipid bilayer with embedded proteins, channels, and cholesterol that separates inside from outside.
Phospholipid bilayer
Two layers of phospholipids with hydrophilic heads and hydrophobic tails forming the membrane.
Integral vs peripheral proteins
Integral proteins span the membrane; peripheral proteins attach to the surface.
Cell wall
Flexible structural outer layer providing support and protection; composition varies by organism.
Cytoplasm
Liquid component inside the cell membrane containing organelles and molecules; ~80% water.
Flagellum
Hairlike appendage providing motility by whipping; structure varies across species.
Pilus
Short hairlike appendage used for adhesion, sensing, and cell recognition.
Ribosome
Complex of RNA and proteins that translates mRNA into polypeptides (protein synthesis).
Cilia
Threadlike projections that can move cells or sense the environment; similar to pilus in function.
Chloroplast
Membrane‑bound organelle in plants and algae that performs photosynthesis and contains chlorophyll.
Cytoskeleton
Network of protein filaments (actin, microtubules, intermediate filaments) providing structure and transport.
Rough Endoplasmic Reticulum (RER)
Membrane network studded with ribosomes; major site of protein synthesis and folding.
Smooth Endoplasmic Reticulum (SER)
Membrane network involved in lipid and small molecule synthesis; fewer ribosomes.
Golgi apparatus
Receives, sorts, modifies, and packages molecules for transport within or outside the cell.
Mitochondria
Membrane‑bound organelle that generates ATP; contains its own genome and is maternally inherited.
Vesicles
Small membrane‑bound compartments (endosomes, lysosomes, peroxisomes, vacuoles) for transport and storage.
Endosome vs lysosome
Endosomes sort internalized material; lysosomes contain degradative enzymes for breakdown.
Peroxisome
Organelle involved in oxidative reactions and detoxification (e.g., hydrogen peroxide metabolism).
Exosome
Small extracellular vesicle involved in intercellular communication and transport of molecules.
Macromolecule
definition: large polymeric biological molecules such as nucleic acids, proteins, carbohydrates, and lipids.
Monomer vs polymer
Monomer = single subunit (e.g., nucleotide, amino acid); Polymer = chain of monomers (e.g., DNA, protein).
Water (biological role)
Solvent for biological systems, polar molecule that stabilizes charged species and buffers temperature changes.
Carbohydrates
Composed of C, H, O; primary energy source; monomer = monosaccharide (e.g., glucose).
Glucose (formula)
Common sugar used for energy: C6H12O6.
Disaccharide example
Lactose is a disaccharide composed of two monosaccharides.
Polysaccharides
Glycogen (animal energy storage) and cellulose (plant structural component).
Lipids
Nonpolar molecules (fatty acids, phospholipids, sterols) used for energy storage, membranes, and signaling.
Fatty acid
structure: long hydrocarbon chain with a terminal carboxyl group; building block of many lipids.
Phospholipid
structural lipid with hydrophilic head and hydrophobic tails; forms cell membranes.
Sterols (cholesterol)
Rigid lipid molecules that modulate membrane fluidity and serve as hormone precursors.