1/95
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Random variable
A random variable assigns a numerical value to each outcome in a random process. It allows us to mathematically describe uncertainty and model real-world phenomena. Random variables can be discrete (countable outcomes) or continuous (values over an interval). They are the foundational objects in probability theory and statistical modelling.
Discrete random variable
A discrete random variable takes on values from a countable set, such as integers or category labels. Probabilities are described using a probability mass function (PMF). These variables appear when modelling counts, successes/failures, or classification labels. They are easy to work with because probabilities add directly.
Continuous random variable
A continuous random variable takes values from an uncountable interval such as real numbers. Probabilities come from areas under the probability density function (PDF), not single values. Continuous variables are used for measurements like height, temperature, or time. They require calculus for proper handling.
PDF, PMF, CDF
The PMF gives the probability of each exact value of a discrete random variable; probabilities add up to 1. The PDF gives the probability density for a continuous random variable; probabilities come from integrating over intervals. The CDF gives
𝑃
(
𝑋
≤
𝑥
)
P(X≤x) for both discrete and continuous variables and is especially useful for quantiles and tail analysis. Understanding the roles of PMF/PDF/CDF together allows you to read and interpret any probability distribution.
Expectation (mean)
Expectation represents the long-run average value of a random variable over infinite repetitions. It captures the “centre of mass” of a distribution. In ML, expectation underlies risk minimisation, gradient calculations, and probabilistic modelling. Formula:
𝐸
[
𝑋
∑
𝑥
𝑝
(
𝑥
)
E[X]=∑xp(x) or
𝐸
[
𝑋
∫
𝑥
𝑓
(
𝑥
)
𝑑
𝑥
E[X]=∫xf(x)dx.
Variance
Variance measures the degree to which a random variable deviates from its mean. High variance means the data is widely spread; low variance means it is tightly clustered. In modelling, variance represents uncertainty and helps describe the stability of predictions. Formula:
𝑉
𝑎
𝑟
(
𝑋
𝐸
[
(
𝑋
−
𝜇
)
2
]
Var(X)=E[(X−μ)
2
].
(Expanded) Variance (model variance)
Model variance measures how much a model’s predictions would change if trained on different subsets of the data. High-variance models are sensitive to noise and tend to overfit. Variance is controlled using regularisation, ensembling, or simpler architectures. It forms one-half of the bias–variance tradeoff.
Covariance
Covariance measures how two variables move together. Positive covariance means they increase together; negative means one increases while the other decreases. Covariance is used in multivariate modelling, correlation, PCA, and understanding feature relationships. Formula:
𝐶
𝑜
𝑣
(
𝑋
,
𝑌
𝐸
[
(
𝑋
−
𝜇
𝑋
)
(
𝑌
−
𝜇
𝑌
)
]
Cov(X,Y)=E[(X−μ
X
)(Y−μ
Y
)].
Correlation
Correlation is a normalised version of covariance that always lies between –1 and 1. It measures the strength and direction of linear relationships between variables. It is unitless and scale-independent, making it easier to interpret than covariance. Formula:
𝐶
𝑜
𝑣
(
𝑋
,
𝑌
)
𝜎
𝑋
𝜎
𝑌
ρ=
σ
X
σ
Y
Cov(X,Y)
.
Conditional probability
Conditional probability quantifies the likelihood of an event given another event has occurred. It is defined as
𝑃
(
𝐴
∣
𝐵
𝑃
(
𝐴
∩
𝐵
)
𝑃
(
𝐵
)
P(A∣B)=
P(B)
P(A∩B)
. Conditional reasoning underpins Bayesian inference, Markov models, and sequential decision-making. It reflects how probabilities update when new information becomes available.
Independence
Two events are independent if knowing one gives no information about the other. Mathematically,
𝑃
(
𝐴
∩
𝐵
𝑃
(
𝐴
)
𝑃
(
𝐵
)
P(A∩B)=P(A)P(B). Independence assumptions simplify many models, such as Naive Bayes and classical statistics. True independence is rare, but approximate independence often suffices in modelling.
(Expanded) Bayes’ theorem
Bayes’ theorem provides a principled way to update beliefs after observing new evidence. It combines prior belief
𝑃
(
𝐻
)
P(H), likelihood
𝑃
(
𝐷
∣
𝐻
)
P(D∣H), and marginal likelihood
𝑃
(
𝐷
)
P(D) to compute the posterior
𝑃
(
𝐻
∣
𝐷
)
P(H∣D):
𝑃
(
𝐻
∣
𝐷
𝑃
(
𝐷
∣
𝐻
)
𝑃
(
𝐻
)
𝑃
(
𝐷
)
.
P(H∣D)=
P(D)
P(D∣H)P(H)
.
Intuitively, it answers: “Given what I believed and what I see now, what should I believe next?” It is foundational in probabilistic modelling, diagnostics, Bayesian optimisation, and Naive Bayes. Bayes’ theorem is central to reasoning under uncertainty.
Law of Large Numbers (LLN)
The LLN states that as the number of samples increases, the sample mean converges to the true population mean. This explains why averages stabilise with more data. It underlies statistical estimation, Monte Carlo methods, and performance evaluation. LLN justifies treating sample averages as reliable approximations.
Central Limit Theorem (CLT)
The CLT states that the mean of many independent random variables becomes approximately normally distributed, regardless of the original distribution. This result enables the use of normal-based confidence intervals and hypothesis tests even on non-normal data. It is one of the most powerful theorems in statistics and underpins much of classical ML theory.
Bias (model bias)
Model bias is the error introduced when a model’s assumptions are too simple to capture the true data-generating process. High-bias models underfit and systematically miss important patterns. Bias can be reduced by increasing model complexity or enriching feature representations. It forms one-half of the bias–variance tradeoff.
Bias–variance tradeoff
The bias–variance tradeoff describes how model complexity affects generalisation. Increasing complexity reduces bias but increases variance; decreasing complexity increases bias but reduces variance. Good models strike a balance between these two sources of error. This framework is essential for diagnosing under- or overfitting.
Confidence interval
A confidence interval provides a range that, under repeated sampling, would contain the true parameter with a specified frequency (e.g., 95%). It reflects uncertainty in statistical estimates due to finite data. Confidence intervals are crucial in A/B testing, inference, and communicating uncertainty.
Hypothesis testing
Hypothesis testing evaluates whether observed data is consistent with a null hypothesis. A test statistic quantifies how extreme the data is relative to what the null predicts. P-values, t-tests, chi-square tests, and ANOVA are specific tools within this framework. Hypothesis testing helps assess significance rather than effect size.
p-value
A p-value is the probability of observing a result at least as extreme as the data if the null hypothesis were true. Low p-values suggest the data is unlikely under the null, providing evidence against it. A p-value does not measure effect size or the probability a hypothesis is true—common interview misconception.
t-test
A t-test compares means between groups while accounting for sample size and variability. It assumes approximate normality of the underlying data. The test statistic follows a t-distribution under the null. Common types include one-sample, paired, and two-sample t-tests.
Chi-square test
A chi-square test evaluates whether observed categorical counts differ significantly from expected counts. It is widely used for tests of independence and goodness-of-fit. The test statistic follows a chi-square distribution when sample sizes are sufficiently large.
ANOVA
ANOVA (Analysis of Variance) tests whether the means of three or more groups differ significantly. It decomposes total variability into between-group and within-group components. A significant result indicates at least one group mean differs, but post-hoc tests are needed to identify which.
Maximum likelihood estimation (MLE)
MLE chooses the parameter values that maximise the probability of the observed data. It is widely used due to its consistency and efficiency for large samples. Many classical ML models—including logistic regression and Naive Bayes—are trained using MLE principles.
MAP estimation
MAP estimation extends MLE by incorporating a prior belief over parameters. It maximises the posterior distribution
𝑃
(
𝜃
∣
𝐷
)
P(θ∣D), not just the likelihood. Regularisation in ML (e.g., L2) can be interpreted as MAP with specific priors.
Normal distribution
Bernoulli distribution
The Bernoulli distribution models a single binary outcome, taking value 1 with probability
𝑝
p and 0 with probability
1
−
𝑝
1−p. It is the simplest discrete distribution and forms the basis of logistic regression and binary classification. Bernoulli trials are building blocks for more complex models like the binomial distribution.
Binomial distribution
The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials. It is parameterised by
𝑛
n (number of trials) and
𝑝
p (success probability). It is widely used in A/B testing, reliability analysis, and scenarios involving repeated binary outcomes.
Poisson distribution
The Poisson distribution models counts of rare, independent events that occur over a fixed interval. It is parameterised by a single rate parameter
𝜆
λ, which equals both its mean and variance. Common applications include call arrivals, web traffic, and failure events. It is also a natural limit of the binomial distribution when
𝑛
n is large and
𝑝
p is small.
Exponential distribution
The exponential distribution models the waiting time between independent Poisson events. It has the “memoryless” property, meaning future behaviour does not depend on past events. This makes it useful in reliability engineering, queueing theory, and modelling time-to-failure.
Gamma distribution
The gamma distribution generalises the exponential distribution and models the sum of multiple independent exponential waiting times. It is parameterised by a shape and rate, allowing flexible modelling of positive, skewed data. It serves as a conjugate prior for Poisson rates in Bayesian inference.
Beta distribution
The beta distribution models probabilities or proportions on the interval [0,1], and is parameterised by two shape parameters. It can take many shapes (U-shaped, uniform, bell-curved), making it highly flexible. It is widely used as a conjugate prior for Bernoulli/binomial likelihoods.
Uniform distribution
The uniform distribution assigns equal probability density across a defined interval. It is simple, often used for baseline random sampling or non-informative priors. Many simulation methods and random number generators rely on uniform distributions.
Weibull distribution
The Weibull distribution is widely used in reliability engineering to model time-to-failure. It can model increasing, decreasing, or constant hazard rates depending on its shape parameter. This flexibility makes it suitable for materials fatigue, failure prediction, and wind-speed modelling.
Lognormal distribution
A lognormal distribution arises when the logarithm of a variable follows a normal distribution. It models positive, right-skewed data that grow multiplicatively rather than additively. Common applications include income, reaction times, and asset prices.
Pareto distribution
The Pareto distribution is a heavy-tailed distribution often associated with the “80/20 rule.” It models extreme values and highly skewed quantities like wealth, city sizes, or catastrophic losses. Its tail behaviour makes it useful in risk analysis and extreme value modelling.
Conjugate priors
A conjugate prior is a prior distribution that, when combined with a particular likelihood, results in a posterior of the same functional form. This makes Bayesian updating analytically simple and computationally efficient. Examples include Beta–Binomial, Gamma–Poisson, and Normal–Normal conjugate pairs.
Vector
A vector is an ordered list of numbers representing a point, direction, or feature set in a multidimensional space. Vectors are the fundamental objects of linear algebra and appear everywhere in ML: data samples, weight parameters, and gradients. They support operations like dot products, norms, and transformations.
Matrix
A matrix is a rectangular array of numbers representing datasets, linear transformations, or parameters in a model. They allow compact representation of systems of equations and transformations of vector spaces. In ML, matrices power operations like linear layers, attention mechanisms, and covariance calculations.
Dot product
The dot product measures similarity between two vectors by multiplying and summing their components. A large positive value indicates the vectors point in similar directions; zero indicates orthogonality. It underpins linear regression, cosine similarity, attention mechanisms, and gradient calculations.
Matrix multiplication
Matrix multiplication composes linear transformations and is defined by combining rows of one matrix with columns of another. It is central to all neural network operations, where layers apply matrix–vector multiplications. Understanding matrix multiplication is essential for reasoning about complexity and model architecture.
Norm
A norm measures the size or length of a vector. Common norms include L1 (sum of absolute values) and L2 (Euclidean length). Norms appear in regularisation, optimisation, and distance metrics, influencing model stability and generalisation.
Eigenvalue
An eigenvalue measures how much a corresponding eigenvector is scaled under a linear transformation. Large eigenvalues reflect directions of strong transformation. Eigenvalues are key in analysing covariance matrices, optimisation curvature, and system stability.
Eigenvector
An eigenvector is a direction that remains unchanged (up to scaling) under a linear transformation. It reveals intrinsic structure in matrices — such as principal directions of data variation in PCA. Eigenvectors are central in dimensionality reduction and spectral clustering.
Positive-definite matrix
A positive-definite matrix produces a positive value for every non-zero vector in the form
𝑥
𝑇
𝐴
𝑥
x
T
Ax. These matrices define valid covariance matrices and guarantee convexity in quadratic optimisation problems. They ensure unique minima in many estimation tasks.
Singular Value Decomposition (SVD)
SVD decomposes a matrix into
𝑈
Σ
𝑉
𝑇
UΣV
T
, representing rotation → scaling → rotation. It provides numerical stability and insights into matrix rank and structure. SVD is used in PCA, recommender systems, and dimensionality reduction.
PCA (as eigen-decomposition of covariance)
PCA finds orthogonal directions (principal components) that capture maximum variance in the data. It is implemented via eigen-decomposition of the covariance matrix or via SVD. PCA is used for dimensionality reduction, noise filtering, and visualisation, especially for high-dimensional data.
Derivative
Gradient
A gradient is the vector of partial derivatives of a multivariable function. It points in the direction of steepest increase of the function; the negative gradient points toward steepest descent. Gradients are essential in training ML models because they tell us how to adjust parameters to reduce loss. Efficient gradient computation via backpropagation enables deep learning.
Gradient descent
Gradient descent is an iterative optimisation algorithm that updates parameters in the direction of the negative gradient. By repeatedly taking small steps downhill, the algorithm seeks a local (or global) minimum of the loss function. Its performance depends heavily on the learning rate. Gradient descent is the foundation of most training procedures in ML.
Stochastic Gradient Descent (SGD)
SGD computes gradients using small mini-batches rather than the entire dataset. This introduces noise into updates, which helps escape shallow minima and speeds up training on large datasets. SGD is widely used in deep learning due to its efficiency and generalisation properties. Tuning batch size and learning rate is critical for good performance.
Adam
Adam is an optimisation algorithm that combines momentum with adaptive learning rates. It keeps running estimates of both first and second moments of the gradient, adjusting learning rates per parameter. Adam often converges faster and more smoothly than SGD, especially in deep networks. It is widely used because it requires little hyperparameter tuning and works well out-of-the-box.
RMSProp
RMSProp adapts learning rates by dividing the gradient by a moving average of squared gradients. This stabilises updates in directions with large gradients and accelerates learning in flatter directions. RMSProp was designed to address issues in non-stationary optimisation problems and is commonly used in training recurrent neural networks.
Convex function
A convex function has the property that any line segment between two points on its graph lies above or on the graph. This structure ensures that every local minimum is a global minimum, making optimisation predictable and stable. Many classical ML losses (e.g., squared error, logistic loss) are convex, which is why they are easier to optimise than deep networks.
Lagrange multipliers
Lagrange multipliers allow optimisation of a function subject to constraints by incorporating the constraints into a new objective, the Lagrangian. Solutions occur where gradients of the original function align with gradients of the constraint. This technique underpins constrained optimisation, dual formulations, and SVMs. It is a core tool in optimisation theory.
Loss landscape
A loss landscape is the geometric surface formed by plotting model parameters against loss values. Its shape determines how hard training is: flat valleys, sharp minima, and saddle points all affect optimiser behaviour. Deep learning landscapes are highly non-convex, but often contain large flat minima that generalise well. Understanding loss landscapes helps explain training stability and generalisation.
Supervised learning
Unsupervised learning
Unsupervised learning finds structure in data without labels. It includes clustering, dimensionality reduction, and density estimation. These methods reveal patterns, groupings, or latent structure in the data and are useful for exploration, preprocessing, or feature extraction.
Reinforcement learning
Reinforcement learning trains an agent to make sequential decisions by receiving rewards from an environment. The agent learns a policy that maximises long-term reward through exploration and trial-and-error. RL is used in robotics, game-playing, and resource allocation problems, where actions influence future states.
Training set
The training set is the portion of data used to fit the model’s parameters. It directly influences learned patterns and optimisation. Overfitting can occur when the model learns noise instead of general structure.
Validation set
The validation set is used to tune hyperparameters and select models. It acts as an unbiased checkpoint during training, helping identify overfitting before the test stage. The validation set guides model refinement.
Test set
The test set provides the final unbiased estimate of model performance after all training and tuning are complete. It cannot influence any model decisions or hyperparameters. Good ML practice keeps the test set untouched until the end.
Underfitting
Underfitting occurs when a model is too simple to capture underlying patterns in the data. It results in high error on both training and test sets. Solutions include adding complexity, engineering better features, or reducing regularisation.
Overfitting
Overfitting occurs when a model captures noise or idiosyncrasies of the training data rather than general patterns. This produces low training error but high test error. Regularisation, early stopping, and cross-validation help prevent overfitting.
Cross-validation
Cross-validation evaluates model performance by training and validating on multiple different splits of the data. In k-fold CV, the dataset is divided into k parts, and each part is used once as validation. It reduces variance in performance estimates and provides a robust foundation for hyperparameter tuning.
Regularisation
Regularisation discourages overly complex models to improve generalisation. It works by adding penalties to the loss function or restricting model behaviour. Common techniques include L1/L2 penalties, dropout, and early stopping.
L1 regularisation
L1 regularisation adds the sum of absolute parameter values to the loss function. It encourages sparsity, driving some weights to exactly zero. This makes L1 useful for feature selection and high-dimensional datasets.
L2 regularisation
L2 regularisation adds the sum of squared parameter values to the loss. It discourages large weights but rarely forces them to zero. L2 stabilises optimisation and is used widely in linear models and neural networks.
Dropout
Dropout randomly disables a subset of neurons during training. This prevents co-adaptation and forces the network to learn redundant, robust representations. Dropout acts as a powerful regulariser in deep learning models.
Early stopping
Early stopping halts training when validation performance stops improving. This prevents the model from overfitting the training data. It is simple, effective, and widely used in neural network training.
Accuracy
Accuracy measures the proportion of predictions that are correct. It is intuitive but unreliable with class-imbalanced datasets. It should only be used when all classes are equally important.
Precision
Precision is the proportion of predicted positives that are true positives:
𝑇
𝑃
/
(
𝑇
𝑃
+
𝐹
𝑃
)
TP/(TP+FP). It answers: “When the model predicts positive, how often is it right?” Precision is important when false positives are costly.
Recall
Recall is the proportion of actual positives correctly identified:
𝑇
𝑃
/
(
𝑇
𝑃
+
𝐹
𝑁
)
TP/(TP+FN). It answers: “Out of all real positives, how many did the model find?” Recall is crucial when missing positive cases is costly.
F1 score
The F1 score is the harmonic mean of precision and recall. It balances the two and is especially useful for imbalanced datasets. A high F1 means the model maintains both good precision and recall.
ROC curve
The ROC curve plots true positive rate vs. false positive rate across classification thresholds. It visualises the trade-off between sensitivity and specificity. ROC curves are threshold-independent, making them useful for comparing classifiers.
AUC
AUC (Area Under the ROC Curve) measures how well a model separates positive from negative classes. It ranges from 0.5 (random guessing) to 1.0 (perfect ranking). AUC is widely used in risk scoring, medicine, and fraud detection.
Log-loss
Log-loss (cross-entropy loss) measures the accuracy of predicted probabilities. It heavily penalises confident wrong predictions. It is the standard loss for classification tasks and encourages well-calibrated probabilities.
RMSE
RMSE (root mean squared error) is the square root of the average squared prediction error. It penalises large errors more than MAE. RMSE is widely used in regression because it measures typical prediction deviation in original units.
MAE
MAE (mean absolute error) averages the absolute differences between predictions and true values. It is more robust to outliers than RMSE. MAE is easy to interpret: “on average, predictions are off by X.”
Feature engineering
Feature engineering involves creating or transforming input features to make patterns easier for a model to learn. It includes encoding categories, scaling values, generating interactions, and incorporating domain knowledge. Good feature engineering often matters more than model choice on tabular data.
Hyperparameter tuning
Hyperparameter tuning searches for the combination of settings that yields the best model performance. It is performed using validation data or cross-validation. Hyperparameter tuning can significantly improve results without changing the model architecture.
Grid search
Grid search evaluates all combinations of hyperparameters in a predefined grid. It is simple but becomes computationally expensive with many parameters. It is effective when the search space is small and well understood.
Random search
Random search samples hyperparameters randomly. It is more efficient than grid search in high-dimensional spaces because it explores more diverse configurations. It is often a strong baseline for tuning.
Bayesian optimisation
Bayesian optimisation models performance as a function of hyperparameters using a surrogate model (often a Gaussian process). It chooses new hyperparameters by balancing exploration and exploitation. It is highly effective when model training is expensive.
Data leakage
Data leakage occurs when information from outside the training data (especially from the test set or future data) influences the model during training. This causes unrealistically high performance during evaluation and failure in deployment. Avoiding leakage is one of the most important practical skills in ML.
Linear regression
Logistic regression
Logistic regression models the probability of a binary class using the logistic (sigmoid) function applied to a linear combination of features. It outputs probabilities and is highly interpretable through log-odds. It remains a strong baseline in classification tasks, especially with high-dimensional sparse data. Regularisation is commonly applied to improve generalisation.
Naive Bayes
Naive Bayes applies Bayes’ theorem with the strong assumption that features are conditionally independent given the class. Despite this unrealistic assumption, it performs extremely well on text and other high-dimensional datasets. It is simple, fast, and produces probabilistic outputs. Common variants include Gaussian, Multinomial, and Bernoulli Naive Bayes.
k-NN (k-nearest neighbours)
k-NN predicts a label based on the majority class (classification) or average value (regression) of the k closest training points. It is non-parametric and makes no assumptions about data distribution. However, it becomes slow on large datasets and suffers in high-dimensional spaces. Feature scaling is crucial for good performance.
Decision tree
A decision tree recursively splits the feature space into regions that maximise class purity or reduce error. Trees are interpretable and capture nonlinear relationships naturally. However, single trees tend to overfit unless pruned. They form the basis of more powerful ensemble methods.
Random forest
A random forest combines many decision trees trained on bootstrapped samples with random feature selection. Averaging many diverse trees reduces variance and improves generalisation. Random forests perform well out-of-the-box, handle mixed data types, and are robust to noise. They are widely used on tabular datasets.
XGBoost
XGBoost is an efficient implementation of gradient boosting that builds trees sequentially to correct residual errors. It incorporates regularisation, learning rate shrinkage, and weighted sampling to improve robustness. XGBoost is a top performer on structured/tabular data and is common in Kaggle competitions and production systems.
LightGBM
LightGBM is a gradient boosting framework optimised for speed and memory efficiency. It uses histogram-based feature binning and leaf-wise tree growth for faster training. It often outperforms XGBoost on large datasets. It also handles categorical variables efficiently.
CatBoost
CatBoost is a gradient boosting library designed to handle categorical features natively using target statistics without leakage. It reduces the need for preprocessing and performs strongly on mixed-type datasets. CatBoost often requires less tuning and excels on tabular problems with categorical richness.
SVM (Support Vector Machine)
An SVM finds the hyperplane that maximises the margin between classes in a feature space. Margin maximisation leads to strong generalisation performance. Through kernel functions (e.g., RBF), SVMs can model highly nonlinear boundaries without explicit feature engineering. They work well on medium-sized datasets and clear-margin problems.
k-means clustering
k-means partitions data into k clusters by alternating between assigning points to the nearest centroid and updating centroids. It is simple and scalable but assumes spherical, similarly sized clusters. It is sensitive to initialisation and does not handle complex cluster shapes well.
DBSCAN
DBSCAN clusters points based on density, identifying core points, border points, and noise. It can find arbitrarily shaped clusters and automatically detects outliers. It works well when cluster density is meaningful but struggles with varying densities or high-dimensional data.
Hierarchical clustering
Hierarchical clustering builds a multi-level clustering tree (dendrogram) either by merging clusters bottom-up (agglomerative) or splitting top-down (divisive). It does not require choosing the number of clusters in advance. The clustering structure can be visualised to choose meaningful groupings.
t-SNE
t-SNE is a nonlinear dimensionality reduction method designed for 2D/3D visualisation of high-dimensional data. It preserves local neighbourhood relationships, making clusters visually distinct. However, it does not preserve global structure and is not used for downstream ML tasks. It is best suited for exploratory analysis of embeddings or image features.
UMAP
UMAP is a nonlinear embedding method that preserves both local and some global structure while being faster and more scalable than t-SNE. It produces meaningful low-dimensional representations and maintains more of the data’s manifold structure. UMAP is popular for visualising embeddings, clustering structure, and high-dimensional biological data.
Hidden Markov Model (HMM)
An HMM models sequences where observations are generated by hidden states that follow a Markov process. It is defined by transition probabilities, emission probabilities, and initial state distributions. HMMs are used in speech recognition, bioinformatics, part-of-speech tagging, and any domain involving sequential latent structure. Algorithms like Viterbi (decoding) and Baum–Welch (learning) enable inference and traini