Data Science Terminology 1

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/95

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

96 Terms

New cards

Random variable

A random variable assigns a numerical value to each outcome in a random process. It allows us to mathematically describe uncertainty and model real-world phenomena. Random variables can be discrete (countable outcomes) or continuous (values over an interval). They are the foundational objects in probability theory and statistical modelling.

New cards

Discrete random variable

A discrete random variable takes on values from a countable set, such as integers or category labels. Probabilities are described using a probability mass function (PMF). These variables appear when modelling counts, successes/failures, or classification labels. They are easy to work with because probabilities add directly.

New cards

Continuous random variable

A continuous random variable takes values from an uncountable interval such as real numbers. Probabilities come from areas under the probability density function (PDF), not single values. Continuous variables are used for measurements like height, temperature, or time. They require calculus for proper handling.

New cards

PDF, PMF, CDF

The PMF gives the probability of each exact value of a discrete random variable; probabilities add up to 1. The PDF gives the probability density for a continuous random variable; probabilities come from integrating over intervals. The CDF gives
𝑃
(
𝑋
≤
𝑥
)
P(X≤x) for both discrete and continuous variables and is especially useful for quantiles and tail analysis. Understanding the roles of PMF/PDF/CDF together allows you to read and interpret any probability distribution.

New cards

Expectation (mean)

Expectation represents the long-run average value of a random variable over infinite repetitions. It captures the “centre of mass” of a distribution. In ML, expectation underlies risk minimisation, gradient calculations, and probabilistic modelling. Formula:
𝐸
[
𝑋

]

∑
𝑥
𝑝
(
𝑥
)
E[X]=∑xp(x) or
𝐸
[
𝑋

]

∫
𝑥
𝑓
(
𝑥
)
𝑑
𝑥
E[X]=∫xf(x)dx.

New cards

Variance

Variance measures the degree to which a random variable deviates from its mean. High variance means the data is widely spread; low variance means it is tightly clustered. In modelling, variance represents uncertainty and helps describe the stability of predictions. Formula:
𝑉
𝑎
𝑟
(
𝑋

)

𝐸
[
(
𝑋
−
𝜇
)
2
]
Var(X)=E[(X−μ)
2
].

New cards

(Expanded) Variance (model variance)

Model variance measures how much a model’s predictions would change if trained on different subsets of the data. High-variance models are sensitive to noise and tend to overfit. Variance is controlled using regularisation, ensembling, or simpler architectures. It forms one-half of the bias–variance tradeoff.

New cards

Covariance

Covariance measures how two variables move together. Positive covariance means they increase together; negative means one increases while the other decreases. Covariance is used in multivariate modelling, correlation, PCA, and understanding feature relationships. Formula:
𝐶
𝑜
𝑣
(
𝑋
,
𝑌

)

𝐸
[
(
𝑋
−
𝜇
𝑋
)
(
𝑌
−
𝜇
𝑌
)
]
Cov(X,Y)=E[(X−μ
X

)(Y−μ
Y

)].

New cards

Correlation

Correlation is a normalised version of covariance that always lies between –1 and 1. It measures the strength and direction of linear relationships between variables. It is unitless and scale-independent, making it easier to interpret than covariance. Formula:

𝜌

𝐶
𝑜
𝑣
(
𝑋
,
𝑌
)
𝜎
𝑋
𝜎
𝑌
ρ=
σ
X

σ
Y

Cov(X,Y)

New cards

Conditional probability

Conditional probability quantifies the likelihood of an event given another event has occurred. It is defined as
𝑃
(
𝐴
∣
𝐵

)

𝑃
(
𝐴
∩
𝐵
)
𝑃
(
𝐵
)
P(A∣B)=
P(B)
P(A∩B)

. Conditional reasoning underpins Bayesian inference, Markov models, and sequential decision-making. It reflects how probabilities update when new information becomes available.

New cards

Independence

Two events are independent if knowing one gives no information about the other. Mathematically,
𝑃
(
𝐴
∩
𝐵

)

𝑃
(
𝐴
)
𝑃
(
𝐵
)
P(A∩B)=P(A)P(B). Independence assumptions simplify many models, such as Naive Bayes and classical statistics. True independence is rare, but approximate independence often suffices in modelling.

New cards

(Expanded) Bayes’ theorem

Bayes’ theorem provides a principled way to update beliefs after observing new evidence. It combines prior belief
𝑃
(
𝐻
)
P(H), likelihood
𝑃
(
𝐷
∣
𝐻
)
P(D∣H), and marginal likelihood
𝑃
(
𝐷
)
P(D) to compute the posterior
𝑃
(
𝐻
∣
𝐷
)
P(H∣D):

𝑃
(
𝐻
∣
𝐷

)

𝑃
(
𝐷
∣
𝐻
)

𝑃
(
𝐻
)
𝑃
(
𝐷
)
.
P(H∣D)=
P(D)
P(D∣H)P(H)

Intuitively, it answers: “Given what I believed and what I see now, what should I believe next?” It is foundational in probabilistic modelling, diagnostics, Bayesian optimisation, and Naive Bayes. Bayes’ theorem is central to reasoning under uncertainty.

New cards

Law of Large Numbers (LLN)

The LLN states that as the number of samples increases, the sample mean converges to the true population mean. This explains why averages stabilise with more data. It underlies statistical estimation, Monte Carlo methods, and performance evaluation. LLN justifies treating sample averages as reliable approximations.

New cards

Central Limit Theorem (CLT)

The CLT states that the mean of many independent random variables becomes approximately normally distributed, regardless of the original distribution. This result enables the use of normal-based confidence intervals and hypothesis tests even on non-normal data. It is one of the most powerful theorems in statistics and underpins much of classical ML theory.

New cards

Bias (model bias)

Model bias is the error introduced when a model’s assumptions are too simple to capture the true data-generating process. High-bias models underfit and systematically miss important patterns. Bias can be reduced by increasing model complexity or enriching feature representations. It forms one-half of the bias–variance tradeoff.

New cards

Bias–variance tradeoff

The bias–variance tradeoff describes how model complexity affects generalisation. Increasing complexity reduces bias but increases variance; decreasing complexity increases bias but reduces variance. Good models strike a balance between these two sources of error. This framework is essential for diagnosing under- or overfitting.

New cards

Confidence interval

A confidence interval provides a range that, under repeated sampling, would contain the true parameter with a specified frequency (e.g., 95%). It reflects uncertainty in statistical estimates due to finite data. Confidence intervals are crucial in A/B testing, inference, and communicating uncertainty.

New cards

Hypothesis testing

Hypothesis testing evaluates whether observed data is consistent with a null hypothesis. A test statistic quantifies how extreme the data is relative to what the null predicts. P-values, t-tests, chi-square tests, and ANOVA are specific tools within this framework. Hypothesis testing helps assess significance rather than effect size.

New cards

p-value

A p-value is the probability of observing a result at least as extreme as the data if the null hypothesis were true. Low p-values suggest the data is unlikely under the null, providing evidence against it. A p-value does not measure effect size or the probability a hypothesis is true—common interview misconception.

New cards

t-test

A t-test compares means between groups while accounting for sample size and variability. It assumes approximate normality of the underlying data. The test statistic follows a t-distribution under the null. Common types include one-sample, paired, and two-sample t-tests.

New cards

Chi-square test

A chi-square test evaluates whether observed categorical counts differ significantly from expected counts. It is widely used for tests of independence and goodness-of-fit. The test statistic follows a chi-square distribution when sample sizes are sufficiently large.

New cards

ANOVA

ANOVA (Analysis of Variance) tests whether the means of three or more groups differ significantly. It decomposes total variability into between-group and within-group components. A significant result indicates at least one group mean differs, but post-hoc tests are needed to identify which.

New cards

Maximum likelihood estimation (MLE)

MLE chooses the parameter values that maximise the probability of the observed data. It is widely used due to its consistency and efficiency for large samples. Many classical ML models—including logistic regression and Naive Bayes—are trained using MLE principles.

New cards

MAP estimation

MAP estimation extends MLE by incorporating a prior belief over parameters. It maximises the posterior distribution
𝑃
(
𝜃
∣
𝐷
)
P(θ∣D), not just the likelihood. Regularisation in ML (e.g., L2) can be interpreted as MAP with specific priors.
Normal distribution

New cards

Bernoulli distribution

The Bernoulli distribution models a single binary outcome, taking value 1 with probability
𝑝
p and 0 with probability
1
−
𝑝
1−p. It is the simplest discrete distribution and forms the basis of logistic regression and binary classification. Bernoulli trials are building blocks for more complex models like the binomial distribution.

New cards

Binomial distribution

The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials. It is parameterised by
𝑛
n (number of trials) and
𝑝
p (success probability). It is widely used in A/B testing, reliability analysis, and scenarios involving repeated binary outcomes.

New cards

Poisson distribution

The Poisson distribution models counts of rare, independent events that occur over a fixed interval. It is parameterised by a single rate parameter
𝜆
λ, which equals both its mean and variance. Common applications include call arrivals, web traffic, and failure events. It is also a natural limit of the binomial distribution when
𝑛
n is large and
𝑝
p is small.

New cards

Exponential distribution

The exponential distribution models the waiting time between independent Poisson events. It has the “memoryless” property, meaning future behaviour does not depend on past events. This makes it useful in reliability engineering, queueing theory, and modelling time-to-failure.

New cards

Gamma distribution

The gamma distribution generalises the exponential distribution and models the sum of multiple independent exponential waiting times. It is parameterised by a shape and rate, allowing flexible modelling of positive, skewed data. It serves as a conjugate prior for Poisson rates in Bayesian inference.

New cards

Beta distribution

The beta distribution models probabilities or proportions on the interval [0,1], and is parameterised by two shape parameters. It can take many shapes (U-shaped, uniform, bell-curved), making it highly flexible. It is widely used as a conjugate prior for Bernoulli/binomial likelihoods.

New cards

Uniform distribution

The uniform distribution assigns equal probability density across a defined interval. It is simple, often used for baseline random sampling or non-informative priors. Many simulation methods and random number generators rely on uniform distributions.

New cards

Weibull distribution

The Weibull distribution is widely used in reliability engineering to model time-to-failure. It can model increasing, decreasing, or constant hazard rates depending on its shape parameter. This flexibility makes it suitable for materials fatigue, failure prediction, and wind-speed modelling.

New cards

Lognormal distribution

A lognormal distribution arises when the logarithm of a variable follows a normal distribution. It models positive, right-skewed data that grow multiplicatively rather than additively. Common applications include income, reaction times, and asset prices.

New cards

Pareto distribution

The Pareto distribution is a heavy-tailed distribution often associated with the “80/20 rule.” It models extreme values and highly skewed quantities like wealth, city sizes, or catastrophic losses. Its tail behaviour makes it useful in risk analysis and extreme value modelling.

New cards

Conjugate priors

A conjugate prior is a prior distribution that, when combined with a particular likelihood, results in a posterior of the same functional form. This makes Bayesian updating analytically simple and computationally efficient. Examples include Beta–Binomial, Gamma–Poisson, and Normal–Normal conjugate pairs.

New cards

Vector

A vector is an ordered list of numbers representing a point, direction, or feature set in a multidimensional space. Vectors are the fundamental objects of linear algebra and appear everywhere in ML: data samples, weight parameters, and gradients. They support operations like dot products, norms, and transformations.

New cards

Matrix

A matrix is a rectangular array of numbers representing datasets, linear transformations, or parameters in a model. They allow compact representation of systems of equations and transformations of vector spaces. In ML, matrices power operations like linear layers, attention mechanisms, and covariance calculations.

New cards

Dot product

The dot product measures similarity between two vectors by multiplying and summing their components. A large positive value indicates the vectors point in similar directions; zero indicates orthogonality. It underpins linear regression, cosine similarity, attention mechanisms, and gradient calculations.

New cards

Matrix multiplication

Matrix multiplication composes linear transformations and is defined by combining rows of one matrix with columns of another. It is central to all neural network operations, where layers apply matrix–vector multiplications. Understanding matrix multiplication is essential for reasoning about complexity and model architecture.

New cards

Norm

A norm measures the size or length of a vector. Common norms include L1 (sum of absolute values) and L2 (Euclidean length). Norms appear in regularisation, optimisation, and distance metrics, influencing model stability and generalisation.

New cards

Eigenvalue

An eigenvalue measures how much a corresponding eigenvector is scaled under a linear transformation. Large eigenvalues reflect directions of strong transformation. Eigenvalues are key in analysing covariance matrices, optimisation curvature, and system stability.

New cards

Eigenvector

An eigenvector is a direction that remains unchanged (up to scaling) under a linear transformation. It reveals intrinsic structure in matrices — such as principal directions of data variation in PCA. Eigenvectors are central in dimensionality reduction and spectral clustering.

New cards

Positive-definite matrix

A positive-definite matrix produces a positive value for every non-zero vector in the form
𝑥
𝑇
𝐴
𝑥
x
T
Ax. These matrices define valid covariance matrices and guarantee convexity in quadratic optimisation problems. They ensure unique minima in many estimation tasks.

New cards

Singular Value Decomposition (SVD)

SVD decomposes a matrix into
𝑈
Σ
𝑉
𝑇
UΣV
T
, representing rotation → scaling → rotation. It provides numerical stability and insights into matrix rank and structure. SVD is used in PCA, recommender systems, and dimensionality reduction.

New cards

PCA (as eigen-decomposition of covariance)

PCA finds orthogonal directions (principal components) that capture maximum variance in the data. It is implemented via eigen-decomposition of the covariance matrix or via SVD. PCA is used for dimensionality reduction, noise filtering, and visualisation, especially for high-dimensional data.
Derivative

New cards

Gradient

A gradient is the vector of partial derivatives of a multivariable function. It points in the direction of steepest increase of the function; the negative gradient points toward steepest descent. Gradients are essential in training ML models because they tell us how to adjust parameters to reduce loss. Efficient gradient computation via backpropagation enables deep learning.

New cards

Gradient descent

Gradient descent is an iterative optimisation algorithm that updates parameters in the direction of the negative gradient. By repeatedly taking small steps downhill, the algorithm seeks a local (or global) minimum of the loss function. Its performance depends heavily on the learning rate. Gradient descent is the foundation of most training procedures in ML.

New cards

Stochastic Gradient Descent (SGD)

SGD computes gradients using small mini-batches rather than the entire dataset. This introduces noise into updates, which helps escape shallow minima and speeds up training on large datasets. SGD is widely used in deep learning due to its efficiency and generalisation properties. Tuning batch size and learning rate is critical for good performance.

New cards

Adam

Adam is an optimisation algorithm that combines momentum with adaptive learning rates. It keeps running estimates of both first and second moments of the gradient, adjusting learning rates per parameter. Adam often converges faster and more smoothly than SGD, especially in deep networks. It is widely used because it requires little hyperparameter tuning and works well out-of-the-box.

New cards

RMSProp

RMSProp adapts learning rates by dividing the gradient by a moving average of squared gradients. This stabilises updates in directions with large gradients and accelerates learning in flatter directions. RMSProp was designed to address issues in non-stationary optimisation problems and is commonly used in training recurrent neural networks.

New cards

Convex function

A convex function has the property that any line segment between two points on its graph lies above or on the graph. This structure ensures that every local minimum is a global minimum, making optimisation predictable and stable. Many classical ML losses (e.g., squared error, logistic loss) are convex, which is why they are easier to optimise than deep networks.

New cards

Lagrange multipliers

Lagrange multipliers allow optimisation of a function subject to constraints by incorporating the constraints into a new objective, the Lagrangian. Solutions occur where gradients of the original function align with gradients of the constraint. This technique underpins constrained optimisation, dual formulations, and SVMs. It is a core tool in optimisation theory.

New cards

Loss landscape

A loss landscape is the geometric surface formed by plotting model parameters against loss values. Its shape determines how hard training is: flat valleys, sharp minima, and saddle points all affect optimiser behaviour. Deep learning landscapes are highly non-convex, but often contain large flat minima that generalise well. Understanding loss landscapes helps explain training stability and generalisation.
Supervised learning

New cards

Unsupervised learning

Unsupervised learning finds structure in data without labels. It includes clustering, dimensionality reduction, and density estimation. These methods reveal patterns, groupings, or latent structure in the data and are useful for exploration, preprocessing, or feature extraction.

New cards

Reinforcement learning

Reinforcement learning trains an agent to make sequential decisions by receiving rewards from an environment. The agent learns a policy that maximises long-term reward through exploration and trial-and-error. RL is used in robotics, game-playing, and resource allocation problems, where actions influence future states.

New cards

Training set

The training set is the portion of data used to fit the model’s parameters. It directly influences learned patterns and optimisation. Overfitting can occur when the model learns noise instead of general structure.

New cards

Validation set

The validation set is used to tune hyperparameters and select models. It acts as an unbiased checkpoint during training, helping identify overfitting before the test stage. The validation set guides model refinement.

New cards

Test set

The test set provides the final unbiased estimate of model performance after all training and tuning are complete. It cannot influence any model decisions or hyperparameters. Good ML practice keeps the test set untouched until the end.

New cards

Underfitting

Underfitting occurs when a model is too simple to capture underlying patterns in the data. It results in high error on both training and test sets. Solutions include adding complexity, engineering better features, or reducing regularisation.

New cards

Overfitting

Overfitting occurs when a model captures noise or idiosyncrasies of the training data rather than general patterns. This produces low training error but high test error. Regularisation, early stopping, and cross-validation help prevent overfitting.

New cards

Cross-validation

Cross-validation evaluates model performance by training and validating on multiple different splits of the data. In k-fold CV, the dataset is divided into k parts, and each part is used once as validation. It reduces variance in performance estimates and provides a robust foundation for hyperparameter tuning.

New cards

Regularisation

Regularisation discourages overly complex models to improve generalisation. It works by adding penalties to the loss function or restricting model behaviour. Common techniques include L1/L2 penalties, dropout, and early stopping.

New cards

L1 regularisation

L1 regularisation adds the sum of absolute parameter values to the loss function. It encourages sparsity, driving some weights to exactly zero. This makes L1 useful for feature selection and high-dimensional datasets.

New cards

L2 regularisation

L2 regularisation adds the sum of squared parameter values to the loss. It discourages large weights but rarely forces them to zero. L2 stabilises optimisation and is used widely in linear models and neural networks.

New cards

Dropout

Dropout randomly disables a subset of neurons during training. This prevents co-adaptation and forces the network to learn redundant, robust representations. Dropout acts as a powerful regulariser in deep learning models.

New cards

Early stopping

Early stopping halts training when validation performance stops improving. This prevents the model from overfitting the training data. It is simple, effective, and widely used in neural network training.

New cards

Accuracy

Accuracy measures the proportion of predictions that are correct. It is intuitive but unreliable with class-imbalanced datasets. It should only be used when all classes are equally important.

New cards

Precision

Precision is the proportion of predicted positives that are true positives:
𝑇
𝑃
/
(
𝑇
𝑃
+
𝐹
𝑃
)
TP/(TP+FP). It answers: “When the model predicts positive, how often is it right?” Precision is important when false positives are costly.

New cards

Recall

Recall is the proportion of actual positives correctly identified:
𝑇
𝑃
/
(
𝑇
𝑃
+
𝐹
𝑁
)
TP/(TP+FN). It answers: “Out of all real positives, how many did the model find?” Recall is crucial when missing positive cases is costly.

New cards

F1 score

The F1 score is the harmonic mean of precision and recall. It balances the two and is especially useful for imbalanced datasets. A high F1 means the model maintains both good precision and recall.

New cards

ROC curve

The ROC curve plots true positive rate vs. false positive rate across classification thresholds. It visualises the trade-off between sensitivity and specificity. ROC curves are threshold-independent, making them useful for comparing classifiers.

New cards

AUC

AUC (Area Under the ROC Curve) measures how well a model separates positive from negative classes. It ranges from 0.5 (random guessing) to 1.0 (perfect ranking). AUC is widely used in risk scoring, medicine, and fraud detection.

New cards

Log-loss

Log-loss (cross-entropy loss) measures the accuracy of predicted probabilities. It heavily penalises confident wrong predictions. It is the standard loss for classification tasks and encourages well-calibrated probabilities.

New cards

RMSE

RMSE (root mean squared error) is the square root of the average squared prediction error. It penalises large errors more than MAE. RMSE is widely used in regression because it measures typical prediction deviation in original units.

New cards

MAE

MAE (mean absolute error) averages the absolute differences between predictions and true values. It is more robust to outliers than RMSE. MAE is easy to interpret: “on average, predictions are off by X.”

New cards

Feature engineering

Feature engineering involves creating or transforming input features to make patterns easier for a model to learn. It includes encoding categories, scaling values, generating interactions, and incorporating domain knowledge. Good feature engineering often matters more than model choice on tabular data.

New cards

Hyperparameter tuning

Hyperparameter tuning searches for the combination of settings that yields the best model performance. It is performed using validation data or cross-validation. Hyperparameter tuning can significantly improve results without changing the model architecture.

New cards

Grid search

Grid search evaluates all combinations of hyperparameters in a predefined grid. It is simple but becomes computationally expensive with many parameters. It is effective when the search space is small and well understood.

New cards

Random search

Random search samples hyperparameters randomly. It is more efficient than grid search in high-dimensional spaces because it explores more diverse configurations. It is often a strong baseline for tuning.

New cards

Bayesian optimisation

Bayesian optimisation models performance as a function of hyperparameters using a surrogate model (often a Gaussian process). It chooses new hyperparameters by balancing exploration and exploitation. It is highly effective when model training is expensive.

New cards

Data leakage

Data leakage occurs when information from outside the training data (especially from the test set or future data) influences the model during training. This causes unrealistically high performance during evaluation and failure in deployment. Avoiding leakage is one of the most important practical skills in ML.
Linear regression

New cards

Logistic regression

Logistic regression models the probability of a binary class using the logistic (sigmoid) function applied to a linear combination of features. It outputs probabilities and is highly interpretable through log-odds. It remains a strong baseline in classification tasks, especially with high-dimensional sparse data. Regularisation is commonly applied to improve generalisation.

New cards

Naive Bayes

Naive Bayes applies Bayes’ theorem with the strong assumption that features are conditionally independent given the class. Despite this unrealistic assumption, it performs extremely well on text and other high-dimensional datasets. It is simple, fast, and produces probabilistic outputs. Common variants include Gaussian, Multinomial, and Bernoulli Naive Bayes.

New cards

k-NN (k-nearest neighbours)

k-NN predicts a label based on the majority class (classification) or average value (regression) of the k closest training points. It is non-parametric and makes no assumptions about data distribution. However, it becomes slow on large datasets and suffers in high-dimensional spaces. Feature scaling is crucial for good performance.

New cards

Decision tree

A decision tree recursively splits the feature space into regions that maximise class purity or reduce error. Trees are interpretable and capture nonlinear relationships naturally. However, single trees tend to overfit unless pruned. They form the basis of more powerful ensemble methods.

New cards

Random forest

A random forest combines many decision trees trained on bootstrapped samples with random feature selection. Averaging many diverse trees reduces variance and improves generalisation. Random forests perform well out-of-the-box, handle mixed data types, and are robust to noise. They are widely used on tabular datasets.

New cards

XGBoost

XGBoost is an efficient implementation of gradient boosting that builds trees sequentially to correct residual errors. It incorporates regularisation, learning rate shrinkage, and weighted sampling to improve robustness. XGBoost is a top performer on structured/tabular data and is common in Kaggle competitions and production systems.

New cards

LightGBM

LightGBM is a gradient boosting framework optimised for speed and memory efficiency. It uses histogram-based feature binning and leaf-wise tree growth for faster training. It often outperforms XGBoost on large datasets. It also handles categorical variables efficiently.

New cards

CatBoost

CatBoost is a gradient boosting library designed to handle categorical features natively using target statistics without leakage. It reduces the need for preprocessing and performs strongly on mixed-type datasets. CatBoost often requires less tuning and excels on tabular problems with categorical richness.

New cards

SVM (Support Vector Machine)

An SVM finds the hyperplane that maximises the margin between classes in a feature space. Margin maximisation leads to strong generalisation performance. Through kernel functions (e.g., RBF), SVMs can model highly nonlinear boundaries without explicit feature engineering. They work well on medium-sized datasets and clear-margin problems.

New cards

k-means clustering

k-means partitions data into k clusters by alternating between assigning points to the nearest centroid and updating centroids. It is simple and scalable but assumes spherical, similarly sized clusters. It is sensitive to initialisation and does not handle complex cluster shapes well.

New cards

DBSCAN

DBSCAN clusters points based on density, identifying core points, border points, and noise. It can find arbitrarily shaped clusters and automatically detects outliers. It works well when cluster density is meaningful but struggles with varying densities or high-dimensional data.

New cards

Hierarchical clustering

Hierarchical clustering builds a multi-level clustering tree (dendrogram) either by merging clusters bottom-up (agglomerative) or splitting top-down (divisive). It does not require choosing the number of clusters in advance. The clustering structure can be visualised to choose meaningful groupings.

New cards

t-SNE

t-SNE is a nonlinear dimensionality reduction method designed for 2D/3D visualisation of high-dimensional data. It preserves local neighbourhood relationships, making clusters visually distinct. However, it does not preserve global structure and is not used for downstream ML tasks. It is best suited for exploratory analysis of embeddings or image features.

New cards

UMAP

UMAP is a nonlinear embedding method that preserves both local and some global structure while being faster and more scalable than t-SNE. It produces meaningful low-dimensional representations and maintains more of the data’s manifold structure. UMAP is popular for visualising embeddings, clustering structure, and high-dimensional biological data.

New cards

Hidden Markov Model (HMM)

An HMM models sequences where observations are generated by hidden states that follow a Markov process. It is defined by transition probabilities, emission probabilities, and initial state distributions. HMMs are used in speech recognition, bioinformatics, part-of-speech tagging, and any domain involving sequential latent structure. Algorithms like Viterbi (decoding) and Baum–Welch (learning) enable inference and traini