1/77
89 vocabulary flashcards distilling the major machine-learning validation, optimization, feature-selection and explainability concepts covered in the lecture.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Data Augmentation
A technique commonly employed in machine learning to synthetically expand the size and diversity of a training dataset. It involves applying various transformations (e.g., rotations, flips, scaling, cropping, adding noise, color shifts) to existing data samples. This process helps to improve model generalization and robustness, reduce the risk of overfitting by introducing more varied examples, balance class distributions in imbalanced datasets, and lower the costs associated with acquiring new real-world data.
K-Fold Cross-Validation (K-Fold CV)
A robust and widely used validation scheme for evaluating machine learning models, short for K-Fold Cross-Validation. The dataset is partitioned into K equally sized subsets or 'folds'. The validation process is then repeated K times: in each iteration, one fold is designated as the validation set, and the remaining K-1 folds are used as the training set. This ensures every data point is used for validation exactly once and for training K-1 times, leading to more reliable and less biased estimates of model performance than a single train-test split. Common choices for K are 5 or 10, representing a trade-off between bias, variance, and computational cost.
Leave-One-Out Cross-Validation (LOOCV)
A special case of K-Fold Cross-Validation where the number of folds K is equal to the total number of data samples n in the dataset (K = n). In each of the n iterations, a single data point is used as the validation set, and the remaining n-1 data points are used for training. While this method generally provides a very low-bias estimate of the model's test error because the training set in each fold is nearly as large as the original dataset, its main drawbacks are high variance (due to high correlation between the n models trained) and extremely high computational cost, making it impractical for large datasets.
Leave-One-Subject-Out CV (LOSO-CV)
A specialized cross-validation variant crucial when a dataset contains multiple samples originating from the same subject or entity. To prevent data leakage, where information from the validation subject inadvertently influences the training phase, all measurements belonging to one subject are grouped and left out as the validation set in each fold. The model is then trained on data from all other subjects. This ensures that the model's performance generalizes well to entirely new, unseen subjects, rather than simply memorizing subject-specific patterns.
Bootstrap
A powerful resampling method used to estimate the sampling distribution of a statistic (e.g., mean, median, standard deviation) or to assess the accuracy of a model's estimators. It works by repeatedly drawing samples with replacement from the original observed dataset. If the original dataset has N samples, each bootstrap sample will also contain N samples, drawn randomly with replacement. This process generates many 'pseudo-datasets' (bootstrap samples), each of which can be used to train a model or compute a statistic. This technique is commonly used to estimate the variance, standard errors, and confidence intervals of model parameters or predictions, especially when traditional analytical methods are difficult or impossible.
Gradient Descent
An iterative first-order optimization algorithm widely used to find the local minimum of a differentiable cost (or loss) function. It operates by repeatedly adjusting the model's parameters (weights and biases) in the direction opposite to the gradient of the cost function with respect to those parameters. The magnitude of each step taken is determined by a crucial hyperparameter called the learning rate (\alpha or \eta), which controls the convergence speed and stability of the optimization process. The procedure continues iteratively for a predefined number of epochs or until the change in the cost function falls below a certain threshold.
Simulated Annealing (SA)
A probabilistic metaheuristic optimization algorithm inspired by the annealing process in metallurgy, where metals are heated and slowly cooled to reduce defects and achieve a low-energy state. In this algorithm, the search space is explored by iteratively moving from a current solution to a neighboring one. Unlike greedy algorithms, it occasionally accepts 'worse' moves (i.e., moves that increase the cost function) with a diminishing probability that depends on a 'temperature' parameter (T). Initially, the temperature is high, allowing for greater exploration and acceptance of worse moves (helping to escape local minima). As the optimization progresses, the 'temperature' slowly 'cools' (decreases), making the algorithm less likely to accept worse moves and more likely to converge to an optimal or near-optimal solution.
Swarm Intelligence
An Artificial Intelligence paradigm inspired by the collective behavior of decentralized, self-organized systems found in nature, such as ant colonies, bird flocks, or fish schools. It involves a multitude of simple, unsophisticated agents interacting locally with each other and with their environment. Despite the lack of central control and the simplicity of individual agents, their collective interactions lead to the emergence of complex, intelligent global behavior that can solve difficult problems, often exhibiting robustness and scalability (e.g., finding optimal paths, resource allocation, and job scheduling).
Ant Colony Optimization (ACO)
A probabilistic technique for solving computational problems, particularly those that can be reduced to finding optimal paths through graphs, inspired by the way ants find the shortest paths from their colony to food sources. Virtual 'ants' explore a graph, laying down virtual 'pheromones' on the edges they traverse. Paths that are frequently chosen by ants accumulate more pheromone, becoming more attractive for subsequent ants. Over time, pheromones also 'evaporate' (decrease), which helps the algorithm forget suboptimal paths and prevents premature convergence to local optima, allowing a broader exploration of the solution space and discovery of short-cost paths.
Maximum Likelihood Estimation (MLE)
A widely used method for estimating the parameters of a statistical model. The core idea is to find the parameter values that maximize the 'likelihood function', which quantifies how probable it is to observe the given set of data points if the model's parameters were those specific values. Essentially, this estimation technique seeks the parameter values that make the observed data most probable under the assumed statistical distribution. It is widely used in various statistical and machine learning contexts for parameter estimation when assuming a particular probability distribution for the data.
Genetic Algorithm (GA)
An evolutionary optimization and search algorithm inspired by the process of natural selection and genetics. It operates on a 'population' of candidate solutions, often called 'individuals' or 'chromosomes', which are typically initialized randomly. In each 'generation' (iteration), the algorithm iteratively improves the population through three main genetic operators: 1. Selection: Individuals with higher 'fitness' (better solutions) are more likely to be chosen as 'parents' for the next generation. 2. Crossover (Recombination): Genetic material from two parent individuals is combined to create new 'offspring' solutions, inheriting characteristics from both. 3. Mutation: Random small alterations are introduced into the offspring's genetic material to maintain diversity and prevent premature convergence to local optima. This iterative process allows the algorithm to effectively search large and complex solution spaces for optimal or near-optimal solutions.
Mutation (GA)
In the context of Genetic Algorithms, this is a genetic operator that introduces random alterations or changes in the 'genes' (components representing parameters or features) of an 'individual' (a candidate solution) within the population. This process is typically applied with a small probability after crossover. Its primary purpose is to maintain genetic diversity within the population from one generation to the next, prevent premature convergence to local optima by exploring new regions of the search space, and introduce potentially novel characteristics that might lead to better solutions.
Crossover (GA)
In Genetic Algorithms, this genetic operator (also known as recombination) combines genetic material from two 'parent' individuals (candidate solutions) to produce one or more new 'offspring' individuals. This process mimics biological reproduction and allows for the exploration of new solution combinations by mixing the traits (genes) of successful parents. Common types include one-point crossover (where a single point is chosen to swap segments), two-point crossover, or uniform crossover. This operator is crucial for efficient exploration of the search space and for creating new and potentially better solutions.
Fitness Threshold (GA)
In a Genetic Algorithm, this refers to a predefined stopping criterion that dictates when the evolutionary process terminates. 'Fitness' quantifies how well a candidate solution (individual) performs with respect to the problem's objective. During the evolutionary process, the algorithm monitors the 'best fitness' achieved by any individual in the current population. Once this best fitness value reaches or exceeds the specified threshold, the algorithm halts, assuming that a sufficiently good solution has been found. This threshold acts as a targeted performance goal for the optimization process.
Feature
An individual measurable property or characteristic of a phenomenon being observed, also known as an attribute or predictor variable. In the context of machine learning, these are the input variables used by a model to make predictions or classifications. They can be numerical (e.g., age, blood pressure), categorical (e.g., gender, disease status), or more abstract representations, but they must be relevant and ideally informative to the target problem. The quality and selection of these properties significantly impact model performance.
Clinically Relevant Feature
A measured variable that is not only statistically significant but also practically meaningful and useful in a clinical context, typically for diagnosis, prognosis, or monitoring. Statistically, it should demonstrate a significant difference between groups (e.g., diseased vs. healthy), often assessed by a small p-value (
Feature Selection
The process of choosing a subset of the most relevant and informative variables (features) from an original set of features in a dataset. The primary goals of this process are to: 1. Improve Model Accuracy: By removing irrelevant or redundant features that can introduce noise or bias. 2. Enhance Interpretability: Models built on fewer, more meaningful features are often easier to understand. 3. Reduce Computational Cost: Training and inference become faster with fewer dimensions. 4. Mitigate Overfitting: Simpler models with fewer features are less likely to overfit the training data. Common methods include filter, wrapper, and embedded techniques.
Univariate Evaluation
A method of assessing the relevance or performance of individual features in isolation, rather than considering their interactions with other features. Each feature is evaluated independently against the target variable using various statistical metrics. Common metrics include: the p-value (to test the statistical significance of the relationship between the feature and the target), Area Under the Curve (AUC) for classification tasks (measuring discriminative power), or explained variance / R-squared (R^2) for regression tasks. While simple and fast, this evaluation method might overlook features that are only informative when combined with others.
Relief Method
A heuristic feature-weighting algorithm used for feature selection, particularly effective for identifying attributes that are relevant in the presence of strong dependencies between features. For each instance in the dataset, this method identifies its 'nearest hit' (the closest instance of the same class) and its 'nearest miss' (the closest instance of a different class). It then updates a relevance weight for each feature: increasing the weight if the feature's value differs between the instance and its nearest hit, and decreasing it if the feature's value differs between the instance and its nearest miss. Features with higher final weights are considered more relevant, as they tend to distinguish between classes while being consistent within classes.
Collinearity
A statistical phenomenon where two or more predictor variables (features) in a multiple regression model are highly correlated with each other. While not inherently problematic for prediction if the goal is purely predictive accuracy, it can significantly impair the interpretation and stability of individual regression coefficient estimates. High levels of this phenomenon mean the model struggles to determine the unique contribution of each correlated predictor, leading to inflated standard errors of coefficients, making them statistically insignificant, and potentially unstable model estimates.
Multicollinearity
An extreme form of collinearity, occurring when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy, or when multiple predictor variables are highly correlated with each other. This situation causes the regression coefficients to become unstable and less interpretable, as changes in one predictor's value might be offset by changes in others, making it difficult to isolate the individual effect of each variable. Variance Inflation Factor (VIF) is a common metric used to detect and quantify this issue.
Correlation Coefficient Matrix (Heat Map)
A square table that quantifies and displays the pairwise correlation coefficients between multiple variables in a dataset. Often visualized as a Heat Map, each cell (i, j) in the matrix represents the correlation coefficient (e.g., Pearson's \rho) between variable i and variable j. The diagonal elements are always 1, as a variable's correlation with itself is perfect. This visual tool is commonly used to detect feature redundancy: if two features have a very high absolute correlation coefficient (e.g., | \rho | > 0.75 or 0.8), it suggests they provide similar information, and one might be considered redundant and a candidate for removal to simplify the model or mitigate multicollinearity.
Sequential Forward Selection (SFS)
A bottom-up, wrapper-based feature selection method. It initiates with an empty set of features. In each step, it iteratively adds the feature from the remaining set that, when combined with the already selected features, yields the best improvement in the model's performance (e.g., accuracy, R-squared) as evaluated on a validation set or through cross-validation. This process continues until no further significant improvement is observed, or a predefined number of features is reached. This algorithm is greedy, meaning it does not re-evaluate features once added, which can sometimes lead to suboptimal global solutions.
Sequential Backward Selection (SBS)
A top-down, wrapper-based feature selection method. It begins with the full set of all available features. In each iterative step, it considers removing one feature from the current set. The feature whose removal results in the smallest degradation (or even improvement) in the model's performance (as assessed by a chosen evaluation metric on a validation set or via cross-validation) is permanently removed. This process continues until no further improvement is observed by removing features, or a desired number of features is left. Like SFS, this algorithm is also greedy.
Dimensionality Reduction
A set of techniques used to transform data from a high-dimensional space into a lower-dimensional space. The primary goal is to thereby reduce the number of input variables while attempting to retain as much relevant information as possible. This is often done to combat the 'curse of dimensionality,' which can lead to sparse data, increased computational cost, and overfitting in high-dimensional spaces. Methods like Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) achieve this by creating new, composite features (components) that capture the underlying variance or structure in the original data.
Singular Value Decomposition (SVD)
A powerful matrix factorization technique that decomposes an m \times n matrix A into three constituent matrices: A = U \Sigma V^T. Here: \n\n* U is an m \times m orthogonal matrix whose columns are the left singular vectors of A. \n* \Sigma (often denoted as \Lambda) is an m \times n rectangular diagonal matrix with non-negative real numbers on the diagonal, called the singular values of A. These singular values are typically ordered in descending magnitude and quantify the amount of variance or 'information content' captured by each corresponding dimension. \n* V^T is the transpose of an n \times n orthogonal matrix V, whose columns are the right singular vectors of A. \n\nThis decomposition is fundamental in dimensionality reduction, noise reduction, and recommender systems, as it allows for approximating the original matrix by retaining only the largest singular values and their corresponding vectors, effectively capturing the most significant patterns in the data.
Principal Component Analysis (PCA)
An unsupervised dimensionality reduction technique used to transform a dataset's existing variables into a new set of orthogonal (uncorrelated) variables called 'principal components.' It achieves this by finding directions (components) in the data that capture the maximum amount of variance. The first principal component accounts for the largest possible variance, and each succeeding component accounts for the next highest variance under the constraint that it is orthogonal to the preceding components. This technique is particularly useful for mitigating multicollinearity because the new components are uncorrelated, simplifying model interpretation and improving stability. As an unsupervised method, it does not use information about the target variable during the transformation.
Noise (in data)
Within data analysis, this refers to unwanted, random, or irrelevant variations, errors, or inaccuracies present in a dataset that can obscure the true underlying patterns or signals. It can originate from various sources, such as measurement errors, faulty sensors, data entry mistakes, or natural variability. While typically viewed as detrimental (reducing model accuracy, increasing training time, and complicating analysis), in some contexts, what is considered 'noise' might contain useful information, or its structure might be indicative of certain phenomena. Examples include electrical interference in medical signals, environmental disturbances, or random fluctuations in survey responses.
Digital Filter
An algorithmic operation applied to a discrete-time signal to modify its frequency content. The primary purpose is typically to remove unwanted components (like noise) or to isolate specific frequency bands. These filters operate in either the frequency domain (by selectively attenuating or amplifying specific frequencies, like a low-pass filter allowing only low frequencies, or a high-pass filter allowing high frequencies) or the time domain (by convolving the signal with a filter kernel). Common types include low-pass, high-pass, band-pass, and band-stop filters, each designed to pass certain frequency ranges and block others based on their cutoff frequencies. They are crucial in signal processing for tasks like denoising, smoothing, and feature extraction.
Morphological Filter
A non-linear spatial-domain operation primarily used in image processing for tasks such as noise removal (e.g., salt-and-pepper noise), object boundary extraction, segmentation, and shape analysis. These filters operate based on set theory principles and are applied using a 'structuring element' or 'kernel' over the image. Key operations include: \n\n* Dilation: Expands the shapes in an image, effectively 'growing' bright regions or filling small holes. \n* Erosion: Shrinks shapes, effectively 'shrinking' bright regions or removing small spurious objects. \n* Opening: An operation consisting of erosion followed by dilation, used to remove small objects and smooth object contours. \n* Closing: An operation consisting of dilation followed by erosion, used to fill small holes and gaps within objects. \n\nThese filters are particularly effective for binary images but can be extended to grayscale images.
Median Filtering
A non-linear digital filtering technique commonly used for noise reduction, especially for suppressing 'salt-and-pepper' noise or other sharp, spike-like artifacts in signals or images. Unlike averaging filters that replace a pixel's value with the mean of its neighbors, this technique replaces each value with the median value of its surrounding neighbors within a defined window or kernel. Because the median is less sensitive to extreme outliers than the mean, this filter is highly effective at preserving edges and details while specifically removing impulse noise, making it a robust smoothing technique.
Averaging (Epoch Averaging)
A common noise-reduction technique applied to signals where a response or event is repeatedly evoked (forming 'epochs') and measured, often referred to as Epoch Averaging. By averaging multiple repetitions (epochs) of the signal, random uncorrelated noise tends to cancel itself out (due to its random nature), while the coherent underlying signal reinforces. The signal-to-noise ratio (SNR) improves proportionally to the square root of the number of averaged epochs (\sqrt{M}), meaning the noise amplitude decreases approximately by a factor of 1/\sqrt{M} for M epochs. This method is widely used in neuroscience (e.g., Event-Related Potentials, ERPs) and other fields where signals are weak and buried in noise.
Blind Source Separation (BSS)
A family of computational methods used to separate a set of observed mixed signals into their individual underlying source signals, with little or no prior information about the nature of the source signals or the mixing process (hence 'blind'). For example, it can separate individual voices from a recording of multiple people speaking simultaneously from a single microphone. Techniques like Independent Component Analysis (ICA) are prominent within this field, aiming to find components that are statistically as independent from each other as possible, which often corresponds to the original source signals.
Independent Component Analysis (ICA)
A computational method for Blind Source Separation (BSS) that aims to find an underlying set of statistically independent source signals from a multivariate dataset of observed mixed signals. Unlike Principal Component Analysis (PCA), which seeks uncorrelated components, this method seeks components that are as statistically independent as possible (meaning the value of one component provides no information about the value of another). This is often achieved by optimizing for non-Gaussianity, as typically at most one of the independent sources can be Gaussian. Common approaches to maximize independence involve maximizing functions like kurtosis or using Maximum Likelihood Estimation (MLE) based on non-Gaussian assumptions. This technique is widely used in biomedical signal processing (e.g., EEG, fMRI) to separate artifacts or distinct brain activity signals.
Classification
A supervised machine learning task where the objective is to predict the categorical class label of a given input data sample based on patterns learned from a labeled training dataset. 'Supervised' implies that the model learns from examples where the correct output (class) is already known. Common types of tasks include: \n\n* Binary Classification: Assigning samples to one of two mutually exclusive classes (e.g., spam/not spam, disease/no disease). \n* Multiclass Classification: Assigning samples to one of more than two mutually exclusive classes (e.g., classifying animal species: cat, dog, bird). \n* Multilabel Classification: Assigning samples to multiple non-mutually exclusive classes simultaneously (e.g., an image of an animal can be both 'mammal' and 'pet'). \n* Image Segmentation: Classifying each pixel in an image to belong to a certain object or background class.
Clinical Prevalence
The proportion of individuals in a specific population who have a particular disease or medical condition at a given time point or over a specified period. In machine learning, especially in medical diagnostics, a low value of this metric means that the number of positive (diseased) cases in an available dataset is significantly smaller than the number of negative (healthy) cases. This imbalance can lead to challenges for classification models, as they may become biased towards the majority class, resulting in poor performance on the minority (diseased) class, even if overall accuracy appears high.
Confusion Matrix
A tabular summary used to evaluate the performance of a classification model, especially on a set of test data for which the true class labels are known. It allows for the visualization of the model's performance beyond simple overall accuracy. For a binary classification problem, it typically has four entries: \n\n* True Positives (TP): Instances that were correctly predicted as positive. \n* True Negatives (TN): Instances that were correctly predicted as negative. \n* False Positives (FP): Instances that were incorrectly predicted as positive (Type I error). \n* False Negatives (FN): Instances that were incorrectly predicted as negative (Type II error). \n\nFrom these values, various performance metrics can be calculated, such as Sensitivity/Recall (TP / (TP + FN)), Specificity (TN / (TN + FP)), Precision (TP / (TP + FP)), Accuracy ((TP + TN) / (TP + TN + FP + FN)), and F1-score.
Linear Discriminant Analysis (LDA)
A supervised dimensionality reduction and classification technique primarily used to find a linear combination of features that best separates two or more classes of objects or events. It aims to maximize the ratio of 'between-class scatter' (the variance between the means of different classes) to 'within-class scatter' (the variance within each class). By doing so, this technique projects the data onto a lower-dimensional space where classes are maximally separable, making it useful both for classification and for visualizing class separation. As a supervised method, it explicitly uses the class labels during its learning process.
Logistic Regression
A statistical classification algorithm used to model the probability of a binary outcome (e.g., 0 or 1, true or false) based on one or more predictor variables. Despite its name, it is fundamentally a classification model, not a regression model for continuous outcomes. It operates by using a sigmoid (logistic) function to transform the linear combination of inputs into a probability score between 0 and 1. This output can then be thresholded to assign a final class. The coefficients associated with each feature indicate the strength and direction of the relationship between that feature and the log-odds of the outcome, thereby implying feature importance.
Naïve Bayes Classifier
A probabilistic machine learning model commonly used for classification tasks. It is based on Bayes' Theorem with a 'naïve' assumption: that all features are conditionally independent of each other given the class label. This simplification makes the computation of probabilities feasible, even with high-dimensional data. Despite this strong independence assumption, these classifiers often perform surprisingly well in practice, especially in text classification (e.g., spam detection) and disease diagnosis. The model calculates the probability of each class given the input features and assigns the sample to the class with the highest posterior probability.
Fuzzy Logic
A form of many-valued logic in which the truth values of variables may be any real number between 0 and 1, inclusive, rather than just the crisp true (1) or false (0). This allows for representing and reasoning with vagueness and uncertainty, mimicking human intuition. Instead of strict binary classification, it assigns 'partial membership' to sets, enabling a system to understand concepts like 'slightly tall' or 'moderately ill.' This framework is particularly useful in expert systems, control systems, and clinical decision support where rules are subjective, imprecise, or based on linguistic variables (e.g., 'If temperature is high AND cough is mild, then assess for flu').
Decision Tree
A non-parametric supervised learning method used for both classification and regression tasks that is structured like a flowchart. Each internal 'node' represents a test on a specific feature (e.g., 'Is age > 30?'), each 'branch' represents the outcome of the test, and each 'leaf node' represents a class label (for classification) or a numerical value (for regression). The tree is built by recursively splitting the dataset based on feature thresholds chosen to maximize 'information gain' or minimize 'Gini impurity' at each step. While highly interpretable, these trees can be prone to overfitting if not properly pruned or limited in depth.
Information Gain
A metric primarily used in the construction of Decision Trees to determine the most effective feature to split the data at each node. It quantifies the reduction in 'entropy' (a measure of impurity or randomness in the data) after a dataset is split based on a particular feature. A higher value of this metric indicates a better split, meaning the chosen feature more effectively separates the data into purer, more homogeneous subsets with respect to the target variable. The algorithm selects the feature that yields the maximum value of this metric for the current node.
Gini Index
Also known as Gini impurity, this is a metric used as an alternative to information gain (based on entropy) for evaluating the quality of a split in a Decision Tree. It measures the impurity of a node, with 0 indicating a perfectly pure node where all samples belong to a single class, and higher values indicating greater impurity. The objective is to choose a split that minimizes the impurity (Gini index) of the resulting child nodes. While similar in purpose to entropy, this metric is computationally slightly faster as it doesn't involve logarithms, and it tends to isolate the most frequent class in its own branch more often than entropy.
Bagging (Bootstrap Aggregating)
An ensemble machine learning technique designed to improve the stability and accuracy of predictive models and reduce variance, short for Bootstrap Aggregating. It involves building multiple versions of a predictor and then combining them for a final prediction. The process works as follows: 1. Bootstrap Samples: Multiple diverse subsets (bootstrap samples) are created from the original training dataset by sampling with replacement. 2. Model Training: A base learning algorithm (e.g., a decision tree) is trained independently on each of these bootstrap samples, resulting in multiple distinct models. 3. Aggregation: For classification, the final prediction is made by aggregating the outputs through voting (majority class). For regression, it's typically done by averaging the predictions. This technique effectively reduces variance and helps prevent overfitting by training models on slightly different datasets and averaging out their individual errors.
Random Subspace Method
An ensemble learning technique that promotes diversity among base learners by training each learner on a randomly selected subset of the original features (a 'random subspace'). Unlike Bagging, which samples data instances, this method samples features. By restricting the features available to each base model, it encourages different models to focus on different aspects of the data, thereby reducing correlation between the models and increasing ensemble diversity. This approach is particularly useful in high-dimensional datasets where many features might be noisy or redundant, and is a core component of algorithms like Random Forests.
Random Forest
A powerful and popular ensemble learning method applicable for both classification and regression tasks. It extends the concept of Bagging by building a 'forest' of multiple Decision Trees. Its strength comes from two key sources of randomness: 1. Bagging: Each tree in the forest is trained on a different bootstrap sample of the original training data. 2. Candidate Feature Randomness: When splitting a node in a tree, only a random subset of features is considered for the optimal split, rather than all available features. These two randomizations ensure that the individual trees are diverse and decorrelated. The final prediction is made by averaging the predictions of all individual trees (for regression) or by majority voting (for classification). These models are known for their high accuracy, robustness to overfitting, and ability to handle high-dimensional data.
Artificial Neuron
The fundamental building block of an artificial neural network, also known as a perceptron or node, mimicking the biological neuron. It performs a basic computation: 1. Weighted Sum: It receives multiple input signals (x1, x2, \ldots, xn), each multiplied by an associated 'weight' (w1, w2, \ldots, wn), representing the strength of the connection. 2. Bias: A constant term 'bias' (b) is added to the weighted sum. 3. Activation Function: The total sum (z = \sum{i=1}^n wi x_i + b) is then passed through a non-linear 'activation function' (e.g., ReLU, sigmoid, tanh), which introduces non-linearity into the network, allowing it to learn complex patterns. The output of the activation function is the neuron's output (a = f(z)), which can serve as input to other neurons.
Backpropagation
The primary algorithm used to train Artificial Neural Networks, an abbreviation for 'backward propagation of errors.' It is a supervised learning method that works by efficiently computing the gradient of the loss function with respect to each weight in the network. The process involves two phases: 1. Forward Pass: Input data is fed through the network to generate an output prediction, and the loss (error) is calculated. 2. Backward Pass: The calculated loss is propagated backward from the output layer to the input layer. Using the chain rule of calculus, the algorithm determines how much each weight and bias contributed to the error. These calculated gradients are then used by an optimization algorithm (e.g., Gradient Descent) to iteratively adjust (update) the weights and biases in the network, aiming to minimize the loss function and improve model accuracy. This iterative process continues over many 'epochs'.
Overfitting
A common phenomenon in machine learning where a model learns the training data too well, effectively memorizing specific patterns, noise, and outliers present only in the training set rather than learning the underlying general relationships. Consequently, when presented with 'unseen' or new data, the model performs poorly because it fails to generalize its learned patterns. This typically occurs when a model is excessively complex for the available training data, or when it has been trained for an excessively long period.
Early Stopping
A regularization technique employed during the training of iterative machine learning models (especially neural networks) to prevent overfitting and improve generalization performance. It involves monitoring the model's performance (e.g., loss or accuracy) on a separate 'validation set' during training. Instead of training for a fixed number of epochs, training is halted prematurely when the performance on the validation set begins to degrade (e.g., validation loss starts increasing), even if the training loss continues to decrease. This strategy ensures that the model captures the underlying patterns without memorizing the noise specific to the training data.
Convolutional Neural Network (CNN)
A specialized type of Neural Network architecture primarily designed to process data with a known grid-like topology, such as images (2D grid of pixels) or time series (1D grid), also known as ConvNet. Its distinctive features are: \n\n* Convolutional Layers: These layers apply 'filters' or 'kernels' (small matrices of learnable weights) that slide across the input data, performing dot products to detect local patterns and extract features like edges, textures, or shapes. This process results in 'feature maps.' \n* Pooling Layers (e.g., Max Pooling): These layers progressively reduce the spatial dimensions of the feature maps, thereby decreasing the number of parameters and computations, and providing invariance to small translations. \n\nThese networks are dominant in computer vision tasks like image classification, object detection, and segmentation due to their ability to automatically learn hierarchical spatial features.
Kernel (Filter)
In a Convolutional Neural Network (CNN), this term (also known as a filter or feature detector) refers to a small, learnable matrix of weights that slides (convolves) across the input data (e.g., an image). At each position, it performs an element-wise multiplication with the local receptive field of the input and sums the results. This operation highlights specific local patterns or features, such as edges, corners, or textures. Different kernels are specialized to detect different features. The output of a kernel sliding across an entire input is a 'feature map', indicating where that particular feature is present in the input.
Feature Map
In a Convolutional Neural Network (CNN), this (also known as an activation map) refers to the output of a convolutional layer after applying a specific kernel (filter) across the input data. When a kernel slides over the input data, it computes a dot product at each spatial position, and these results populate the map. Each map highlights the presence and strength of a particular feature (e.g., a vertical edge, a specific texture) detected by its corresponding kernel across the entire input. A convolutional layer typically consists of multiple kernels, each producing its own map, which collectively form the input to subsequent layers.
Max Pooling
A common down-sampling operation used in Convolutional Neural Networks (CNNs) after convolutional layers. It partitions the input feature map into a set of non-overlapping rectangular 'windows' or 'regions'. For each window, it extracts (takes) the maximum value, discarding all other values within that window. This operation serves several purposes: \n\n* Dimensionality Reduction: It reduces the spatial dimensions (width and height) of the feature map, thereby decreasing computational cost and memory usage. \n* Feature Salience: By taking the maximum, it helps to identify the most salient or prominent features within each window, essentially distilling the most important information. \n* Translation Invariance: It provides a degree of translation invariance, meaning that small shifts in the input feature's position do not significantly change the pooled output, making the model more robust.
Autoencoder
A type of unsupervised Artificial Neural Network designed to learn an efficient, compressed representation (encoding) of input data. It consists of two main parts: \n\n* Encoder: Maps the input data into a lower-dimensional latent space representation (often called the 'bottleneck' or 'code' layer). \n* Decoder: Reconstructs the original input data from this compressed representation. \n\nThe network is trained to minimize the reconstruction error (the difference between the original input and the reconstructed output). These networks can be used for various tasks, including: dimensionality reduction (the bottleneck layer provides a compact representation), denoising (by training on noisy data and reconstructing clean data), anomaly detection (anomalies are poorly reconstructed), and feature learning (the learned encoding can serve as features for other tasks).
Recurrent Neural Network (RNN)
A type of Artificial Neural Network specifically designed to process sequential data (e.g., time series, speech, text) where the order and context of elements are crucial. Unlike traditional feedforward networks, these networks have 'loops' or connections in their architecture that allow information to persist from one step of the sequence to the next. This internal memory mechanism enables them to capture temporal dependencies and context in the input sequence. However, basic versions of these networks suffer from vanishing and exploding gradient problems, making it difficult to learn long-term dependencies, which led to the development of more advanced variants like LSTMs and GRUs.
Long Short-Term Memory (LSTM)
A sophisticated variant of Recurrent Neural Networks (RNNs) designed to overcome the vanishing gradient problem, allowing them to effectively learn and remember long-term dependencies in sequential data. This is achieved through a unique architecture containing a 'cell state' and three specialized 'gates' that regulate the flow of information into and out of the cell: 1. Forget Gate: Decides which information to discard from the cell state. 2. Input Gate: Decides which new information to store in the cell state. 3. Output Gate: Controls what information from the cell state is outputted. This gating mechanism gives these networks a more refined control over what information is kept or forgotten, making them highly effective for tasks like natural language processing, speech recognition, and time series prediction.
K-Means Clustering
An unsupervised, partitional clustering algorithm that aims to partition n observations into K distinct clusters, where each observation belongs to the cluster with the nearest mean (centroid). The algorithm works iteratively: 1. Initialization: Randomly select K data points as initial cluster centroids. 2. Assignment Step: Each data point is assigned to its closest centroid based on a chosen distance metric (e.g., Euclidean distance). 3. Update Step: The centroid for each cluster is re-calculated as the mean of all data points assigned to that cluster. Steps 2 and 3 are repeated until the cluster assignments no longer change or a maximum number of iterations is reached. The 'K' parameter, representing the number of clusters, must be specified beforehand.
Hierarchical Clustering
An unsupervised clustering method that creates a hierarchy of clusters, often visualized as a dendrogram. It has the advantage of not requiring the number of clusters to be specified in advance. There are two main approaches: 1. Agglomerative (Bottom-Up): Starts with each data point as its own cluster. In each step, the two closest clusters are merged until only one large cluster remains. Different 'linkage' criteria (e.g., single, complete, average) determine how the distance between clusters is measured. 2. Divisive (Top-Down): Starts with all data points in one large cluster. In each step, the most heterogeneous cluster is split into two smaller clusters until each data point is its own cluster. The output is a tree-like structure (dendrogram) that allows users to decide on the number of clusters by cutting the dendrogram at a desired distance (or dissimilarity) threshold.
Dendrogram
A tree diagram used to visualize the results of hierarchical clustering. The bottom of this diagram represents individual data points, and as you move up the tree, branches merge (in agglomerative clustering) or split (in divisive clustering), illustrating the formation of clusters at different levels of similarity or distance. \n\n* The vertical axis typically represents the distance or dissimilarity metric at which clusters are merged or split. \n* The horizontal axis represents the data points or clusters. \n\nBy drawing a horizontal line across the diagram at a specific distance threshold, one can determine the number of clusters by counting the number of vertical lines intersected, effectively 'cutting' the tree to form distinct clusters.
Self-Organizing Map (SOM)
An unsupervised neural network, also known as a Kohonen map, that produces a low-dimensional (typically 2-D) discretized representation (a 'map') of the input space. It maps high-dimensional input data onto a grid of 'neurons' while preserving the topological properties of the input space. This means that data points that are close together in the high-dimensional input space will be mapped to neighboring neurons on the 2-D map. The learning process involves competitive learning (a 'winner-take-all' approach for the best matching unit) and collaborative learning (neighboring neurons also adapt). These maps are primarily used for visualizing high-dimensional data, clustering, and exploratory data analysis.
Linear Regression
A fundamental statistical modeling and supervised learning technique used to model the linear relationship between a continuous dependent variable (y) and one or more independent variables (x). Its goal is to find the best-fitting straight line (or hyperplane in higher dimensions) that minimizes the sum of squared residuals (the differences between observed and predicted values). For a single independent variable, the model is commonly represented as y = \alpha + \beta x + \epsilon, where \alpha is the intercept, \beta is the slope (coefficient), and \epsilon represents the error term. It provides a simple, interpretable model for predicting continuous outcomes.
Multiple Linear Regression
An extension of simple linear regression that models the relationship between a single continuous dependent variable (y) and two or more independent (or predictor) variables (x1, x2, \ldots, xp). The model assumes a linear relationship and aims to fit a hyperplane to the data, represented by the equation: y = \beta0 + \beta1 x1 + \beta2 x2 + \ldots + \betap xp + \epsilon. Here, \beta0 is the intercept, and each \betai represents the coefficient for the corresponding predictor xi, quantifying the change in y for a one-unit change in xi, holding all other predictors constant. These coefficients indicate the estimated contribution and direction of each feature to the prediction of the dependent variable.
Stepwise Regression
An automated, iterative procedure for building a multiple linear regression model by sequentially adding or removing predictor variables based on a predefined statistical criterion, such as p-values, AIC (Akaike Information Criterion), or BIC (Bayesian Information Criterion). Common variants include: \n\n* Forward Selection: Starts with no predictors and iteratively adds the most statistically significant one into the model. \n* Backward Elimination: Starts with all predictors in the model and iteratively removes the least statistically significant one. \n* Bidirectional (Hybrid): Combines both forward and backward steps, adding and removing variables at each iteration. \n\nThe goal is to find an optimal subset of predictors that improves model fit without unnecessary complexity. However, this method can suffer from issues like overfitting to the training data and ignoring important interactions between variables.
Support Vector Machine (SVM)
A powerful supervised learning model primarily used for classification tasks, though it can also be adapted for regression. For classification, its main objective is to find the optimal 'hyperplane' that maximally separates data points of different classes in a high-dimensional space. The 'maximal-margin hyperplane' is the one that has the largest distance (margin) to the nearest training data points of any class, which are known as 'support vectors'. \n\n* Linear SVM: For linearly separable data, the hyperplane is a straight line or plane. \n* Kernel Trick: For non-linearly separable data, these models use 'kernel functions' (e.g., polynomial, RBF, sigmoid) to implicitly map the input data into a higher-dimensional feature space where a linear separation might become possible. This avoids explicitly computing the coordinates in the higher-dimensional space, making it computationally efficient.
Support Vector Regression (SVR)
An adaptation of the Support Vector Machine (SVM) algorithm specifically for regression tasks, where the goal is to predict continuous values rather than discrete classes. Unlike traditional regression models that minimize the sum of squared errors, this method aims to find a function that deviates from the training data by no more than a specified threshold (\epsilon), while simultaneously trying to keep the function as flat as possible (i.e., minimizing its complexity). This '\epsilon-insensitive margin' means that errors occurring within this margin are not penalized. Similar to SVM for classification, this method can also use kernel functions to model non-linear relationships, providing a robust approach to regression, especially with noisy or high-dimensional data.
CART (Classification and Regression Tree)
A specific implementation of the Decision Tree algorithm that stands for Classification and Regression Tree, capable of handling both classification and regression tasks. \n\n* For classification trees, this method typically uses criteria like the Gini Index (impurity) to determine the optimal splits, aiming to create leaf nodes that are as homogeneous as possible in terms of class labels. The prediction for a new sample is the majority class in its terminal leaf. \n* For regression trees, this method uses criteria like mean squared error (MSE) or mean absolute error (MAE) to find splits that minimize the variance within leaf nodes. The prediction for a new sample is often the average or median of the target values of the training samples within its terminal leaf. \n\nThese trees are binary (each node splits into two branches) and are simple yet powerful, forming the basis for popular ensemble methods like Random Forests.
Regularization
A set of techniques used in machine learning to prevent overfitting by discouraging overly complex models. It works by adding a 'penalty term' to the model's loss (cost) function during training. This penalty term is typically a function of the model's coefficients or weights. By penalizing large coefficient values, this process forces the model to simplify by shrinking or even driving some coefficients to zero, thereby reducing the model's reliance on any single feature or complex interactions. This helps in improving the model's generalization ability to unseen data. Common types include L1 (Lasso) and L2 (Ridge) variants.
Ridge Regression
A regularized version of linear regression that helps to mitigate overfitting and multicollinearity by adding an 'L2 penalty' term to the standard ordinary least squares (OLS) loss function. This penalty equals the sum of the squared values of the coefficients (\lambda \sum{j=1}^p \betaj^2), multiplied by a tuning parameter \lambda (\lambda \ge 0). \n\n* The L2 penalty discourages large coefficient values, spreading influence among correlated features rather than selecting just one. \n* \lambda (lambda): This hyperparameter controls the strength of the penalty. A larger \lambda value leads to greater shrinkage of coefficients towards zero. \n\nThis method effectively shrinks the coefficients, which primarily helps in controlling model variance and making the model more robust to noisy data. However, it does not perform feature selection, meaning all features are retained, just with reduced coefficients.
LASSO Regression
A type of linear regression that employs an 'L1 penalty' for regularization, short for Least Absolute Shrinkage and Selection Operator. It adds the sum of the absolute values of the coefficients (\lambda \sum{j=1}^p | \betaj |) to the standard OLS loss function, multiplied by a tuning parameter \lambda (\lambda \ge 0). \n\n* The L1 penalty has a specific property that tends to drive some coefficients exactly to zero, effectively performing automatic feature selection by excluding less important features from the model. This results in a sparse model, which is simpler and more interpretable. \n* \lambda (lambda): This hyperparameter controls the strength of the penalty. A larger \lambda value leads to more coefficients being shrunk to zero. \n\nThis regression technique is highly valuable when dealing with datasets that have many features, as it can identify and use only the most impactful ones, making it useful for both regularization and feature selection.
Elastic Net
A regularization method for linear regression that combines both the L1 (LASSO) and L2 (Ridge) penalties. It incorporates a weighted sum of both penalty terms into the standard ordinary least squares (OLS) loss function: \lambda1 \sum | \betaj | + \lambda2 \sum \betaj^2. This combination offers the benefits of both approaches: \n\n* Like LASSO (L1 penalty), it can perform feature selection by shrinking some coefficients to exactly zero, resulting in a sparse model. \n* Like Ridge (L2 penalty), it groups highly correlated features together and tends to shrink their coefficients proportionally, making it more robust to multicollinearity issues compared to using LASSO alone, which might arbitrarily select only one of a group of correlated features. \n\nThis method combines the sparsity of LASSO with the grouping effect of Ridge, making it a powerful and flexible regularization technique for models with many correlated features.
Interpretability
The degree to which a human can understand the internal workings, mechanisms, and reasoning behind a machine learning model's predictions or decisions. An interpretable model is often referred to as a 'white box' model because its logic is transparent and easily comprehensible. For example, simple linear regression models, decision trees (especially small ones), or rule-based systems are generally considered interpretable, as one can directly see how input features contribute to the output. High levels of this attribute are crucial in fields where trust, accountability, and causal understanding are paramount, such as medicine or finance.
Explainability
In the context of machine learning, this refers to the ability to clarify or justify the reasoning behind the predictions or decisions made by a complex, opaque model (often called a 'black box' model), even if its internal mechanisms are not directly understandable. Unlike 'interpretability,' which implies inherent transparency, this often involves 'post-hoc' analysis – applying techniques after the model has been trained to shed light on its behavior. Techniques like SHAP values, LIME, or Saliency Maps are used to provide insights into which features or input regions most influenced a specific prediction, helping to build trust, diagnose issues, and ensure compliance in complex models like deep neural networks.
Saliency Map
A visualization technique primarily used to understand which parts of an input (e.g., an image) are most important or influential for a Convolutional Neural Network's (CNN) specific prediction. It is typically generated by calculating the gradient of the model's output (e.g., the class score) with respect to the input pixels. High-gradient pixels indicate that a small change in their value would lead to a significant change in the output score, effectively highlighting the 'saliency' or importance of those regions. The result is often displayed as a heatmap overlaid on the original image, showing the areas that the model 'focused on' to make its decision.
Grad-CAM
An explainability technique, short for Gradient-weighted Class Activation Mapping, used to produce visual explanations for predictions made by Convolutional Neural Networks (CNNs), particularly useful for image classification tasks. It generates a 'coarse-grain localization map' (heatmap) that highlights the important regions in the input image for a specific predicted class. This is achieved by computing the gradient of the target class score with respect to the feature maps of a final convolutional layer. These gradients are then globally averaged and used as 'weights' for each feature map, indicating the importance of each feature map for the target class. The weighted sum of feature maps produces the activation heatmap, visually demonstrating why a CNN made a particular classification decision by showing the relevant areas.
LIME
An explainability technique, short for Local Interpretable Model-agnostic Explanations, designed to make the predictions of any 'black-box' machine learning model (making it 'model-agnostic') more understandable to humans. The core idea is to explain individual predictions. For a given prediction: 1. Local: It perturbs the original input data slightly to create new, nearby data samples. 2. Interpretable: It obtains the black-box model's predictions for these perturbed samples. 3. Surrogate Model: It then trains a simple, 'interpretable' model (e.g., linear model, decision tree) on these perturbed samples and their corresponding black-box predictions, weighted by their proximity to the original instance. This local surrogate model reveals which features were most influential for that specific prediction, providing a locally faithful explanation without requiring knowledge of the black-box model's internal structure.
SHAP Values
A cutting-edge explainability framework, short for SHapley Additive exPlanations, that applies concepts from cooperative game theory (specifically, Shapley values) to machine learning. SHAP values quantify the contribution of each feature to a particular prediction by distributing the 'payoff' (i.e., the difference between the actual prediction for an instance and the average prediction across the dataset) among the features. A SHAP value for a feature represents the average marginal contribution of that feature value across all possible coalitions (combinations) of features. This allows for both global (overall feature importance) and local (feature contribution to a specific prediction) interpretability for any machine learning model.