1/249
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No study sessions yet.
Null Hypothesis
A statistical assumption that there is no effect, no difference, or no relationship between variables in a population, serving as the default claim to be tested.
Shapiro-Wilk Test
A statistical test used to check if a dataset is normally distributed, with the null hypothesis stating that the data comes from a normal distribution.
Normal Distribution
A continuous probability distribution that is symmetric and bell-shaped, where most values cluster around the mean and probabilities decrease as you move further from it.
Correlation
A statistical measure that describes the strength and direction of the relationship between two variables, ranging from -1 (perfect negative) to +1 (perfect positive).
Autocorrelation
A statistical measure that shows how a time series is correlated with its own past values, indicating repeating patterns, trends, or seasonality.
Non-Parametric Methods
Statistical techniques that do not assume a specific probability distribution for the data, making them useful for small samples, ordinal data, or when normality assumptions are not met.
Uniform Distribution
A probability distribution where all outcomes in a given range are equally likely, with constant probability across the interval.
Log-Normal Distribution Test
A statistical test used to determine if data follows a log-normal distribution, where the logarithm of the variable is normally distributed.
Discrete Random Variable
A random variable that can take on a countable number of distinct values, often representing outcomes like counts, categories, or integers.
Empirical Distribution
A probability distribution derived directly from observed data, where probabilities are assigned based on the relative frequencies of outcomes in the sample.
Descriptive Probability
The use of probability to summarize and describe the likelihood of outcomes based on observed data, rather than predicting or inferring about a larger population.
Algorithm in Data Science
A step-by-step set of rules or instructions used to process data, perform analysis, or build models to identify patterns, make predictions, or solve problems.
Model in Data Science
A mathematical or computational representation of a real-world process or system, built from data to describe relationships, make predictions, or support decision-making.
Joint Distribution
A probability distribution that describes the likelihood of two or more random variables occurring together, showing how their probabilities are related.
Marginal Probability
The probability of a single event occurring, found by summing or integrating over all possible outcomes of the other variables in a joint distribution.
Conditional Probability
The probability of an event occurring given that another event has already occurred, expressed as 𝑃(𝐴∣𝐵)= 𝑃(𝐴∩𝐵) / 𝑃(𝐵).
Discretize
The process of converting continuous data or variables into discrete categories or intervals, often used in data preprocessing or feature engineering.
Discrete Variable
A variable that can take only distinct, separate values (often integers or categories), with no intermediate values possible between them.
Marginal Probability
The probability of a single event or variable occurring, obtained by summing or integrating over the other variables in a joint distribution.
Central Limit Theorem
A statistical principle stating that the sampling distribution of the sample mean approaches a normal distribution as the sample size becomes large, regardless of the population's original distribution.
Confidence Interval
A range of values, derived from sample data, that is likely to contain the true population parameter with a specified level of confidence (e.g., 95%).
Graph of a Function
A visual representation of all possible input-output pairs (x,f(x)), usually plotted on a coordinate plane, showing how the function behaves across its domain.
Cartesian Coordinates
A system that locates points on a plane using ordered pairs (x,y), where 𝑥 represents the horizontal position and y represents the vertical position, defined relative to perpendicular axes.
Euler's Number
A mathematical constant denoted by 𝑒≈2.718, which is the base of the natural logarithm and arises in continuous growth, compound interest, and many areas of calculus and probability.
Domain of a Function
The set of all possible input values (𝑥) for which the function is defined.
Range of a Function
The set of all possible output values (𝑦) that the function can produce.
K-Means Clustering
An unsupervised machine learning algorithm that partitions data into K clusters by assigning each point to the nearest centroid and updating centroids to minimize within-cluster distances.
Clustering (Unsupervised)
A machine learning technique that groups similar data points together without using predefined labels, aiming to discover hidden patterns or structures in the data.
Scale the Data (KMeans)
Preprocessing step where features are normalized or standardized so that all variables contribute equally to distance calculations, preventing features with larger ranges from dominating clustering.
Scale the Data (KMeans)
A preprocessing step where features are normalized or standardized so all variables contribute equally to distance calculations, preventing larger-scale features from dominating clustering.
Elbow Method
A technique to choose the optimal number of clusters in K-Means by plotting the within-cluster sum of squares (WCSS) and identifying the "elbow" point where adding more clusters gives little improvement.
Elbow Method
A technique to determine the optimal number of clusters in K-Means by plotting the within-cluster sum of squares (inertia) against the number of clusters and choosing the point where the curve bends ("elbow"), showing diminishing returns from adding more clusters.
Supervised Classification Model
A machine learning model trained on labeled data to predict categorical outcomes (class labels) for new, unseen inputs.
Random Forest
An ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.
F1-Score
A performance metric that combines precision and recall into a single value using their harmonic mean, especially useful for evaluating models on imbalanced datasets.
Regression
A supervised learning method used to model and predict continuous numerical outcomes based on input features.
Imputation
The process of filling in missing data values using strategies like mean, median, or predictive methods.
Type Conversion
Changing data from one type or format to another, such as string to integer or float.
Downstream Modeling
The stage where preprocessed data is used to train and evaluate machine learning models.
Exploratory Data Analysis (EDA)
The process of analyzing and visualizing data to find patterns, detect anomalies, and test assumptions before modeling.
MinMaxScaler
A preprocessing method that rescales features to a fixed range, usually [0, 1].
Heatmap
A graphical representation of data where values are shown as colors, often used to visualize correlations in a matrix.
Radar Chart
A graphical method for displaying multivariate data, where multiple axes start from the same point and values are plotted in a spider-web shape.
Tokenization
Splitting text into smaller units such as words, sentences, or subwords.
Stemming
Reducing words to their root form by trimming suffixes or prefixes (e.g., "running" → "run").
Lemmatization
Reducing words to their base dictionary form using linguistic rules (e.g., "better" → "good").
POS Tagging
Assigning parts of speech (noun, verb, adjective, etc.) to each word in text.
Sentiment Analysis
Determining the emotional tone or opinion expressed in text (positive, negative, neutral).
Corpora
Large structured collections of text used for training, testing, and analyzing NLP models.
Named Entity Recognition (NER)
Identifying and classifying entities in text such as names, places, dates, and organizations.
Dependency Parsing
Analyzing the grammatical structure of a sentence to determine how words are related.
Word Vectors
Numerical representations of words in continuous space that capture meaning and semantic relationships.
Denominator
The bottom number in a fraction, representing the total number of equal parts the whole is divided into.
One-Hot Encoding
A method to convert categorical variables into binary vectors, where each category is represented by 1 (present) and 0 (absent).
Absolute Value
=The distance of a number from zero on the number line, always non-negative.
Dot Product
An operation that multiplies two vectors and returns a scalar value, often used to measure similarity.
Vector Magnitude
The length of a vector, calculated as the square root of the sum of squared components.
Norms
Functions that measure the size or length of vectors, such as L1 (Manhattan) or L2 (Euclidean) norms.
Matrix
A rectangular array of numbers arranged in rows and columns.
Vector
An ordered list of numbers that can represent a point or direction in space.
Tensor
A generalization of scalars, vectors, and matrices to higher dimensions.
Linear Maps
Functions between vector spaces that preserve addition and scalar multiplication.
Linear System of Equations
A collection of linear equations involving the same set of variables, often solved simultaneously.
Eigenvectors
Non-zero vectors that remain in the same direction after a linear transformation, only scaled.
Eigenvalues
Scalars that indicate how much the corresponding eigenvector is stretched or compressed during a linear transformation.
Normalization
A preprocessing technique that rescales data so that values fall within a standard range, often between 0 and 1.
Customer Churn
When customers stop doing business with a company or service, often measured as the percentage of customers lost during a given period.
DBSCAN
A density-based clustering algorithm that groups together points that are closely packed and labels points in sparse regions as noise.
Pairplot
A Seaborn visualization that shows relationships between pairs of variables using scatterplots and histograms (or KDEs) in a grid format.
Cosine Similarity
A metric that measures how similar two vectors are by calculating the cosine of the angle between them, with values ranging from -1 (opposite) to 1 (identical).
Cosine
A trigonometric function that represents the ratio of the adjacent side to the hypotenuse in a right triangle.
Matrix Determinant
A scalar value that represents how a matrix scales space and indicates whether the matrix is invertible (non-zero determinant) or not (zero determinant).
Orthogonal Matrices
Square matrices whose rows and columns are orthonormal vectors, meaning their transpose is equal to their inverse (𝐴^𝑇𝐴=𝐼).
Random Forest Model
An ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.
Logistic Regression Model
A supervised learning model that predicts the probability of binary outcomes using the logistic (sigmoid) function.
Data Serialization
The process of converting data into a standard format (like JSON, CSV, or binary) so it can be stored, transmitted, and later reconstructed.
Kernel Density Estimation (KDE)
A non-parametric method to estimate the probability density function of a dataset by smoothing data points with a kernel function, producing a smooth curve that represents the distribution.
Rolling Average (Moving Average)
A technique that calculates the average of data points over a sliding window, used to smooth short-term fluctuations and highlight longer-term trends.
Feature Selection
The process of choosing the most relevant features (variables) from a dataset to improve model performance, reduce overfitting, and simplify the model.
Train/Test Split
A method of dividing data into two sets: one for training the model and the other for testing its performance on unseen data.
Nominal Data
Categorical data without order (e.g., colors, names).
Ordinal Data
Categorical data with a meaningful order but unequal intervals (e.g., rankings).
Interval Data
Data with equal intervals but no true zero (e.g., temperature in °C).
Ratio Data
Data with equal intervals and a true zero, allowing ratio comparisons (e.g., weight, height).
Supervised Learning
Models trained on labeled data to predict outcomes.
Unsupervised Learning
Models that find patterns or group data without labels.
Classification
A supervised task that predicts discrete categories (e.g., spam/not spam).
Regression
A supervised task that predicts continuous values (e.g., price, age).
Clustering
An unsupervised task that groups similar data points without predefined labels.
Logistic Regression
A supervised model that predicts the probability of binary outcomes using the logistic (sigmoid) function.
Multilayer Perceptron (MLP)
A type of artificial neural network composed of input, hidden, and output layers, used for supervised learning.
Sigmoid Function (Logistic Regression)
A mathematical function that maps any real value into the range (0, 1), giving an S-shaped curve for probabilities.
Latent Processes (ML)
Hidden or unobserved variables or structures that influence observed data, often inferred using models like PCA, factor analysis, or topic modeling.
Confocal Images
High-resolution optical images produced by confocal microscopy, which uses focused laser light and spatial filtering to remove out-of-focus blur and improve depth accuracy.
Convolutional Layers
Layers in a neural network that apply filters to input data (like images) to detect features such as edges, textures, and shapes.
Latent Variables
Hidden or unobserved variables that influence observed data, often representing underlying patterns or factors inferred through models like PCA, factor analysis, or autoencoders.
Empirical Risk Minimization (ERM)
A principle in machine learning where a model is trained to minimize the average loss (error) on the training data, serving as an estimate of the true risk over the entire data distribution.
Rational Numbers
Numbers that can be expressed as a fraction of two integers (ab\frac{a}{b}ba), where b≠0b \neq 0b=0.
Irrational Numbers
Numbers that cannot be expressed as a fraction of two integers; their decimals are non-repeating and non-terminating (e.g., √2, π).
Natural Numbers
The counting numbers starting from 1, 2, 3, and so on.