1/68
Vocabulary flashcards covering key terms, algorithms, metrics, and concepts from the Data Science lecture notes. Designed to reinforce definitions for exam preparation.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Data Science
Interdisciplinary field that extracts knowledge and insights from structured and unstructured data using scientific methods, processes, algorithms and systems.
Problem Formulation (Data Science Step)
Stage where outcome of interest, task type, and predictor variables are identified.
Data Collection & Processing
Gathering representative examples and transforming, cleaning, filtering, and aggregating them into a model-ready format.
Modeling (Data Science Step)
Applying machine-learning algorithms, evaluating models, and analysing sensitivity and cost.
Insight & Action
Translating model results into understandable recommendations and communicating them for workflow improvement.
Structured Data
Highly organized, mostly quantitative data stored in rows, columns or relational databases; easy to manage with traditional tools.
Unstructured Data
Qualitative data (e.g., images, audio, e-mails) that cannot be easily stored in rows/columns and needs specialized tools to analyse.
Big Data – Volume
Characteristic referring to the massive amount of data generated and stored.
Big Data – Velocity
Characteristic describing the high speed at which data are generated and must be processed (often real-time).
Big Data – Variety
Characteristic denoting the multiple formats and types (text, video, sensor, etc.) present in big data sets.
Dynamic Dashboard
Interactive data-visualization interface enabling real-time exploration and communication of results.
Descriptive Analysis
Analytic type answering “What is happening?” through KPIs, summary tables, and static charts.
Diagnostic Analysis
Analytic type answering “Why is it happening?” by drilling down to uncover root causes and patterns.
Predictive Analysis
Analytic type answering “What is likely to happen?” by applying algorithms to forecast future outcomes.
Prescriptive Analysis
Analytic type answering “What should we do?” using optimization and simulation to recommend actions.
Mean
Average value; sum of observations divided by their count.
Median
Middle value when observations are ordered.
Mode
Most frequently occurring value in a data set.
Range
Difference between maximum and minimum values.
Variance
Average squared deviation of each observation from the mean.
Standard Deviation
Square root of variance; standard measure of data spread.
Supervised Learning
Machine-learning setting where both predictors X and response Y are observed for training.
Unsupervised Learning
Machine-learning setting where only predictors X are observed; aims to find structure (e.g., clusters).
Reinforcement Learning
Learning paradigm where an agent learns via trial-and-error interactions with an environment.
Regression (Predictive Modelling)
Supervised technique estimating a continuous outcome based on input variables.
Classification
Supervised technique assigning observations to discrete categories.
Clustering
Unsupervised technique grouping similar observations without predefined labels.
Forecasting
Predicting future values (often time-series) using methods like moving averages or exponential smoothing.
Bias (Model)
Error from approximating a real problem by a simpler model; affects accuracy systematically.
Variance (Model)
Sensitivity of a model to fluctuations in the training set; high variance implies overfitting risk.
Bias-Variance Trade-off
Balancing act between low bias (complex models) and low variance (simple models).
Overfitting
Model captures noise in training data, harming performance on unseen data.
Underfitting
Model is too simple to capture underlying patterns, resulting in poor training and test performance.
Training Error
Average loss calculated on the training sample.
Test Error
Average loss when predicting unseen (held-out) data; true indicator of generalization.
Cross-Validation (CV)
Resampling technique to estimate test error by dividing data into training and validation folds.
Leave-One-Out CV (LOOCV)
CV method using n-1 observations for training and 1 for validation, repeated n times.
K-Fold Cross-Validation
CV method splitting data into K folds; each fold is used once for validation and K-1 times for training.
Bootstrap (Resampling)
Technique drawing B samples with replacement from data to estimate variability and confidence intervals.
Parametric Model
Model assuming a specific functional form with a fixed set of parameters (e.g., linear regression).
Non-Parametric Model
Model making no strong assumptions about data distribution; flexible but may need more data.
Linear Regression
Parametric method modelling linear relationship between dependent variable and one or more predictors.
Multicollinearity
Situation where predictors are highly correlated, destabilizing coefficient estimates.
p-Value
Probability of observing result at least as extreme as sample, assuming null hypothesis is true.
R-Squared
Proportion of variance in the response explained by the model; goodness-of-fit measure.
Logistic Regression
Regression technique modelling log-odds of a binary outcome as a linear function of predictors.
Odds Ratio
Exponentiated logistic coefficient; measures change in odds for one-unit increase in predictor.
k-Nearest Neighbour (kNN)
Non-parametric algorithm classifying (or regressing) based on labels of the k closest training points.
Distance Metric (kNN)
Mathematical measure (e.g., Euclidean, Hamming) used to quantify similarity between observations.
Standardization (Scaling)
Rescaling features (e.g., to zero mean & unit variance) to prevent dominance in distance calculations.
Curse of Dimensionality
Exponential growth of data sparsity and computational cost as feature dimension increases.
Decision Tree
Non-parametric model that splits data recursively based on feature thresholds to predict target.
Impurity Measure
Criterion (e.g., Gini index, entropy, RSS) used to select best split in a decision tree.
Bagging (Bootstrap Aggregating)
Ensemble technique averaging predictions of models fitted on bootstrap samples to reduce variance.
Random Forest
Bagging ensemble of decision trees with random feature selection at splits to decorrelate trees.
Variable Importance (RF)
Score quantifying each predictor’s contribution to reducing impurity or RSS across forest trees.
Boosting
Sequential ensemble technique that iteratively focuses on misclassified instances to reduce bias.
AdaBoost
Adaptive boosting algorithm assigning higher weights to previously misclassified observations when training new weak learners.
Cluster Cohesion (Intra-Cluster)
Measure of how close points are to centroid within the same cluster (e.g., SSE).
Cluster Separation (Inter-Cluster)
Measure of distance between centroids of different clusters.
K-Means
Iterative algorithm partitioning data into K clusters by minimizing within-cluster SSE.
Hierarchical Clustering
Algorithm building nested clusters via successive merges (agglomerative) or splits (divisive).
Gower Distance
Similarity measure that handles mixed numerical and categorical data; used in clustering.
Market Basket Analysis (MBA)
Mining transaction data to find associations between items frequently purchased together.
Frequent Item-Set
Set of items appearing together in transactions at least as often as a support threshold.
Support (MBA)
Proportion of transactions containing a given item-set.
Confidence (MBA)
Conditional probability that a transaction containing antecedent items also contains consequent items.
Lift
Ratio of confidence to expected confidence; evaluates strength of association relative to independence.
Apriori Algorithm
Classic method for mining frequent item-sets by exploring increasing item-set sizes and pruning infrequent candidates.