Data Science & Machine-Learning Fundamentals – Vocabulary Cards

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/68

Earn XP

Description and Tags

Vocabulary flashcards covering key terms, algorithms, metrics, and concepts from the Data Science lecture notes. Designed to reinforce definitions for exam preparation.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

69 Terms

New cards

Data Science

Interdisciplinary field that extracts knowledge and insights from structured and unstructured data using scientific methods, processes, algorithms and systems.

New cards

Problem Formulation (Data Science Step)

Stage where outcome of interest, task type, and predictor variables are identified.

New cards

Data Collection & Processing

Gathering representative examples and transforming, cleaning, filtering, and aggregating them into a model-ready format.

New cards

Modeling (Data Science Step)

Applying machine-learning algorithms, evaluating models, and analysing sensitivity and cost.

New cards

Insight & Action

Translating model results into understandable recommendations and communicating them for workflow improvement.

New cards

Structured Data

Highly organized, mostly quantitative data stored in rows, columns or relational databases; easy to manage with traditional tools.

New cards

Unstructured Data

Qualitative data (e.g., images, audio, e-mails) that cannot be easily stored in rows/columns and needs specialized tools to analyse.

New cards

Big Data – Volume

Characteristic referring to the massive amount of data generated and stored.

New cards

Big Data – Velocity

Characteristic describing the high speed at which data are generated and must be processed (often real-time).

New cards

Big Data – Variety

Characteristic denoting the multiple formats and types (text, video, sensor, etc.) present in big data sets.

New cards

Dynamic Dashboard

Interactive data-visualization interface enabling real-time exploration and communication of results.

New cards

Descriptive Analysis

Analytic type answering “What is happening?” through KPIs, summary tables, and static charts.

New cards

Diagnostic Analysis

Analytic type answering “Why is it happening?” by drilling down to uncover root causes and patterns.

New cards

Predictive Analysis

Analytic type answering “What is likely to happen?” by applying algorithms to forecast future outcomes.

New cards

Prescriptive Analysis

Analytic type answering “What should we do?” using optimization and simulation to recommend actions.

New cards

Mean

Average value; sum of observations divided by their count.

New cards

Median

Middle value when observations are ordered.

New cards

Mode

Most frequently occurring value in a data set.

New cards

Range

Difference between maximum and minimum values.

New cards

Variance

Average squared deviation of each observation from the mean.

New cards

Standard Deviation

Square root of variance; standard measure of data spread.

New cards

Supervised Learning

Machine-learning setting where both predictors X and response Y are observed for training.

New cards

Unsupervised Learning

Machine-learning setting where only predictors X are observed; aims to find structure (e.g., clusters).

New cards

Reinforcement Learning

Learning paradigm where an agent learns via trial-and-error interactions with an environment.

New cards

Regression (Predictive Modelling)

Supervised technique estimating a continuous outcome based on input variables.

New cards

Classification

Supervised technique assigning observations to discrete categories.

New cards

Clustering

Unsupervised technique grouping similar observations without predefined labels.

New cards

Forecasting

Predicting future values (often time-series) using methods like moving averages or exponential smoothing.

New cards

Bias (Model)

Error from approximating a real problem by a simpler model; affects accuracy systematically.

New cards

Variance (Model)

Sensitivity of a model to fluctuations in the training set; high variance implies overfitting risk.

New cards

Bias-Variance Trade-off

Balancing act between low bias (complex models) and low variance (simple models).

New cards

Overfitting

Model captures noise in training data, harming performance on unseen data.

New cards

Underfitting

Model is too simple to capture underlying patterns, resulting in poor training and test performance.

New cards

Training Error

Average loss calculated on the training sample.

New cards

Test Error

Average loss when predicting unseen (held-out) data; true indicator of generalization.

New cards

Cross-Validation (CV)

Resampling technique to estimate test error by dividing data into training and validation folds.

New cards

Leave-One-Out CV (LOOCV)

CV method using n-1 observations for training and 1 for validation, repeated n times.

New cards

K-Fold Cross-Validation

CV method splitting data into K folds; each fold is used once for validation and K-1 times for training.

New cards

Bootstrap (Resampling)

Technique drawing B samples with replacement from data to estimate variability and confidence intervals.

New cards

Parametric Model

Model assuming a specific functional form with a fixed set of parameters (e.g., linear regression).

New cards

Non-Parametric Model

Model making no strong assumptions about data distribution; flexible but may need more data.

New cards

Linear Regression

Parametric method modelling linear relationship between dependent variable and one or more predictors.

New cards

Multicollinearity

Situation where predictors are highly correlated, destabilizing coefficient estimates.

New cards

p-Value

Probability of observing result at least as extreme as sample, assuming null hypothesis is true.

New cards

R-Squared

Proportion of variance in the response explained by the model; goodness-of-fit measure.

New cards

Logistic Regression

Regression technique modelling log-odds of a binary outcome as a linear function of predictors.

New cards

Odds Ratio

Exponentiated logistic coefficient; measures change in odds for one-unit increase in predictor.

New cards

k-Nearest Neighbour (kNN)

Non-parametric algorithm classifying (or regressing) based on labels of the k closest training points.

New cards

Distance Metric (kNN)

Mathematical measure (e.g., Euclidean, Hamming) used to quantify similarity between observations.

New cards

Standardization (Scaling)

Rescaling features (e.g., to zero mean & unit variance) to prevent dominance in distance calculations.

New cards

Curse of Dimensionality

Exponential growth of data sparsity and computational cost as feature dimension increases.

New cards

Decision Tree

Non-parametric model that splits data recursively based on feature thresholds to predict target.

New cards

Impurity Measure

Criterion (e.g., Gini index, entropy, RSS) used to select best split in a decision tree.

New cards

Bagging (Bootstrap Aggregating)

Ensemble technique averaging predictions of models fitted on bootstrap samples to reduce variance.

New cards

Random Forest

Bagging ensemble of decision trees with random feature selection at splits to decorrelate trees.

New cards

Variable Importance (RF)

Score quantifying each predictor’s contribution to reducing impurity or RSS across forest trees.

New cards

Boosting

Sequential ensemble technique that iteratively focuses on misclassified instances to reduce bias.

New cards

AdaBoost

Adaptive boosting algorithm assigning higher weights to previously misclassified observations when training new weak learners.

New cards

Cluster Cohesion (Intra-Cluster)

Measure of how close points are to centroid within the same cluster (e.g., SSE).

New cards

Cluster Separation (Inter-Cluster)

Measure of distance between centroids of different clusters.

New cards

K-Means

Iterative algorithm partitioning data into K clusters by minimizing within-cluster SSE.

New cards

Hierarchical Clustering

Algorithm building nested clusters via successive merges (agglomerative) or splits (divisive).

New cards

Gower Distance

Similarity measure that handles mixed numerical and categorical data; used in clustering.

New cards

Market Basket Analysis (MBA)

Mining transaction data to find associations between items frequently purchased together.

New cards

Frequent Item-Set

Set of items appearing together in transactions at least as often as a support threshold.

New cards

Support (MBA)

Proportion of transactions containing a given item-set.

New cards

Confidence (MBA)

Conditional probability that a transaction containing antecedent items also contains consequent items.

New cards

Lift

Ratio of confidence to expected confidence; evaluates strength of association relative to independence.

New cards

Apriori Algorithm

Classic method for mining frequent item-sets by exploring increasing item-set sizes and pruning infrequent candidates.