FBLA - Data Science and AI

0.0(0)

Studied by 1 person

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/142

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No study sessions yet.

143 Terms

New cards

Mean

The average of a set of numbers, found by summing values and dividing by count.

New cards

Variance

A measure of how far data points spread out from the mean.

New cards

Precision

The proportion of true positives among all predicted positives.

New cards

Recall

The proportion of true positives among all actual positives.

New cards

F1-score

Harmonic mean of precision and recall, useful for imbalanced datasets.

New cards

Supervised Learning

Machine learning using labeled data to train models.

New cards

Unsupervised Learning

Machine learning using unlabeled data to find patterns.

New cards

Overfitting

When a model performs well on training data but poorly on new data.

New cards

Normalization

Scaling data to a standard range, often 0-1.

New cards

Neural Network

A computational model inspired by the human brain, consisting of layers of nodes.

New cards

Bias in AI

Systematic error introduced when training data misrepresents reality.

New cards

Bayes' Theorem

A formula for conditional probability: P(A|B) = P(B|A)P(A)/P(B).

New cards

ROC Curve

Graph showing trade-off between true positive rate and false positive rate.

New cards

Feature Engineering

Creating new input variables to improve model performance.

New cards

Reinforcement Learning

Training agents through rewards and penalties for actions taken.

New cards

Population vs. sample

A population is the entire set of interest; a sample is a subset drawn from it to estimate population parameters.

New cards

Parameter vs. statistic

Parameters describe populations (e.g., μ, σ); statistics describe samples (e.g., x, s).

New cards

Mean

Sum of values divided by count; sensitive to outliers.

New cards

Median

Middle value in a sorted list; robust to outliers.

New cards

Mode

Most frequent value; can be multimodal.

New cards

Variance

Average squared deviation from the mean; population: σ², sample: s².

New cards

Standard deviation

Square root of variance; interpretable spread in original units.

New cards

Interquartile range (IQR)

Q3 − Q1; robust spread measure used in boxplots.

New cards

Skewness

Asymmetry of distribution; positive skew has a long right tail.

New cards

Kurtosis

Tail heaviness relative to normal; high kurtosis implies heavy tails.

New cards

Empirical rule

In normal distributions, ~68%, 95%, 99.7% within 1, 2, 3 SDs.

New cards

Z-score

Standardized value: (x - μ)/σ; compares across scales.

New cards

Central limit theorem

Sample mean tends toward normal as n increases, regardless of population distribution.

New cards

Law of large numbers

Sample average converges to population mean as sample size grows.

New cards

Correlation vs. causation

Correlation quantifies association; causation requires mechanisms and controls.

New cards

Pearson correlation

Linear association; sensitive to outliers; −1 to 1.

New cards

Spearman correlation

Rank-based; robust to nonlinearity and outliers.

New cards

Probability basics

P(A ∪ B) = P(A) + P(B) - P(A ∩ B); independence: P(A ∩ B)=P(A)P(B).

New cards

Conditional probability

P(A|B)=P(A ∩ B)/P(B).

New cards

Bayes' theorem

P(A|B)=P(B|A)P(A)/P(B).

New cards

Prior vs. posterior

Prior: belief before data; posterior: updated belief after observing evidence via Bayes.

New cards

Likelihood

Probability of data given parameters; central in ML (maximum likelihood).

New cards

Distributions: normal

Symmetric, bell-shaped; defined by μ, σ; ubiquitous in measurement data.

New cards

Distributions: binomial

Fixed n trials, success probability p; counts of successes; mean np, var np(1−p).

New cards

Distributions: Poisson

Counts of events over fixed interval with rate λ; mean = variance = λ.

New cards

Distributions: exponential

Memoryless waiting times; parameter λ; mean 1/λ.

New cards

Distributions: Bernoulli

Single trial with success/failure; mean p, variance p(1−p).

New cards

Sampling methods

Simple random, stratified, cluster, systematic; impact bias and variance.

New cards

Bias types (stats)

Selection bias, survivorship bias, measurement bias, nonresponse bias.

New cards

Confidence intervals

Range likely containing parameter; depends on variability and sample size.

New cards

Hypothesis testing

Null vs. alternative; p-value assesses evidence against null.

New cards

Type I vs. Type II error

Type I: false positive (α); Type II: false negative (β); power = 1−β.

New cards

Data quality dimensions

Accuracy, completeness, consistency, timeliness, validity, uniqueness.

New cards

Data cleaning

Handle missing (drop, impute), fix types, de-duplicate, resolve outliers, enforce constraints.

New cards

Missing data mechanisms

MCAR, MAR, MNAR; guide imputation strategy.

New cards

Imputation methods

Mean/median, mode, KNN impute, regression impute, multivariate imputation (MICE).

New cards

Feature scaling

Normalization (min-max), standardization (z-score), robust scaling (IQR-based).

New cards

Feature encoding

One-hot, ordinal, target encoding (use with caution to avoid leakage).

New cards

Feature selection

Filter (correlation, chi-squared), wrapper (RFE), embedded (L1/L2 regularization).

New cards

Dimensionality reduction

PCA (linear), t-SNE/UMAP (manifold visualization), autoencoders (nonlinear).

New cards

Data leakage

Train data includes information from test or future; inflates performance; avoid via strict splits.

New cards

Train/validation/test split

Typical: 60-20-20 or 70-15-15; validation tunes; test is final unbiased estimate.

New cards

Cross-validation

k-fold, stratified k-fold for classification; reduces variance of performance estimates.

New cards

Stratification

Preserve class proportions across folds/splits; critical in imbalanced data.

New cards

Supervised learning

Learn mapping from features X to labels y using labeled data.

New cards

Regression vs. classification

Regression predicts continuous values; classification predicts discrete classes.

New cards

Overfitting vs. underfitting

Overfit: memorizes noise; underfit: too simple; manage with regularization, more data.

New cards

Bias-variance tradeoff

High bias: underfit; high variance: overfit; aim for optimal complexity.

New cards

Regularization

L1 (lasso) sparsity, L2 (ridge) shrinkage; reduces overfitting.

New cards

Early stopping

Halt training when validation loss stops improving to prevent overfit.

New cards

Ensembles

Bagging (Random Forest), boosting (XGBoost), stacking; often superior generalization.

New cards

Linear regression

Minimize \(\sum (y - \hat{y})^2\); assumptions: linearity, homoscedasticity, normal errors, independence.

New cards

Logistic regression

Sigmoid outputs probability; decision boundary via log-odds; interpretable coefficients.

New cards

KNN

Instance-based; choose k and distance metric; sensitive to scaling and noise.

New cards

Naive Bayes

Assumes feature independence; strong baseline for text; fast and robust.

New cards

Decision trees

Recursive splits; interpretable; prone to overfitting without pruning.

New cards

Random forest

Ensemble of trees via bagging; reduce variance; feature importance estimates.

New cards

Gradient boosting

Sequential trees fit residuals; powerful but sensitive to hyperparameters.

New cards

SVM

Maximize margin with kernels (linear, RBF); effective in high-dimensional spaces.

New cards

Clustering: K-means

Partition into k clusters; minimizes within-cluster variance; requires scaling; spherical clusters.

New cards

Clustering: hierarchical

Agglomerative/divisive; dendrogram visual; flexible but computationally heavy.

New cards

Clustering: DBSCAN

Density-based; finds arbitrary shapes and noise; requires eps/minPts tuning.

New cards

Topic modeling

LDA uncovers topics via word distributions; unsupervised text analysis.

New cards

Evaluation: accuracy

Proportion correct; misleading in imbalanced data.

New cards

Precision

TP / (TP + FP); how often positives predicted are correct.

New cards

Recall (sensitivity)

TP / (TP + FN); how many actual positives captured.

New cards

Specificity

TN / (TN + FP); true negative rate.

New cards

F1-score

Harmonic mean of precision and recall; balances both.

New cards

Confusion matrix

2×2 summary: TP, FP, TN, FN; foundation for metrics.

New cards

ROC curve

TPR vs. FPR across thresholds; AUC summarizes separability.

New cards

PR curve

Precision vs. recall; preferred in heavy class imbalance.

New cards

Regression metrics

MAE (robust), MSE (penalizes large errors), RMSE (scale-aware), \(R^2\) (variance explained).

New cards

Calibration

Agreement between predicted probabilities and observed frequencies; reliability diagrams.

New cards

Threshold selection

Choose decision threshold optimizing metric of interest (F1, cost-sensitive, Youden's J).

New cards

Neural networks

Layers of neurons; weights and biases; nonlinear activations enable complex functions.

New cards

Activation functions

ReLU, Leaky ReLU, Sigmoid, Tanh, Softmax (for multiclass probabilities).

New cards

Backpropagation

Gradient-based weight updates via chain rule; paired with optimizers like SGD/Adam.

New cards

Vanishing/exploding gradients

Gradients shrink or blow up in deep nets; mitigated with normalization, residuals.

New cards

Batch normalization

Normalizes layer inputs per batch; stabilizes training.

New cards

Dropout

Randomly zeroes activations; regularizes by preventing co-adaptation.

New cards

CNNs

Convolutions for spatial features; pooling reduces dimensions; used in computer vision.

New cards

RNNs

Sequential modeling with recurrent connections; struggles with long dependencies.

New cards

LSTM/GRU

Gated RNNs; capture long-term dependencies more effectively than vanilla RNNs.

New cards

Transformers

Attention mechanisms model global dependencies; state-of-the-art in NLP and beyond.

100

New cards

Word embeddings

Dense vector representations (Word2Vec, GloVe); capture semantic similarity.