Machine Learning - Cap 5610 - Midterm

0.0(0)

Studied by 27 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/206

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

207 Terms

New cards

Machine Learning (ML)

Learn from data

New cards

ACM

Association for Computing Machinery

New cards

ACM Turing Award

Equivalent to the “Nobel Prize of Computing”

New cards

Alan Mathison Turing (1912–1954)

Father of theoretical computer science and artificial intelligence who articulated the mathematical foundation and limits of computing

New cards

Supervised Learning

A type of machine learning where both input and output are given (labeled data)

New cards

Unsupervised Learning

A type of machine learning where only input is given and no labels (unlabeled data)

New cards

Support Vector Machine (SVM)

A supervised learning model that classifies data by finding the optimal hyperplane

New cards

K-means Clustering

An unsupervised learning algorithm that groups data into clusters based on similarity

New cards

Feature

An independent variable

New cards

Observation (Row)

An example or data point in the dataset

New cards

Accuracy

(TP+TN)/(TP+TN+FP+FN) – proportion of correct predictions

New cards

Precision

TP/(TP+FP) – proportion of predicted positives that are truly positive

New cards

Recall (Sensitivity/TPR)

TP/(TP+FN) – proportion of actual positives correctly predicted

New cards

Specificity

TN/(TN+FP) – proportion of actual negatives correctly predicted

New cards

F1 Score

2 * (Precision * Recall) / (Precision + Recall) – harmonic mean of precision and recall

New cards

Confusion Matrix

A table showing counts of TP

New cards

True Positive (TP)

Correctly predicted positive cases

New cards

False Positive (FP)

Incorrectly predicted as positive

New cards

False Negative (FN)

Incorrectly predicted as negative

New cards

True Negative (TN)

Correctly predicted negative cases

New cards

ROC Curve

A graphical plot of True Positive Rate (TPR) vs. False Positive Rate (FPR) at various thresholds

New cards

AUC (Area Under Curve)

A measure of the overall performance of a classifier across thresholds

New cards

Cross-validation

A resampling method (e.g., 5-fold) to evaluate model performance on different data splits

New cards

LASSO (Least Absolute Shrinkage and Selection Operator)

A regression-based method for feature selection

New cards

Concrete Autoencoder (CAE)

A deep learning method for feature selection using neural networks

New cards

Feature Extraction

Process of creating a new, smaller set of features that captures the most useful information from the original data

New cards

Principal Component Analysis (PCA)

A dimensionality reduction method for feature extraction

New cards

Autoencoder (AE)

A neural network that compresses and reconstructs data for feature extraction

New cards

Pan-cancer Analysis

Studying multiple cancer types to find common biomarkers using machine learning

New cards

Stochastic Behavior of CAE

The variability in results due to randomness in training runs of Concrete Autoencoder

New cards

Data Representation (Computer Science)

Data is represented as 0 and 1 in general computing

New cards

Data Representation (Machine Learning)

Data is represented as vectors and matrices

New cards

Scalar

A single number represented by a lowercase italic letter (e.g.

New cards

Vector

An ordered array of numbers

New cards

Matrix

A 2-D array of numbers identified by two indices

New cards

Tensor

An array with more than two axes

New cards

Categorical Variable

Discrete/qualitative variable

New cards

Nominal Variable

Categorical variable with two or more categories that have no intrinsic order

New cards

Ordinal Variable

Categorical variable with two or more categories that can be ordered or ranked

New cards

Continuous Variable

A variable that takes values on a continuous scale

New cards

Sample

An item to process (classify or cluster)

New cards

Feature Vector

An N-dimensional vector of numerical features representing a sample

New cards

Feature Selection

Process of filtering out irrelevant or redundant features while keeping a subset of the original features

New cards

Structured Data

Data organized in rows and columns (e.g.

New cards

Unstructured Data

Data without a predefined structure (e.g., text data), often converted into structured form like Bag-of-Words

New cards

Input Vector (𝑥𝑖)

Independent variable representing the ith sample

New cards

Response Variable (𝑦)

Dependent variable representing the outcome

New cards

Binary Classification (𝑦 ∈ {−1, 1} or {0, 1})

Task of predicting one of two classes

New cards

Multi-label Classification (𝑦 ∈ ℤ)

Task of predicting multiple discrete labels

New cards

Regression (𝑦 ∈ ℝ)

Predicting a continuous value

New cards

Principal Component Analysis (PCA)

A dimensionality reduction technique that transforms high-dimensional data into uncorrelated variables (principal components) capturing maximum variance

New cards

Principal Component (PC)

A linear combination of original variables that explains variance in data

New cards

PC1

Principal component explaining the largest variance in the dataset

New cards

PC2

Principal component explaining the next largest variance

New cards

Scree Plot

A plot showing the variance explained by each principal component

New cards

Step 1 of PCA

Standardization and centering data so variables share the same scale

New cards

Step 2 of PCA

Compute covariance matrix to identify relationships between variables

New cards

Step 3 of PCA

Eigen decomposition to identify eigenvectors (principal components) and eigenvalues (variance explained)

New cards

Eigenvector

A vector indicating a direction of maximum variance in data

New cards

Eigenvalue

A scalar indicating the amount of variance explained by its corresponding eigenvector

New cards

Step 4 of PCA

Select significant principal components to create a feature vector

New cards

Step 5 of PCA

Project data onto the new feature vector space (dimensionality reduction)

New cards

Applications of PCA

Data visualization

New cards

t-SNE (t-Distributed Stochastic Neighbor Embedding)

A nonlinear dimensionality reduction technique for visualizing high-dimensional data in 2D or 3D

New cards

t-SNE Purpose

Creates a visual “map” of high-dimensional data to reveal patterns and clusters

New cards

t-SNE Strength

Preserves local structure of data

New cards

Difference PCA vs. t-SNE

PCA preserves global structure (deterministic) while t-SNE preserves local structure (stochastic)

New cards

KL Divergence in t-SNE

Measures the difference between probability distributions in original and reduced dimensions

New cards

Model Parameters

Values determined using the training dataset

New cards

Hyperparameters

Values set before training that control the learning process (e.g.

New cards

t-SNE Hyperparameter: Components

Dimension of the embedded space

New cards

t-SNE Hyperparameter: Perplexity

, Effective number of nearest neighbors (recommended 5–50, must be < number of samples)

New cards

t-SNE Hyperparameter: Iterations

Number of optimization steps (≥250 recommended)

New cards

Scatter Plot

A plot showing relationships between two variables (e.g., Iris dataset)

New cards

Line Plot

Visualization showing how a variable changes with another

New cards

Histogram

Approximate representation of the distribution of numerical data (introduced by Karl Pearson)

New cards

Density Plot

Plot showing proportions of values in a distribution

New cards

Bar Plot

Visualization effective for categorical data with fewer than 10 categories

New cards

Box Plot

Visualization showing distributions with quartiles and medians (five-number summary: min, Q1, median, Q3, max)

New cards

Violin Plot

Visualization combining box plot and density plot to show distribution and probability density

New cards

LN_IC50

Log normalized IC50 representing drug dose sensitivity or resistance

New cards

Low IC50

Indicates tumor cells are more sensitive to the drug (better response)

New cards

High IC50

Indicates tumor cells are resistant to the drug (worse response)

New cards

Regression Models

Predict a continuous target variable such as drug response (IC50)

New cards

Five-Fold Cross Validation

Model evaluation method splitting data into 5 folds for training/testing

New cards

MAE (Mean Absolute Error)

Average of absolute differences between predictions and true values

New cards

MSE (Mean Squared Error)

Average of squared differences between predictions and true values

New cards

RMSE (Root Mean Squared Error)

Square root of MSE

New cards

R² (Coefficient of Determination)

Proportion of variance explained by the model

New cards

PCC (Pearson Correlation Coefficient)

Measure of linear correlation between predicted and actual values

New cards

CatBoost

Gradient boosting decision tree model shown to perform best for Docetaxel regression

New cards

SHAP (SHapley Additive exPlanations)

Explainable AI method for feature importance attribution

New cards

Global Feature Importance (Regression)

Average of absolute SHAP values across all samples

New cards

Local Feature Importance (Regression)

SHAP values for an individual sample prediction

New cards

Top 10 Genes (Docetaxel)

Most important genomic features identified by SHAP for drug response

New cards

Force Plot (SHAP)

Visualization showing how each feature contributes positively or negatively to prediction

New cards

Classification

Predicting categorical labels (e.g., cancer type) using input features

New cards

Five-Fold Cross Validation (Classification)

Evaluation method where classifier is trained/tested across 5 splits

New cards

Evaluation Metrics (Classification)

Accuracy

100

New cards

LightGBM

Gradient boosting framework identified as best classifier in experiments