1/76
DA 320 Prep
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
List
Mutable, Ordered Sequence ([1, ‘a’])
Tuple
Immutable, ordered sequence ((1, ‘a’))
Dictionary
Key-value pairs ({‘key’:value})
Set
Unordered collection of unique elements
Data Wrangling → Discovering
Exploratory data analysis (EDA) to understand structure.
Data Wrangling - Structure
Transforming features to uniform formats.
Data Wrangling - Cleaning
Addressing missing values and outliers.
Data Wrangling - Enriching
Adding new features or external data.
Data Wrangling - Validating
Verifying consistency and accuracy of data.
Data Wrangling - Publishing
Making the dataset available for others,
Missing Completely at Random (MCAR)
Probability of missingness is same for all cases.
Missing at Random
Probability of missingness depends on observed data.
Missing Not At Random
Missingness depends on unknown reasons.
How to visualize categorical features?
Bar charts (Counts), Pie Charts, Stacked/Grouped Bar Charts
How to visualize numerical features?
Histogram, Box Plot(min/Q1/median/Q3/max), Scatter Plot
Tukey’s Fences
Outliers are values failing below Q1 - 1.5(Interq Range), or above Q3 + 1.5(Interq Range)
Z Score
Points with |Z| > 3, are often considered outliers.
Linear Regression
Model : y = b + mx ( b ~ y int, m ~ slope)
Residuals
difference between observed and predicted
Least Squares + Sum of Squared Errors
minimize sum of residuals (difference between predicted and observed value) squared.
R² : Coefficient of Determination
Proportion of variation explained by the model.
Logistic Regression
Used for binary classification, predicting 0 or 1 using the sigmoid function (e^ ( mx + b) / 1 + e^ (mx + b). The model assumes a linear relationship between input x, and the log odds of the outcome.
kNN
Classifies a new instance based on the majority class of its k closest neighbors.
Euclidean Distance Metric
distance = sqrt [(x1 - y1)² + (x2-y2)²]
Support Vector Machine (SVM)
Find a hyperplane that separates classes with the maximum margin.
Support Vectors
The specific instances closest to the decision boundary that determine its position.
Kernels
Functions (Polynomial, Radial Basis Function) that map the data to higher dimensions to make the data linearly separable.
Hinge Loss
A penalty loss function for instances that fall within the margin or on the wrong side.
Naive Bayes
ASSUME: features are independent given the class. Based upon bayes theorem for calculating posterior probability : P(X|Y)P(Y) // P(X) = P(Y|X) … P(Y) - prior prob of class, P(X|Y) - likelihood.
Decision Tree
Nodes (Root - start, decision - test a feature, leaf - final output)
CART
Splitting algorithm for splitting data to minimize impurity.
Gini Index
Measures misclassification probability (0 is pure). 1 - sum(posterior probability).
Entropy
Measures information/disorder. sum(probability / log(probability)
Bagging / Bootstrap Aggregating
Fits base models in parallel on random bootstrap samples. Aggregates via voting (classif) or averaging (regression). Example is Random Forest.
Boosting
Fits models sequentially. Each model focuses on errors (misclassified instances) of the previous one by updating weights. Ex are adaboost xgbboost.
Stacking
Fits multiple base models (often different types) and uses a meta-model to combine their predictions.
Random Forest
Collection of decision trees trained on bootstrap samples. Reduces variance and overfitting compared to single trees. Uses a random subset of features at each split.
K Means Clustering
Partition data into k clusters by minimizing distance to centroids.
Centroid Formula
Mean of points in a cluster
Elbow Method
Plot WCSS (Within Cluster Sum of Squares) vs k to find optimal cluster count.
Hierarchical
Agglomerative → Bottom up approach merging closest clusters.
Dendrogram → Tree diagram showing merge order and distances.
DBSCAN
Density Based. Groups points in high density regions; identifies points in low-density regions as outliers.
Dimensionality Reduction (PCA)
Principal Component Analysis transforms features into orthogonal components ordered by variance explained.
Eigenvalues
Represent the magnitude of variance explained by a factor.
Loadings
Correlation between original feature and the principal component factor. Loading = eigenvec * sqrt(eigenvalue)
Scree Plot
Line plot of eigenvalues used to select the number of factors (keep factors before the “leveling off” point).
Turing Test
A test proposed by Alan Turing where a machine is considered intelligent if it can successfully pretend to be human to a knowledgeable observer during a text-based interaction.
Generative AI
AI creates new content (text, images, audio) with architectures like GAN, Diffusion, Transformers, and VAE.
Hallucination
When LLM’s generate nonsensical or inaccurate information.
Algorithmic Bias
Systematic errors creating unfair outcomes, often stemming from unrepresentative training data (facial recognition data was not trained on darker skinned faces)
Transformer Architecture
Processes sequences in parallel , unlike RNN’s, using an Encoder (convert sequence to vector) and Decoder (convert vector to output sequence auto - regressively)
Self Attention Mechanism
Weighs the importance of different words in a sequence relative to each other.
Positional Encoding
Adds info about the order of tokens since Transformers process inputs in parallel.
Perceptron (Single Neuron)
Receives inputs, assigns weights, computes weighted sum, adds bias and passes it through activation function.
Perceptron Training Rule
Updates weights to minimize error: W_new = W_old + (learning rate)(error)(x)
ACtivation Function
Functions that define the output of a node given an input or set of inputs.
unit Step
Outputs 1 if z >= 0, else 0
Sigmoid Function
Outputs a real number between 0 and 1 (useful for probabilities).
f(z) = \frac{1}{1 + e^{-z}}
HYperbolic Tangent (tanh)
Outputs a value between -1 and 124.
f(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}
Rectified Linear Unit (ReLU)
Outputs the input directly if positive, else max(0, z)
Softmax
Used for multi class classification - converts a vector of numbers into probabilities summing to 128282828
Loss Functions
Measures the difference between predicted and actual values.
Mean Squared Error
For regression
Binary Cross - Entropy (Log Loss)
For binary classification
Gradient Descent
Algorithm to find the local minimum of a function in the direction of the negative gradient. W_new = W_old - learning_rate(dL/dw)
Stochastic Gradient Descent SGD
Update weights using the gradient of a single instance at a time
Regularization
Techniques to prevent overfitting by penalizing large weights.
L1 Reg
Sum of abs weights, forces weights to become exactly zero
L2 Reg
Sum of squared weights, shrinks weights close to 0
Dropout
Randomly ignoring a subset of neurons during training batch to improve robustness
Association Analysis
Discover hidden predictive information and relationships (rules) among attributes in transactional datasets.
Rule X to Y
X is antecedent/body, y is the consequent/head.
Support
Probability that a transaction contains both X and Y
Confidence
The conditional probability that a transaction containing X also contains Y.
Conviction
Measures the degree of implication; how much the rule exceeds expected confidence if items were independent.
Frequent Itemset Generation
Find all itemsets with support >= minimum support threshold
Rule Generation
From frequent itemsets, generate rules that satisfy confidence >= minimum confidence. Note: Rules from the same itemset (e.g., $\{A, B\}$) have identical support but different confidence.