Data Analysis

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/76

flashcard set

Earn XP

Description and Tags

DA 320 Prep

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

77 Terms

1
New cards

List

Mutable, Ordered Sequence ([1, ‘a’])

2
New cards

Tuple

Immutable, ordered sequence ((1, ‘a’))

3
New cards

Dictionary

Key-value pairs ({‘key’:value})

4
New cards

Set

Unordered collection of unique elements

5
New cards

Data Wrangling → Discovering

Exploratory data analysis (EDA) to understand structure.

6
New cards

Data Wrangling - Structure

Transforming features to uniform formats.

7
New cards

Data Wrangling - Cleaning

Addressing missing values and outliers.

8
New cards

Data Wrangling - Enriching

Adding new features or external data.

9
New cards

Data Wrangling - Validating

Verifying consistency and accuracy of data.

10
New cards

Data Wrangling - Publishing

Making the dataset available for others,

11
New cards

Missing Completely at Random (MCAR)

Probability of missingness is same for all cases.

12
New cards

Missing at Random

Probability of missingness depends on observed data.

13
New cards

Missing Not At Random

Missingness depends on unknown reasons.

14
New cards

How to visualize categorical features?

Bar charts (Counts), Pie Charts, Stacked/Grouped Bar Charts

15
New cards

How to visualize numerical features?

Histogram, Box Plot(min/Q1/median/Q3/max), Scatter Plot

16
New cards

Tukey’s Fences

Outliers are values failing below Q1 - 1.5(Interq Range), or above Q3 + 1.5(Interq Range)

17
New cards

Z Score

Points with |Z| > 3, are often considered outliers.

18
New cards

Linear Regression

Model : y = b + mx ( b ~ y int, m ~ slope)

19
New cards

Residuals

difference between observed and predicted

20
New cards

Least Squares + Sum of Squared Errors

minimize sum of residuals (difference between predicted and observed value) squared.

21
New cards

R² : Coefficient of Determination

Proportion of variation explained by the model.

22
New cards

Logistic Regression

Used for binary classification, predicting 0 or 1 using the sigmoid function (e^ ( mx + b) / 1 + e^ (mx + b). The model assumes a linear relationship between input x, and the log odds of the outcome.

23
New cards

kNN

Classifies a new instance based on the majority class of its k closest neighbors.

24
New cards

Euclidean Distance Metric

distance = sqrt [(x1 - y1)² + (x2-y2)²]

25
New cards

Support Vector Machine (SVM)

Find a hyperplane that separates classes with the maximum margin.

26
New cards

Support Vectors

The specific instances closest to the decision boundary that determine its position.

27
New cards

Kernels

Functions (Polynomial, Radial Basis Function) that map the data to higher dimensions to make the data linearly separable.

28
New cards

Hinge Loss

A penalty loss function for instances that fall within the margin or on the wrong side.

29
New cards

Naive Bayes

ASSUME: features are independent given the class. Based upon bayes theorem for calculating posterior probability : P(X|Y)P(Y) // P(X) = P(Y|X) … P(Y) - prior prob of class, P(X|Y) - likelihood.

30
New cards

Decision Tree

Nodes (Root - start, decision - test a feature, leaf - final output)

31
New cards

CART

Splitting algorithm for splitting data to minimize impurity.

32
New cards

Gini Index

Measures misclassification probability (0 is pure). 1 - sum(posterior probability).

33
New cards

Entropy

Measures information/disorder. sum(probability / log(probability)

34
New cards

Bagging / Bootstrap Aggregating

Fits base models in parallel on random bootstrap samples. Aggregates via voting (classif) or averaging (regression). Example is Random Forest.

35
New cards

Boosting

Fits models sequentially. Each model focuses on errors (misclassified instances) of the previous one by updating weights. Ex are adaboost xgbboost.

36
New cards

Stacking

Fits multiple base models (often different types) and uses a meta-model to combine their predictions.

37
New cards

Random Forest

Collection of decision trees trained on bootstrap samples. Reduces variance and overfitting compared to single trees. Uses a random subset of features at each split.

38
New cards

K Means Clustering

Partition data into k clusters by minimizing distance to centroids.

39
New cards

Centroid Formula

Mean of points in a cluster

40
New cards

Elbow Method

Plot WCSS (Within Cluster Sum of Squares) vs k to find optimal cluster count.

41
New cards

Hierarchical

Agglomerative → Bottom up approach merging closest clusters.

Dendrogram → Tree diagram showing merge order and distances.

42
New cards

DBSCAN

Density Based. Groups points in high density regions; identifies points in low-density regions as outliers.

43
New cards

Dimensionality Reduction (PCA)

Principal Component Analysis transforms features into orthogonal components ordered by variance explained.

44
New cards

Eigenvalues

Represent the magnitude of variance explained by a factor.

45
New cards

Loadings

Correlation between original feature and the principal component factor. Loading = eigenvec * sqrt(eigenvalue)

46
New cards

Scree Plot

Line plot of eigenvalues used to select the number of factors (keep factors before the “leveling off” point).

47
New cards

Turing Test

A test proposed by Alan Turing where a machine is considered intelligent if it can successfully pretend to be human to a knowledgeable observer during a text-based interaction.

48
New cards

Generative AI

AI creates new content (text, images, audio) with architectures like GAN, Diffusion, Transformers, and VAE.

49
New cards

Hallucination

When LLM’s generate nonsensical or inaccurate information.

50
New cards

Algorithmic Bias

Systematic errors creating unfair outcomes, often stemming from unrepresentative training data (facial recognition data was not trained on darker skinned faces)

51
New cards

Transformer Architecture

Processes sequences in parallel , unlike RNN’s, using an Encoder (convert sequence to vector) and Decoder (convert vector to output sequence auto - regressively)

52
New cards

Self Attention Mechanism

Weighs the importance of different words in a sequence relative to each other.

53
New cards

Positional Encoding

Adds info about the order of tokens since Transformers process inputs in parallel.

54
New cards

Perceptron (Single Neuron)

Receives inputs, assigns weights, computes weighted sum, adds bias and passes it through activation function.

55
New cards

Perceptron Training Rule

Updates weights to minimize error: W_new = W_old + (learning rate)(error)(x)

56
New cards

ACtivation Function

Functions that define the output of a node given an input or set of inputs.

57
New cards

unit Step

Outputs 1 if z >= 0, else 0

58
New cards

Sigmoid Function

 Outputs a real number between 0 and 1 (useful for probabilities).
f(z) = \frac{1}{1 + e^{-z}}

59
New cards

HYperbolic Tangent (tanh)

Outputs a value between -1 and 124.
f(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

60
New cards

Rectified Linear Unit (ReLU)

Outputs the input directly if positive, else max(0, z)

61
New cards

Softmax

Used for multi class classification - converts a vector of numbers into probabilities summing to 128282828

62
New cards

Loss Functions

Measures the difference between predicted and actual values.

63
New cards

Mean Squared Error

For regression

64
New cards

Binary Cross - Entropy (Log Loss)

For binary classification

65
New cards

Gradient Descent

Algorithm to find the local minimum of a function in the direction of the negative gradient. W_new = W_old - learning_rate(dL/dw)

66
New cards

Stochastic Gradient Descent SGD

Update weights using the gradient of a single instance at a time

67
New cards

Regularization

Techniques to prevent overfitting by penalizing large weights.

68
New cards

L1 Reg

Sum of abs weights, forces weights to become exactly zero

69
New cards

L2 Reg

Sum of squared weights, shrinks weights close to 0

70
New cards

Dropout

Randomly ignoring a subset of neurons during training batch to improve robustness

71
New cards

Association Analysis

Discover hidden predictive information and relationships (rules) among attributes in transactional datasets.

72
New cards

Rule X to Y

X is antecedent/body, y is the consequent/head.

73
New cards

Support

Probability that a transaction contains both X and Y

74
New cards

Confidence

The conditional probability that a transaction containing X also contains Y.

75
New cards

Conviction

Measures the degree of implication; how much the rule exceeds expected confidence if items were independent.

76
New cards

Frequent Itemset Generation

Find all itemsets with support >= minimum support threshold

77
New cards

Rule Generation

From frequent itemsets, generate rules that satisfy confidence >= minimum confidence. Note: Rules from the same itemset (e.g., $\{A, B\}$) have identical support but different confidence.