Data Science

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/48

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

49 Terms

1
New cards

Nominal

data is a categorical type of data that represents labels or names without a specific order or ranking.

2
New cards

Ordinal

data is a categorical type of data that has a defined order or ranking, allowing for comparison of relative positions.

3
New cards

Categorical

data represents distinct categories or groups.

4
New cards

Numerical

data is a type of data that represents quantifiable values, allowing for mathematical calculations and comparisons.

5
New cards

Discrete

Difference between units on scale is constant but can only take certain values

6
New cards

Interval

Difference between units on scale is constant, but no zero point - measures exact difference.

7
New cards

Scatterplot/ line plot/ scatterplot matrix

Compare two variables (numerical/numerical)

8
New cards

Joint plot

Compare two variables (numerical / numerical data)

9
New cards

Bivariate Kernel Density Plot

Numerical/ numerical data

10
New cards

Boxplot / violin plot

categorical / numerical data

11
New cards

Heatmap

Categorical / categorical data

12
New cards

Probability Distribution

between 0 and 1

13
New cards

Bernoulli Random Variable

Two possible outcomes

14
New cards

Binomial Random Variable

How many successes after n times

15
New cards

Central limit theorem

Distribution of the sample mean will approximately normal if large sample.

-normalize data by data version of mean

16
New cards

Bootstrap

Sampling with replacements

-provide consistence

-helps quantify errors when making inferences

17
New cards

Confidence Intervals

Measure variation in a statistic

18
New cards

Null Hypothesis

No effect or nothing of interest

19
New cards

Alternative Hypothesis

There is an effect

20
New cards

Test Statistic

Denote difference between null and alternative hypothesis

21
New cards

Rejection Criterion

Rejects Null Hypothesis

22
New cards

Type I Error

Rejects Null Hypothesis when it is true, false positive

23
New cards

Type II Error

24
New cards

Hypothesis Testing

Model that helps decide between different hypotheses using falsification

25
New cards

P-Value

probability of getting a more extreme value than the observed test statistic given Null hypothesis is true.

-Lower p value = lower the risk of type I error

26
New cards

Rejects null hypothesis

p-value < 0.05

-Null Hypothesis unlikely to be making a type I error

27
New cards

Binomial Distribution

Probability distribution that describes the number of successes in a fixed number of independent trials of binary experiment.

probability of K successes n trials and p probability of successes

28
New cards

t-test

Numerical vs Categorical/ Two Categories

Compares variables with two values vs numerical. It answers the question of means of two groups are different.

29
New cards

Kruskal-Wallis Test

Same as t-test but with many categories.

Hypothesis test that compares multiple values vs numerical variables but does not specify which category is different.

-ranks the sum of two and check if ranks differ

30
New cards

Pearson’s Correction

Numerical vs Numerical

Measures strength/direction of the linear relationship and answers the question of two variables move together.

  • Between -1 and 1, if 0 = non-linear relationship

31
New cards

Spearman’s Correlation

Used if there is non-linear relationship

  • correlation does not equal to causation

32
New cards

X^s Test of Independence

compares two categories and measure whether there is dependence.

  • H0: independent, no association

  • Ha: dependent, is an association

33
New cards

Family Error

Probability of making at least one type I error.

34
New cards

What does probability answers

Count all the test in the same statistical family together

35
New cards

Bonferroni Correction

k test simultaneously

  • rejects p-value <= alpha/2

  • Adjust for multiple comparisons to control the family-wise error rate

36
New cards

Multiple Hypothesis Test

The more hypothesis testing the more type I error accrue

  • Using Bonferroni Correction will help reduce this risk by adjusting the threshold.

37
New cards

Alpha_new

adjust significance level for each individual test

38
New cards

Data Set Cards

knowt flashcard image
39
New cards

Clustering

Unsupervised technique used to group similar datas

40
New cards

K-Means

Distance-based clustering algorithm

  • Uses distance to measure intra - cluster “coherence“

  • Finds local optimum

  • clustering metric-sum of squares

Pros:

  • Simplicity

  • Scalability

  • Convergence

Cons:

  • Sensitive to outliers

  • Cluster shape

  • Choosing K values

41
New cards

Convex

data from A to B without going out of the circle

42
New cards

Elbow

Find the optimal K value for K-means

  • Can be very hard to find the “elbow“ when line is linear

43
New cards

Silhouette Scores

Metric for evaluating any clustering (not only to choose the best k for k-means). Returns the average of silhouette coefficients over all samples

-1: cluster is incorrect

0: overlapping

1: strong structure

44
New cards

Hierarchical Clustering

Bottom up method

  • does not require number of cluster k to run

  • Can interpret dendrogram (“tree-based“)

  • Expensive in terms of comute and memory

45
New cards

Curse of Dimensionality

High dimensional data tends to be sparse and hard to analyze, more features cause complication to model.

46
New cards

Principal Components Analysis (PCA)

reduce dimensionality while preserving as much information as possible.

  • Goal: find the subspace and project the data.

  • Benefits: Simplifies data without losing information and helps with visualization.

47
New cards

The Process of PCA

1)Feature Matrix

2) Standardize data(sensitve to scale)

3) Compute the Covariance Matrix E

4) Find the Eigenvalues and Eigenvector

5) project the data

48
New cards

Eigenvalues

How much variance (information) in each direction

49
New cards

Eigenvector

new axes or direction