Data Science Fundementals 2-2

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/53

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

54 Terms

1
New cards

___ learning should be used for a ML model predicting a value.

Supervised

2
New cards

___ learning should be used for a ML model not predicting a value.

Unsupervised

3
New cards

A supervised learning model finding a discrete value is performing ___.

Classification

4
New cards

A supervised learning model finding a continuous value is performing ___.

Regression

5
New cards

An unsupervised model trying to fit data into discrete groups is performing ___.

Clustering

6
New cards

An unsupervised model making a numeric estimate is performing ___.

Density Estimation

7
New cards

Supervised Learning

Learning from data with known outcomes.

8
New cards

Unsupervised Learning

Learning from data with unknown outcomes.

9
New cards

Step 1 of making a decision tree:

Calculate entropy of the target variable.

10
New cards

Step 2 of making a decision tree:

Split the dataset and calculate the entropy for each sub-set. Add the entropies together and compare to the original entropy.

11
New cards

Step 3 of making a decision tree:

Choose the attribute with the smallest entropy as the decision node. Repeat the steps again.

12
New cards

Entropy/Uncertainty

The amount of information lost for a decision.

13
New cards

The best attribute to split a decision tree with is the one that produces the ___ tree.

Smallest

14
New cards

Decision Node Purity

The amount of outcomes a decision can possibly have.

15
New cards
<p>Is this decision pure?</p>

Is this decision pure?

Yes

16
New cards
<p>Is this decsision pure?</p>

Is this decsision pure?

No

17
New cards

The attribute with the ___ entropy should be selected.

Lowest

18
New cards

Informationed Gained =

Info(D)-Info_A(D) where Info(D) is the old entropy and Info_A(D) is the new entropy.

19
New cards

k-Nearest Neighbors

A supervised learning algorithm that uses a distance-based algorithm to cluster tuples.

20
New cards

The k in k-NN stands for:

The number of neighbors each cluster should have.

21
New cards

Step 1 of k-NN:

Decide on the similarity metric, then split the dataset into training and testing data. Pick an evaluation metric.

22
New cards

Step 2 of k-NN:

Run k-NN a few times, changing k each time.

23
New cards

Step 3 of k-NN:

Choose the “best” k.

24
New cards

k-Means

An unsupervised learning algorithm that clusters similar objects.

25
New cards

Step 1 of k-Means:

Pick a k number of random points to be the centroids.

26
New cards

Step 2 of k-Means:

Assign each data point to the centroid closest to them.

27
New cards

Step 3 of k-Means:

Move the centroids to the average location of all the data points in their cluster. Repeat steps 2 and 3 until all centroids move little to none.

28
New cards

It is possible for k-Means to fall into an ___.

Infinite Loop

29
New cards

Random Forest

A collection of decision trees.

30
New cards

Social Network

A collecion of actors and relations.

31
New cards

Social Actor

A single unit in a social network.

32
New cards

Social Dyad

A pair of actors.

33
New cards

Social Triad

A triplet of actors.

34
New cards

Social Subgroup

A subset of a social network.

35
New cards

Social Relation

A relational tie between actors.

36
New cards

Social Ego Network

The “part of the network surrounding a single actor.”

37
New cards

When presenting to a project sponsor, the presentation should be:

Short, technically simple, and have the results introduced early into the presentation.

38
New cards

When presenting to an end user, the presentation should be:

Focused on how the model improves their day-to-day lives and how to use the model.

39
New cards

When presenting to other data scientists, the presentation should be:

Technically complex and brutally honest about the limitations and assumtions of the model.

40
New cards

True Positive (TP)

Predicted Posivite, Actually Positive

41
New cards

True Negative (TN)

Predicted Negative, Actually Negative

42
New cards

False Positive (FP)

Predicted Positive, Actually Negative

43
New cards

False Negative (FN)

Predicted Negative, Actually Positive

44
New cards

F_1-Score =

2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}

45
New cards

Precision =

\frac{TP}{TP+FP}

46
New cards

Recall =

\frac{TP}{TP+FN}

47
New cards

Accuracy =

\frac{TP+TN}{TP+TN+FP+FN}

48
New cards

Linear Regression =

f(x)+b where f(x) is the slope and b is the error term.

49
New cards

Multiple Linear Regression =

f(0)+f(1)i_1+…+f(n)i_n+b where f(x) is the slope and i is the independent variable

50
New cards

Underfitting

When the model’s predictions doesn’t come close to matching the actual data.

51
New cards

Overfitting

When the model’s predictions matches the testing data too well and underperforms in the real world.

52
New cards

Data Leakage

Training data shown to the model that wouldn’t be available in the real world.

53
New cards

Support Vector Machines

A supervised classification model that seperates data into two groups based on finding a line with the maximum distance between both points.

54
New cards

Residuals =

x - \hat{x} where x is the original and \hat{x} is the prediction