Data Mining 1

studied byStudied by 0 people
0.0(0)
learn
LearnA personalized and smart learning plan
exam
Practice TestTake a test on your terms and definitions
spaced repetition
Spaced RepetitionScientifically backed study method
heart puzzle
Matching GameHow quick can you match all your cards?
flashcards
FlashcardsStudy terms and definitions

1 / 96

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

97 Terms

1

Tabular Data

Data that’s found in a “table” format.

New cards
2
<p>This is the:</p>

This is the:

Data Science Lifecycle

New cards
3

Tabular Row

A single observation with a group of features.

New cards
4

Tabular Column

A single feature of a group of observations.

New cards
5

pandas

A python library used to manipulate tabular data.
Similar to R.

New cards
6

Dataframe

A table defined by pandas.

New cards
7

Series

A single column of a table as defined by pandas.
Contains an object that holds an index and a value.

New cards
8

Given a series s, s\text{[“a”]} returns:

The value in s associated with index \text{a}.

New cards
9

Given a series s, s\text{[[“a”,”c”]]} returns:

A series that contains the rows in s that are associated with indices \text{a} and \text{c}.

New cards
10

Given a series s, s>0 returns:

A series that contains every index in s and a bool telling whether or not the value is >0.

New cards
11

Given a series s, s[s>0] returns:

A series that contains every row in s whose value is >0.

New cards
12

Dataframe Index

Defines where a particular row will appear in a dataframe.
Does not have to be numeric or unique.

New cards
13

Given a dataframe df, df.index wil return:

Returns every row from df associated with the index.

New cards
14

Given a dataframe df, df.columns returns:

The name of every column in df.

New cards
15

Given a dataframe df, df.shape returns:

The width and height of df.

New cards
16

Given a dataframe df, df.head(n) returns:

The first n rows from df.

New cards
17

Given a dataframe df, df.tail(n) returns:

The last n rows from df.

New cards
18

Given a dataframe df, df.loc[rows, cols] returns:

All rows specified in rows with only the columns specified in cols from df.

New cards
19

The difference between iloc and loc:

iloc uses the position of the column and row, while loc uses the identifier associated with the column and row.

New cards
20

Given a dataframe df, df[0:3] returns:

Every row from df associated with indices 0 to 3 inclusive.

New cards
21

Given a dataframe df, df[cols] returns:

Every row from df with only the columns specified in cols.

New cards
22

Given a dataframe df, df.loc[:9, :] returns:

The first 10 rows from df with every column included.

New cards
23

Given a dataframe df, df[[True,False,True]] returns:

The 1st and 3rd rows from df.

New cards
24

Given a dataframe df, df[(df[“Sex”]==”F”)] returns:

All rows from df where Sex is equal to F.

New cards
25

Given a dataframe df that has no column testCol, df[“testCol”]=testData will:

Add a column called testCol to df with the rows specified in testData.

New cards
26

Given a dataframe df, df[“col”]=df[“col”]-1 will:

Subtract 1 from every row in the column col from df.

New cards
27

Given a dataframe df, df.rename(columns={“oldCol“:newCol}) returns:

A version of df where oldCol is instead called newCol.

New cards
28

Given a dataframe df, df.drop(“oldCol“, axis=”columns”) returns:

A version of df without oldCol.

New cards
29

Given a dataframe df, df.size returns:

The number of individual elements in df.
I.e. Width \times Height.

New cards
30

Given a dataframe df, df.describe() returns:

A general summary of df.

New cards
31

Given a dataframe df, df.sample(n) returns:

A random sample of n rows from df without replacement.
Calling this again will not returns any of the same n rows.

New cards
32

Given a dataframe df, df[“col”].value\_counts() returns:

How many times each value in col occurs.

New cards
33

Given a dataframe df, df[“col”].unique() returns:

The first occurance of every value in col from df.

New cards
34

Given a dataframe df, df[“col”].sort\_values() will:

Sorts the values of col in numeric or alphabetical order.

New cards
35

Given a dataframe df, df.groupby(“Year“) will:

Group the rows in df by the data in Year.

New cards
36

Aggregate Function

A function that looks at more than one row/column at once.

New cards
37

Given a dataframe df, df.groupby(“Year“).agg(sum) returns:

The sum of each unique year in df.

New cards
38

Given a dataframe df, df.groupby(“Year”).size() returns:

The number of rows in df associated with every unique year.

New cards
39

Given a dataframe df, df.groupby(“Year”).count() returns:

Returns the number of values in each column with a non-missing value associated with every unique year.

New cards
40

Given a dataframe df and a command
df.groupby(col).filter(\text{FUNC}), what goes in \text{FUNC}?

A lambda function that returns either True or False.

New cards
41

Given a dataframe df, df.groupby(val) returns an object of type:

\text{DataFrameGroupBy}

New cards
42

Given a dataframe df, df.groupby(val).agg(func) returns an object of type:

DataFrame

New cards
43

Given two dataframes df1 and df2, pd.merge(left=df1, right=df2, left\_on=”col1”, right\_on=”col2”) returns:

A dataframe that merges df1 and df2 based on col1 and col2.

New cards
44

Given a series S, S.map(func) will:

Apply func to each element in S.

New cards
45

Probability

The frequency with which an event occurs in a collection of independent but identical tries.

New cards
46

P(A|B) stands for:

The probability that A happens due to B happening.

New cards
47

P(A|B)=

\frac{P(B|A) \times P(A)}{P(B)}

New cards
48

Census

A complete count or survey of a population.

New cards
49

Population

The complete set of individuals being studied.

New cards
50

Survey

A set of questions or measurments.

New cards
51

Sample

A subset of the population.

New cards
52

Inference / Prediction

The act of drawing conclusions about a population based on a sample.

New cards
53

Sampling Frame

A subset of the population that could possibly be in a sample.

New cards
54

Selection Bias

Systematically excluding particular groups.

New cards
55

Response Bias

Respondants don’t always respond truthfully.

New cards
56

Population Parameter

A number that describes something about the population.

New cards
57

Sample Statistic

An estimate of the number computed on a sample.

New cards
58

Cenertal Limit Theorem

Given a sample is large enough, its distribution will always resemble a normal distribution and will be centered at the population mean.

New cards
59

Sample Space

All possible outcomes for some random event.

New cards
60

Until an experiment occurs a random variable:

Does not hold a value.

New cards
61

Probabilities

The chance that a random variable will take each possible value.

New cards
62

X \sim F(p) means:

The random variable X has a distribution F with a parameter p.

New cards
63

Null Hypothesis

The “default” hypothesis given a scenario.

New cards
64

P-Value

The probability that a given hypothesis could occur.

New cards
65

Standard Deviation (SD) =

\frac{\text{Population SD}}{\sqrt{Sample Size}}

New cards
66

Square Root Law

Increasing the sample size by a factor will decrease the SD by the square root of the factor.

New cards
67

Convergence

When two or more series of values drift towards the same value.

New cards
68

Expectation

The weighted average of the possible values of a random variable.
The weights are the probabilities of the values.

New cards
69

Given X is a random variable and x is a possible value of X, Expectation =

\sum_{\text{all possible x}}{xP(X=x)}

New cards
70

Exploratory Data Analysis (EDA)

The process of iteratively asking and answering more questions about a dataset.

New cards
71

The EDA process:

Question
Investigate
Interpret
Repeat

New cards
72

Model

An idealized representation of a system.

New cards
73

Deterministic Physical Models

Laws that govern how the world works.

New cards
74

Reasons for building models:

To explain complex phenomena.
To make accurate predictions.
To make casual inferences.

New cards
75

Model Evaluation Statistics:

Error
Bias
Variance

New cards
76

Bias

How close a model is to the estimate (on average) to the parameter.

New cards
77

Variance

How spread out the estimate is.

New cards
78

Mean Squared Error (MSE) given \hat \theta =

E((\hat \theta - \theta)²)

New cards
79

Bias given \hat \theta =

E(\hat \theta - \theta)=E(\hat \theta) - \theta

New cards
80

Variance =

\frac{1}{n} \sum^{n}_{i=1}(x_i - \bar x)²

New cards
81

Simple Linear Regression (\hat y) =

\theta_0 + \theta_1 x

New cards
82

The Modeling Process:

Choose a model.
Choose a loss function.
Fit the model.
Evaluate model performance.

New cards
83

Loss Function

Characterizes the cost / error / fit resulting from a particular choice of model and parameters.

New cards
84

Squared Loss / L2 Loss (L(y,\hat y)) =

(y - \hat y)^2

New cards
85

Absolute Loss / L1 Loss (L(y,\hat y)) =

|y - \hat y|

New cards
86

Covariance =

\frac{1}{n} \sum_i (y_i - \bar y)(x_i - \bar x)

New cards
87

Multiple Linear Regression (\hat y) =

\theta_0 + \theta_1 x_1 + … + \theta_p x_p

New cards
88

L2 Vector Norm given an n^{\text{th}} dimensional vector =

\sqrt{ \sum^{n}_{i=1} (x^2_i) }

New cards
89

Span

The set of all possible linear combinations between two columns of a matrix.

New cards
90

Deductive Reasoning

Reasoning based on nature / tradition.

New cards
91

Inductive Reasoning

Reasoning based on the observations made.

New cards
92

\frac{dz}{dx}=

\frac{dz}{\hat{y}} \times \frac{\hat{y}}{dx}

New cards
93

Multiple Linear Regression assumes that…

Every included parameter has no relationship.

New cards
94

Gradient Descent

Finding the lowest value by stepping in either direction based on the slope.

New cards
95

Given gradient descent and a learning rate \alpha, x^{t+1}=

x^t - \alpha \frac{d}{dx}f(x^t)

New cards
96

Gradient Descent stops after…

A fixed number of updates or the change in results is too low.

New cards
97

MSE for Linear Regression (\hat{R}(\theta)) =

\frac{1}{n} \sum ^{n} _{i=1} (y_i - (\theta_0 + \theta_1 x))^2

New cards

Explore top notes

note Note
studied byStudied by 56 people
145 days ago
5.0(2)
note Note
studied byStudied by 9 people
751 days ago
5.0(1)
note Note
studied byStudied by 51 people
758 days ago
5.0(2)
note Note
studied byStudied by 22 people
968 days ago
4.5(2)
note Note
studied byStudied by 7 people
569 days ago
5.0(1)
note Note
studied byStudied by 1 person
809 days ago
5.0(1)
note Note
studied byStudied by 36 people
720 days ago
5.0(1)
note Note
studied byStudied by 10144 people
699 days ago
4.6(60)

Explore top flashcards

flashcards Flashcard (27)
studied byStudied by 21 people
141 days ago
5.0(3)
flashcards Flashcard (97)
studied byStudied by 18 people
843 days ago
5.0(1)
flashcards Flashcard (61)
studied byStudied by 5 people
94 days ago
5.0(1)
flashcards Flashcard (75)
studied byStudied by 8 people
724 days ago
5.0(2)
flashcards Flashcard (20)
studied byStudied by 2 people
15 days ago
5.0(1)
flashcards Flashcard (32)
studied byStudied by 19 people
719 days ago
5.0(1)
flashcards Flashcard (48)
studied byStudied by 39 people
407 days ago
5.0(1)
flashcards Flashcard (278)
studied byStudied by 172 people
134 days ago
5.0(1)
robot