Data Mining 1

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/96

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

97 Terms

New cards

Tabular Data

Data that’s found in a “table” format.

New cards

This is the:

Data Science Lifecycle

New cards

Tabular Row

A single observation with a group of features.

New cards

Tabular Column

A single feature of a group of observations.

New cards

pandas

A python library used to manipulate tabular data.
Similar to R.

New cards

Dataframe

A table defined by pandas.

New cards

Series

A single column of a table as defined by pandas.
Contains an object that holds an index and a value.

New cards

Given a series s, s\text{[“a”]} returns:

The value in s associated with index \text{a}.

New cards

Given a series s, s\text{[[“a”,”c”]]} returns:

A series that contains the rows in s that are associated with indices \text{a} and \text{c}.

New cards

Given a series s, s>0 returns:

A series that contains every index in s and a bool telling whether or not the value is >0.

New cards

Given a series s, s[s>0] returns:

A series that contains every row in s whose value is >0.

New cards

Dataframe Index

Defines where a particular row will appear in a dataframe.
Does not have to be numeric or unique.

New cards

Given a dataframe df, df.index wil return:

Returns every row from df associated with the index.

New cards

Given a dataframe df, df.columns returns:

The name of every column in df.

New cards

Given a dataframe df, df.shape returns:

The width and height of df.

New cards

Given a dataframe df, df.head(n) returns:

The first n rows from df.

New cards

Given a dataframe df, df.tail(n) returns:

The last n rows from df.

New cards

Given a dataframe df, df.loc[rows, cols] returns:

All rows specified in rows with only the columns specified in cols from df.

New cards

The difference between iloc and loc:

iloc uses the position of the column and row, while loc uses the identifier associated with the column and row.

New cards

Given a dataframe df, df[0:3] returns:

Every row from df associated with indices 0 to 3 inclusive.

New cards

Given a dataframe df, df[cols] returns:

Every row from df with only the columns specified in cols.

New cards

Given a dataframe df, df.loc[:9, :] returns:

The first 10 rows from df with every column included.

New cards

Given a dataframe df, df[[True,False,True]] returns:

The 1st and 3rd rows from df.

New cards

Given a dataframe df, df[(df[“Sex”]==”F”)] returns:

All rows from df where Sex is equal to F.

New cards

Given a dataframe df that has no column testCol, df[“testCol”]=testData will:

Add a column called testCol to df with the rows specified in testData.

New cards

Given a dataframe df, df[“col”]=df[“col”]-1 will:

Subtract 1 from every row in the column col from df.

New cards

Given a dataframe df, df.rename(columns={“oldCol“:newCol}) returns:

A version of df where oldCol is instead called newCol.

New cards

Given a dataframe df, df.drop(“oldCol“, axis=”columns”) returns:

A version of df without oldCol.

New cards

Given a dataframe df, df.size returns:

The number of individual elements in df.
I.e. Width \times Height.

New cards

Given a dataframe df, df.describe() returns:

A general summary of df.

New cards

Given a dataframe df, df.sample(n) returns:

A random sample of n rows from df without replacement.
Calling this again will not returns any of the same n rows.

New cards

Given a dataframe df, df[“col”].value\_counts() returns:

How many times each value in col occurs.

New cards

Given a dataframe df, df[“col”].unique() returns:

The first occurance of every value in col from df.

New cards

Given a dataframe df, df[“col”].sort\_values() will:

Sorts the values of col in numeric or alphabetical order.

New cards

Given a dataframe df, df.groupby(“Year“) will:

Group the rows in df by the data in Year.

New cards

Aggregate Function

A function that looks at more than one row/column at once.

New cards

Given a dataframe df, df.groupby(“Year“).agg(sum) returns:

The sum of each unique year in df.

New cards

Given a dataframe df, df.groupby(“Year”).size() returns:

The number of rows in df associated with every unique year.

New cards

Given a dataframe df, df.groupby(“Year”).count() returns:

Returns the number of values in each column with a non-missing value associated with every unique year.

New cards

Given a dataframe df and a command
df.groupby(col).filter(\text{FUNC}), what goes in \text{FUNC}?

A lambda function that returns either True or False.

New cards

Given a dataframe df, df.groupby(val) returns an object of type:

\text{DataFrameGroupBy}

New cards

Given a dataframe df, df.groupby(val).agg(func) returns an object of type:

DataFrame

New cards

Given two dataframes df1 and df2, pd.merge(left=df1, right=df2, left\_on=”col1”, right\_on=”col2”) returns:

A dataframe that merges df1 and df2 based on col1 and col2.

New cards

Given a series S, S.map(func) will:

Apply func to each element in S.

New cards

Probability

The frequency with which an event occurs in a collection of independent but identical tries.

New cards

P(A|B) stands for:

The probability that A happens due to B happening.

New cards

P(A|B)=

\frac{P(B|A) \times P(A)}{P(B)}

New cards

Census

A complete count or survey of a population.

New cards

Population

The complete set of individuals being studied.

New cards

Survey

A set of questions or measurments.

New cards

Sample

A subset of the population.

New cards

Inference / Prediction

The act of drawing conclusions about a population based on a sample.

New cards

Sampling Frame

A subset of the population that could possibly be in a sample.

New cards

Selection Bias

Systematically excluding particular groups.

New cards

Response Bias

Respondants don’t always respond truthfully.

New cards

Population Parameter

A number that describes something about the population.

New cards

Sample Statistic

An estimate of the number computed on a sample.

New cards

Cenertal Limit Theorem

Given a sample is large enough, its distribution will always resemble a normal distribution and will be centered at the population mean.

New cards

Sample Space

All possible outcomes for some random event.

New cards

Until an experiment occurs a random variable:

Does not hold a value.

New cards

Probabilities

The chance that a random variable will take each possible value.

New cards

X \sim F(p) means:

The random variable X has a distribution F with a parameter p.

New cards

Null Hypothesis

The “default” hypothesis given a scenario.

New cards

P-Value

The probability that a given hypothesis could occur.

New cards

Standard Deviation (SD) =

\frac{\text{Population SD}}{\sqrt{Sample Size}}

New cards

Square Root Law

Increasing the sample size by a factor will decrease the SD by the square root of the factor.

New cards

Convergence

When two or more series of values drift towards the same value.

New cards

Expectation

The weighted average of the possible values of a random variable.
The weights are the probabilities of the values.

New cards

Given X is a random variable and x is a possible value of X, Expectation =

\sum_{\text{all possible x}}{xP(X=x)}

New cards

Exploratory Data Analysis (EDA)

The process of iteratively asking and answering more questions about a dataset.

New cards

The EDA process:

Question
Investigate
Interpret
Repeat

New cards

Model

An idealized representation of a system.

New cards

Deterministic Physical Models

Laws that govern how the world works.

New cards

Reasons for building models:

To explain complex phenomena.
To make accurate predictions.
To make casual inferences.

New cards

Model Evaluation Statistics:

Error
Bias
Variance

New cards

Bias

How close a model is to the estimate (on average) to the parameter.

New cards

Variance

How spread out the estimate is.

New cards

Mean Squared Error (MSE) given \hat \theta =

E((\hat \theta - \theta)²)

New cards

Bias given \hat \theta =

E(\hat \theta - \theta)=E(\hat \theta) - \theta

New cards

Variance =

\frac{1}{n} \sum^{n}_{i=1}(x_i - \bar x)²

New cards

Simple Linear Regression (\hat y) =

\theta_0 + \theta_1 x

New cards

The Modeling Process:

Choose a model.
Choose a loss function.
Fit the model.
Evaluate model performance.

New cards

Loss Function

Characterizes the cost / error / fit resulting from a particular choice of model and parameters.

New cards

Squared Loss / L2 Loss (L(y,\hat y)) =

(y - \hat y)^2

New cards

Absolute Loss / L1 Loss (L(y,\hat y)) =

|y - \hat y|

New cards

Covariance =

\frac{1}{n} \sum_i (y_i - \bar y)(x_i - \bar x)

New cards

Multiple Linear Regression (\hat y) =

\theta_0 + \theta_1 x_1 + … + \theta_p x_p

New cards

L2 Vector Norm given an n^{\text{th}} dimensional vector =

\sqrt{ \sum^{n}_{i=1} (x^2_i) }

New cards

Span

The set of all possible linear combinations between two columns of a matrix.

New cards

Deductive Reasoning

Reasoning based on nature / tradition.

New cards

Inductive Reasoning

Reasoning based on the observations made.

New cards

\frac{dz}{dx}=

\frac{dz}{\hat{y}} \times \frac{\hat{y}}{dx}

New cards

Multiple Linear Regression assumes that…

Every included parameter has no relationship.

New cards

Gradient Descent

Finding the lowest value by stepping in either direction based on the slope.

New cards

Given gradient descent and a learning rate \alpha, x^{t+1}=

x^t - \alpha \frac{d}{dx}f(x^t)

New cards

Gradient Descent stops after…

A fixed number of updates or the change in results is too low.

New cards

MSE for Linear Regression (\hat{R}(\theta)) =

\frac{1}{n} \sum ^{n} _{i=1} (y_i - (\theta_0 + \theta_1 x))^2