Data Mining 1

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/96

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 7:43 PM on 3/3/25
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

97 Terms

1
New cards

Tabular Data

Data that’s found in a “table” format.

2
New cards
<p>This is the:</p>

This is the:

Data Science Lifecycle

3
New cards

Tabular Row

A single observation with a group of features.

4
New cards

Tabular Column

A single feature of a group of observations.

5
New cards

pandas

A python library used to manipulate tabular data.
Similar to R.

6
New cards

Dataframe

A table defined by pandas.

7
New cards

Series

A single column of a table as defined by pandas.
Contains an object that holds an index and a value.

8
New cards

Given a series ss, s[“a”]s\text{[“a”]} returns:

The value in ss associated with index a\text{a}.

9
New cards

Given a series ss, s[[“a”,”c”]]s\text{[[“a”,”c”]]} returns:

A series that contains the rows in ss that are associated with indices a\text{a} and c\text{c}.

10
New cards

Given a series ss, s>0 returns:

A series that contains every index in ss and a bool telling whether or not the value is >0.

11
New cards

Given a series ss, s[s>0] returns:

A series that contains every row in ss whose value is >0.

12
New cards

Dataframe Index

Defines where a particular row will appear in a dataframe.
Does not have to be numeric or unique.

13
New cards

Given a dataframe dfdf, df.indexdf.index wil return:

Returns every row from dfdf associated with the index.

14
New cards

Given a dataframe dfdf, df.columnsdf.columns returns:

The name of every column in dfdf.

15
New cards

Given a dataframe dfdf, df.shapedf.shape returns:

The width and height of dfdf.

16
New cards

Given a dataframe dfdf, df.head(n)df.head(n) returns:

The first nn rows from dfdf.

17
New cards

Given a dataframe dfdf, df.tail(n)df.tail(n) returns:

The last nn rows from dfdf.

18
New cards

Given a dataframe dfdf, df.loc[rows,cols]df.loc[rows, cols] returns:

All rows specified in rowsrows with only the columns specified in colscols from dfdf.

19
New cards

The difference between ilociloc and locloc:

ilociloc uses the position of the column and row, while locloc uses the identifier associated with the column and row.

20
New cards

Given a dataframe dfdf, df[0:3]df[0:3] returns:

Every row from dfdf associated with indices 0 to 3 inclusive.

21
New cards

Given a dataframe dfdf, df[cols]df[cols] returns:

Every row from dfdf with only the columns specified in colscols.

22
New cards

Given a dataframe dfdf, df.loc[:9,:]df.loc[:9, :] returns:

The first 10 rows from dfdf with every column included.

23
New cards

Given a dataframe dfdf, df[[True,False,True]]df[[True,False,True]] returns:

The 1st and 3rd rows from dfdf.

24
New cards

Given a dataframe dfdf, df[(df[Sex]==F)]df[(df[“Sex”]==”F”)] returns:

All rows from dfdf where SexSex is equal to FF.

25
New cards

Given a dataframe dfdf that has no column testColtestCol, df[testCol]=testDatadf[“testCol”]=testData will:

Add a column called testColtestCol to dfdf with the rows specified in testDatatestData.

26
New cards

Given a dataframe dfdf, df[col]=df[col]1df[“col”]=df[“col”]-1 will:

Subtract 1 from every row in the column colcol from dfdf.

27
New cards

Given a dataframe dfdf, df.rename(columns=oldCol:newCol)df.rename(columns={“oldCol“:newCol}) returns:

A version of dfdf where oldCololdCol is instead called newColnewCol.

28
New cards

Given a dataframe dfdf, df.drop(oldCol,axis=columns)df.drop(“oldCol“, axis=”columns”) returns:

A version of dfdf without oldCololdCol.

29
New cards

Given a dataframe dfdf, df.sizedf.size returns:

The number of individual elements in dfdf.
I.e. Width ×\times Height.

30
New cards

Given a dataframe dfdf, df.describe()df.describe() returns:

A general summary of dfdf.

31
New cards

Given a dataframe dfdf, df.sample(n)df.sample(n) returns:

A random sample of nn rows from dfdf without replacement.
Calling this again will not returns any of the same nn rows.

32
New cards

Given a dataframe dfdf, df[col].value_counts()df[“col”].value\_counts() returns:

How many times each value in colcol occurs.

33
New cards

Given a dataframe dfdf, df[col].unique()df[“col”].unique() returns:

The first occurance of every value in colcol from dfdf.

34
New cards

Given a dataframe dfdf, df[col].sort_values()df[“col”].sort\_values() will:

Sorts the values of colcol in numeric or alphabetical order.

35
New cards

Given a dataframe dfdf, df.groupby(Year)df.groupby(“Year“) will:

Group the rows in dfdf by the data in YearYear.

36
New cards

Aggregate Function

A function that looks at more than one row/column at once.

37
New cards

Given a dataframe dfdf, df.groupby(Year).agg(sum)df.groupby(“Year“).agg(sum) returns:

The sum of each unique year in dfdf.

38
New cards

Given a dataframe dfdf, df.groupby(Year).size()df.groupby(“Year”).size() returns:

The number of rows in dfdf associated with every unique year.

39
New cards

Given a dataframe dfdf, df.groupby(Year).count()df.groupby(“Year”).count() returns:

Returns the number of values in each column with a non-missing value associated with every unique year.

40
New cards

Given a dataframe dfdf and a command
df.groupby(col).filter(FUNC)df.groupby(col).filter(\text{FUNC}), what goes in FUNC\text{FUNC}?

A lambda function that returns either TrueTrue or FalseFalse.

41
New cards

Given a dataframe dfdf, df.groupby(val)df.groupby(val) returns an object of type:

DataFrameGroupBy\text{DataFrameGroupBy}

42
New cards

Given a dataframe dfdf, df.groupby(val).agg(func)df.groupby(val).agg(func) returns an object of type:

DataFrameDataFrame

43
New cards

Given two dataframes df1df1 and df2df2, pd.merge(left=df1,right=df2,left_on=col1,right_on=col2)pd.merge(left=df1, right=df2, left\_on=”col1”, right\_on=”col2”) returns:

A dataframe that merges df1df1 and df2df2 based on col1col1 and col2col2.

44
New cards

Given a series SS, S.map(func)S.map(func) will:

Apply funcfunc to each element in SS.

45
New cards

Probability

The frequency with which an event occurs in a collection of independent but identical tries.

46
New cards

P(AB)P(A|B) stands for:

The probability that AA happens due to BB happening.

47
New cards

P(AB)=P(A|B)=

P(BA)×P(A)P(B)\frac{P(B|A) \times P(A)}{P(B)}

48
New cards

Census

A complete count or survey of a population.

49
New cards

Population

The complete set of individuals being studied.

50
New cards

Survey

A set of questions or measurments.

51
New cards

Sample

A subset of the population.

52
New cards

Inference / Prediction

The act of drawing conclusions about a population based on a sample.

53
New cards

Sampling Frame

A subset of the population that could possibly be in a sample.

54
New cards

Selection Bias

Systematically excluding particular groups.

55
New cards

Response Bias

Respondants don’t always respond truthfully.

56
New cards

Population Parameter

A number that describes something about the population.

57
New cards

Sample Statistic

An estimate of the number computed on a sample.

58
New cards

Cenertal Limit Theorem

Given a sample is large enough, its distribution will always resemble a normal distribution and will be centered at the population mean.

59
New cards

Sample Space

All possible outcomes for some random event.

60
New cards

Until an experiment occurs a random variable:

Does not hold a value.

61
New cards

Probabilities

The chance that a random variable will take each possible value.

62
New cards

XF(p)X \sim F(p) means:

The random variable XX has a distribution FF with a parameter pp.

63
New cards

Null Hypothesis

The “default” hypothesis given a scenario.

64
New cards

P-Value

The probability that a given hypothesis could occur.

65
New cards

Standard Deviation (SD) =

Population SDSampleSize\frac{\text{Population SD}}{\sqrt{Sample Size}}

66
New cards

Square Root Law

Increasing the sample size by a factor will decrease the SD by the square root of the factor.

67
New cards

Convergence

When two or more series of values drift towards the same value.

68
New cards

Expectation

The weighted average of the possible values of a random variable.
The weights are the probabilities of the values.

69
New cards

Given XX is a random variable and xx is a possible value of XX, Expectation =

all possible xxP(X=x)\sum_{\text{all possible x}}{xP(X=x)}

70
New cards

Exploratory Data Analysis (EDA)

The process of iteratively asking and answering more questions about a dataset.

71
New cards

The EDA process:

Question
Investigate
Interpret
Repeat

72
New cards

Model

An idealized representation of a system.

73
New cards

Deterministic Physical Models

Laws that govern how the world works.

74
New cards

Reasons for building models:

To explain complex phenomena.
To make accurate predictions.
To make casual inferences.

75
New cards

Model Evaluation Statistics:

Error
Bias
Variance

76
New cards

Bias

How close a model is to the estimate (on average) to the parameter.

77
New cards

Variance

How spread out the estimate is.

78
New cards

Mean Squared Error (MSE) given θ^\hat \theta =

E((θ^θ)2)E((\hat \theta - \theta)²)

79
New cards

Bias given θ^\hat \theta =

E(θ^θ)=E(θ^)θE(\hat \theta - \theta)=E(\hat \theta) - \theta

80
New cards

Variance =

1ni=1n(xixˉ)2\frac{1}{n} \sum^{n}_{i=1}(x_i - \bar x)²

81
New cards

Simple Linear Regression (y^\hat y) =

θ0+θ1x\theta_0 + \theta_1 x

82
New cards

The Modeling Process:

Choose a model.
Choose a loss function.
Fit the model.
Evaluate model performance.

83
New cards

Loss Function

Characterizes the cost / error / fit resulting from a particular choice of model and parameters.

84
New cards

Squared Loss / L2 Loss (L(y,y^)L(y,\hat y)) =

(yy^)2(y - \hat y)^2

85
New cards

Absolute Loss / L1 Loss (L(y,y^)L(y,\hat y)) =

yy^|y - \hat y|

86
New cards

Covariance =

1ni(yiyˉ)(xixˉ)\frac{1}{n} \sum_i (y_i - \bar y)(x_i - \bar x)

87
New cards

Multiple Linear Regression (y^\hat y) =

θ0+θ1x1++θpxp\theta_0 + \theta_1 x_1 + … + \theta_p x_p

88
New cards

L2 Vector Norm given an nthn^{\text{th}} dimensional vector =

i=1n(xi2)\sqrt{ \sum^{n}_{i=1} (x^2_i) }

89
New cards

Span

The set of all possible linear combinations between two columns of a matrix.

90
New cards

Deductive Reasoning

Reasoning based on nature / tradition.

91
New cards

Inductive Reasoning

Reasoning based on the observations made.

92
New cards

dzdx=\frac{dz}{dx}=

dzy^×y^dx\frac{dz}{\hat{y}} \times \frac{\hat{y}}{dx}

93
New cards

Multiple Linear Regression assumes that…

Every included parameter has no relationship.

94
New cards

Gradient Descent

Finding the lowest value by stepping in either direction based on the slope.

95
New cards

Given gradient descent and a learning rate α\alpha, xt+1=x^{t+1}=

xtαddxf(xt)x^t - \alpha \frac{d}{dx}f(x^t)

96
New cards

Gradient Descent stops after…

A fixed number of updates or the change in results is too low.

97
New cards

MSE for Linear Regression (R^(θ)\hat{R}(\theta)) =

1ni=1n(yi(θ0+θ1x))2\frac{1}{n} \sum ^{n} _{i=1} (y_i - (\theta_0 + \theta_1 x))^2