Machine Learning Exam 2 (with examples)

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/192

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

193 Terms

1
New cards

Data Understanding

- examining key summary characteristics
- find problems (invalid values, missing values, unexpected distributions, outliers)
- visualize data

<p>- examining key summary characteristics<br>- find problems (invalid values, missing values, unexpected distributions, outliers) <br>- visualize data</p>
2
New cards

Single variable summaries

mean and standard deviation

3
New cards

mean

simple average of all the values

4
New cards

standard deviation

spread of the distribution, most often needed in the normal distribution

<p>spread of the distribution, most often needed in the normal distribution</p>
5
New cards

Descriptive statistics

organize, describe, and summarize data to help you better understand it

<p>organize, describe, and summarize data to help you better understand it</p>
6
New cards

examples of descriptive statistics

frequency, min/max, centra tendency (mean, median,mode), dispersion or variability (range, variable, SD)

7
New cards

What descriptive statistics are good for

- finding unusual data
- screening the range and shape of the data
- determining the central tendency
- drawing preliminary conclusions

8
New cards

FREQ procedure

proceeds a one-way frequency table for each variable named in the tables statement

<p>proceeds a one-way frequency table for each variable named in the tables statement</p>
9
New cards

TABLES statement

defines the variable in a proc FREQ procedure

If the TABLES statement is omitted, a one-way frequency table is produced for every variable in the data set. This can produce a large amount of output and is seldom preferred.

<p>defines the variable in a proc FREQ procedure <br><br>If the TABLES statement is omitted, a one-way frequency table is produced for every variable in the data set. This can produce a large amount of output and is seldom preferred.</p>
10
New cards

FREQ procedure outputs

The default output includes frequency and percentage values, including cumulative statistics.

<p>The default output includes frequency and percentage values, including cumulative statistics.</p>
11
New cards

2 options to suppress statistics in a FREQ procedure (used in the TABLES statement)

nocum
nopercent

<p>nocum<br>nopercent</p>
12
New cards

suppress statistics

use options in the tables statement to suppress the display of selected default statistics
(tables variable(s) / options;)

<p>use options in the tables statement to suppress the display of selected default statistics <br>(tables variable(s) / options;)</p>
13
New cards

nocum

suppress the cumulative statistics

- cumulative frequency
- cumulative percent
both are standard with FREQ procedure

<p>suppress the cumulative statistics<br><br>- cumulative frequency <br>- cumulative percent <br><b>both are standard with FREQ procedure</b></p>
14
New cards

nopercent

surpasses the percentage display
standard with REQ procedure

<p>surpasses the percentage display<br><b>standard with REQ procedure</b></p>
15
New cards

BY statement

used to request separate analyses for each BY group

- the data set must be sorted or indexed by the variable(s) named in the BY statement

goes in a proc freq

<p>used to request separate analyses for each BY group <br><br>- the data set must be sorted or indexed by the variable(s) named in the BY statement <br><br><b>goes in a proc freq</b></p>
16
New cards

cross-tabulation table

an asterisk between two variables generates a two-way frequency table, or cross tabulation table

<p>an asterisk between two variables generates a two-way frequency table, or cross tabulation table</p>
17
New cards

3 ways to say cross-tabulation table

1. cross-tabulation table
2.contingency table
3.. two-way frequency table

18
New cards

cross-tabulation table output

a single table with statistics for each distinct combination of values of the selected variables.

<p>a single table with statistics for each distinct combination of values of the selected variables.</p>
19
New cards

4 standard statistics on cross-tabulation table output

1. Frequency
2. Percent
3. row pct
4. col pct

<p>1. Frequency <br>2. Percent<br>3. row pct<br>4. col pct</p>
20
New cards

Frequency

the numbers that meet the criteria for the given box

<p>the numbers that meet the criteria for the given box</p>
21
New cards

Percent

the percentage of the total that meet the criteria for the given box

<p>the percentage of the total that meet the criteria for the given box</p>
22
New cards

row pct

the percentage of the row that meet the criteria for the given box

<p>the percentage of the row that meet the criteria for the given box</p>
23
New cards

col pct

the percentage of the column that meet the criteria for the given box

<p>the percentage of the column that meet the criteria for the given box</p>
24
New cards

4 options to suppress statistics in a cross-tabulation table

norow
nocol
nofreq
nopercent

<p>norow<br>nocol<br>nofreq<br>nopercent</p>
25
New cards

norow

suppresses the display of the row percentage

<p>suppresses the display of the row percentage</p>
26
New cards

nocol

suppresses the display of the column percentage.

<p>suppresses the display of the column percentage.</p>
27
New cards

nofreq

suppresses the frequency display

<p>suppresses the frequency display</p>
28
New cards

nopercent

suppresses the percentage display

<p>suppresses the percentage display</p>
29
New cards

2 options to change the look of the cross-tabulation table output

1. LIST
2. CROSSLIST

Included in TABLES statement

30
New cards

LIST

Makes the output look like a list

<p>Makes the output look like a list</p>
31
New cards

CROSSLIST

Displays the crosstabulation results in a segregated form

<p>Displays the crosstabulation results in a segregated form</p>
32
New cards

FREQ procedure with multiple variables (in TABLES statement) but no asterisks (*)

lists all discrete values for a variable and reports missing values

<p>lists all discrete values for a variable and reports missing values</p>
33
New cards

FREQ procedure with multiple variables (in TABLES statement) but no asterisks (*) output

Like single proc FREQ but doubled

<p>Like single proc FREQ but doubled</p>
34
New cards

order = freq

displays the results in descending frequency order

included in proc FREQ statement

<p>displays the results in descending frequency order<br><br><b>included in proc FREQ statement</b></p>
35
New cards

if-then statement

executes a SAS segment for observations that meet a specific condition

- defines a condition
- statement can be any executable SAS statement
- if expression is true, then statement executes

<p>executes a SAS segment for observations that meet a specific condition <br><br>- defines a condition <br>- statement can be any executable SAS statement <br>- if expression is true, then statement executes</p>
36
New cards

Fixing the data

must fix all problems in the data set and have a reason to fix them

<p>must fix all problems in the data set and have a reason to fix them</p>
37
New cards

means procedure

summary reports with descriptive statistics

<p>summary reports with descriptive statistics</p>
38
New cards

PROC MEANS output

Includes descriptive stats (mean, std dev, min, max)

<p>Includes descriptive stats (mean, std dev, min, max)</p>
39
New cards

analysis variables

numeric variables for which statistics are to be computed

<p>numeric variables for which statistics are to be computed</p>
40
New cards

var statement

identifies then analysis variable (or variables) and their order in the output

<p>identifies then analysis variable (or variables) and their order in the output</p>
41
New cards

Data Preparation

Hardest most time consuming part
Takes 60-90% of the time
Most important
How models are wrong

<p>Hardest most time consuming part <br>Takes 60-90% of the time <br>Most important <br>How models are wrong</p>
42
New cards

Variable Cleaning

- incorrect variables
- outliers (remove, transform, bin, leave them)
- missing values

<p>- incorrect variables <br>- outliers (remove, transform, bin, leave them) <br>- missing values</p>
43
New cards

3 Types of missing values

1.MCAR
2.MAR
3.MNAR

<p>1.MCAR<br>2.MAR<br>3.MNAR</p>
44
New cards

MCAR

Missing Completely At Random
- no way to determine what it should be

<p>Missing Completely At Random<br>- no way to determine what it should be</p>
45
New cards

MAR

Missing at random
- not completely random

<p>Missing at random <br>- not completely random</p>
46
New cards

MNAR

missing not at random
- can probably determine what it should be
- something someone doesn't want to specify but is implied

<p>missing not at random <br>- can probably determine what it should be <br>- something someone doesn't want to specify but is implied</p>
47
New cards

2 ways to deal with missing values

1.Listwise delete them
2. Impute a value

48
New cards

listwise deletion

- delete missing values all together
- risk of biasing data

<p>- delete missing values all together<br>- risk of biasing data</p>
49
New cards

impute a value

Use a/the (constant, mean, median, distribution, calculation) to fill in missing values

<p>Use a/the (constant, mean, median, distribution, calculation) to fill in missing values</p>
50
New cards

feature engineering

adds to the data by creating a variable from something that already exists
- variable creation
- dummy coding
- binning (bucketing)
- calculation

<p>adds to the data by creating a variable from something that already exists <br>- variable creation <br>- dummy coding <br>- binning (bucketing) <br>- calculation</p>
51
New cards

dummy coding

Numeric "1" or "0" coding where each number represents an alternate response such as "female" or "male" / "yes" , "no"

<p>Numeric "1" or "0" coding where each number represents an alternate response such as "female" or "male" / "yes" , "no"</p>
52
New cards

Binning

the exact value doesn't matter
- histograms

<p>the exact value doesn't matter <br>- histograms</p>
53
New cards

normalization

converting the range of values into a standard range
- increases the speed of learning
- reduces the computing power

<p>converting the range of values into a standard range <br>- increases the speed of learning <br>- reduces the computing power</p>
54
New cards

Standardization

- like normalization except uses the standard normal distribution

- Mean of 0 standard deviations of 1

- use for unsupervised learning models, if its close to a bell curve, if there are extreme outliers

<p>- like normalization except uses the standard normal distribution <br><br>- Mean of 0 standard deviations of 1 <br><br>- use for unsupervised learning models, if its close to a bell curve, if there are extreme outliers</p>
55
New cards

Feature engineering drawbacks

more variables are not always better
- risk of overfitting data
- risk of use of dimensionality (need more data points)

<p>more variables are not always better<br>- risk of overfitting data<br>- risk of use of dimensionality (need more data points)</p>
56
New cards

Proc means OTHER descriptive statistics available

- Median
- Mode
- Q1
- Q3
- Range
- Qrange

<p>- Median<br>- Mode<br>- Q1<br>- Q3<br>- Range<br>- Qrange</p>
57
New cards

median

middle value when all the values are ordered

<p>middle value when all the values are ordered</p>
58
New cards

mode

the most repeated value

<p>the most repeated value</p>
59
New cards

Q1

the value that 25% of all values are at or below

60
New cards

Q3

the value that 75% of all values are at or below

61
New cards

range

the largest minus the smallest

<p>the largest minus the smallest</p>
62
New cards

qrange

the Q3 value minus the Q1 value (essentially reduces impact of outliers)

63
New cards

nonobs

suppresses the N Obs column

<p>suppresses the N Obs column</p>
64
New cards

maxdec

specifies the number of decimals to display
- maxdec =1 will have one decimal place

<p>specifies the number of decimals to display<br>- maxdec =1 will have one decimal place</p>
65
New cards

adding statistics in the MEANS procedure

request statistics after the dataset to override the default stats (includes the OTHER descriptive statistics referred too earlier)

<p>request statistics after the dataset to override the default stats (includes the OTHER descriptive statistics referred too earlier)</p>
66
New cards

class statement

identifies variables whose values define subgroups for the analysis

<p>identifies variables whose values define subgroups for the analysis</p>
67
New cards

classification variables

- character or numeric
- typically have few discrete values
- the data set does not need to be sorted or indexed by the classification

<p>- character or numeric <br>- typically have few discrete values <br>- the data set does not need to be sorted or indexed by the classification</p>
68
New cards

n obs

the number of observations with each unique combination of class variables

<p>the number of observations with each unique combination of class variables</p>
69
New cards

n

the number of observations with non missing values of the analysis variable(s)

<p>the number of observations with non missing values of the analysis variable(s)</p>
70
New cards

skewness

- tendency of data to be more spread out on one side of the mean than the other
- asymmetry of data

71
New cards

left skewed

skewness < 0

<p>skewness &lt; 0</p>
72
New cards

right skewed

skewness > 0

<p>skewness &gt; 0</p>
73
New cards

Kurtosis

tendency of data to be concentrated toward the tails or middle of the data
- concentrated toward the mean

<p>tendency of data to be concentrated toward the tails or middle of the data <br>- concentrated toward the mean</p>
74
New cards

negative kurtosis (platykurtic)

less peaked
k < 0

<p>less peaked<br>k &lt; 0</p>
75
New cards

Positive kurtosis (leptokurtic)

more peaked
k > 0

<p>more peaked<br>k &gt; 0</p>
76
New cards

Moderate Kurtosis (mesokurtic / normal)

Middle peaked
k=0

77
New cards

Normal distribution

a bell-shaped curve, describing the spread of a characteristic throughout a population

<p>a bell-shaped curve, describing the spread of a characteristic throughout a population</p>
78
New cards

1 Standard deviation

68%

79
New cards

2 standard deviations

95%

80
New cards

3 standard deviations

99%

81
New cards

What = normal

-mean and median are similar
- skewness is close to 0
- kurtosis is close to 0

<p>-mean and median are similar <br>- skewness is close to 0<br> - kurtosis is close to 0</p>
82
New cards

What is used in SAS to determine normality

Proc Univariate

83
New cards

proc univariate

displays extreme observations, missing values and other statistics for the variable(s) included in the var statement

If the VAR statement is omitted, PROC UNIVARIATE analyzes all numeric variables in the data set

<p>displays extreme observations, missing values and other statistics for the variable(s) included in the var statement<br><br>If the VAR statement is omitted, PROC UNIVARIATE analyzes all numeric variables in the data set</p>
84
New cards

proc univariate output includes

-extreme observations section
-moments section
-tests for location section
-missing values section section
-basic statistical measures section
-Quantities section

<p>-extreme observations section<br>-moments section<br>-tests for location section<br>-missing values section section<br>-basic statistical measures section<br>-Quantities section</p>
85
New cards

extreme observations section

includes the five lowest and five highest values for the analysis variable and the corresponding observation numbers

<p>includes the five lowest and five highest values for the analysis variable and the corresponding observation numbers</p>
86
New cards

moments section

includes things like skewness and kurtosis

<p>includes things like skewness and kurtosis</p>
87
New cards

basic statistical measures section

includes things like mean median and mode

<p>includes things like mean median and mode</p>
88
New cards

Tests for location section

includes p value

<p>includes p value</p>
89
New cards

Quantities section

includes Q1 & Q3

<p>includes Q1 &amp; Q3</p>
90
New cards

Missing values section

displays the number and percentage of observations with missing values fro the analysis variable

<p>displays the number and percentage of observations with missing values fro the analysis variable</p>
91
New cards

obs

is the observation number, not the count of observations with that value

<p>is the observation number, not the count of observations with that value</p>
92
New cards

ID statement

displays the value of the identifying variable (or variables) in addition tot he observation number

<p>displays the value of the identifying variable (or variables) in addition tot he observation number</p>
93
New cards

histogram (proc univariate)

specifies that you want a histogram, and what variables to include
(histogram salary / normal;)

<p>specifies that you want a histogram, and what variables to include<br>(histogram salary / normal;)</p>
94
New cards

inset (proc univariate)

specifies statistics you might want to include on that plot (ex. skewness, kurtosis)

<p>specifies statistics you might want to include on that plot (ex. skewness, kurtosis)</p>
95
New cards

test for normalcy

want a p-value > 0.05 for normal distribution

96
New cards

2 tests for normalcy

1. Kolomogrov-Smirnov
2.Anderson- Darling

-Shows in a goodness of fit output after running a proc univariate with a histogram or probplot

<p>1. Kolomogrov-Smirnov<br>2.Anderson- Darling<br><br>-Shows in a goodness of fit output after running a proc univariate with a histogram or probplot</p>
97
New cards

probplot

specifies what variables you want a plot for
options specify how you want the line drawn
(probplot variables </ options>;)

<p>specifies what variables you want a plot for <br>options specify how you want the line drawn<br>(probplot variables &lt;/ options&gt;;)</p>
98
New cards

probplot output

line of best fit plus datapoints

<p>line of best fit plus datapoints</p>
99
New cards

one-dimension data visualization

Histograms (binning)
- creating bins for a variable
- see how the variable looks

<p>Histograms (binning)<br>- creating bins for a variable <br>- see how the variable looks</p>
100
New cards

Box plots

graphical representation of the quartile statistics

<p>graphical representation of the quartile statistics</p>