Machine Learning Exam 2 (with examples)

studied byStudied by 0 people
0.0(0)
Get a hint
Hint

Data Understanding

1 / 192

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

193 Terms

1

Data Understanding

- examining key summary characteristics
- find problems (invalid values, missing values, unexpected distributions, outliers)
- visualize data

<p>- examining key summary characteristics<br>- find problems (invalid values, missing values, unexpected distributions, outliers) <br>- visualize data</p>
New cards
2

Single variable summaries

mean and standard deviation

New cards
3

mean

simple average of all the values

New cards
4

standard deviation

spread of the distribution, most often needed in the normal distribution

<p>spread of the distribution, most often needed in the normal distribution</p>
New cards
5

Descriptive statistics

organize, describe, and summarize data to help you better understand it

<p>organize, describe, and summarize data to help you better understand it</p>
New cards
6

examples of descriptive statistics

frequency, min/max, centra tendency (mean, median,mode), dispersion or variability (range, variable, SD)

New cards
7

What descriptive statistics are good for

- finding unusual data
- screening the range and shape of the data
- determining the central tendency
- drawing preliminary conclusions

New cards
8

FREQ procedure

proceeds a one-way frequency table for each variable named in the tables statement

<p>proceeds a one-way frequency table for each variable named in the tables statement</p>
New cards
9

TABLES statement

defines the variable in a proc FREQ procedure

If the TABLES statement is omitted, a one-way frequency table is produced for every variable in the data set. This can produce a large amount of output and is seldom preferred.

<p>defines the variable in a proc FREQ procedure <br><br>If the TABLES statement is omitted, a one-way frequency table is produced for every variable in the data set. This can produce a large amount of output and is seldom preferred.</p>
New cards
10

FREQ procedure outputs

The default output includes frequency and percentage values, including cumulative statistics.

<p>The default output includes frequency and percentage values, including cumulative statistics.</p>
New cards
11

2 options to suppress statistics in a FREQ procedure (used in the TABLES statement)

nocum
nopercent

<p>nocum<br>nopercent</p>
New cards
12

suppress statistics

use options in the tables statement to suppress the display of selected default statistics
(tables variable(s) / options;)

<p>use options in the tables statement to suppress the display of selected default statistics <br>(tables variable(s) / options;)</p>
New cards
13

nocum

suppress the cumulative statistics

- cumulative frequency
- cumulative percent
<b>both are standard with FREQ procedure</b>

<p>suppress the cumulative statistics<br><br>- cumulative frequency <br>- cumulative percent <br><b>both are standard with FREQ procedure</b></p>
New cards
14

nopercent

surpasses the percentage display
<b>standard with REQ procedure</b>

<p>surpasses the percentage display<br><b>standard with REQ procedure</b></p>
New cards
15

BY statement

used to request separate analyses for each BY group

- the data set must be sorted or indexed by the variable(s) named in the BY statement

<b>goes in a proc freq</b>

<p>used to request separate analyses for each BY group <br><br>- the data set must be sorted or indexed by the variable(s) named in the BY statement <br><br><b>goes in a proc freq</b></p>
New cards
16

cross-tabulation table

an asterisk between two variables generates a two-way frequency table, or cross tabulation table

<p>an asterisk between two variables generates a two-way frequency table, or cross tabulation table</p>
New cards
17

3 ways to say cross-tabulation table

1. cross-tabulation table
2.contingency table
3.. two-way frequency table

New cards
18

cross-tabulation table output

a single table with statistics for each distinct combination of values of the selected variables.

<p>a single table with statistics for each distinct combination of values of the selected variables.</p>
New cards
19

4 standard statistics on cross-tabulation table output

1. Frequency
2. Percent
3. row pct
4. col pct

<p>1. Frequency <br>2. Percent<br>3. row pct<br>4. col pct</p>
New cards
20

Frequency

the numbers that meet the criteria for the given box

<p>the numbers that meet the criteria for the given box</p>
New cards
21

Percent

the percentage of the total that meet the criteria for the given box

<p>the percentage of the total that meet the criteria for the given box</p>
New cards
22

row pct

the percentage of the row that meet the criteria for the given box

<p>the percentage of the row that meet the criteria for the given box</p>
New cards
23

col pct

the percentage of the column that meet the criteria for the given box

<p>the percentage of the column that meet the criteria for the given box</p>
New cards
24

4 options to suppress statistics in a cross-tabulation table

norow
nocol
nofreq
nopercent

<p>norow<br>nocol<br>nofreq<br>nopercent</p>
New cards
25

norow

suppresses the display of the row percentage

<p>suppresses the display of the row percentage</p>
New cards
26

nocol

suppresses the display of the column percentage.

<p>suppresses the display of the column percentage.</p>
New cards
27

nofreq

suppresses the frequency display

<p>suppresses the frequency display</p>
New cards
28

nopercent

suppresses the percentage display

<p>suppresses the percentage display</p>
New cards
29

2 options to change the look of the cross-tabulation table output

1. LIST
2. CROSSLIST

<b>Included in TABLES statement</b>

New cards
30

LIST

Makes the output look like a list

<p>Makes the output look like a list</p>
New cards
31

CROSSLIST

Displays the crosstabulation results in a segregated form

<p>Displays the crosstabulation results in a segregated form</p>
New cards
32

FREQ procedure with multiple variables (in TABLES statement) but no asterisks (*)

lists all discrete values for a variable and reports missing values

<p>lists all discrete values for a variable and reports missing values</p>
New cards
33

FREQ procedure with multiple variables (in TABLES statement) but no asterisks (*) output

Like single proc FREQ but doubled

<p>Like single proc FREQ but doubled</p>
New cards
34

order = freq

displays the results in descending frequency order

<b>included in proc FREQ statement</b>

<p>displays the results in descending frequency order<br><br><b>included in proc FREQ statement</b></p>
New cards
35

if-then statement

executes a SAS segment for observations that meet a specific condition

- defines a condition
- statement can be any executable SAS statement
- if expression is true, then statement executes

<p>executes a SAS segment for observations that meet a specific condition <br><br>- defines a condition <br>- statement can be any executable SAS statement <br>- if expression is true, then statement executes</p>
New cards
36

Fixing the data

must fix all problems in the data set and have a reason to fix them

<p>must fix all problems in the data set and have a reason to fix them</p>
New cards
37

means procedure

summary reports with descriptive statistics

<p>summary reports with descriptive statistics</p>
New cards
38

PROC MEANS output

Includes descriptive stats (mean, std dev, min, max)

<p>Includes descriptive stats (mean, std dev, min, max)</p>
New cards
39

analysis variables

numeric variables for which statistics are to be computed

<p>numeric variables for which statistics are to be computed</p>
New cards
40

var statement

identifies then analysis variable (or variables) and their order in the output

<p>identifies then analysis variable (or variables) and their order in the output</p>
New cards
41

Data Preparation

Hardest most time consuming part
Takes 60-90% of the time
Most important
How models are wrong

<p>Hardest most time consuming part <br>Takes 60-90% of the time <br>Most important <br>How models are wrong</p>
New cards
42

Variable Cleaning

- incorrect variables
- outliers (remove, transform, bin, leave them)
- missing values

<p>- incorrect variables <br>- outliers (remove, transform, bin, leave them) <br>- missing values</p>
New cards
43

3 Types of missing values

1.MCAR
2.MAR
3.MNAR

<p>1.MCAR<br>2.MAR<br>3.MNAR</p>
New cards
44

MCAR

Missing Completely At Random
- no way to determine what it should be

<p>Missing Completely At Random<br>- no way to determine what it should be</p>
New cards
45

MAR

Missing at random
- not completely random

<p>Missing at random <br>- not completely random</p>
New cards
46

MNAR

missing not at random
- can probably determine what it should be
- something someone doesn't want to specify but is implied

<p>missing not at random <br>- can probably determine what it should be <br>- something someone doesn't want to specify but is implied</p>
New cards
47

2 ways to deal with missing values

1.Listwise delete them
2. Impute a value

New cards
48

listwise deletion

- delete missing values all together
- risk of biasing data

<p>- delete missing values all together<br>- risk of biasing data</p>
New cards
49

impute a value

Use a/the (constant, mean, median, distribution, calculation) to fill in missing values

<p>Use a/the (constant, mean, median, distribution, calculation) to fill in missing values</p>
New cards
50

feature engineering

adds to the data by creating a variable from something that already exists
- variable creation
- dummy coding
- binning (bucketing)
- calculation

<p>adds to the data by creating a variable from something that already exists <br>- variable creation <br>- dummy coding <br>- binning (bucketing) <br>- calculation</p>
New cards
51

dummy coding

Numeric "1" or "0" coding where each number represents an alternate response such as "female" or "male" / "yes" , "no"

<p>Numeric "1" or "0" coding where each number represents an alternate response such as "female" or "male" / "yes" , "no"</p>
New cards
52

Binning

the exact value doesn't matter
- histograms

<p>the exact value doesn't matter <br>- histograms</p>
New cards
53

normalization

converting the range of values into a standard range
- increases the speed of learning
- reduces the computing power

<p>converting the range of values into a standard range <br>- increases the speed of learning <br>- reduces the computing power</p>
New cards
54

Standardization

- like normalization except uses the standard normal distribution

- Mean of 0 standard deviations of 1

- use for unsupervised learning models, if its close to a bell curve, if there are extreme outliers

<p>- like normalization except uses the standard normal distribution <br><br>- Mean of 0 standard deviations of 1 <br><br>- use for unsupervised learning models, if its close to a bell curve, if there are extreme outliers</p>
New cards
55

Feature engineering drawbacks

more variables are not always better
- risk of overfitting data
- risk of use of dimensionality (need more data points)

<p>more variables are not always better<br>- risk of overfitting data<br>- risk of use of dimensionality (need more data points)</p>
New cards
56

Proc means OTHER descriptive statistics available

- Median
- Mode
- Q1
- Q3
- Range
- Qrange

<p>- Median<br>- Mode<br>- Q1<br>- Q3<br>- Range<br>- Qrange</p>
New cards
57

median

middle value when all the values are ordered

<p>middle value when all the values are ordered</p>
New cards
58

mode

the most repeated value

<p>the most repeated value</p>
New cards
59

Q1

the value that 25% of all values are at or below

New cards
60

Q3

the value that 75% of all values are at or below

New cards
61

range

the largest minus the smallest

<p>the largest minus the smallest</p>
New cards
62

qrange

the Q3 value minus the Q1 value (essentially reduces impact of outliers)

New cards
63

nonobs

suppresses the N Obs column

<p>suppresses the N Obs column</p>
New cards
64

maxdec

specifies the number of decimals to display
- maxdec =1 will have one decimal place

<p>specifies the number of decimals to display<br>- maxdec =1 will have one decimal place</p>
New cards
65

adding statistics in the MEANS procedure

request statistics after the dataset to override the default stats (includes the OTHER descriptive statistics referred too earlier)

<p>request statistics after the dataset to override the default stats (includes the OTHER descriptive statistics referred too earlier)</p>
New cards
66

class statement

identifies variables whose values define subgroups for the analysis

<p>identifies variables whose values define subgroups for the analysis</p>
New cards
67

classification variables

- character or numeric
- typically have few discrete values
- the data set does not need to be sorted or indexed by the classification

<p>- character or numeric <br>- typically have few discrete values <br>- the data set does not need to be sorted or indexed by the classification</p>
New cards
68

n obs

the number of observations with each unique combination of class variables

<p>the number of observations with each unique combination of class variables</p>
New cards
69

n

the number of observations with non missing values of the analysis variable(s)

<p>the number of observations with non missing values of the analysis variable(s)</p>
New cards
70

skewness

- tendency of data to be more spread out on one side of the mean than the other
- asymmetry of data

New cards
71

left skewed

skewness < 0

<p>skewness &lt; 0</p>
New cards
72

right skewed

skewness > 0

<p>skewness &gt; 0</p>
New cards
73

Kurtosis

tendency of data to be concentrated toward the tails or middle of the data
- concentrated toward the mean

<p>tendency of data to be concentrated toward the tails or middle of the data <br>- concentrated toward the mean</p>
New cards
74

negative kurtosis (platykurtic)

less peaked
k < 0

<p>less peaked<br>k &lt; 0</p>
New cards
75

Positive kurtosis (leptokurtic)

more peaked
k > 0

<p>more peaked<br>k &gt; 0</p>
New cards
76

Moderate Kurtosis (mesokurtic / normal)

Middle peaked
k=0

New cards
77

Normal distribution

a bell-shaped curve, describing the spread of a characteristic throughout a population

<p>a bell-shaped curve, describing the spread of a characteristic throughout a population</p>
New cards
78

1 Standard deviation

68%

New cards
79

2 standard deviations

95%

New cards
80

3 standard deviations

99%

New cards
81

What = normal

-mean and median are similar
- skewness is close to 0
- kurtosis is close to 0

<p>-mean and median are similar <br>- skewness is close to 0<br> - kurtosis is close to 0</p>
New cards
82

What is used in SAS to determine normality

Proc Univariate

New cards
83

proc univariate

displays extreme observations, missing values and other statistics for the variable(s) included in the var statement

If the VAR statement is omitted, PROC UNIVARIATE analyzes all numeric variables in the data set

<p>displays extreme observations, missing values and other statistics for the variable(s) included in the var statement<br><br>If the VAR statement is omitted, PROC UNIVARIATE analyzes all numeric variables in the data set</p>
New cards
84

proc univariate output includes

-extreme observations section
-moments section
-tests for location section
-missing values section section
-basic statistical measures section
-Quantities section

<p>-extreme observations section<br>-moments section<br>-tests for location section<br>-missing values section section<br>-basic statistical measures section<br>-Quantities section</p>
New cards
85

extreme observations section

includes the five lowest and five highest values for the analysis variable and the corresponding observation numbers

<p>includes the five lowest and five highest values for the analysis variable and the corresponding observation numbers</p>
New cards
86

moments section

includes things like skewness and kurtosis

<p>includes things like skewness and kurtosis</p>
New cards
87

basic statistical measures section

includes things like mean median and mode

<p>includes things like mean median and mode</p>
New cards
88

Tests for location section

includes p value

<p>includes p value</p>
New cards
89

Quantities section

includes Q1 & Q3

<p>includes Q1 &amp; Q3</p>
New cards
90

Missing values section

displays the number and percentage of observations with missing values fro the analysis variable

<p>displays the number and percentage of observations with missing values fro the analysis variable</p>
New cards
91

obs

is the observation number, not the count of observations with that value

<p>is the observation number, not the count of observations with that value</p>
New cards
92

ID statement

displays the value of the identifying variable (or variables) in addition tot he observation number

<p>displays the value of the identifying variable (or variables) in addition tot he observation number</p>
New cards
93

histogram (proc univariate)

specifies that you want a histogram, and what variables to include
(histogram salary / normal;)

<p>specifies that you want a histogram, and what variables to include<br>(histogram salary / normal;)</p>
New cards
94

inset (proc univariate)

specifies statistics you might want to include on that plot (ex. skewness, kurtosis)

<p>specifies statistics you might want to include on that plot (ex. skewness, kurtosis)</p>
New cards
95

test for normalcy

want a p-value > 0.05 for normal distribution

New cards
96

2 tests for normalcy

1. Kolomogrov-Smirnov
2.Anderson- Darling

-Shows in a goodness of fit output after running a proc univariate with a histogram or probplot

<p>1. Kolomogrov-Smirnov<br>2.Anderson- Darling<br><br>-Shows in a goodness of fit output after running a proc univariate with a histogram or probplot</p>
New cards
97

probplot

specifies what variables you want a plot for
options specify how you want the line drawn
(probplot variables </ options>;)

<p>specifies what variables you want a plot for <br>options specify how you want the line drawn<br>(probplot variables &lt;/ options&gt;;)</p>
New cards
98

probplot output

line of best fit plus datapoints

<p>line of best fit plus datapoints</p>
New cards
99

one-dimension data visualization

Histograms (binning)
- creating bins for a variable
- see how the variable looks

<p>Histograms (binning)<br>- creating bins for a variable <br>- see how the variable looks</p>
New cards
100

Box plots

graphical representation of the quartile statistics

<p>graphical representation of the quartile statistics</p>
New cards

Explore top notes

note Note
studied byStudied by 19 people
... ago
5.0(1)
note Note
studied byStudied by 19 people
... ago
5.0(1)
note Note
studied byStudied by 14 people
... ago
5.0(1)
note Note
studied byStudied by 112 people
... ago
5.0(2)
note Note
studied byStudied by 20 people
... ago
5.0(1)
note Note
studied byStudied by 2 people
... ago
5.0(1)
note Note
studied byStudied by 20 people
... ago
5.0(1)
note Note
studied byStudied by 46 people
... ago
5.0(2)

Explore top flashcards

flashcards Flashcard (24)
studied byStudied by 5 people
... ago
5.0(1)
flashcards Flashcard (161)
studied byStudied by 7 people
... ago
5.0(1)
flashcards Flashcard (42)
studied byStudied by 9 people
... ago
5.0(1)
flashcards Flashcard (144)
studied byStudied by 3 people
... ago
5.0(1)
flashcards Flashcard (24)
studied byStudied by 7 people
... ago
5.0(1)
flashcards Flashcard (67)
studied byStudied by 2 people
... ago
5.0(1)
flashcards Flashcard (47)
studied byStudied by 5 people
... ago
5.0(1)
robot