Machine Learning Exam 2 (with examples)

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/192

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

193 Terms

New cards

Data Understanding

- examining key summary characteristics
- find problems (invalid values, missing values, unexpected distributions, outliers)
- visualize data

<p>- examining key summary characteristics<br>- find problems (invalid values, missing values, unexpected distributions, outliers) <br>- visualize data</p>

New cards

Single variable summaries

mean and standard deviation

New cards

mean

simple average of all the values

New cards

standard deviation

spread of the distribution, most often needed in the normal distribution

New cards

Descriptive statistics

organize, describe, and summarize data to help you better understand it

New cards

examples of descriptive statistics

frequency, min/max, centra tendency (mean, median,mode), dispersion or variability (range, variable, SD)

New cards

What descriptive statistics are good for

- finding unusual data
- screening the range and shape of the data
- determining the central tendency
- drawing preliminary conclusions

New cards

FREQ procedure

proceeds a one-way frequency table for each variable named in the tables statement

New cards

TABLES statement

defines the variable in a proc FREQ procedure

If the TABLES statement is omitted, a one-way frequency table is produced for every variable in the data set. This can produce a large amount of output and is seldom preferred.

New cards

FREQ procedure outputs

The default output includes frequency and percentage values, including cumulative statistics.

New cards

2 options to suppress statistics in a FREQ procedure (used in the TABLES statement)

nocum
nopercent

New cards

suppress statistics

use options in the tables statement to suppress the display of selected default statistics
(tables variable(s) / options;)

<p>use options in the tables statement to suppress the display of selected default statistics <br>(tables variable(s) / options;)</p>

New cards

nocum

suppress the cumulative statistics

- cumulative frequency
- cumulative percent
both are standard with FREQ procedure

<p>suppress the cumulative statistics<br><br>- cumulative frequency <br>- cumulative percent <br><b>both are standard with FREQ procedure</b></p>

New cards

nopercent

surpasses the percentage display
standard with REQ procedure

<p>surpasses the percentage display<br><b>standard with REQ procedure</b></p>

New cards

BY statement

used to request separate analyses for each BY group

- the data set must be sorted or indexed by the variable(s) named in the BY statement

goes in a proc freq

New cards

cross-tabulation table

an asterisk between two variables generates a two-way frequency table, or cross tabulation table

New cards

3 ways to say cross-tabulation table

1. cross-tabulation table
2.contingency table
3.. two-way frequency table

New cards

cross-tabulation table output

a single table with statistics for each distinct combination of values of the selected variables.

New cards

4 standard statistics on cross-tabulation table output

1. Frequency
2. Percent
3. row pct
4. col pct

New cards

Frequency

the numbers that meet the criteria for the given box

New cards

Percent

the percentage of the total that meet the criteria for the given box

New cards

row pct

the percentage of the row that meet the criteria for the given box

New cards

col pct

the percentage of the column that meet the criteria for the given box

New cards

4 options to suppress statistics in a cross-tabulation table

norow
nocol
nofreq
nopercent

New cards

norow

suppresses the display of the row percentage

New cards

nocol

suppresses the display of the column percentage.

New cards

nofreq

suppresses the frequency display

New cards

nopercent

suppresses the percentage display

New cards

2 options to change the look of the cross-tabulation table output

1. LIST
2. CROSSLIST

Included in TABLES statement

New cards

LIST

Makes the output look like a list

New cards

CROSSLIST

Displays the crosstabulation results in a segregated form

New cards

FREQ procedure with multiple variables (in TABLES statement) but no asterisks (*)

lists all discrete values for a variable and reports missing values

New cards

FREQ procedure with multiple variables (in TABLES statement) but no asterisks (*) output

Like single proc FREQ but doubled

New cards

order = freq

displays the results in descending frequency order

included in proc FREQ statement

<p>displays the results in descending frequency order<br><br><b>included in proc FREQ statement</b></p>

New cards

if-then statement

executes a SAS segment for observations that meet a specific condition

- defines a condition
- statement can be any executable SAS statement
- if expression is true, then statement executes

<p>executes a SAS segment for observations that meet a specific condition <br><br>- defines a condition <br>- statement can be any executable SAS statement <br>- if expression is true, then statement executes</p>

New cards

Fixing the data

must fix all problems in the data set and have a reason to fix them

New cards

means procedure

summary reports with descriptive statistics

New cards

PROC MEANS output

Includes descriptive stats (mean, std dev, min, max)

New cards

analysis variables

numeric variables for which statistics are to be computed

New cards

var statement

identifies then analysis variable (or variables) and their order in the output

New cards

Data Preparation

Hardest most time consuming part
Takes 60-90% of the time
Most important
How models are wrong

<p>Hardest most time consuming part <br>Takes 60-90% of the time <br>Most important <br>How models are wrong</p>

New cards

Variable Cleaning

- incorrect variables
- outliers (remove, transform, bin, leave them)
- missing values

<p>- incorrect variables <br>- outliers (remove, transform, bin, leave them) <br>- missing values</p>

New cards

3 Types of missing values

1.MCAR
2.MAR
3.MNAR

New cards

MCAR

Missing Completely At Random
- no way to determine what it should be

<p>Missing Completely At Random<br>- no way to determine what it should be</p>

New cards

MAR

Missing at random
- not completely random

<p>Missing at random <br>- not completely random</p>

New cards

MNAR

missing not at random
- can probably determine what it should be
- something someone doesn't want to specify but is implied

<p>missing not at random <br>- can probably determine what it should be <br>- something someone doesn't want to specify but is implied</p>

New cards

2 ways to deal with missing values

1.Listwise delete them
2. Impute a value

New cards

listwise deletion

- delete missing values all together
- risk of biasing data

<p>- delete missing values all together<br>- risk of biasing data</p>

New cards

impute a value

Use a/the (constant, mean, median, distribution, calculation) to fill in missing values

New cards

feature engineering

adds to the data by creating a variable from something that already exists
- variable creation
- dummy coding
- binning (bucketing)
- calculation

New cards

dummy coding

Numeric "1" or "0" coding where each number represents an alternate response such as "female" or "male" / "yes" , "no"

New cards

Binning

the exact value doesn't matter
- histograms

<p>the exact value doesn't matter <br>- histograms</p>

New cards

normalization

converting the range of values into a standard range
- increases the speed of learning
- reduces the computing power

<p>converting the range of values into a standard range <br>- increases the speed of learning <br>- reduces the computing power</p>

New cards

Standardization

- like normalization except uses the standard normal distribution

- Mean of 0 standard deviations of 1

- use for unsupervised learning models, if its close to a bell curve, if there are extreme outliers

<p>- like normalization except uses the standard normal distribution <br><br>- Mean of 0 standard deviations of 1 <br><br>- use for unsupervised learning models, if its close to a bell curve, if there are extreme outliers</p>

New cards

Feature engineering drawbacks

more variables are not always better
- risk of overfitting data
- risk of use of dimensionality (need more data points)

<p>more variables are not always better<br>- risk of overfitting data<br>- risk of use of dimensionality (need more data points)</p>

New cards

Proc means OTHER descriptive statistics available

- Median
- Mode
- Q1
- Q3
- Range
- Qrange

New cards

median

middle value when all the values are ordered

New cards

mode

the most repeated value

New cards

the value that 25% of all values are at or below

New cards

the value that 75% of all values are at or below

New cards

range

the largest minus the smallest

New cards

qrange

the Q3 value minus the Q1 value (essentially reduces impact of outliers)

New cards

nonobs

suppresses the N Obs column

New cards

maxdec

specifies the number of decimals to display
- maxdec =1 will have one decimal place

New cards

adding statistics in the MEANS procedure

request statistics after the dataset to override the default stats (includes the OTHER descriptive statistics referred too earlier)

New cards

class statement

identifies variables whose values define subgroups for the analysis

New cards

classification variables

- character or numeric
- typically have few discrete values
- the data set does not need to be sorted or indexed by the classification

<p>- character or numeric <br>- typically have few discrete values <br>- the data set does not need to be sorted or indexed by the classification</p>

New cards

n obs

the number of observations with each unique combination of class variables

New cards

the number of observations with non missing values of the analysis variable(s)

New cards

skewness

- tendency of data to be more spread out on one side of the mean than the other
- asymmetry of data

New cards

left skewed

skewness < 0

New cards

right skewed

skewness > 0

New cards

Kurtosis

tendency of data to be concentrated toward the tails or middle of the data
- concentrated toward the mean

<p>tendency of data to be concentrated toward the tails or middle of the data <br>- concentrated toward the mean</p>

New cards

negative kurtosis (platykurtic)

less peaked
k < 0

New cards

Positive kurtosis (leptokurtic)

more peaked
k > 0

New cards

Moderate Kurtosis (mesokurtic / normal)

Middle peaked
k=0

New cards

Normal distribution

a bell-shaped curve, describing the spread of a characteristic throughout a population

New cards

1 Standard deviation

68%

New cards

2 standard deviations

95%

New cards

3 standard deviations

99%

New cards

What = normal

-mean and median are similar
- skewness is close to 0
- kurtosis is close to 0

<p>-mean and median are similar <br>- skewness is close to 0<br> - kurtosis is close to 0</p>

New cards

What is used in SAS to determine normality

Proc Univariate

New cards

proc univariate

displays extreme observations, missing values and other statistics for the variable(s) included in the var statement

If the VAR statement is omitted, PROC UNIVARIATE analyzes all numeric variables in the data set

<p>displays extreme observations, missing values and other statistics for the variable(s) included in the var statement<br><br>If the VAR statement is omitted, PROC UNIVARIATE analyzes all numeric variables in the data set</p>

New cards

proc univariate output includes

-extreme observations section
-moments section
-tests for location section
-missing values section section
-basic statistical measures section
-Quantities section