MLBA_02

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/38

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

39 Terms

New cards

Classification

Responses in categories/classes

New cards

Prediction

Numerical value predicted (not class)

New cards

Association Rules

Finding Associations/patterns between items in large databases

New cards

Collaborative Filtering

Suggests what goes with what from an individual based on user history and measurable behaviors

New cards

Data Reduction

Consolidating large number of records into a smaller set

New cards

Dimension Reduction

Reducing number of variables

New cards

Data Reduction

_____ reduces number of rows in .xlsx file

New cards

Dimension Reduction

_____ reduces number of columns in an .xlsx file

New cards

Data Visualization

Data exploration by creating charts and dashboards

New cards

Data Visualization is also called _____

Visual Analytics

New cards

Visual Analytics is also called _____

Data Visualization

New cards

What charts are created in numerical visual analytics

Histograms and Boxplots

New cards

What charts are created in categorical visual analytics

Bar Charts

New cards

Supervised Learning Algorithm

Uses training data to teach model, gives validation data to test how it does compared to other models, and then gives test data to tell how well it will do

New cards

Training Data

Data used to teach supervised learning algorithm

New cards

Validation Data

Used to test how the supervised learning algorithm compares to other models

New cards

Test Data

Tells how well a supervised learning algorithm will do

New cards

Does training data have known or unknown outcomes?

Known

New cards

Does validation data have known or unknown outcomes?

Known

New cards

Does testing data have known or unknown results?

Unknown

New cards

Unsupervised Learning Algorithm

Has no outcome to predict or classify, thus no learning necessary

New cards

Machine Learning Project Steps

Collect data.
Explore, clean, and process data.
Reduce dimensions, if necessary
Determine machine learning task
Partition Data (training, validation, test)
Choose machine learning techniques to use
Use algorithms to perform tasks
Interpret results of algorithm
Deploy model

New cards

SEMMA

Sample, Explore, Modify, Model, Assess

New cards

S in Semma

Sample: Take sample from data set, partition in three

New cards

E in SEMMA

Explore: Examine data, statistically and geographically

New cards

M1 in SEMMA

Modify: Transform variables and impute missing values

New cards

M2 in SEMMA

Model: Fit predictive models

New cards

A in SEMMA

Assess: Compare using validation data set

New cards

Preliminary Steps

Organize Data
Sample from database
Oversample rare events in classification tasks
Process and clean data
Handle categorical variables
Select variables
How many variables, how much data?
Outliers
Missing values
Normalize/Standardize, rescale data

New cards

Why sample from database?

Models don’t need to be inundated with data to be accurate

New cards

Why oversample rare events in classification tasks

If we sample something rare, we may have too much common data to create an accurate model

New cards

What are ways of handling categorical variables

Code numerically

New cards

How many variables and how much data?

10 records per variable is good rule of thumb

New cards

How to identify and what to do with outliers?

3+ std dev from mean, determine if outlier values are wrong, natural, or what is being sought after.

New cards

How to handle missing values

Omit records, replace with imputed value (like mean), determine if data is unnecessary or if investment is needed

New cards

Normalizing Data

Standardizing data, normalize as z-score to bring variables onto one scale

New cards

Data Partitioning

Dividing data into training, validation, and test d ata

New cards

Overfitting

Model fits training data too well and thus is ineffective in predicting future outcome values

New cards

Using a smaller data set is likely to generate a(n) _____ result, therefore ________________________________

ACCURATE, therefore number of required records fits into rows of an excel spreadsheet