1/38
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Classification
Responses in categories/classes
Prediction
Numerical value predicted (not class)
Association Rules
Finding Associations/patterns between items in large databases
Collaborative Filtering
Suggests what goes with what from an individual based on user history and measurable behaviors
Data Reduction
Consolidating large number of records into a smaller set
Dimension Reduction
Reducing number of variables
Data Reduction
_____ reduces number of rows in .xlsx file
Dimension Reduction
_____ reduces number of columns in an .xlsx file
Data Visualization
Data exploration by creating charts and dashboards
Data Visualization is also called _____
Visual Analytics
Visual Analytics is also called _____
Data Visualization
What charts are created in numerical visual analytics
Histograms and Boxplots
What charts are created in categorical visual analytics
Bar Charts
Supervised Learning Algorithm
Uses training data to teach model, gives validation data to test how it does compared to other models, and then gives test data to tell how well it will do
Training Data
Data used to teach supervised learning algorithm
Validation Data
Used to test how the supervised learning algorithm compares to other models
Test Data
Tells how well a supervised learning algorithm will do
Does training data have known or unknown outcomes?
Known
Does validation data have known or unknown outcomes?
Known
Does testing data have known or unknown results?
Unknown
Unsupervised Learning Algorithm
Has no outcome to predict or classify, thus no learning necessary
Machine Learning Project Steps
Collect data.
Explore, clean, and process data.
Reduce dimensions, if necessary
Determine machine learning task
Partition Data (training, validation, test)
Choose machine learning techniques to use
Use algorithms to perform tasks
Interpret results of algorithm
Deploy model
SEMMA
Sample, Explore, Modify, Model, Assess
S in Semma
Sample: Take sample from data set, partition in three
E in SEMMA
Explore: Examine data, statistically and geographically
M1 in SEMMA
Modify: Transform variables and impute missing values
M2 in SEMMA
Model: Fit predictive models
A in SEMMA
Assess: Compare using validation data set
Preliminary Steps
Organize Data
Sample from database
Oversample rare events in classification tasks
Process and clean data
Handle categorical variables
Select variables
How many variables, how much data?
Outliers
Missing values
Normalize/Standardize, rescale data
Why sample from database?
Models don’t need to be inundated with data to be accurate
Why oversample rare events in classification tasks
If we sample something rare, we may have too much common data to create an accurate model
What are ways of handling categorical variables
Code numerically
How many variables and how much data?
10 records per variable is good rule of thumb
How to identify and what to do with outliers?
3+ std dev from mean, determine if outlier values are wrong, natural, or what is being sought after.
How to handle missing values
Omit records, replace with imputed value (like mean), determine if data is unnecessary or if investment is needed
Normalizing Data
Standardizing data, normalize as z-score to bring variables onto one scale
Data Partitioning
Dividing data into training, validation, and test d ata
Overfitting
Model fits training data too well and thus is ineffective in predicting future outcome values
Using a smaller data set is likely to generate a(n) _____ result, therefore ________________________________
ACCURATE, therefore number of required records fits into rows of an excel spreadsheet