MLBA_02

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/38

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

39 Terms

1
New cards

Classification

Responses in categories/classes

2
New cards

Prediction

Numerical value predicted (not class)

3
New cards

Association Rules

Finding Associations/patterns between items in large databases

4
New cards

Collaborative Filtering

Suggests what goes with what from an individual based on user history and measurable behaviors

5
New cards

Data Reduction

Consolidating large number of records into a smaller set

6
New cards

Dimension Reduction

Reducing number of variables

7
New cards

Data Reduction

_____ reduces number of rows in .xlsx file

8
New cards

Dimension Reduction

_____ reduces number of columns in an .xlsx file

9
New cards

Data Visualization

Data exploration by creating charts and dashboards

10
New cards

Data Visualization is also called _____

Visual Analytics

11
New cards

Visual Analytics is also called _____

Data Visualization

12
New cards

What charts are created in numerical visual analytics

Histograms and Boxplots

13
New cards

What charts are created in categorical visual analytics

Bar Charts

14
New cards

Supervised Learning Algorithm

Uses training data to teach model, gives validation data to test how it does compared to other models, and then gives test data to tell how well it will do

15
New cards

Training Data

Data used to teach supervised learning algorithm

16
New cards

Validation Data

Used to test how the supervised learning algorithm compares to other models

17
New cards

Test Data

Tells how well a supervised learning algorithm will do

18
New cards

Does training data have known or unknown outcomes?

Known

19
New cards

Does validation data have known or unknown outcomes?

Known

20
New cards

Does testing data have known or unknown results?

Unknown

21
New cards

Unsupervised Learning Algorithm

Has no outcome to predict or classify, thus no learning necessary

22
New cards

Machine Learning Project Steps

  1. Collect data.

  2. Explore, clean, and process data.

  3. Reduce dimensions, if necessary

  4. Determine machine learning task

  5. Partition Data (training, validation, test)

  6. Choose machine learning techniques to use

  7. Use algorithms to perform tasks

  8. Interpret results of algorithm

  9. Deploy model

23
New cards

SEMMA

Sample, Explore, Modify, Model, Assess

24
New cards

S in Semma

Sample: Take sample from data set, partition in three

25
New cards

E in SEMMA

Explore: Examine data, statistically and geographically

26
New cards

M1 in SEMMA

Modify: Transform variables and impute missing values

27
New cards

M2 in SEMMA

Model: Fit predictive models

28
New cards

A in SEMMA

Assess: Compare using validation data set

29
New cards

Preliminary Steps

  1. Organize Data

  2. Sample from database

  3. Oversample rare events in classification tasks

  4. Process and clean data

  5. Handle categorical variables

  6. Select variables

  7. How many variables, how much data?

  8. Outliers

  9. Missing values

  10. Normalize/Standardize, rescale data

30
New cards

Why sample from database?

Models don’t need to be inundated with data to be accurate

31
New cards

Why oversample rare events in classification tasks

If we sample something rare, we may have too much common data to create an accurate model

32
New cards

What are ways of handling categorical variables

Code numerically

33
New cards

How many variables and how much data?

10 records per variable is good rule of thumb

34
New cards

How to identify and what to do with outliers?

3+ std dev from mean, determine if outlier values are wrong, natural, or what is being sought after.

35
New cards

How to handle missing values

Omit records, replace with imputed value (like mean), determine if data is unnecessary or if investment is needed

36
New cards

Normalizing Data

Standardizing data, normalize as z-score to bring variables onto one scale

37
New cards

Data Partitioning

Dividing data into training, validation, and test d ata

38
New cards

Overfitting

Model fits training data too well and thus is ineffective in predicting future outcome values

39
New cards

Using a smaller data set is likely to generate a(n) _____ result, therefore ________________________________

ACCURATE, therefore number of required records fits into rows of an excel spreadsheet