Ch. 2 Data Collection

0.0(0)
studied byStudied by 2 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/42

flashcard set

Earn XP

Description and Tags

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

43 Terms

1
New cards
What are the steps of Machine Learning Cycle
fetch, clean, prepare, train model, evaluate model, deploy to production, monitor & evaluate, repeat
2
New cards
What is the General rule about # of data points vs. features?
10x data points compared to number of features
3
New cards
Minimum # of rows in a good dataset, & what makes a large data set good
> 100 rows because more data usually means better model training
4
New cards
What type of data attributes are ideal & why?
precise attributes because models need to train on important features
5
New cards
What should a data field be & why?
data fields should be complete with no missing vales because models will skew results otherwise
6
New cards
How should values be displayed & why?
values should be consistent because models like clean & consistent data
7
New cards
How should data outcomes exist & why?
data outcomes should be evenly distributed because models can't learn on skewed outcomes
8
New cards
What are the synonyms of a dataset?
input data, training/testing data
9
New cards
What are the synonyms of a column?
attribute, feature
10
New cards
What are the synonyms of a row?
data point, observation, sample
11
New cards
What is a schema?
information about the data like column names
12
New cards
What is structured data?
data with a defined schema
13
New cards
What is unstructured data?
data with no defined schema like PDFs, images, videos, audio
14
New cards
What is semi structured data?
Data with some structure but not enough to be relations like CSV, JSON, XML
15
New cards
What does a key-value pair make up in semi-structured data?
an attribute
16
New cards
What is a Data Warehouse?
a big DB that collects usually preprocessed data from several sources in different formats
17
New cards
What can you do with data stored in a Data Warehouse?
run analytics & use BI tools to find important info about the data
18
New cards
What is a Data Lake?
single repository for storing large amounts of usually unprocessed/ unstructured data
19
New cards
What is the difference between labeled & unlabeled data?
attribute of labeled data is known, while unknown for unlabeled data
20
New cards
What is the difference between supervised learning & unsupervised learning?
supervised learning is done with labeled data, while unsupervised learning is with unlabeled data
21
New cards
What are categorical features?
values that are part of a group
22
New cards
What are continuous features?
values that are measurable numbers
23
New cards
Tip for remembering difference between qualitative vs quantitative data
qualitative has an L for categorical while quantitative has an N for continuous
24
New cards
What is a Text/ Corpus Dataset & what is it used for?
dataset collected from text & used for NLP, speech recognition, text to speech
25
New cards
What is a Ground Truth dataset?
dataset composed of factual data aka observed or measured
26
New cards
What do we use S3 for in Machine Learning?
S3 is for storing our data
27
New cards
What is RDS used for?
create fully managed relational DBs
28
New cards
What is Dynamo DB?
NoSQL datastore for non relational DBs that is used to store key value pairs
29
New cards
What is the ideal data for DynamoDB?
schema less data & unstructured/ semi structured data
30
New cards
What is Redshift?
fully managed data warehouse that congregates any type of data from other services like S3 & DynamoDB
31
New cards
What does Redshift Spectrum allow you to do?
query a Redshift Cluster, but only data from S3
32
New cards
What does a Data Pipeline allow you to?
process & move data between AWS compute & storage services, & support hybrid data migration
33
New cards
Give a Data Pipeline example transfer
RDS/ Redshift/ DynamoDB to Data Pipeline to S3
34
New cards
What does the Database Migration Service (DMS) allow you to do?
migrate data between different DB platforms
35
New cards
Give a DMS example migration
inhouse/ EC2/ RDS to DMS to S3
36
New cards
Can a Data pipeline or DMS use a transformation tool?
data pipeline can use a transformation tool
37
New cards
What is AWS Glue & what does it allow you to do?
Extract Transform & Load (ETL) service for loading your data between sources
38
New cards
How does Glue load data?
sends out a crawler to determine the schema of the data, if schema is determined then the data can be outputted to other services like Athena & EMR
39
New cards
Give a Glue example load
S3/ EC2 DB/ DynamoDB/ Redshift/ RDS to Glue to S3/ Redshift/ Athena/ EMR
40
New cards
What is EMR?
fully managed hadoop cluster ecosystem that runs on several EC2 instances
41
New cards
What does EMR allow you to do?
choose which frameworks you want in your clusters
42
New cards
What is Athena?
serverless SQL query tool for S3
43
New cards
What's the difference between Redshift Spectrum & Athena?
Redshift spectrum requires a Redshift cluster to query, while Athena requires a S3 bucket of data