Ch. 2 Data Collection

0.0(0)

Studied by 2 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/42

Earn XP

Description and Tags

Computer Science

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

43 Terms

New cards

What are the steps of Machine Learning Cycle

fetch, clean, prepare, train model, evaluate model, deploy to production, monitor & evaluate, repeat

New cards

What is the General rule about # of data points vs. features?

10x data points compared to number of features

New cards

Minimum # of rows in a good dataset, & what makes a large data set good

> 100 rows because more data usually means better model training

New cards

What type of data attributes are ideal & why?

precise attributes because models need to train on important features

New cards

What should a data field be & why?

data fields should be complete with no missing vales because models will skew results otherwise

New cards

How should values be displayed & why?

values should be consistent because models like clean & consistent data

New cards

How should data outcomes exist & why?

data outcomes should be evenly distributed because models can't learn on skewed outcomes

New cards

What are the synonyms of a dataset?

input data, training/testing data

New cards

What are the synonyms of a column?

attribute, feature

New cards

What are the synonyms of a row?

data point, observation, sample

New cards

What is a schema?

information about the data like column names

New cards

What is structured data?

data with a defined schema

New cards

What is unstructured data?

data with no defined schema like PDFs, images, videos, audio

New cards

What is semi structured data?

Data with some structure but not enough to be relations like CSV, JSON, XML

New cards

What does a key-value pair make up in semi-structured data?

an attribute

New cards

What is a Data Warehouse?

a big DB that collects usually preprocessed data from several sources in different formats

New cards

What can you do with data stored in a Data Warehouse?

run analytics & use BI tools to find important info about the data

New cards

What is a Data Lake?

single repository for storing large amounts of usually unprocessed/ unstructured data

New cards

What is the difference between labeled & unlabeled data?

attribute of labeled data is known, while unknown for unlabeled data

New cards

What is the difference between supervised learning & unsupervised learning?

supervised learning is done with labeled data, while unsupervised learning is with unlabeled data

New cards

What are categorical features?

values that are part of a group

New cards

What are continuous features?

values that are measurable numbers

New cards

Tip for remembering difference between qualitative vs quantitative data

qualitative has an L for categorical while quantitative has an N for continuous

New cards

What is a Text/ Corpus Dataset & what is it used for?

dataset collected from text & used for NLP, speech recognition, text to speech

New cards

What is a Ground Truth dataset?

dataset composed of factual data aka observed or measured

New cards

What do we use S3 for in Machine Learning?

S3 is for storing our data

New cards

What is RDS used for?

create fully managed relational DBs

New cards

What is Dynamo DB?

NoSQL datastore for non relational DBs that is used to store key value pairs

New cards

What is the ideal data for DynamoDB?

schema less data & unstructured/ semi structured data

New cards

What is Redshift?

fully managed data warehouse that congregates any type of data from other services like S3 & DynamoDB

New cards

What does Redshift Spectrum allow you to do?

query a Redshift Cluster, but only data from S3

New cards

What does a Data Pipeline allow you to?

process & move data between AWS compute & storage services, & support hybrid data migration

New cards

Give a Data Pipeline example transfer

RDS/ Redshift/ DynamoDB to Data Pipeline to S3

New cards

What does the Database Migration Service (DMS) allow you to do?

migrate data between different DB platforms

New cards

Give a DMS example migration

inhouse/ EC2/ RDS to DMS to S3

New cards

Can a Data pipeline or DMS use a transformation tool?

data pipeline can use a transformation tool

New cards

What is AWS Glue & what does it allow you to do?

Extract Transform & Load (ETL) service for loading your data between sources

New cards

How does Glue load data?

sends out a crawler to determine the schema of the data, if schema is determined then the data can be outputted to other services like Athena & EMR

New cards

Give a Glue example load

S3/ EC2 DB/ DynamoDB/ Redshift/ RDS to Glue to S3/ Redshift/ Athena/ EMR

New cards

What is EMR?

fully managed hadoop cluster ecosystem that runs on several EC2 instances

New cards

What does EMR allow you to do?

choose which frameworks you want in your clusters

New cards

What is Athena?

serverless SQL query tool for S3

New cards

What's the difference between Redshift Spectrum & Athena?

Redshift spectrum requires a Redshift cluster to query, while Athena requires a S3 bucket of data