3_Parallel Processing in the Data Pipeline

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/4

flashcard set

Earn XP

Description and Tags

Flashcards about parallel processing, Amazon EMR, and data pipelines.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

5 Terms

1
New cards

Parallel Processing

A technique that splits data into parts, processes them separately on clusters of servers, and then aggregates the results to handle large datasets efficiently.

2
New cards

Amazon EMR

A managed cluster platform that reduces the complexity of running big data frameworks like Apache Hadoop and Spark.

3
New cards

Data Lake Zones

Organized sections within a data lake that separate data based on its state (e.g., drop zone, analytics zone, curated data zone).

4
New cards

Data Cleaning Job (in EMR)

Copies data from the drop zone, runs data cleaning processes, and copies the result set to the data analytics zone.

5
New cards

Data Curation Job (in EMR)

Copies data from the data analytics zone, runs processing steps to curate the data, and copies those results to the curated data zone for user access.