1/4
Flashcards about parallel processing, Amazon EMR, and data pipelines.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Parallel Processing
A technique that splits data into parts, processes them separately on clusters of servers, and then aggregates the results to handle large datasets efficiently.
Amazon EMR
A managed cluster platform that reduces the complexity of running big data frameworks like Apache Hadoop and Spark.
Data Lake Zones
Organized sections within a data lake that separate data based on its state (e.g., drop zone, analytics zone, curated data zone).
Data Cleaning Job (in EMR)
Copies data from the drop zone, runs data cleaning processes, and copies the result set to the data analytics zone.
Data Curation Job (in EMR)
Copies data from the data analytics zone, runs processing steps to curate the data, and copies those results to the curated data zone for user access.