3_Parallel Processing in the Data Pipeline

Parallel Processing for Big Data

Parallel processing techniques address limitations of single servers in handling large datasets.
Basic Steps:
- Split data into parts.
- Process parts in parallel on server clusters.
- Aggregate results into a final output.
Significance: Reduces the processing time for big data jobs.
Examples of Open Source Frameworks: Apache Hadoop and Spark.

Scenario: Cleaning and preparing data in an analytics pipeline.
Data Lake Organization: Utilizes zones for data of different states.
Data Transfer: Data moves from on-premises sources to an Amazon S3 data drop zone.
Data Cleaning Job:
- Copies data from the drop zone.
- Runs data cleaning processes.
- Copies the result set to the data analytics zone.
Data Curation Job:
- Copies data from the data analytics zone.
- Performs data curation steps.
- Copies results to the curated data zone.
Curated Data Usage: Accessed for visualization and analysis; often loaded into a data warehouse.

Parallel processing uses server clusters to accelerate big data jobs via data segmentation.
Amazon EMR is a cloud-based service for executing big data frameworks with reduced complexity.