3_Parallel Processing in the Data Pipeline
Parallel Processing for Big Data
- Parallel processing techniques address limitations of single servers in handling large datasets.
- Basic Steps:
- Split data into parts.
- Process parts in parallel on server clusters.
- Aggregate results into a final output.
- Significance: Reduces the processing time for big data jobs.
- Examples of Open Source Frameworks: Apache Hadoop and Spark.
Amazon EMR
- Amazon EMR is a managed cluster platform.
- Simplifies running big data frameworks (e.g., Apache, Hadoop, Spark).
Data Pipeline Example Using Amazon EMR
- Scenario: Cleaning and preparing data in an analytics pipeline.
- Data Lake Organization: Utilizes zones for data of different states.
- Data Transfer: Data moves from on-premises sources to an Amazon S3 data drop zone.
- Data Cleaning Job:
- Copies data from the drop zone.
- Runs data cleaning processes.
- Copies the result set to the data analytics zone.
- Data Curation Job:
- Copies data from the data analytics zone.
- Performs data curation steps.
- Copies results to the curated data zone.
- Curated Data Usage: Accessed for visualization and analysis; often loaded into a data warehouse.
Key Takeaways
- Parallel processing uses server clusters to accelerate big data jobs via data segmentation.
- Amazon EMR is a cloud-based service for executing big data frameworks with reduced complexity.