CS431/451/631/651 - Module 3: From MapReduce to Spark Notes
Module Overview
- Focuses on Data-Intensive Distributed Computing.
- Covers programming and algorithm design aspects with a transition from MapReduce to Spark.
Data Center as a Computer
- Key Concept: Hadoop serves as the instruction set for data processing.
Abstraction Layers
- Abstraction Layers:
- Higher-Level Language (e.g. Python)
- Lower-Level Language (e.g. C)
- Assembly
- Machine Code
- Instruction Set Architecture
- Electronics (Transistors)
Hadoop Overview
- Hadoop Characteristics:
- Extensive boilerplate code
- Tends towards tedious programming due to complexity.
- Raises question: Can we simplify programming with a higher-level approach?
Possible Alternatives
- SQL and Pig Scripts introduced as alternatives to lower the boilerplate burden of Hadoop.
Apache Pig
- Pig simplifies processing by allowing the use of script-like syntax, making it more user-friendly than raw Hadoop.
- Example use case in Pig: To find the top 10 most visited pages in each category from a dataset.
- Basic Pig Script Example:
```pig
visits = load '/data/visits' as (user, url, time);
…
store topUrls into '/data/topUrls';
## Hadoop 2.0 Enhancements
- Transition from Hadoop to Hadoop 2.0 allows for developers to create distributed applications other than MapReduce, enhancing versatility.
## Introduction to Apache Spark
- Spark is described as:
- Efficient with in-memory storage and faster execution.
- Compatible with Hadoop but reduces the lines of code required for implementation.
- **Performance Benchmark**: Spark is shown to be significantly faster at execution due to its in-memory capabilities.
## Resilient Distributed Datasets (RDDs)
- **RDD Characteristics**:
- Collections of objects across a cluster; stored in RAM.
- Built through parallel transformations and can automatically recover from failures.
- **Operations**:
- Transformations like `map`, `filter`, and `groupBy`.
- Actions like `count`, `collect`, and `save`.
## Practical Example - Log Mining
- RDD example for error log analysis:
python
lines = spark.textFile("hdfs://…")
errors = lines.filter(lambda s: s.startswith("ERROR"))
messages = errors.map(lambda s: s.split(" ")[2])
## Fault Recovery in Spark
- RDDs maintain lineage, allowing for efficient recomputation of lost data.
## Key-Value Pair Operations
- Operations on RDDs of key-value pairs provided by Spark, enhancing data processing flexibility.
- Example of a word count application using key-value pairs:
python
counts = lines.flatMap(lambda line: line.split("")).map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)
```
Parallelism in Spark
- Spark allows for setting the level of parallelism in operations.
- By default, uses hash partitioning but can override for specific use cases.
Broadcast Variables and Accumulators
- Broadcast Variables: Send a single copy of the variable to all executors rather than a copy to each task.
- Accumulators: Allow communication from executor back to driver for adding values during processing.
Performance Optimization Techniques
- Use
reduceByKeyfor effective key-based reductions before shuffling, thereby optimizing performance.
Summary of Operations
- Various operations in Spark RDDs include:
map,filter,groupBy,join,reduceByKey,combineByKey, and more.
- Understanding the correct use of operations can drastically improve the performance and readability of Spark applications.