CS431/451/631/651 - Module 3: From MapReduce to Spark Notes

Module Overview

Focuses on Data-Intensive Distributed Computing.
Covers programming and algorithm design aspects with a transition from MapReduce to Spark.

Data Center as a Computer

Key Concept: Hadoop serves as the instruction set for data processing.

Abstraction Layers

Abstraction Layers:
- Higher-Level Language (e.g. Python)
- Lower-Level Language (e.g. C)
- Assembly
- Machine Code
- Instruction Set Architecture
- Electronics (Transistors)

Hadoop Overview

Hadoop Characteristics:
- Extensive boilerplate code
- Tends towards tedious programming due to complexity.
- Raises question: Can we simplify programming with a higher-level approach?

Possible Alternatives

SQL and Pig Scripts introduced as alternatives to lower the boilerplate burden of Hadoop.

Apache Pig

Pig simplifies processing by allowing the use of script-like syntax, making it more user-friendly than raw Hadoop.
Example use case in Pig: To find the top 10 most visited pages in each category from a dataset.
Basic Pig Script Example:
```pig
visits = load '/data/visits' as (user, url, time);
…
store topUrls into '/data/topUrls';

## Hadoop 2.0 Enhancements
- Transition from Hadoop to Hadoop 2.0 allows for developers to create distributed applications other than MapReduce, enhancing versatility.

## Introduction to Apache Spark
- Spark is described as:
  - Efficient with in-memory storage and faster execution.
  - Compatible with Hadoop but reduces the lines of code required for implementation.
- **Performance Benchmark**: Spark is shown to be significantly faster at execution due to its in-memory capabilities.

## Resilient Distributed Datasets (RDDs)
- **RDD Characteristics**:  
  - Collections of objects across a cluster; stored in RAM.
  - Built through parallel transformations and can automatically recover from failures.
- **Operations**:  
  - Transformations like `map`, `filter`, and `groupBy`.
  - Actions like `count`, `collect`, and `save`.

## Practical Example - Log Mining
- RDD example for error log analysis:

python
lines = spark.textFile("hdfs://…")
errors = lines.filter(lambda s: s.startswith("ERROR"))
messages = errors.map(lambda s: s.split(" ")[2])

## Fault Recovery in Spark
- RDDs maintain lineage, allowing for efficient recomputation of lost data.

## Key-Value Pair Operations
- Operations on RDDs of key-value pairs provided by Spark, enhancing data processing flexibility.
- Example of a word count application using key-value pairs:

python
counts = lines.flatMap(lambda line: line.split("")).map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)
```

Parallelism in Spark

Spark allows for setting the level of parallelism in operations.
By default, uses hash partitioning but can override for specific use cases.

Broadcast Variables and Accumulators

Broadcast Variables: Send a single copy of the variable to all executors rather than a copy to each task.
Accumulators: Allow communication from executor back to driver for adding values during processing.

Performance Optimization Techniques

Use reduceByKey for effective key-based reductions before shuffling, thereby optimizing performance.

Summary of Operations

Various operations in Spark RDDs include:
- map, filter, groupBy, join, reduceByKey, combineByKey, and more.
Understanding the correct use of operations can drastically improve the performance and readability of Spark applications.