CS431/451/631/651 - Module 3: From MapReduce to Spark Notes

Module Overview

  • Focuses on Data-Intensive Distributed Computing.
  • Covers programming and algorithm design aspects with a transition from MapReduce to Spark.

Data Center as a Computer

  • Key Concept: Hadoop serves as the instruction set for data processing.

Abstraction Layers

  • Abstraction Layers:
    • Higher-Level Language (e.g. Python)
    • Lower-Level Language (e.g. C)
    • Assembly
    • Machine Code
    • Instruction Set Architecture
    • Electronics (Transistors)

Hadoop Overview

  • Hadoop Characteristics:
    • Extensive boilerplate code
    • Tends towards tedious programming due to complexity.
    • Raises question: Can we simplify programming with a higher-level approach?

Possible Alternatives

  • SQL and Pig Scripts introduced as alternatives to lower the boilerplate burden of Hadoop.

Apache Pig

  • Pig simplifies processing by allowing the use of script-like syntax, making it more user-friendly than raw Hadoop.
  • Example use case in Pig: To find the top 10 most visited pages in each category from a dataset.
  • Basic Pig Script Example:
    ```pig
    visits = load '/data/visits' as (user, url, time);

    store topUrls into '/data/topUrls';
## Hadoop 2.0 Enhancements
- Transition from Hadoop to Hadoop 2.0 allows for developers to create distributed applications other than MapReduce, enhancing versatility.

## Introduction to Apache Spark
- Spark is described as:
  - Efficient with in-memory storage and faster execution.
  - Compatible with Hadoop but reduces the lines of code required for implementation.
- **Performance Benchmark**: Spark is shown to be significantly faster at execution due to its in-memory capabilities.

## Resilient Distributed Datasets (RDDs)
- **RDD Characteristics**:  
  - Collections of objects across a cluster; stored in RAM.
  - Built through parallel transformations and can automatically recover from failures.
- **Operations**:  
  - Transformations like `map`, `filter`, and `groupBy`.
  - Actions like `count`, `collect`, and `save`.

## Practical Example - Log Mining
- RDD example for error log analysis:

python
lines = spark.textFile("hdfs://…")
errors = lines.filter(lambda s: s.startswith("ERROR"))
messages = errors.map(lambda s: s.split(" ")[2])

## Fault Recovery in Spark
- RDDs maintain lineage, allowing for efficient recomputation of lost data.

## Key-Value Pair Operations
- Operations on RDDs of key-value pairs provided by Spark, enhancing data processing flexibility.
- Example of a word count application using key-value pairs:

python
counts = lines.flatMap(lambda line: line.split("")).map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)
```

Parallelism in Spark

  • Spark allows for setting the level of parallelism in operations.
  • By default, uses hash partitioning but can override for specific use cases.

Broadcast Variables and Accumulators

  • Broadcast Variables: Send a single copy of the variable to all executors rather than a copy to each task.
  • Accumulators: Allow communication from executor back to driver for adding values during processing.

Performance Optimization Techniques

  • Use reduceByKey for effective key-based reductions before shuffling, thereby optimizing performance.

Summary of Operations

  • Various operations in Spark RDDs include:
    • map, filter, groupBy, join, reduceByKey, combineByKey, and more.
  • Understanding the correct use of operations can drastically improve the performance and readability of Spark applications.