L19_Distribute File System map reduce

Overview

Big Data refers to the large volumes of structured and unstructured data that inundate businesses on a daily basis. To efficiently process and analyze this data, innovative data structures, databases, and algorithms have been developed.

Key Concepts in Big Data

Big Data Data Structure: Efficient data structures are essential for managing and processing large datasets.
Big Data Databases: Various databases are designed to handle big data, utilizing parallel and distributed processing methods to enhance performance.
Parallel Processing: This technique allows multiple processes to work on a task simultaneously, improving the speed of data processing.
Distributed Processing: Involves distributing data across multiple servers for simultaneous processing.

Goals and Frameworks of Big Data

Goal of Big Data: To develop generalizations or models that summarize complex datasets. This involves creating frameworks that facilitate the implementation of powerful algorithms and analyses.
Data Frameworks: Technologies such as Hadoop and Spark facilitate distributed computing.
Hadoop File System (HDFS): A distributed file system designed to run on commodity hardware, which helps to resolve big data storage challenges efficiently.
MapReduce: A programming model for processing large data sets with a distributed algorithm on a cluster.
Recommendation Systems: Use algorithms to suggest products or content to users based on their preferences and behaviors.

Big Data Workflow

The Big Data workflow encompasses various systems and tools designed to orchestrate the processing of data:

Systems: Hardware and software architectures that manage data storage and processing.
Statistical Methods: Techniques used to analyze and interpret data patterns.
Distributed Tools: Software that enables data processing and analysis across multiple machines.

Classical Approach to Data Processing

The classical approach to processing data typically involves a single computer architecture:

CPU: Central processing unit that performs computations.
Memory: Temporary storage for data currently in use (e.g., 64 GB memory options).
Disk: Persistent storage that holds data long term.

Reading Data: Disk vs. Memory

Reading Times: Accessing data from disk is approximately 100,000 times slower than accessing it from main memory, emphasizing the need for efficient data handling methods.
IO Bounded Systems: Systems limited by input/output operations with disks, causing significant delays in processing large datasets.

Challenges in IO Cluster Computing

Node Failures: Traditional clusters may experience node failures, impacting the performance and reliability of data processing.
Network Bottlenecks: With typical throughput of 1 to 10 Gb/s, network speeds can limit overall data processing capabilities.
Ad-hoc Programming Difficulties: Distributed programming often involves complex setups, requiring robust programming systems that simplify task distribution.

Distributed Filesystems

Distributed storage is critical for scaling big data solutions:

Chunk Servers: Store data in chunks (usually 16-64MB), which are often replicated for fault tolerance.
Name Node: Stores metadata and coordinates data access across chunk servers, ensuring data is available and retrievable.

MapReduce Overview

MapReduce is a programming model used for processing large data sets in a distributed manner:

Mapper Function: Processes input data and produces a set of intermediate key-value pairs.
Shuffle and Sort: Organizes intermediate data by key before passing it to the Reducer.
Reducer Function: Combines key-value pairs to produce final output.

Use Cases for MapReduce

Document Processing: MapReduce can efficiently manage tasks like word count or analyzing user interactions in files.
Performance Enhancements: Employing techniques like combiners for local aggregation can help optimize data processing pipelines by reducing intermediate data size.

Example Applications

Word Count: MapReduce can count the frequency of words in a set of documents by mapping words as keys and their counts as values.
Employee Data Analysis: By applying MapReduce, organizations can identify top-performing employees or analyze salary distributions across departments.
Weather Data Aggregation: The framework can also be applied to analyze temperature records from weather stations, extracting yearly maximum temperatures.

Advantages and Limitations of MapReduce

Advantages:
- Fault tolerance, scalability, and cost-effectiveness.
- It can handle vast datasets across commodity hardware without shared memory.
Limitations:
- Not suitable for real-time processing, limited in caching intermediate data, primarily designed for batch processes.