TOPIC 6

Definition: A distributed system is a group of independent computers, known as nodes, that work together as a single entity.
Characteristics:
  - Independent Nodes: Each computer operates independently while contributing to the overall system.
  - Communication: Nodes communicate over a network to share data, resources, and tasks.
  - Common Goal: The nodes coordinate to achieve a unified objective while providing a seamless experience to the user.
  - Fault Tolerance: The system's operation is not halted by the failure of a single node.
Example: An online shopping platform like Amazon utilizes distributed systems to manage multiple servers handling diverse tasks such as product searches, payment processing, and order fulfillment.
- Note: Even if one server experiences failure, users can continue their activities unimpeded.

Online Banking Systems: Banks leverage distributed systems where transactions are recorded and shared among branches and ATMs, facilitating uninterrupted service.
E-Commerce Platforms: Websites like Amazon or Flipkart utilize multiple servers to manage orders, payments, and inventory to accommodate high traffic efficiently.
Social Media Platforms: Services such as Facebook, Twitter, and Instagram store vast amounts of user data across several data centers to optimize processing.
Online Gaming Systems: Multiplayer games (e.g., PUBG) distribute game states across various servers to optimize real-time interaction and reduce latency.

Functionality:
  - Tasks are divided among nodes, which work collaboratively over a network.
  - Each node runs its application and maintains localized data.
  - Communication protocols or middleware services facilitate node interaction.
  - Tasks and data are distributed across nodes for parallel processing.
  - Each node processes its data and shares results with others as necessary.
  - Data may reside in distributed databases rather than a single centralized system.
  - The system ensures coordination, consistency, and fault tolerance.

Advantages:
  - Resource Sharing: Increases efficiency by allowing nodes to share data and resources, thereby reducing costs.
  - Scalability: Facilitates easy handling of growing workloads by adding nodes without major restructuring.
  - Reliability and Fault Tolerance: Ensures operation continuity even with node failures.
  - Performance: Enhanced processing speed as workloads are managed across multiple nodes.
Disadvantages:
  - Complexity: Designing and maintaining distributed systems are generally more complicated than centralized systems.
  - Security Issues: Larger attack surfaces due to multiple nodes increase vulnerability to unauthorized access.
  - Network Dependency: Performance is heavily reliant on network quality, speed, and reliability.
  - Data Consistency: Maintaining data synchronization across nodes, especially in real-time applications, can be challenging.

Common Challenges:
  - Network Partitions: Communication disruptions can lead to inconsistent data states (split-brain scenarios).
  - Replication and Consistency: Balancing high availability with data consistency is often challenging; adopting models like eventual or strong consistency can help.
  - Fault Tolerance: Systems must embody robust strategies to sustain operations during node failures.
  - Concurrency and Coordination: Avoiding conflicts during concurrent access necessitates complex coordination protocols.
  - Scalability and Load Balancing: Efficient performance requires effective management of resource distribution among nodes as workload increases.

Key Principles:
  1. Decentralization:
     - Spreading control among nodes enhances reliability, as failure in one node minimally impacts the entire system.
     - Examples: Peer-to-peer networking and distributed consensus algorithms reinforce decentralization.
  2. Scalability:
     - The system must accommodate increasing workloads seamlessly.
     - Two types exist: Horizontal Scalability (adding more nodes) and Vertical Scalability (upgrading existing nodes).
  3. Fault Tolerance:
     - The system should detect and recover from failures efficiently to ensure continuous operation.
     - Techniques include data/task replication and resource redundancy.
  4. Consistency:
     - Ensuring uniformity across all system components despite simultaneous operations via methods like atomic transactions and locks.
     - Different consistency models include strong, eventual, and causal consistency.
  5. Performance Optimization:
     - Enhancing speed and efficiency through improved data storage strategies and optimized communication protocols.

Overview: MapReduce is the computational engine in Hadoop that processes massive datasets through a structured two-phase model:
- Map Phase: Breaks down datasets into smaller chunks and processes them, producing intermediate (key, value) pairs.
- Reduce Phase: Aggregates intermediate results to generate the final output.
Storage: Hadoop Distributed File System (HDFS) is utilized for storing large data volumes.

Input File: sample.txt is stored in HDFS, divided into input splits (e.g., first.txt, second.txt). Each split is designated to a Mapper.
Step 1: Input Splitting & Record Reader:
- The input file is segmented into splits, which are converted into (key, value) pairs via RecordReader.
- TextInputFormat (Default): The key corresponds to the byte offset, while the value represents the line in the file.
Step 2: Map Phase:
- Each Mapper processes its assigned (key, value) pair and produces intermediate outputs (e.g., from (0, "Hello I am GeeksforGeeks"), outputs (Hello, 1) and (1, 1)).
Step 3: Shuffling and Sorting:
- Outputs from the Mapper are grouped by identical keys (e.g., all values for "How" aggregated as (How, [1,1])) prior to the reduce phase.
- Sorting aligns keys for sequential processing.
Step 4: Reduce Phase:
  - The Reducer compiles values for each key (e.g., for (How, [1,1]), the output becomes (How, 2)).
  - Final output stored in result.output might include counts of words, aggregating multiple occurrences, e.g.,
    - data - 5
    - science - 2
    - big - 2
    - fun - 1
    - powerful - 1