TOPIC 6

Distributed Problem Solving

Introduction to Distributed Systems

  • Definition: A distributed system is a group of independent computers, known as nodes, that work together as a single entity.
  • Characteristics:
      - Independent Nodes: Each computer operates independently while contributing to the overall system.
      - Communication: Nodes communicate over a network to share data, resources, and tasks.
      - Common Goal: The nodes coordinate to achieve a unified objective while providing a seamless experience to the user.
      - Fault Tolerance: The system's operation is not halted by the failure of a single node.
  • Example: An online shopping platform like Amazon utilizes distributed systems to manage multiple servers handling diverse tasks such as product searches, payment processing, and order fulfillment.
      - Note: Even if one server experiences failure, users can continue their activities unimpeded.

Real-World Examples of Distributed Systems

  • Online Banking Systems: Banks leverage distributed systems where transactions are recorded and shared among branches and ATMs, facilitating uninterrupted service.
  • E-Commerce Platforms: Websites like Amazon or Flipkart utilize multiple servers to manage orders, payments, and inventory to accommodate high traffic efficiently.
  • Social Media Platforms: Services such as Facebook, Twitter, and Instagram store vast amounts of user data across several data centers to optimize processing.
  • Online Gaming Systems: Multiplayer games (e.g., PUBG) distribute game states across various servers to optimize real-time interaction and reduce latency.

Types of Distributed Systems

  • Distributed systems can be categorized based on organizational structure, communication methods, and task distribution:
      1. Client-Server Systems:
         - A central server provides services while multiple clients request them, e.g., Gmail, online banking systems.
         - Characteristics:
           - Server manages data and processing.
           - Clients send requests and receive responses.
      2. Peer-to-Peer (P2P) Systems:
         - No central authority exists; all nodes can act as both clients and servers, e.g., BitTorrent, blockchain networks.
         - Characteristics:
           - Direct sharing of resources among nodes.
           - Equal standing of all nodes in function.
      3. Clustered Systems:
         - A group of closely linked computers working together as one system for higher performance and reliability, e.g., Google search clusters, supercomputers.
      4. Cloud-Based Distributed Systems:
         - Uses cloud infrastructure to distribute computing resources across data centers, e.g., AWS, Microsoft Azure, Google Cloud.
         - Characteristics:
           - On-demand scalability.
           - Services accessible over the internet.

Working of Distributed Systems

  • Functionality:
      - Tasks are divided among nodes, which work collaboratively over a network.
      - Each node runs its application and maintains localized data.
      - Communication protocols or middleware services facilitate node interaction.
      - Tasks and data are distributed across nodes for parallel processing.
      - Each node processes its data and shares results with others as necessary.
      - Data may reside in distributed databases rather than a single centralized system.
      - The system ensures coordination, consistency, and fault tolerance.

Advantages and Disadvantages of Distributed Systems

  • Advantages:
      - Resource Sharing: Increases efficiency by allowing nodes to share data and resources, thereby reducing costs.
      - Scalability: Facilitates easy handling of growing workloads by adding nodes without major restructuring.
      - Reliability and Fault Tolerance: Ensures operation continuity even with node failures.
      - Performance: Enhanced processing speed as workloads are managed across multiple nodes.

  • Disadvantages:
      - Complexity: Designing and maintaining distributed systems are generally more complicated than centralized systems.
      - Security Issues: Larger attack surfaces due to multiple nodes increase vulnerability to unauthorized access.
      - Network Dependency: Performance is heavily reliant on network quality, speed, and reliability.
      - Data Consistency: Maintaining data synchronization across nodes, especially in real-time applications, can be challenging.

Common Problems in Distributed Systems and Their Solutions

  • Common Challenges:
      - Network Partitions: Communication disruptions can lead to inconsistent data states (split-brain scenarios).
      - Replication and Consistency: Balancing high availability with data consistency is often challenging; adopting models like eventual or strong consistency can help.
      - Fault Tolerance: Systems must embody robust strategies to sustain operations during node failures.
      - Concurrency and Coordination: Avoiding conflicts during concurrent access necessitates complex coordination protocols.
      - Scalability and Load Balancing: Efficient performance requires effective management of resource distribution among nodes as workload increases.

Design Principles for Distributed Systems

  • Key Principles:
      1. Decentralization:
         - Spreading control among nodes enhances reliability, as failure in one node minimally impacts the entire system.
         - Examples: Peer-to-peer networking and distributed consensus algorithms reinforce decentralization.
      2. Scalability:
         - The system must accommodate increasing workloads seamlessly.
         - Two types exist: Horizontal Scalability (adding more nodes) and Vertical Scalability (upgrading existing nodes).
      3. Fault Tolerance:
         - The system should detect and recover from failures efficiently to ensure continuous operation.
         - Techniques include data/task replication and resource redundancy.
      4. Consistency:
         - Ensuring uniformity across all system components despite simultaneous operations via methods like atomic transactions and locks.
         - Different consistency models include strong, eventual, and causal consistency.
      5. Performance Optimization:
         - Enhancing speed and efficiency through improved data storage strategies and optimized communication protocols.

MapReduce in Hadoop

  • Overview: MapReduce is the computational engine in Hadoop that processes massive datasets through a structured two-phase model:
      - Map Phase: Breaks down datasets into smaller chunks and processes them, producing intermediate (key, value) pairs.
      - Reduce Phase: Aggregates intermediate results to generate the final output.
  • Storage: Hadoop Distributed File System (HDFS) is utilized for storing large data volumes.

How MapReduce Works

  • Input File: sample.txt is stored in HDFS, divided into input splits (e.g., first.txt, second.txt). Each split is designated to a Mapper.

  • Step 1: Input Splitting & Record Reader:
      - The input file is segmented into splits, which are converted into (key, value) pairs via RecordReader.
      - TextInputFormat (Default): The key corresponds to the byte offset, while the value represents the line in the file.

  • Step 2: Map Phase:
      - Each Mapper processes its assigned (key, value) pair and produces intermediate outputs (e.g., from (0, "Hello I am GeeksforGeeks"), outputs (Hello, 1) and (1, 1)).

  • Step 3: Shuffling and Sorting:
      - Outputs from the Mapper are grouped by identical keys (e.g., all values for "How" aggregated as (How, [1,1])) prior to the reduce phase.
      - Sorting aligns keys for sequential processing.

  • Step 4: Reduce Phase:
      - The Reducer compiles values for each key (e.g., for (How, [1,1]), the output becomes (How, 2)).
      - Final output stored in result.output might include counts of words, aggregating multiple occurrences, e.g.,
        - data - 5
        - science - 2
        - big - 2
        - fun - 1
        - powerful - 1