TOPIC 6
Distributed Problem Solving
Introduction to Distributed Systems
- Definition: A distributed system is a group of independent computers, known as nodes, that work together as a single entity.
- Characteristics:
- Independent Nodes: Each computer operates independently while contributing to the overall system.
- Communication: Nodes communicate over a network to share data, resources, and tasks.
- Common Goal: The nodes coordinate to achieve a unified objective while providing a seamless experience to the user.
- Fault Tolerance: The system's operation is not halted by the failure of a single node. - Example: An online shopping platform like Amazon utilizes distributed systems to manage multiple servers handling diverse tasks such as product searches, payment processing, and order fulfillment.
- Note: Even if one server experiences failure, users can continue their activities unimpeded.
Real-World Examples of Distributed Systems
- Online Banking Systems: Banks leverage distributed systems where transactions are recorded and shared among branches and ATMs, facilitating uninterrupted service.
- E-Commerce Platforms: Websites like Amazon or Flipkart utilize multiple servers to manage orders, payments, and inventory to accommodate high traffic efficiently.
- Social Media Platforms: Services such as Facebook, Twitter, and Instagram store vast amounts of user data across several data centers to optimize processing.
- Online Gaming Systems: Multiplayer games (e.g., PUBG) distribute game states across various servers to optimize real-time interaction and reduce latency.
Types of Distributed Systems
- Distributed systems can be categorized based on organizational structure, communication methods, and task distribution:
1. Client-Server Systems:
- A central server provides services while multiple clients request them, e.g., Gmail, online banking systems.
- Characteristics:
- Server manages data and processing.
- Clients send requests and receive responses.
2. Peer-to-Peer (P2P) Systems:
- No central authority exists; all nodes can act as both clients and servers, e.g., BitTorrent, blockchain networks.
- Characteristics:
- Direct sharing of resources among nodes.
- Equal standing of all nodes in function.
3. Clustered Systems:
- A group of closely linked computers working together as one system for higher performance and reliability, e.g., Google search clusters, supercomputers.
4. Cloud-Based Distributed Systems:
- Uses cloud infrastructure to distribute computing resources across data centers, e.g., AWS, Microsoft Azure, Google Cloud.
- Characteristics:
- On-demand scalability.
- Services accessible over the internet.
Working of Distributed Systems
- Functionality:
- Tasks are divided among nodes, which work collaboratively over a network.
- Each node runs its application and maintains localized data.
- Communication protocols or middleware services facilitate node interaction.
- Tasks and data are distributed across nodes for parallel processing.
- Each node processes its data and shares results with others as necessary.
- Data may reside in distributed databases rather than a single centralized system.
- The system ensures coordination, consistency, and fault tolerance.
Advantages and Disadvantages of Distributed Systems
Advantages:
- Resource Sharing: Increases efficiency by allowing nodes to share data and resources, thereby reducing costs.
- Scalability: Facilitates easy handling of growing workloads by adding nodes without major restructuring.
- Reliability and Fault Tolerance: Ensures operation continuity even with node failures.
- Performance: Enhanced processing speed as workloads are managed across multiple nodes.Disadvantages:
- Complexity: Designing and maintaining distributed systems are generally more complicated than centralized systems.
- Security Issues: Larger attack surfaces due to multiple nodes increase vulnerability to unauthorized access.
- Network Dependency: Performance is heavily reliant on network quality, speed, and reliability.
- Data Consistency: Maintaining data synchronization across nodes, especially in real-time applications, can be challenging.
Common Problems in Distributed Systems and Their Solutions
- Common Challenges:
- Network Partitions: Communication disruptions can lead to inconsistent data states (split-brain scenarios).
- Replication and Consistency: Balancing high availability with data consistency is often challenging; adopting models like eventual or strong consistency can help.
- Fault Tolerance: Systems must embody robust strategies to sustain operations during node failures.
- Concurrency and Coordination: Avoiding conflicts during concurrent access necessitates complex coordination protocols.
- Scalability and Load Balancing: Efficient performance requires effective management of resource distribution among nodes as workload increases.
Design Principles for Distributed Systems
- Key Principles:
1. Decentralization:
- Spreading control among nodes enhances reliability, as failure in one node minimally impacts the entire system.
- Examples: Peer-to-peer networking and distributed consensus algorithms reinforce decentralization.
2. Scalability:
- The system must accommodate increasing workloads seamlessly.
- Two types exist: Horizontal Scalability (adding more nodes) and Vertical Scalability (upgrading existing nodes).
3. Fault Tolerance:
- The system should detect and recover from failures efficiently to ensure continuous operation.
- Techniques include data/task replication and resource redundancy.
4. Consistency:
- Ensuring uniformity across all system components despite simultaneous operations via methods like atomic transactions and locks.
- Different consistency models include strong, eventual, and causal consistency.
5. Performance Optimization:
- Enhancing speed and efficiency through improved data storage strategies and optimized communication protocols.
MapReduce in Hadoop
- Overview: MapReduce is the computational engine in Hadoop that processes massive datasets through a structured two-phase model:
- Map Phase: Breaks down datasets into smaller chunks and processes them, producing intermediate (key, value) pairs.
- Reduce Phase: Aggregates intermediate results to generate the final output. - Storage: Hadoop Distributed File System (HDFS) is utilized for storing large data volumes.
How MapReduce Works
Input File:
sample.txtis stored in HDFS, divided into input splits (e.g.,first.txt,second.txt). Each split is designated to a Mapper.Step 1: Input Splitting & Record Reader:
- The input file is segmented into splits, which are converted into (key, value) pairs via RecordReader.
- TextInputFormat (Default): The key corresponds to the byte offset, while the value represents the line in the file.Step 2: Map Phase:
- Each Mapper processes its assigned (key, value) pair and produces intermediate outputs (e.g., from (0, "Hello I am GeeksforGeeks"), outputs (Hello, 1) and (1, 1)).Step 3: Shuffling and Sorting:
- Outputs from the Mapper are grouped by identical keys (e.g., all values for "How" aggregated as (How, [1,1])) prior to the reduce phase.
- Sorting aligns keys for sequential processing.Step 4: Reduce Phase:
- The Reducer compiles values for each key (e.g., for (How, [1,1]), the output becomes (How, 2)).
- Final output stored inresult.outputmight include counts of words, aggregating multiple occurrences, e.g.,
- data - 5
- science - 2
- big - 2
- fun - 1
- powerful - 1