Comprehensive Summary of Distributed Systems and Cloud Computing
Matemática Computacional y Analítica de Datos (Sistemas Distribuidos y la Nube)
Page 1-2
Prof. Álvaro Wong (Computer Architecture & Operating Systems department)
Agenda:
What is a Distributed System?
Characterization of Distributed Systems
Challenges
Metrics
Paxos Algorithm
What is a Distributed System?
Definition:
A distributed system is a system in which the hardware and software components of a networked computer communicate and coordinate via message passing (G. Coulouris).
This definition applies across all systems where networked computers are deployed.
Alternate definition: A collection of computers that appears to users as a single coherent system (A. S. Tanenbaum).
Significant Challenges:
All definitions of distributed systems present significant challenges that need to be addressed.
Characterization of Distributed Systems
Main Characteristics
Concurrency:
Multiple components access a shared resource simultaneously.
Hardware examples: printers, disks.
Software examples: files, databases, data objects.
Absence of a Global Clock:
Requires timing mechanisms for coordination/synchronization (logical time algorithms).
Handling Independent Failures:
Failures can arise from:
Network isolation.
Machine crashes (hardware).
Abnormal program termination (software).
Challenges in Distributed Systems
Heterogeneity:
Coordination of diverse components and resources differing in architecture, operating systems, hardware, networks, communication protocols, and data models.
Openness:
Ability to extend and reimplement the system. Open systems depend on uniform communication mechanisms and well-defined interfaces/APIs.
Security:
Protection of valuable information from unauthorized access, ensuring:
Confidentiality: Prevent unauthorized reading of data.
Integrity: Prevent unauthorized modifications.
Availability: Ensure reliable access to resources.
Scalability:
Capacity to handle increasing workloads by resource scaling or adding nodes, maintaining performance without degrading latency, availability, or consistency.
Types of scalability:
Vertical: Increase the server’s capacity.
Horizontal: Add new servers.
Elasticity: Automatically scaling in real-time (related to cloud platforms).
Failure Handling:
Maintain system functionality in the event of errors.
Types of issues:
Fault: Deviation from actual specifications.
Error: Erroneous system state from faults.
Failure: Component failure that stops it from working.
Concurrency:
Simultaneous task execution.
Approaches to concurrency control include optimistic and pessimistic methods.
Transparency:
Middleware that simplifies user experience by hiding remote resource management intricacies.
Distributed System Metrics
Availability
Definition: Availability (A) is the percentage of time a system is accessible for use:
Cloud system standard: High availability (99.999%).
Calculation of availability in a cloud scenario:
Total time in a month: 30 days × 24 hours/day × 60 minutes/hour = 43,200 minutes.
If downtime is 5 minutes, uptime is:
Availability Calculation:
A = rac{43,195}{43,200} imes 100 = 99.9884 ext{%}
Factors Affecting Availability
Server Failures: (e.g., 5% failure probability, p = 0.05)
(for n replicated servers)
Example for n = 2:
1 - 0.05^2 = 0.9975 = 99.75 ext{%}
Network Partitions or Disconnections:
Can arise from intentional (e.g., BitTorrent) or unintentional causes.
Reliability Metrics
Mean Time Between Failures (MTBF): Frequency metric of failure.
Mean Time To Detect (MTTD): Average time to detect errors.
Mean Time To Resolve (MTTR): Average time to fix issues once detected.
Mean Time To Failure (MTTF): Average operational time before failure occurs.
Coordination: Consensus Algorithms
Paxos Algorithm Overview
Paxos Algorithm: A distributed consensus algorithm proposed by Lamport to agree on a single value despite node or network failures.
Roles in Paxos:
Proposer: Initiates proposals.
Acceptor: Votes on proposals.
Learner: Receives the agreed-upon value.
Uses: Consistent replication, leader election, transaction coordination.
Paxos Algorithm Phases
Phase 1: Prepare
Prepare (1a): Proposer sends a prepare(N) message with the proposal number N.
Promise (1b): If the proposal number is greater than previously known numbers, acceptors send a promise(N, [v]) message, where v is the value of the greatest numbered proposal accepted.
Phase 2: Accept
Accept Request and Propose (2a): If a proposer receives enough promises, it selects a value v to propose.
Accepted (2b): If an acceptor has not committed to a higher proposal, it accepts (N, v) and notifies learners.
Paxos Algorithm Summary
Phase 1a: Prepare and send prepare(N) request.
Phase 1b: Promise to not accept proposals lower than N.
Phase 2: Accept or reject proposals based on previous promises.
Example Scenarios in Paxos Algorithm
Case 1: Proposer initiates consensus without prior values.
Case 2: Prior accepted values influence new proposals.
Case 3: Lower proposal numbers are rejected to avoid overwriting ongoing consensus.
Case 4: Higher proposal numbers can take control, but they must propose the last accepted value.
Resource Management Techniques
Multi-Paxos Algorithm
Definition: An extension of Paxos for handling multiple values efficiently.
Leader election: Conducted once per instance.
Multiple proposals: The leader proposes values for multiple log slots.
Handling Failures and Fault Tolerance
Replication: Maintaining copies across multiple servers to enhance durability and availability.
Active vs. Passive Replication:
Active: Requests sent to all nodes concurrently.
Passive: Requests routed through a primary node before replication.
Techniques for Improving Availability
Redundancy: Regularly maintained copies of key data/services.
Monitoring: Continuous health checks of system operations to preempt failures.
Rollback Capabilities: Effective reverting to previous functional states to minimize disruption occurred by failures.
Analysis of Concurrency Approaches
Optimistic Concurrency Control:
Assumes few conflicts occur, validating at completion.
Pessimistic Concurrency Control:
Locks resources to ensure data integrity during transactions and avoid race conditions.
Networking Concepts in Distributed Systems
IP Addressing and Classes
IP Address Format: Four numbers separated by dots (0-255), e.g., 192.168.21.76.
Classes: Categorized into five distinct classes (A-E) based on size and usage, focusing on network vs. host identification.
CIDR and Subnetting Techniques
Classless Inter-domain Routing (CIDR): Describes networks/subnetworks efficiently.
Subnet Masks: Determine which part of the IP address is dedicated to the network vs. host.
Security Implementation and Firewall Concepts
Security Groups: Acts as virtual firewalls to control traffic at the instance level.
Network ACLs: Adds additional security layers with explicit allow/deny rules.
Security Features: Ensuring user-defined security policies are enforced at multiple network levels.
Storage in Cloud Computing
Overview of Cloud Storage Services
Service Types: Block, File, Object storage methods within cloud infrastructure.
Instance Store vs. EBS vs. S3: Practical applications, advantages, and limitations of each storage method.
Amazon S3 Glacier: Long Term Storage Solutions
Storage Classes: Various S3 classes based on access frequency, retrieval time, and associated costs (Standard, Intelligent-Tiering, Archival).
Optimizing Distributed File Systems
Strategies: Local caching, tiered storage, load balancing, and erasure coding for enhanced performance and cost savings.
Amazon EC2 Storage Options
Details on types, applications, and functional purposes of Elastic Block Store (EBS), Elastic File System (EFS), and their operational characteristics.
Conclusion
Distributed databases: Introduction to the functions and efficiencies in data storage and relations.
AWS Database Services: Various service offerings like Amazon RDS and DynamoDB with attributes and use cases detailed.
Sharding and Data Conditioning: For enhancing operational efficiency in large-scale data environments, covering technical strategies and their implementations via example.