Comprehensive Summary of Distributed Systems and Cloud Computing

Matemática Computacional y Analítica de Datos (Sistemas Distribuidos y la Nube)

Page 1-2

Prof. Álvaro Wong (Computer Architecture & Operating Systems department)
Agenda:
- What is a Distributed System?
- Characterization of Distributed Systems
- Challenges
- Metrics
- Paxos Algorithm

What is a Distributed System?

Definition:
- A distributed system is a system in which the hardware and software components of a networked computer communicate and coordinate via message passing (G. Coulouris).
- This definition applies across all systems where networked computers are deployed.
- Alternate definition: A collection of computers that appears to users as a single coherent system (A. S. Tanenbaum).
Significant Challenges:
- All definitions of distributed systems present significant challenges that need to be addressed.

Characterization of Distributed Systems

Main Characteristics

Concurrency:
- Multiple components access a shared resource simultaneously.
- Hardware examples: printers, disks.
- Software examples: files, databases, data objects.
Absence of a Global Clock:
- Requires timing mechanisms for coordination/synchronization (logical time algorithms).
Handling Independent Failures:
- Failures can arise from:
- Network isolation.
- Machine crashes (hardware).
- Abnormal program termination (software).

Challenges in Distributed Systems

Heterogeneity:
- Coordination of diverse components and resources differing in architecture, operating systems, hardware, networks, communication protocols, and data models.
Openness:
- Ability to extend and reimplement the system. Open systems depend on uniform communication mechanisms and well-defined interfaces/APIs.
Security:
- Protection of valuable information from unauthorized access, ensuring:
- Confidentiality: Prevent unauthorized reading of data.
- Integrity: Prevent unauthorized modifications.
- Availability: Ensure reliable access to resources.
Scalability:
- Capacity to handle increasing workloads by resource scaling or adding nodes, maintaining performance without degrading latency, availability, or consistency.
- Types of scalability:
- Vertical: Increase the server’s capacity.
- Horizontal: Add new servers.
- Elasticity: Automatically scaling in real-time (related to cloud platforms).
Failure Handling:
- Maintain system functionality in the event of errors.
- Types of issues:
- Fault: Deviation from actual specifications.
- Error: Erroneous system state from faults.
- Failure: Component failure that stops it from working.
Concurrency:
- Simultaneous task execution.
- Approaches to concurrency control include optimistic and pessimistic methods.
Transparency:
- Middleware that simplifies user experience by hiding remote resource management intricacies.

Distributed System Metrics

Availability

Definition: Availability (A) is the percentage of time a system is accessible for use:
$A = rac{ ext{Uptime}}{ ext{Total Time}} imes 100$
Cloud system standard: High availability (99.999%).
Calculation of availability in a cloud scenario:
- Total time in a month: 30 days × 24 hours/day × 60 minutes/hour = 43,200 minutes.
- If downtime is 5 minutes, uptime is:
  $ext{Uptime} = 43,200 - 5 = 43,195 ext{ minutes}$
- Availability Calculation:
  A = rac{43,195}{43,200} imes 100 = 99.9884 ext{%}

Factors Affecting Availability

Server Failures: (e.g., 5% failure probability, p = 0.05)
- $ext{Availability} = 1 - p^n$ (for n replicated servers)
- Example for n = 2:
  1 - 0.05^2 = 0.9975 = 99.75 ext{%}
Network Partitions or Disconnections:
- Can arise from intentional (e.g., BitTorrent) or unintentional causes.

Reliability Metrics

Mean Time Between Failures (MTBF): Frequency metric of failure.
Mean Time To Detect (MTTD): Average time to detect errors.
Mean Time To Resolve (MTTR): Average time to fix issues once detected.
Mean Time To Failure (MTTF): Average operational time before failure occurs.

Coordination: Consensus Algorithms

Paxos Algorithm Overview

Paxos Algorithm: A distributed consensus algorithm proposed by Lamport to agree on a single value despite node or network failures.
Roles in Paxos:
- Proposer: Initiates proposals.
- Acceptor: Votes on proposals.
- Learner: Receives the agreed-upon value.
Uses: Consistent replication, leader election, transaction coordination.

Paxos Algorithm Phases

Phase 1: Prepare

Prepare (1a): Proposer sends a prepare(N) message with the proposal number N.
Promise (1b): If the proposal number is greater than previously known numbers, acceptors send a promise(N, [v]) message, where v is the value of the greatest numbered proposal accepted.

Phase 2: Accept

Accept Request and Propose (2a): If a proposer receives enough promises, it selects a value v to propose.
Accepted (2b): If an acceptor has not committed to a higher proposal, it accepts (N, v) and notifies learners.

Paxos Algorithm Summary

Phase 1a: Prepare and send prepare(N) request.
Phase 1b: Promise to not accept proposals lower than N.
Phase 2: Accept or reject proposals based on previous promises.

Example Scenarios in Paxos Algorithm

Case 1: Proposer initiates consensus without prior values.
Case 2: Prior accepted values influence new proposals.
Case 3: Lower proposal numbers are rejected to avoid overwriting ongoing consensus.
Case 4: Higher proposal numbers can take control, but they must propose the last accepted value.

Resource Management Techniques

Multi-Paxos Algorithm

Definition: An extension of Paxos for handling multiple values efficiently.
Leader election: Conducted once per instance.
Multiple proposals: The leader proposes values for multiple log slots.

Handling Failures and Fault Tolerance

Replication: Maintaining copies across multiple servers to enhance durability and availability.
Active vs. Passive Replication:
- Active: Requests sent to all nodes concurrently.
- Passive: Requests routed through a primary node before replication.

Techniques for Improving Availability

Redundancy: Regularly maintained copies of key data/services.
Monitoring: Continuous health checks of system operations to preempt failures.
Rollback Capabilities: Effective reverting to previous functional states to minimize disruption occurred by failures.

Analysis of Concurrency Approaches

Optimistic Concurrency Control:
- Assumes few conflicts occur, validating at completion.
Pessimistic Concurrency Control:
- Locks resources to ensure data integrity during transactions and avoid race conditions.

Networking Concepts in Distributed Systems

IP Addressing and Classes

IP Address Format: Four numbers separated by dots (0-255), e.g., 192.168.21.76.
Classes: Categorized into five distinct classes (A-E) based on size and usage, focusing on network vs. host identification.

CIDR and Subnetting Techniques

Classless Inter-domain Routing (CIDR): Describes networks/subnetworks efficiently.
Subnet Masks: Determine which part of the IP address is dedicated to the network vs. host.

Security Implementation and Firewall Concepts

Security Groups: Acts as virtual firewalls to control traffic at the instance level.
Network ACLs: Adds additional security layers with explicit allow/deny rules.
Security Features: Ensuring user-defined security policies are enforced at multiple network levels.

Storage in Cloud Computing

Overview of Cloud Storage Services

Service Types: Block, File, Object storage methods within cloud infrastructure.
Instance Store vs. EBS vs. S3: Practical applications, advantages, and limitations of each storage method.

Amazon S3 Glacier: Long Term Storage Solutions

Storage Classes: Various S3 classes based on access frequency, retrieval time, and associated costs (Standard, Intelligent-Tiering, Archival).

Optimizing Distributed File Systems

Strategies: Local caching, tiered storage, load balancing, and erasure coding for enhanced performance and cost savings.

Amazon EC2 Storage Options

Details on types, applications, and functional purposes of Elastic Block Store (EBS), Elastic File System (EFS), and their operational characteristics.

Conclusion

Distributed databases: Introduction to the functions and efficiencies in data storage and relations.
AWS Database Services: Various service offerings like Amazon RDS and DynamoDB with attributes and use cases detailed.
Sharding and Data Conditioning: For enhancing operational efficiency in large-scale data environments, covering technical strategies and their implementations via example.