Comprehensive Summary of Distributed Systems and Cloud Computing

Matemática Computacional y Analítica de Datos (Sistemas Distribuidos y la Nube)

Page 1-2

  • Prof. Álvaro Wong (Computer Architecture & Operating Systems department)

  • Agenda:

    • What is a Distributed System?

    • Characterization of Distributed Systems

    • Challenges

    • Metrics

    • Paxos Algorithm

What is a Distributed System?

  • Definition:

    • A distributed system is a system in which the hardware and software components of a networked computer communicate and coordinate via message passing (G. Coulouris).

    • This definition applies across all systems where networked computers are deployed.

    • Alternate definition: A collection of computers that appears to users as a single coherent system (A. S. Tanenbaum).

  • Significant Challenges:

    • All definitions of distributed systems present significant challenges that need to be addressed.

Characterization of Distributed Systems

Main Characteristics
  • Concurrency:

    • Multiple components access a shared resource simultaneously.

    • Hardware examples: printers, disks.

    • Software examples: files, databases, data objects.

  • Absence of a Global Clock:

    • Requires timing mechanisms for coordination/synchronization (logical time algorithms).

  • Handling Independent Failures:

    • Failures can arise from:

    • Network isolation.

    • Machine crashes (hardware).

    • Abnormal program termination (software).

Challenges in Distributed Systems
  • Heterogeneity:

    • Coordination of diverse components and resources differing in architecture, operating systems, hardware, networks, communication protocols, and data models.

  • Openness:

    • Ability to extend and reimplement the system. Open systems depend on uniform communication mechanisms and well-defined interfaces/APIs.

  • Security:

    • Protection of valuable information from unauthorized access, ensuring:

    • Confidentiality: Prevent unauthorized reading of data.

    • Integrity: Prevent unauthorized modifications.

    • Availability: Ensure reliable access to resources.

  • Scalability:

    • Capacity to handle increasing workloads by resource scaling or adding nodes, maintaining performance without degrading latency, availability, or consistency.

    • Types of scalability:

    • Vertical: Increase the server’s capacity.

    • Horizontal: Add new servers.

    • Elasticity: Automatically scaling in real-time (related to cloud platforms).

  • Failure Handling:

    • Maintain system functionality in the event of errors.

    • Types of issues:

    • Fault: Deviation from actual specifications.

    • Error: Erroneous system state from faults.

    • Failure: Component failure that stops it from working.

  • Concurrency:

    • Simultaneous task execution.

    • Approaches to concurrency control include optimistic and pessimistic methods.

  • Transparency:

    • Middleware that simplifies user experience by hiding remote resource management intricacies.

Distributed System Metrics

Availability
  • Definition: Availability (A) is the percentage of time a system is accessible for use:
    A=racextUptimeextTotalTimeimes100A = rac{ ext{Uptime}}{ ext{Total Time}} imes 100

  • Cloud system standard: High availability (99.999%).

  • Calculation of availability in a cloud scenario:

    • Total time in a month: 30 days × 24 hours/day × 60 minutes/hour = 43,200 minutes.

    • If downtime is 5 minutes, uptime is:
      extUptime=43,2005=43,195extminutesext{Uptime} = 43,200 - 5 = 43,195 ext{ minutes}

    • Availability Calculation:
      A = rac{43,195}{43,200} imes 100 = 99.9884 ext{%}

Factors Affecting Availability
  • Server Failures: (e.g., 5% failure probability, p = 0.05)

    • extAvailability=1pnext{Availability} = 1 - p^n (for n replicated servers)

    • Example for n = 2:
      1 - 0.05^2 = 0.9975 = 99.75 ext{%}

  • Network Partitions or Disconnections:

    • Can arise from intentional (e.g., BitTorrent) or unintentional causes.

Reliability Metrics
  • Mean Time Between Failures (MTBF): Frequency metric of failure.

  • Mean Time To Detect (MTTD): Average time to detect errors.

  • Mean Time To Resolve (MTTR): Average time to fix issues once detected.

  • Mean Time To Failure (MTTF): Average operational time before failure occurs.

Coordination: Consensus Algorithms

Paxos Algorithm Overview
  • Paxos Algorithm: A distributed consensus algorithm proposed by Lamport to agree on a single value despite node or network failures.

  • Roles in Paxos:

    • Proposer: Initiates proposals.

    • Acceptor: Votes on proposals.

    • Learner: Receives the agreed-upon value.

  • Uses: Consistent replication, leader election, transaction coordination.

Paxos Algorithm Phases
Phase 1: Prepare
  1. Prepare (1a): Proposer sends a prepare(N) message with the proposal number N.

  2. Promise (1b): If the proposal number is greater than previously known numbers, acceptors send a promise(N, [v]) message, where v is the value of the greatest numbered proposal accepted.

Phase 2: Accept
  1. Accept Request and Propose (2a): If a proposer receives enough promises, it selects a value v to propose.

  2. Accepted (2b): If an acceptor has not committed to a higher proposal, it accepts (N, v) and notifies learners.

Paxos Algorithm Summary
  • Phase 1a: Prepare and send prepare(N) request.

  • Phase 1b: Promise to not accept proposals lower than N.

  • Phase 2: Accept or reject proposals based on previous promises.

Example Scenarios in Paxos Algorithm
  1. Case 1: Proposer initiates consensus without prior values.

  2. Case 2: Prior accepted values influence new proposals.

  3. Case 3: Lower proposal numbers are rejected to avoid overwriting ongoing consensus.

  4. Case 4: Higher proposal numbers can take control, but they must propose the last accepted value.

Resource Management Techniques

Multi-Paxos Algorithm
  • Definition: An extension of Paxos for handling multiple values efficiently.

  • Leader election: Conducted once per instance.

  • Multiple proposals: The leader proposes values for multiple log slots.

Handling Failures and Fault Tolerance
  • Replication: Maintaining copies across multiple servers to enhance durability and availability.

  • Active vs. Passive Replication:

    • Active: Requests sent to all nodes concurrently.

    • Passive: Requests routed through a primary node before replication.

Techniques for Improving Availability
  • Redundancy: Regularly maintained copies of key data/services.

  • Monitoring: Continuous health checks of system operations to preempt failures.

  • Rollback Capabilities: Effective reverting to previous functional states to minimize disruption occurred by failures.

Analysis of Concurrency Approaches
  • Optimistic Concurrency Control:

    • Assumes few conflicts occur, validating at completion.

  • Pessimistic Concurrency Control:

    • Locks resources to ensure data integrity during transactions and avoid race conditions.

Networking Concepts in Distributed Systems

IP Addressing and Classes
  • IP Address Format: Four numbers separated by dots (0-255), e.g., 192.168.21.76.

  • Classes: Categorized into five distinct classes (A-E) based on size and usage, focusing on network vs. host identification.

CIDR and Subnetting Techniques
  • Classless Inter-domain Routing (CIDR): Describes networks/subnetworks efficiently.

  • Subnet Masks: Determine which part of the IP address is dedicated to the network vs. host.

Security Implementation and Firewall Concepts
  • Security Groups: Acts as virtual firewalls to control traffic at the instance level.

  • Network ACLs: Adds additional security layers with explicit allow/deny rules.

  • Security Features: Ensuring user-defined security policies are enforced at multiple network levels.

Storage in Cloud Computing

Overview of Cloud Storage Services
  • Service Types: Block, File, Object storage methods within cloud infrastructure.

  • Instance Store vs. EBS vs. S3: Practical applications, advantages, and limitations of each storage method.

Amazon S3 Glacier: Long Term Storage Solutions
  • Storage Classes: Various S3 classes based on access frequency, retrieval time, and associated costs (Standard, Intelligent-Tiering, Archival).

Optimizing Distributed File Systems
  • Strategies: Local caching, tiered storage, load balancing, and erasure coding for enhanced performance and cost savings.

Amazon EC2 Storage Options
  • Details on types, applications, and functional purposes of Elastic Block Store (EBS), Elastic File System (EFS), and their operational characteristics.

Conclusion

  • Distributed databases: Introduction to the functions and efficiencies in data storage and relations.

  • AWS Database Services: Various service offerings like Amazon RDS and DynamoDB with attributes and use cases detailed.

  • Sharding and Data Conditioning: For enhancing operational efficiency in large-scale data environments, covering technical strategies and their implementations via example.