PA

lecture23 - Distributed File System 1

Class Summary on File Systems and Distributed Systems

1. Introduction

  • Today's session began with a quick check on progress for Assignment 2, encouraging students to assess their understanding and clarify any uncertainties.

  • Office hours were announced to be held from 3 PM to 6 PM to provide dedicated support for questions related to Assignment 2.

  • The main topics covered during the lecture included the complexities of file systems, the intricacies of file system versioning, and a foundational overview of distributed file systems, particularly focusing on how they differ from traditional file systems.

2. Correction from Last Lecture

  • A significant clarification on RAID 10 was provided. It is crucial to differentiate between its two variations: RAID 0 + 1 and RAID 1 + 0.

  • The key differences in layering and their implications on performance, redundancy, and recovery were thoroughly discussed.

  • Students should be prepared for exam questions that will feature illustrated figures to aid in understanding the concepts more clearly.

3. Writing to a File and Crash Scenarios

  • A comprehensive overview of the steps involved in writing to a file was presented. Students learned how unexpected crashes could disrupt these operations and compromise the overall data consistency of the system.

  • A discussion prompt invited students to explore and articulate their understanding of how crashes can vary in impact on file operations.

  • Key points identified included the potential outcomes of crashes such as lost files, inaccessible files, incorrect file sizes, and inconsistencies in metadata which can seriously affect system reliability.

4. Causes of Crashes

  • Various potential causes for crashes were elaborated, including:

    • Hardware failures, which can stem from physical malfunctions of storage devices.

    • Power outages, resulting in abrupt system shutdowns that can affect uncommitted data.

    • Software bugs, which can lead to unpredictable behavior and system freezes.

  • The importance of acknowledging these causes lies in the intersection of business continuity and catastrophe recovery, as a crash could leave the file systems in a corrupted state.

5. Recovery Strategies

5.1 Consistency Tracking Tools

  • The limitations of Windows tools for scanning and resolving metadata inconsistencies were discussed. Although useful, these tools may be slow and limited in addressing all inconsistencies.

5.2 Journaling Techniques

  • Journaling was identified as a robust strategy that records changes to the system before they are fully committed. This method mitigates the impact of crashes, ensuring safer recovery options.

    • Typical implementations include modern file systems like EXT 3, EXT 4, and NTFS.

    • The journaling process involves writing changes to a dedicated journal block and utilizing clear start and end indicators for operations.

    • Challenges discussed included the possibility of writing large data that exceeds block sizes and the risks associated with data reordering which could lead to corruption.

    • Logical Journaling was noted as a more efficient method that logs only metadata changes, enhancing performance by reducing the volume of data writes necessary.

6. Data Integrity with ZFS File System

  • The Copy-on-Write (COW) technology utilized by the ZFS file system was highlighted. This technique ensures changes do not alter the original data until all modifications are complete, thereby safeguarding data integrity even amidst crashes.

  • While COW may lead to wasted space in the event of crashes, the assurance of no data corruption remains a significant advantage.

7. Distributed Systems Overview

  • The transition from self-contained programs to the deployment of services across multiple computers was reviewed. This model exemplified by services like those provided by Google indicated a fundamental shift in how applications are designed.

8. Benefits of Distributed Systems

  • The primary benefits of implementing distributed systems include:

    • Performance enhancement through the interconnection of multiple machines, allowing for significant power and resource pooling.

    • Data Caching, which reduces latency by strategically placing frequently accessed data closer to users thus improving load times and user satisfaction.

    • Replication, which increases fault tolerance by ensuring that multiple copies of data exist, shielding the system from data loss.

9. Characteristics of Distributed Systems

  • Characteristics of distributed systems include:

    • A structure comprising independent components that operate collaboratively through a connected network.

    • The necessity for these systems to present themselves as a coherent unit, ensuring continuity in operation despite potential failures of individual components.

10. Challenges in Distributed Systems

10.1 Concurrency

  • Challenges of concurrent access were discussed, particularly the difficulties posed when multiple requests occur simultaneously. Ensuring consistent data across replicated instances remains a complex task.

10.2 Latency

  • Latency issues stemming from communication delays due to network conditions were highlighted. Various communication models were introduced:

    • Synchronous: Defines a maximum limit on communication times.

    • Partially Synchronous: Timing is unclear but eventual delivery to the recipient is expected.

    • Asynchronous: No guarantee on timing which complicates protocol designs to maintain message integrity.

10.3 Partial Failure Handling

  • The concept of partial failure handling was defined, emphasizing that the failure of a single component does not necessitate a failure at the system level. However, the absence of a global state complicates failure detection and recovery procedures.

11. Availability vs. Reliability

  • The discussion differentiated between availability, which measures the operational time of a system, and reliability, which tracks the duration a system operates without failure. Examples highlighted how redundancy techniques could improve system availability by preventing downtime, with corresponding calculations of uptime presented.

12. Conclusion

  • The session concluded with a preview of the next topics, which will focus on security aspects and further challenges within distributed systems, paving the way for a deeper understanding of data protection and system integrity in complex environments.