Domain 1 - Managing and Optimizing Storage

Domain 1 - Managing and Optimizing Storage
Amazon Elastic Block Storage (EBS)
  • Definition: Amazon Elastic Block Storage (EBS) is a high-performance, block-level storage service designed for use with Amazon EC2 instances for both throughput and transaction-intensive workloads.

    • Unlike Object Storage (S3), EBS provides raw block device access.

  • Availability and Durability:

    • Both the EBS volume and EC2 instance must reside in the same Availability Zone (AZ).

    • EBS volumes are automatically replicated within their AZ to protect against hardware failure, offering high durability.

  • Volume Lifecycle Management:

    1. Volume Creation: Can be provisioned as empty or from an existing Snapshot. The volume status will transition to available once ready for attachment.

    2. Attachment: A volume in the in-use state is attached to an instance.

    3. Volume Deletion: Deletion is prohibited while the volume is attached.

    4. Root Volumes: By default, the DeleteOnTermination attribute is set to True for root volumes. For data persistence, this must be manually toggled to False.

Storage Metrics and Performance
  • IOPS (Input/Output Operations Per Second): A measure of the number of reads/writes a volume can perform per second. Crucial for database workloads.

  • Queue Length: The number of pending I/O requests. If this value is consistently high relative to IOPS, it indicates a performance bottleneck.

  • Throughput: Measured in MiB/s, representing the total volume of data moved. High throughput is vital for streaming and big data analysis.

  • Latency: The time taken for a single I/O unit to complete its round trip. High latency typically signifies that the volume has reached its performance limit.

  • Burst Balance: Applicable to gp2, st1, and sc1.

    • Volumes earn specific credits when operating below their baseline performance.

    • When the workload spikes, the volume consumes these credits to "burst" above the baseline. If BurstBalance = 0, the volume is throttled to its baseline performance.

Detailed Volume Types

Solid State Drives (SSD)

  • General Purpose SSD (gp2/gp3):

    • gp2: Performance scales with volume size (3 IOPS per GiB3 \text{ IOPS per GiB}) with a minimum of 100100 IOPS and a burst up to 3,0003,000 IOPS.

    • gp3: Decouples performance from storage size. Provides a baseline of 3,0003,000 IOPS and 125 MiB/s125 \text{ MiB/s} regardless of volume size.

  • Provisioned IOPS SSD (io1/io2):

    • Designed for I/O-intensive database workloads.

    • io2 Block Express: Offers sub-millisecond latency and up to 256,000256,000 IOPS.

Hard Disk Drives (HDD)

  • Throughput Optimized HDD (st1): Focused on throughput (up to 500 MiB/s\text{up to } 500 \text{ MiB/s}) rather than IOPS. Good for Log processing.

  • Cold HDD (sc1): Lowest cost for infrequently accessed workloads.

  • Magnetic (Standard): Previous generation, rarely used in modern architectures.

RAID Configurations for EBS
  • RAID 0 (Striping): Used to increase total IOPS/Throughput by spreading data across multiple volumes. However, loss of one volume results in data loss for the whole set.

  • RAID 1, 5, 6: Generally redundant for EBS because Amazon already replicates the data at the hardware level. RAID 5/6 specifically incur a heavy parity-calculation overhead that degrades performance on network-attached storage.

Monitoring and Health Checks
  • Volume Status Checks: EBS sends metrics to CloudWatch every 1 minute.

    • Okay: Everything is functioning normally.

    • Impaired: The volume is unavailable or I/O is stalled.

  • Data Consistency: If AWS detects a potential inconsistency, I/O may be disabled. You must enable the Auto-Enabled IO attribute or acknowledge the inconsistency via the CLI to resume service.

  • OS Level Monitoring: Use iostat -xdmzt 1 on Linux or Perfmon on Windows to identify "Micro-bursting" (latency spikes shorter than the 1-minute CloudWatch polling interval).

Instance Store (Ephemeral Storage)
  • Characteristics:

    • Physically attached to the host server, resulting in very low latency and high IOPS.

    • Data Volatility: Data is lost if the instance is stopped, hibernates, or fails. Data persists only during instance reboots.

  • Use Cases: Temporary files, scratch space, distributed file systems (like HDFS), and swap files.

Modifying Volumes (Elastic Volumes)
  • Modifications: You can increase size, change volume type (e.g., gp2 to io1), or adjust IOPS/throughput on the fly without downtime.

  • Cooldown Period: Once a modification starts, you must wait at least 66 hours before modifying the same volume again.

  • File System Extension: After the EBS volume is resized in AWS, the OS-level file system (e.g., ext4, xfs, or NTFS) must be extended using commands like resize2fs or xfs_growfs.

Multi-Attach
  • Available for io1 and io2 volumes on Nitro-based instances.

  • Allows up to 1616 instances to mount the same volume simultaneously.

  • Requires a cluster-aware file system (e.g., GFS2, OCFS2) to manage write-locking and data integrity.

EBS Snapshots and Data Lifecycle
  • Incremental Nature: Only the blocks changed since the last snapshot are stored, reducing storage costs.

  • Fast Snapshot Restore (FSR): Eliminates the need for pre-warming (reading all blocks once) by ensuring the snapshot is instantly available at maximum performance.

  • Amazon Data Lifecycle Manager (DLM):

    • Automates snapshot creation and deletion based on tags.

    • Supports cross-region copy and cross-account sharing policies.

  • Recycle Bin: Provides a safety net for snapshots and AMIs. Deleted snapshots are retained in the Recycle Bin for a specified period (1 to 365 days1\text{ to } 365 \text{ days}) before being permanently purged.

Amazon EFS (Elastic File System)
  • Protocol: Managed NFS (Network File System) for Linux-based workloads.

  • Storage Tiers:

    • Standard: For active data.

    • Infrequent Access (IA): Significantly cheaper; data is moved here automatically by Lifecycle Management if not accessed for a set period (7,14,30,60, or 907, 14, 30, 60, \text{ or } 90 days).

  • Performance Modes:

    • General Purpose: Standard mode.

    • Max I/O: For massive scale; higher latency but higher aggregate throughput.

Amazon FSx (Specialized File Systems)
  • FSx for Windows File Server: Fully managed native Windows SMB file system (supports NTFS and Active Directory).

  • FSx for Lustre: Designed for High-Performance Computing (HPC), machine learning, and video processing. Can process data directly from S3.

  • FSx for NetApp ONTAP: Provides the full capabilities of the NetApp ONTAP file system in the cloud.