Domain 1 - Managing and Optimizing Storage

Definition: Amazon Elastic Block Storage (EBS) is a high-performance, block-level storage service designed for use with Amazon EC2 instances for both throughput and transaction-intensive workloads.
- Unlike Object Storage (S3), EBS provides raw block device access.
Availability and Durability:
- Both the EBS volume and EC2 instance must reside in the same Availability Zone (AZ).
- EBS volumes are automatically replicated within their AZ to protect against hardware failure, offering high durability.
Volume Lifecycle Management:
1. Volume Creation: Can be provisioned as empty or from an existing Snapshot. The volume status will transition to available once ready for attachment.
2. Attachment: A volume in the in-use state is attached to an instance.
3. Volume Deletion: Deletion is prohibited while the volume is attached.
4. Root Volumes: By default, the DeleteOnTermination attribute is set to True for root volumes. For data persistence, this must be manually toggled to False.

IOPS (Input/Output Operations Per Second): A measure of the number of reads/writes a volume can perform per second. Crucial for database workloads.
Queue Length: The number of pending I/O requests. If this value is consistently high relative to IOPS, it indicates a performance bottleneck.
Throughput: Measured in MiB/s, representing the total volume of data moved. High throughput is vital for streaming and big data analysis.
Latency: The time taken for a single I/O unit to complete its round trip. High latency typically signifies that the volume has reached its performance limit.
Burst Balance: Applicable to gp2, st1, and sc1.
- Volumes earn specific credits when operating below their baseline performance.
- When the workload spikes, the volume consumes these credits to "burst" above the baseline. If BurstBalance = 0, the volume is throttled to its baseline performance.

Solid State Drives (SSD)

General Purpose SSD (gp2/gp3):
- gp2: Performance scales with volume size ( $3 \text{ IOPS per GiB}$ ) with a minimum of $100$ IOPS and a burst up to $3,000$ IOPS.
- gp3: Decouples performance from storage size. Provides a baseline of $3,000$ IOPS and $125 \text{ MiB/s}$ regardless of volume size.
Provisioned IOPS SSD (io1/io2):
- Designed for I/O-intensive database workloads.
- io2 Block Express: Offers sub-millisecond latency and up to $256,000$ IOPS.

Hard Disk Drives (HDD)

Throughput Optimized HDD (st1): Focused on throughput ( $\text{up to } 500 \text{ MiB/s}$ ) rather than IOPS. Good for Log processing.
Cold HDD (sc1): Lowest cost for infrequently accessed workloads.
Magnetic (Standard): Previous generation, rarely used in modern architectures.

RAID 0 (Striping): Used to increase total IOPS/Throughput by spreading data across multiple volumes. However, loss of one volume results in data loss for the whole set.
RAID 1, 5, 6: Generally redundant for EBS because Amazon already replicates the data at the hardware level. RAID 5/6 specifically incur a heavy parity-calculation overhead that degrades performance on network-attached storage.

Volume Status Checks: EBS sends metrics to CloudWatch every 1 minute.
- Okay: Everything is functioning normally.
- Impaired: The volume is unavailable or I/O is stalled.
Data Consistency: If AWS detects a potential inconsistency, I/O may be disabled. You must enable the Auto-Enabled IO attribute or acknowledge the inconsistency via the CLI to resume service.
OS Level Monitoring: Use iostat -xdmzt 1 on Linux or Perfmon on Windows to identify "Micro-bursting" (latency spikes shorter than the 1-minute CloudWatch polling interval).

Characteristics:
- Physically attached to the host server, resulting in very low latency and high IOPS.
- Data Volatility: Data is lost if the instance is stopped, hibernates, or fails. Data persists only during instance reboots.
Use Cases: Temporary files, scratch space, distributed file systems (like HDFS), and swap files.

Modifications: You can increase size, change volume type (e.g., gp2 to io1), or adjust IOPS/throughput on the fly without downtime.
Cooldown Period: Once a modification starts, you must wait at least $6$ hours before modifying the same volume again.
File System Extension: After the EBS volume is resized in AWS, the OS-level file system (e.g., ext4, xfs, or NTFS) must be extended using commands like resize2fs or xfs_growfs.

Available for io1 and io2 volumes on Nitro-based instances.
Allows up to $16$ instances to mount the same volume simultaneously.
Requires a cluster-aware file system (e.g., GFS2, OCFS2) to manage write-locking and data integrity.

Incremental Nature: Only the blocks changed since the last snapshot are stored, reducing storage costs.
Fast Snapshot Restore (FSR): Eliminates the need for pre-warming (reading all blocks once) by ensuring the snapshot is instantly available at maximum performance.
Amazon Data Lifecycle Manager (DLM):
- Automates snapshot creation and deletion based on tags.
- Supports cross-region copy and cross-account sharing policies.
Recycle Bin: Provides a safety net for snapshots and AMIs. Deleted snapshots are retained in the Recycle Bin for a specified period ( $1\text{ to } 365 \text{ days}$ ) before being permanently purged.

Protocol: Managed NFS (Network File System) for Linux-based workloads.
Storage Tiers:
- Standard: For active data.
- Infrequent Access (IA): Significantly cheaper; data is moved here automatically by Lifecycle Management if not accessed for a set period ( $7, 14, 30, 60, \text{ or } 90$ days).
Performance Modes:
- General Purpose: Standard mode.
- Max I/O: For massive scale; higher latency but higher aggregate throughput.

FSx for Windows File Server: Fully managed native Windows SMB file system (supports NTFS and Active Directory).
FSx for Lustre: Designed for High-Performance Computing (HPC), machine learning, and video processing. Can process data directly from S3.
FSx for NetApp ONTAP: Provides the full capabilities of the NetApp ONTAP file system in the cloud.