1_Storage in the Data Pipeline

Data Storage Considerations in Analytics Pipelines

This section outlines considerations for storing raw and processed data within an analytics pipeline, focusing on a modern data architecture.

Emphasizes a data lake at the center, surrounded by specialized data stores.
Facilitates data movement into, out of, and between storage components.
Goal: Provide data accessibility, control access, and enable efficient data movement for analysis.

Amazon S3: Central component for data lake functionality.
AWS Lake Formation: Manages the data lake.
AWS Glue: Provides the data catalog.
Amazon Athena: SQL query engine for direct analysis of data within the lake.
Additional AWS storage services supplement this architecture (refer to student guide and well-architected framework for details).

Key Integration: Combination of Amazon S3 for the data lake and Amazon Redshift for the data warehouse.
Data Warehouse (e.g., Amazon Redshift):
- Use Case: Highly structured, curated data, complex queries, and business analytics.
- Cost: Higher storage cost.
Data Lake (e.g., Amazon S3):
- Use Case: Unstructured raw data available for exploration.
- Cost: Lower storage cost.

Balance speed requirements with cost considerations.
Evaluate the business value gained from faster access against the associated cost.
Amazon S3: Lower storage cost than Amazon Redshift.
Performance can be optimized in S3 by organizing data according to access patterns (e.g., storing logs by date in separate folders).
Data Warehouses: Efficient querying over large datasets spanning long periods might justify the higher cost.
Amazon Redshift Spectrum: Queries S3 buckets without moving data to Redshift offering a cost-effective solution.

Consider how long data needs to be stored and the associated cost.
Determine when data can be archived to lower-cost storage options such as Amazon S3.
Within Amazon S3: Optimize costs by moving data to different storage classes over time based on access frequency.

The nature of the data influences storage choices.
Examples:
- Business data from relational databases.
- Real-time data collected in a stream.
Data Transformation: Transformed data can be stored in either the data warehouse or data lake for analytics.

Goal: Select storage that optimizes cost and business value.
Data pipelines utilize a combination of storage types.
Data may transition through different storage types as it moves through the pipeline.
Major Components: AWS data architecture includes an Amazon S3 data lake and an Amazon Redshift data warehouse.