1_Storage in the Data Pipeline
Data Storage Considerations in Analytics Pipelines
This section outlines considerations for storing raw and processed data within an analytics pipeline, focusing on a modern data architecture.
Modern Data Architecture
- Emphasizes a data lake at the center, surrounded by specialized data stores.
- Facilitates data movement into, out of, and between storage components.
- Goal: Provide data accessibility, control access, and enable efficient data movement for analysis.
AWS Services for Data Lake Architecture
- Amazon S3: Central component for data lake functionality.
- AWS Lake Formation: Manages the data lake.
- AWS Glue: Provides the data catalog.
- Amazon Athena: SQL query engine for direct analysis of data within the lake.
- Additional AWS storage services supplement this architecture (refer to student guide and well-architected framework for details).
Data Lake vs. Data Warehouse (Amazon S3 vs. Amazon Redshift)
- Key Integration: Combination of Amazon S3 for the data lake and Amazon Redshift for the data warehouse.
- Data Warehouse (e.g., Amazon Redshift):
- Use Case: Highly structured, curated data, complex queries, and business analytics.
- Cost: Higher storage cost.
- Data Lake (e.g., Amazon S3):
- Use Case: Unstructured raw data available for exploration.
- Cost: Lower storage cost.
Factors Driving Pipeline Storage Choices
1. Performance vs. Cost
- Balance speed requirements with cost considerations.
- Evaluate the business value gained from faster access against the associated cost.
- Amazon S3: Lower storage cost than Amazon Redshift.
- Performance can be optimized in S3 by organizing data according to access patterns (e.g., storing logs by date in separate folders).
- Data Warehouses: Efficient querying over large datasets spanning long periods might justify the higher cost.
- Amazon Redshift Spectrum: Queries S3 buckets without moving data to Redshift offering a cost-effective solution.
2. Data Retention
- Consider how long data needs to be stored and the associated cost.
- Determine when data can be archived to lower-cost storage options such as Amazon S3.
- Within Amazon S3: Optimize costs by moving data to different storage classes over time based on access frequency.
3. Data Characteristics
- The nature of the data influences storage choices.
- Examples:
- Business data from relational databases.
- Real-time data collected in a stream.
- Data Transformation: Transformed data can be stored in either the data warehouse or data lake for analytics.
Key Takeaway
- Goal: Select storage that optimizes cost and business value.
- Data pipelines utilize a combination of storage types.
- Data may transition through different storage types as it moves through the pipeline.
- Major Components: AWS data architecture includes an Amazon S3 data lake and an Amazon Redshift data warehouse.