1_Storage in the Data Pipeline

Data Storage Considerations in Analytics Pipelines

This section outlines considerations for storing raw and processed data within an analytics pipeline, focusing on a modern data architecture.

Modern Data Architecture

  • Emphasizes a data lake at the center, surrounded by specialized data stores.
  • Facilitates data movement into, out of, and between storage components.
  • Goal: Provide data accessibility, control access, and enable efficient data movement for analysis.

AWS Services for Data Lake Architecture

  • Amazon S3: Central component for data lake functionality.
  • AWS Lake Formation: Manages the data lake.
  • AWS Glue: Provides the data catalog.
  • Amazon Athena: SQL query engine for direct analysis of data within the lake.
  • Additional AWS storage services supplement this architecture (refer to student guide and well-architected framework for details).

Data Lake vs. Data Warehouse (Amazon S3 vs. Amazon Redshift)

  • Key Integration: Combination of Amazon S3 for the data lake and Amazon Redshift for the data warehouse.
  • Data Warehouse (e.g., Amazon Redshift):
    • Use Case: Highly structured, curated data, complex queries, and business analytics.
    • Cost: Higher storage cost.
  • Data Lake (e.g., Amazon S3):
    • Use Case: Unstructured raw data available for exploration.
    • Cost: Lower storage cost.

Factors Driving Pipeline Storage Choices

1. Performance vs. Cost

  • Balance speed requirements with cost considerations.
  • Evaluate the business value gained from faster access against the associated cost.
  • Amazon S3: Lower storage cost than Amazon Redshift.
  • Performance can be optimized in S3 by organizing data according to access patterns (e.g., storing logs by date in separate folders).
  • Data Warehouses: Efficient querying over large datasets spanning long periods might justify the higher cost.
  • Amazon Redshift Spectrum: Queries S3 buckets without moving data to Redshift offering a cost-effective solution.

2. Data Retention

  • Consider how long data needs to be stored and the associated cost.
  • Determine when data can be archived to lower-cost storage options such as Amazon S3.
  • Within Amazon S3: Optimize costs by moving data to different storage classes over time based on access frequency.

3. Data Characteristics

  • The nature of the data influences storage choices.
  • Examples:
    • Business data from relational databases.
    • Real-time data collected in a stream.
  • Data Transformation: Transformed data can be stored in either the data warehouse or data lake for analytics.

Key Takeaway

  • Goal: Select storage that optimizes cost and business value.
  • Data pipelines utilize a combination of storage types.
  • Data may transition through different storage types as it moves through the pipeline.
  • Major Components: AWS data architecture includes an Amazon S3 data lake and an Amazon Redshift data warehouse.