Study Guide: Fundamentals of Data Engineering

Fundamentals of Data Engineering

  • This text, authored by Joe Reis and Matt Housley, provides a comprehensive overview of the data engineering landscape, focusing on principles that encompass any relevant technology and aim to stand the test of time.

  • The central framework discussed is the Data Engineering Lifecycle.

Data Engineering Defined

  • Definition: The development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information to support downstream use cases (analysis, machine learning).

  • Intersectionality: Data engineering exists at the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering.

The Data Engineering Lifecycle Stages

  • Generation (Source Systems): The origin of data (IoT, apps, databases). The data engineer consumes but usually does not control these.

  • Storage: Underpins the entire lifecycle. Systems and patterns for data persistence.

  • Ingestion: The process of moving data from source systems into the lifecycle.

  • Transformation: Converting data from its raw form into a useful state for downstream consumption.

  • Serving: Getting value from data via analytics, ML, or reverse ETL.

Major Undercurrents

  • Security: Must be "top of mind." Includes the Principle of Least Privilege (giving users access only to what is essential) and protecting data at rest and in flight.

  • Data Management: Includes data governance, metadata management, data quality, and master data management (MDM).

  • DataOps: Mapping Agile, DevOps, and statistical process control (SPC) to the data domain. Pillars include Automation, Observability/Monitoring, and Incident Response.

  • Data Architecture: The design of systems to support evolving data needs through evaluation of trade-offs.

  • Orchestration: Coordinating many jobs to run efficiently on a scheduled cadence (e.g., using Directed Acyclic Graphs or DAGs).

  • Software Engineering: Applying production-grade engineering practices to data processing (SQL, Python, JVM languages).

Data Maturity and Roles

  • Data Maturity Stages:

    • Stage 1: Starting with data (early planning, small teams).

    • Stage 2: Scaling with data (formal practices, specialized roles).

    • Stage 3: Leading with data (data-driven culture, self-service).

  • Roles:

    • Type A Data Engineer: "A" for Abstraction. Prefers off-the-shelf, managed services.

    • Type B Data Engineer: "B" for Build. Creates custom tools for core competitive advantage.

Data Storage: Raw Ingredients

  • Magnetic Disk Drives (HDD): Slower but cheap (0.03/GB0.03/GB). High capacity but limited by seek time and rotational latency (average >4 ms).

  • Solid-State Drives (SSD): Faster, electronic storage (0.20/GB0.20/GB). Low latency (<0.1 ms) and high IOPS.

  • Random Access Memory (RAM): Volatile, ultrafast (100100 GB/s bandwidth, 0.10.1 microsecond latency), but expensive (10/GB10/GB).

  • Serialization: The process of flattening data for storage/transmission (e.g., JSON, Parquet, Avro).

  • Compression: Reducing data size. Algorithms like Snappy or Gzip increase effective disk and network bandwidth.

Data Storage System Types

  • Block Storage: Virtualized storage like Amazon EBS; mimics a physical disk.

  • Object Storage: Immutable key-value store (e.g., Amazon S3). Highly scalable and the gold standard for data lakes.

  • Distributed Storage: Coordinating multiple servers (nodes) for redundancy and scale.

  • Consistency Models:

    • Strong Consistency: Reads always return the latest write.

    • Eventual Consistency: BAS (Basically Available, Soft-state, Eventual consistency). Data aligns across nodes eventually.

Queries, Modeling, and Transformation

  • Normalization: Reducing redundancy (1NF, 2NF, 3NF).

  • Modeling Techniques:

    • Inmon: The "Corporate Information Factory." Highly normalized (3NF) central warehouse.

    • Kimball: Bottom-up approach. Uses Star Schema (Fact tables for quantitative events, Dimension tables for qualitative context).

    • Data Vault: Uses Hubs (keys), Links (relationships), and Satellites (attributes).

  • Query Performance Optimization: Avoid full table scans, use pruning, and leverage cached query results.

  • Streaming Data Windowing:

    • Fixed/Tumbling Windows: Fixed time intervals.

    • Sliding Windows: Overlapping time periods.

    • Session Windows: Group events by activity gaps.

Serving Data

  • Business Analytics: Dashboards and reports for human decision-making.

  • Operational Analytics: Real-time alerts and actions (e.g., monitoring application health).

  • Embedded Analytics: External-facing data apps for customers.

  • Reverse ETL: Pushing data from an OLAP system (Warehouse) back into a source system (e.g., CRM).

Future of Data Engineering

  • Live Data Stack: Movement toward true real-time applications and stream-based processing.

  • Fusion of Data and Apps: Data is created with analytics in mind from the start.

  • Cloud-Scale Data OS: Enhanced interoperability between services and standardized data APIs.

  • Dark Matter Data: The persistent importance of spreadsheets in business analysis.