Study Guide: Fundamentals of Data Engineering
Fundamentals of Data Engineering
This text, authored by Joe Reis and Matt Housley, provides a comprehensive overview of the data engineering landscape, focusing on principles that encompass any relevant technology and aim to stand the test of time.
The central framework discussed is the Data Engineering Lifecycle.
Data Engineering Defined
Definition: The development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information to support downstream use cases (analysis, machine learning).
Intersectionality: Data engineering exists at the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering.
The Data Engineering Lifecycle Stages
Generation (Source Systems): The origin of data (IoT, apps, databases). The data engineer consumes but usually does not control these.
Storage: Underpins the entire lifecycle. Systems and patterns for data persistence.
Ingestion: The process of moving data from source systems into the lifecycle.
Transformation: Converting data from its raw form into a useful state for downstream consumption.
Serving: Getting value from data via analytics, ML, or reverse ETL.
Major Undercurrents
Security: Must be "top of mind." Includes the Principle of Least Privilege (giving users access only to what is essential) and protecting data at rest and in flight.
Data Management: Includes data governance, metadata management, data quality, and master data management (MDM).
DataOps: Mapping Agile, DevOps, and statistical process control (SPC) to the data domain. Pillars include Automation, Observability/Monitoring, and Incident Response.
Data Architecture: The design of systems to support evolving data needs through evaluation of trade-offs.
Orchestration: Coordinating many jobs to run efficiently on a scheduled cadence (e.g., using Directed Acyclic Graphs or DAGs).
Software Engineering: Applying production-grade engineering practices to data processing (SQL, Python, JVM languages).
Data Maturity and Roles
Data Maturity Stages:
Stage 1: Starting with data (early planning, small teams).
Stage 2: Scaling with data (formal practices, specialized roles).
Stage 3: Leading with data (data-driven culture, self-service).
Roles:
Type A Data Engineer: "A" for Abstraction. Prefers off-the-shelf, managed services.
Type B Data Engineer: "B" for Build. Creates custom tools for core competitive advantage.
Data Storage: Raw Ingredients
Magnetic Disk Drives (HDD): Slower but cheap (). High capacity but limited by seek time and rotational latency (average >4 ms).
Solid-State Drives (SSD): Faster, electronic storage (). Low latency (<0.1 ms) and high IOPS.
Random Access Memory (RAM): Volatile, ultrafast ( GB/s bandwidth, microsecond latency), but expensive ().
Serialization: The process of flattening data for storage/transmission (e.g., JSON, Parquet, Avro).
Compression: Reducing data size. Algorithms like Snappy or Gzip increase effective disk and network bandwidth.
Data Storage System Types
Block Storage: Virtualized storage like Amazon EBS; mimics a physical disk.
Object Storage: Immutable key-value store (e.g., Amazon S3). Highly scalable and the gold standard for data lakes.
Distributed Storage: Coordinating multiple servers (nodes) for redundancy and scale.
Consistency Models:
Strong Consistency: Reads always return the latest write.
Eventual Consistency: BAS (Basically Available, Soft-state, Eventual consistency). Data aligns across nodes eventually.
Queries, Modeling, and Transformation
Normalization: Reducing redundancy (1NF, 2NF, 3NF).
Modeling Techniques:
Inmon: The "Corporate Information Factory." Highly normalized (3NF) central warehouse.
Kimball: Bottom-up approach. Uses Star Schema (Fact tables for quantitative events, Dimension tables for qualitative context).
Data Vault: Uses Hubs (keys), Links (relationships), and Satellites (attributes).
Query Performance Optimization: Avoid full table scans, use pruning, and leverage cached query results.
Streaming Data Windowing:
Fixed/Tumbling Windows: Fixed time intervals.
Sliding Windows: Overlapping time periods.
Session Windows: Group events by activity gaps.
Serving Data
Business Analytics: Dashboards and reports for human decision-making.
Operational Analytics: Real-time alerts and actions (e.g., monitoring application health).
Embedded Analytics: External-facing data apps for customers.
Reverse ETL: Pushing data from an OLAP system (Warehouse) back into a source system (e.g., CRM).
Future of Data Engineering
Live Data Stack: Movement toward true real-time applications and stream-based processing.
Fusion of Data and Apps: Data is created with analytics in mind from the start.
Cloud-Scale Data OS: Enhanced interoperability between services and standardized data APIs.
Dark Matter Data: The persistent importance of spreadsheets in business analysis.