Comprehensive Study Notes: Data Engineering & Big Data Landscape 2025

The Subject: What is Data Engineering?

Data engineering requires a variety of skills: design, technical and operational.
The main raw material of data engineering is data, while the product is information.
The information produced by data engineering supports other disciplines like data analytics and data science.
Data engineering involves multiple stakeholders, spanning design and governance (data architecture, security and management) as well as technical (orchestration and software engineering) expertise that are managed through DataOps best practise.
A data engineer handles the complexities of data ingestion, integration, storage and processing to support existing information use cases and pioneer new data monetization opportunities.
Definition:- "Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning."

The Data: The Value and Challenges of Big Data

The dimensions of big data were initially described by the 3 Vs: volume, velocity and variety. With time more descriptors were proposed: veracity, variability, complexity, value and decay.
The evolution of big data was influenced by the growth of ecommerce (1.0), social media (2.0) and IoT (3.0).
Innovations such as NoSQL databases, use of commodity hardware and open-source solutions, and cloud computing, have developed to handle big data.
The benefits of big data include:- cost savings,
- better decision making,
- improved product/service quality,
- better pricing,
- personalized marketing,
- improved customer service.
The challenges of big data include:- data quality,
- security,
- privacy,
- value justification,
- data management,
- skills shortage.

The Technology: The Big Data Landscape

Open Source Data Engineering Landscape categorizes projects as:- opensource (fully interoperable and vendor-neutral),
- open core (not all components available in open source),
- open foundation (open source serves as backbone for commercial offerings).
OLAP extensions allow seamless transformation of OLTP databases into HTAP (Hybrid Transactional/Analytical Processing) or new HTAS (Hybrid Transactional Analytical Storage) database engines.
Zero disk architecture eliminates the need for locally attached disks, using remote deep storage solutions (e.g., S3 object storage) as the primary persistence layer.
Lakehouse ecosystem is growing; interoperable open standards and frameworks are expected to increase.
Approximately 90% of queries remain within a workload size that can run on a single machine, scanning only recent data.
Open source data engineering landscape covers broad categories, including:- SQL databases (Relational OLTP DBMS, distributed SQL DBMS, cache stores),
- NoSQL databases (document stores, key-value stores, wide-column, graph),
- OLAP databases (columnar, real-time OLAP, time-series),
- Data lake/storage foundation and formats (Parquet, ORC, Avro, etc.),
- Data integration and orchestration (Airflow, NiFi, Kafka, Debezium, etc.),
- Data processing & computation (Spark, Flink, Dask, PySpark, etc.),
- Data quality and modeling tools (Great Expectations, dbt, etc.),
- Metadata management and data catalogs (Apache Atlas, Amundsen, Open Metadata),
- MLOps and analytics platforms (MLflow, Metaflow, Feast, etc.),
- Observability and monitoring (Prometheus, Grafana),
- Open standards and ecosystems (OpenTableFormat, Iceberg, Hudi, Delta Lake, Paimon).
Notable components highlighted include: Iceberg, Delta Lake, Hudi, Paimon (open table formats); Parquet/ORC; MinIO/S3 as storage; Spark, Flink, Kafka; Airflow, Beam, Kafka Connect; dbt; Great Expectations; Amundsen; Open Metadata; DataX such as Trino/Presto; Velocities of data tools in the modern data stack.

The Organization: Data Engineering in an Organization

Data maturity is used to describe organizational capabilities and integration of data across the organization.
Distinctions for data scientists and data engineers:- Type A data scientists focus on analysis; Type A data engineers focus on abstraction and reuse.
- Type B data scientists can build custom data tools; Type B data engineers can build tools as well.
- External-facing data engineers handle high concurrency and complex security; internal-facing focus on business operations and internal stakeholders.
Within an organization, roles and responsibilities include:- Data architects design the blueprint for organizational data management and overall data architecture and systems.
- Software engineers build the software and systems that generate internal data for engineers to process.
- DevOps/SREs produce data through operational monitoring.
- Data analysts (business analysts) interpret business performance and trends; data scientists forecast and analyze.
- ML engineers develop advanced ML techniques, train models, and maintain the infrastructure for scaled ML processes.
Leadership roles often involved in large data initiatives:- CEO: defines vision with C-suite and data leadership; data engineers map available data internally and from third parties.
- CIO: senior executive responsible for IT within the organization (internal-facing).
- CTO: outward-facing tech strategy and architectures for external-facing applications; critical data sources.
- CDO: chief data officer responsible for data assets, data products, strategy, privacy, and data governance (e.g., master data management).
- Project managers: help prioritize deliverables and keep projects on track.
Data maturity framework example:- Stages across organizations: Unaware (1%), Emerging (14%), Learning (46%), Developing (35%), Mastering (5%).
- Organizations by sector include Not-for-profit (n=632), Public sector (n=142), Commercial (n=265); total n=1039; data from 2019-2024.
Key takeaways:- DE is an abstract domain that requires data, knowledge, teamwork, and ethical practice.

The People: Data Engineering Roles, Skills and Activities

Data engineering requires understanding across three broad areas: design, technical, and operational skills.
The data engineer’s core job is to manage the lifecycle of data: ingestion, integration, storage, processing, and delivery to downstream use cases.
The skill set includes:- Evaluation of data tools across undercurrents (data quality, governance, storage, processing, security, etc.).
- Integration and management of trade-offs across the data engineering lifecycle.
- Business/domain knowledge: how data is produced, how it will be consumed, and how it creates value after processing.
- Balancing optimization of cost, agility, scalability, simplicity, reuse, and interoperability.
- Focus on simple, cost-effective high-level services that deliver business value rather than complex low-level tasks (e.g., cluster administration).
Data engineers typically do not directly build ML models or produce reports for analytics; their main role is pipeline and data infrastructure.
Knowledge and collaboration: data engineers must know how to communicate with both nontechnical and technical stakeholders, scope and gather business requirements, and apply Agile, DevOps, and DataOps practices; control costs; and commit to continuous learning.

Data Lifecycle and Underlying Concepts

Data engineering lifecycle stages (Generation, Storage, Ingestion, Transformation, Serving):- The cycle shifts focus from technology toward the data and the ultimate goals it must serve.
- Storage provides flexibility of pipeline operation, chaining, and accommodates batch vs real-time data generation schemes.
- Reverse ETL ingests output from the cycle to enrich data processes.
- The undercurrents capture critical supportive tasks across the entire lifecycle.
Data provenance and governance are implicit in lifecycle design (security, management, data architecture, and DataOps).
Data monetization opportunities can arise from well-governed and well-processed data assets.

History and Evolution of Data Engineering (Timeline)

1970s–1980s: Emergence of business data warehouses and data modeling; IBM engineers develop relational databases and SQL for BI.
1990s: Tim Berners-Lee invents the Internet; mainstream adoption in the mid-1990s; dot-com boom with companies like AOL, Yahoo, Amazon.
Early 2000s: Explosion of data; commodity hardware becomes cheap; Google File System and MapReduce (2003–2004) leverage commodity hardware.
Late 2000s: Yahoo develops Apache Hadoop; the term “big data” is coined; data-driven decision making grows; Amazon expands cloud infrastructure and revenue.
Post-2010: Big data engineering grows too complex; cloud providers (Google Cloud, Microsoft Azure) offer managed services; data tools proliferate; the term “big data” becomes less central; focus shifts to modern data stack, privacy, and compliance.

The Importance of Data Source and Data Quality

Data sources matter for learning and intelligence: human intelligence is fueled by raw data and knowledge; ML uses both unstructured and structured data.
Data can contain biases and irregularities; should be explored via Exploratory Data Analysis (EDA) and possibly transformed (e.g., dimensionality reduction) before use.
The prevalence of big data increases the potential for ML applications and capabilities.
CRISP-DM and similar best practices guide enterprise data processes (data mining and analytics lifecycle).
Data source quality affects downstream analytics, ML outcomes, and decision-making.

Practical Landscape: Open Source, Lakehouse, and HTAP

Open Source Data Engineering Landscape highlights:- Open source, open core, and open foundation models; interoperability and vendor neutrality vary across projects.
HTAP and HTAS: OLAP extensions enable hybrid transactional-analytical processing and storage.
Zero-disk architecture pushes persistence to remote storage like object stores; reduces local disk dependency.
Lakehouse and open standards are increasingly adopted; emphasis on interoperability across tools and platforms.
Single-node processing remains viable for a large share of queries; around 90% of queries can run on a single node, scanning recent data.

Lakehouse Architecture and Data Lakes

Lakehouse architecture combines data lake storage with the ability to perform data warehousing-like analytics.
Key components include: OPEN DATA LAKEHOUSE, OPEN CATALOG, OPEN TABLE FORMAT, OPEN FILE FORMAT, OBJECT STORAGE (e.g., Amazon S3, MINIO), and widely used storage formats like Parquet and ORC.
Prominent ecosystem components include: Apache Iceberg, Apache Hudi, Delta Lake, Apache Paimon; engines for processing include Trino/Spark/ClickHouse, DuckDB, and others.
Storage and metadata layers (Metastore) integrate with tools like Spark, Trino, Hive, and various catalog services.

Single Node Processing and Local Analytics

Single node processing capabilities enable running analytics locally with tools such as Pandas, Polars, DuckDB, and Arrow-based frameworks.
This enables lightweight analysis and prototype work before scaling to distributed systems.
Some examples of tools and ecosystems for single-node workloads include DuckDB, Polars, Arrow, and Pandas.

Data Scientists, Data Engineers, and the Analytics Loop

Data engineering sits upstream from data science: engineers provide data inputs used by data scientists.
Data scientists spend a large portion of their time (often 70–80%) gathering, cleaning, and processing data; actual analysis and ML receive a smaller share.
Data scientists typically are not trained to engineer production-grade data systems; data engineers build robust pipelines to support analytics and ML.
When data engineers optimize the data pipeline, they lay a solid foundation for data scientists to succeed.

Skills, Trade-offs, and the Practicalities of Data Engineering

The data engineer’s skill set includes:- Understanding and selecting tools across the lifecycle and managing trade-offs among cost, performance, scalability, and simplicity.
- Domain knowledge: how data is produced, how it will be consumed, and how to derive value post-processing.
- Balancing high-level services that deliver business value over complex low-level tasks like cluster administration.
Data engineers do not typically build ML models or generate analytics reports; their primary responsibility is the data pipeline and platform.

Data Engineering as a Business Function

Communication skills with both nontechnical and technical audiences are essential.
Ability to scope and gather business and product requirements.
Deep understanding of Agile, DevOps, and DataOps cultures.
Cost control and continuous learning as ongoing responsibilities.

Data Maturity and Organizational Roles

Data maturity describes progression toward higher data utilization and integration.
Internal vs external data engineering roles vary by responsibilities and security concerns.
Data practitioners in organizations may focus on abstraction (reusable components) or building custom tools, depending on their role.
Leadership roles influence data strategy: CIO, CTO, CDO, and CEO collaborate to align data initiatives with business strategy; PMs help manage delivery.

Key Takeaways

Data engineering is an interdisciplinary domain that requires data, technical skills, governance, and ethical practice.
The lifecycle of data spans generation to serving, with DataOps guiding orchestration and governance.
The landscape of tools and architectures (open source, lakehouse, HTAP, zero-disk, etc.) is rapidly evolving and increasingly commoditized through cloud services.
Data maturity and organizational culture significantly influence the success of data initiatives.
Collaboration across roles (data architects, software engineers, DevOps/SREs, analysts, data scientists, ML engineers) is essential to deliver value.

Closing

DE is an abstract domain that requires data, knowledge, teamwork, and ethical practice.
The material emphasizes practical, scalable, and governance-aware approaches to building data-intensive organizations.