Comprehensive Study Notes: Data Engineering & Big Data Landscape 2025

The Subject: What is Data Engineering?

  • Data engineering requires a variety of skills: design, technical and operational.

  • The main raw material of data engineering is data, while the product is information.

  • The information produced by data engineering supports other disciplines like data analytics and data science.

  • Data engineering involves multiple stakeholders, spanning design and governance (data architecture, security and management) as well as technical (orchestration and software engineering) expertise that are managed through DataOps best practise.

  • A data engineer handles the complexities of data ingestion, integration, storage and processing to support existing information use cases and pioneer new data monetization opportunities.

  • Definition:- "Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning."

The Data: The Value and Challenges of Big Data

  • The dimensions of big data were initially described by the 3 Vs: volume, velocity and variety. With time more descriptors were proposed: veracity, variability, complexity, value and decay.

  • The evolution of big data was influenced by the growth of ecommerce (1.0), social media (2.0) and IoT (3.0).

  • Innovations such as NoSQL databases, use of commodity hardware and open-source solutions, and cloud computing, have developed to handle big data.

  • The benefits of big data include:- cost savings,

    • better decision making,

    • improved product/service quality,

    • better pricing,

    • personalized marketing,

    • improved customer service.

  • The challenges of big data include:- data quality,

    • security,

    • privacy,

    • value justification,

    • data management,

    • skills shortage.

The Technology: The Big Data Landscape

  • Open Source Data Engineering Landscape categorizes projects as:- opensource (fully interoperable and vendor-neutral),

    • open core (not all components available in open source),

    • open foundation (open source serves as backbone for commercial offerings).

  • OLAP extensions allow seamless transformation of OLTP databases into HTAP (Hybrid Transactional/Analytical Processing) or new HTAS (Hybrid Transactional Analytical Storage) database engines.

  • Zero disk architecture eliminates the need for locally attached disks, using remote deep storage solutions (e.g., S3 object storage) as the primary persistence layer.

  • Lakehouse ecosystem is growing; interoperable open standards and frameworks are expected to increase.

  • Approximately 90% of queries remain within a workload size that can run on a single machine, scanning only recent data.

  • Open source data engineering landscape covers broad categories, including:- SQL databases (Relational OLTP DBMS, distributed SQL DBMS, cache stores),

    • NoSQL databases (document stores, key-value stores, wide-column, graph),

    • OLAP databases (columnar, real-time OLAP, time-series),

    • Data lake/storage foundation and formats (Parquet, ORC, Avro, etc.),

    • Data integration and orchestration (Airflow, NiFi, Kafka, Debezium, etc.),

    • Data processing & computation (Spark, Flink, Dask, PySpark, etc.),

    • Data quality and modeling tools (Great Expectations, dbt, etc.),

    • Metadata management and data catalogs (Apache Atlas, Amundsen, Open Metadata),

    • MLOps and analytics platforms (MLflow, Metaflow, Feast, etc.),

    • Observability and monitoring (Prometheus, Grafana),

    • Open standards and ecosystems (OpenTableFormat, Iceberg, Hudi, Delta Lake, Paimon).

  • Notable components highlighted include: Iceberg, Delta Lake, Hudi, Paimon (open table formats); Parquet/ORC; MinIO/S3 as storage; Spark, Flink, Kafka; Airflow, Beam, Kafka Connect; dbt; Great Expectations; Amundsen; Open Metadata; DataX such as Trino/Presto; Velocities of data tools in the modern data stack.

The Organization: Data Engineering in an Organization

  • Data maturity is used to describe organizational capabilities and integration of data across the organization.

  • Distinctions for data scientists and data engineers:- Type A data scientists focus on analysis; Type A data engineers focus on abstraction and reuse.

    • Type B data scientists can build custom data tools; Type B data engineers can build tools as well.

    • External-facing data engineers handle high concurrency and complex security; internal-facing focus on business operations and internal stakeholders.

  • Within an organization, roles and responsibilities include:- Data architects design the blueprint for organizational data management and overall data architecture and systems.

    • Software engineers build the software and systems that generate internal data for engineers to process.

    • DevOps/SREs produce data through operational monitoring.

    • Data analysts (business analysts) interpret business performance and trends; data scientists forecast and analyze.

    • ML engineers develop advanced ML techniques, train models, and maintain the infrastructure for scaled ML processes.

  • Leadership roles often involved in large data initiatives:- CEO: defines vision with C-suite and data leadership; data engineers map available data internally and from third parties.

    • CIO: senior executive responsible for IT within the organization (internal-facing).

    • CTO: outward-facing tech strategy and architectures for external-facing applications; critical data sources.

    • CDO: chief data officer responsible for data assets, data products, strategy, privacy, and data governance (e.g., master data management).

    • Project managers: help prioritize deliverables and keep projects on track.

  • Data maturity framework example:- Stages across organizations: Unaware (1%), Emerging (14%), Learning (46%), Developing (35%), Mastering (5%).

    • Organizations by sector include Not-for-profit (n=632), Public sector (n=142), Commercial (n=265); total n=1039; data from 2019-2024.

  • Key takeaways:- DE is an abstract domain that requires data, knowledge, teamwork, and ethical practice.

The People: Data Engineering Roles, Skills and Activities

  • Data engineering requires understanding across three broad areas: design, technical, and operational skills.

  • The data engineer’s core job is to manage the lifecycle of data: ingestion, integration, storage, processing, and delivery to downstream use cases.

  • The skill set includes:- Evaluation of data tools across undercurrents (data quality, governance, storage, processing, security, etc.).

    • Integration and management of trade-offs across the data engineering lifecycle.

    • Business/domain knowledge: how data is produced, how it will be consumed, and how it creates value after processing.

    • Balancing optimization of cost, agility, scalability, simplicity, reuse, and interoperability.

    • Focus on simple, cost-effective high-level services that deliver business value rather than complex low-level tasks (e.g., cluster administration).

  • Data engineers typically do not directly build ML models or produce reports for analytics; their main role is pipeline and data infrastructure.

  • Knowledge and collaboration: data engineers must know how to communicate with both nontechnical and technical stakeholders, scope and gather business requirements, and apply Agile, DevOps, and DataOps practices; control costs; and commit to continuous learning.

Data Lifecycle and Underlying Concepts

  • Data engineering lifecycle stages (Generation, Storage, Ingestion, Transformation, Serving):- The cycle shifts focus from technology toward the data and the ultimate goals it must serve.

    • Storage provides flexibility of pipeline operation, chaining, and accommodates batch vs real-time data generation schemes.

    • Reverse ETL ingests output from the cycle to enrich data processes.

    • The undercurrents capture critical supportive tasks across the entire lifecycle.

  • Data provenance and governance are implicit in lifecycle design (security, management, data architecture, and DataOps).

  • Data monetization opportunities can arise from well-governed and well-processed data assets.

History and Evolution of Data Engineering (Timeline)

  • 1970s–1980s: Emergence of business data warehouses and data modeling; IBM engineers develop relational databases and SQL for BI.

  • 1990s: Tim Berners-Lee invents the Internet; mainstream adoption in the mid-1990s; dot-com boom with companies like AOL, Yahoo, Amazon.

  • Early 2000s: Explosion of data; commodity hardware becomes cheap; Google File System and MapReduce (2003–2004) leverage commodity hardware.

  • Late 2000s: Yahoo develops Apache Hadoop; the term “big data” is coined; data-driven decision making grows; Amazon expands cloud infrastructure and revenue.

  • Post-2010: Big data engineering grows too complex; cloud providers (Google Cloud, Microsoft Azure) offer managed services; data tools proliferate; the term “big data” becomes less central; focus shifts to modern data stack, privacy, and compliance.

The Importance of Data Source and Data Quality

  • Data sources matter for learning and intelligence: human intelligence is fueled by raw data and knowledge; ML uses both unstructured and structured data.

  • Data can contain biases and irregularities; should be explored via Exploratory Data Analysis (EDA) and possibly transformed (e.g., dimensionality reduction) before use.

  • The prevalence of big data increases the potential for ML applications and capabilities.

  • CRISP-DM and similar best practices guide enterprise data processes (data mining and analytics lifecycle).

  • Data source quality affects downstream analytics, ML outcomes, and decision-making.

Practical Landscape: Open Source, Lakehouse, and HTAP

  • Open Source Data Engineering Landscape highlights:- Open source, open core, and open foundation models; interoperability and vendor neutrality vary across projects.

  • HTAP and HTAS: OLAP extensions enable hybrid transactional-analytical processing and storage.

  • Zero-disk architecture pushes persistence to remote storage like object stores; reduces local disk dependency.

  • Lakehouse and open standards are increasingly adopted; emphasis on interoperability across tools and platforms.

  • Single-node processing remains viable for a large share of queries; around 90% of queries can run on a single node, scanning recent data.

Lakehouse Architecture and Data Lakes

  • Lakehouse architecture combines data lake storage with the ability to perform data warehousing-like analytics.

  • Key components include: OPEN DATA LAKEHOUSE, OPEN CATALOG, OPEN TABLE FORMAT, OPEN FILE FORMAT, OBJECT STORAGE (e.g., Amazon S3, MINIO), and widely used storage formats like Parquet and ORC.

  • Prominent ecosystem components include: Apache Iceberg, Apache Hudi, Delta Lake, Apache Paimon; engines for processing include Trino/Spark/ClickHouse, DuckDB, and others.

  • Storage and metadata layers (Metastore) integrate with tools like Spark, Trino, Hive, and various catalog services.

Single Node Processing and Local Analytics

  • Single node processing capabilities enable running analytics locally with tools such as Pandas, Polars, DuckDB, and Arrow-based frameworks.

  • This enables lightweight analysis and prototype work before scaling to distributed systems.

  • Some examples of tools and ecosystems for single-node workloads include DuckDB, Polars, Arrow, and Pandas.

Data Scientists, Data Engineers, and the Analytics Loop

  • Data engineering sits upstream from data science: engineers provide data inputs used by data scientists.

  • Data scientists spend a large portion of their time (often 70–80%) gathering, cleaning, and processing data; actual analysis and ML receive a smaller share.

  • Data scientists typically are not trained to engineer production-grade data systems; data engineers build robust pipelines to support analytics and ML.

  • When data engineers optimize the data pipeline, they lay a solid foundation for data scientists to succeed.

Skills, Trade-offs, and the Practicalities of Data Engineering

  • The data engineer’s skill set includes:- Understanding and selecting tools across the lifecycle and managing trade-offs among cost, performance, scalability, and simplicity.

    • Domain knowledge: how data is produced, how it will be consumed, and how to derive value post-processing.

    • Balancing high-level services that deliver business value over complex low-level tasks like cluster administration.

  • Data engineers do not typically build ML models or generate analytics reports; their primary responsibility is the data pipeline and platform.

Data Engineering as a Business Function

  • Communication skills with both nontechnical and technical audiences are essential.

  • Ability to scope and gather business and product requirements.

  • Deep understanding of Agile, DevOps, and DataOps cultures.

  • Cost control and continuous learning as ongoing responsibilities.

Data Maturity and Organizational Roles

  • Data maturity describes progression toward higher data utilization and integration.

  • Internal vs external data engineering roles vary by responsibilities and security concerns.

  • Data practitioners in organizations may focus on abstraction (reusable components) or building custom tools, depending on their role.

  • Leadership roles influence data strategy: CIO, CTO, CDO, and CEO collaborate to align data initiatives with business strategy; PMs help manage delivery.

Key Takeaways

  • Data engineering is an interdisciplinary domain that requires data, technical skills, governance, and ethical practice.

  • The lifecycle of data spans generation to serving, with DataOps guiding orchestration and governance.

  • The landscape of tools and architectures (open source, lakehouse, HTAP, zero-disk, etc.) is rapidly evolving and increasingly commoditized through cloud services.

  • Data maturity and organizational culture significantly influence the success of data initiatives.

  • Collaboration across roles (data architects, software engineers, DevOps/SREs, analysts, data scientists, ML engineers) is essential to deliver value.

Closing

  • DE is an abstract domain that requires data, knowledge, teamwork, and ethical practice.

  • The material emphasizes practical, scalable, and governance-aware approaches to building data-intensive organizations.