Untitled Notes

Module 1: Introduction to Big Data & Analytics

1.1 What is Big Data?

Big data refers to data sets that are so large and complex that they require advanced methods and technologies to process them efficiently. The characteristics of big data are often encapsulated in the 5 V's:

Volume: Refers to the vast amounts of data generated every second, which requires robust storage solutions.
Velocity: Pertains to the speed at which data is generated and processed, necessitating real-time handling of data streams.
Variety: Involves the different forms of data, whether structured, semi-structured, or unstructured.
Veracity: Addresses the quality and accuracy of data.
Value: Represents the usefulness of the data to inform business decisions.

Traditional Data Systems vs Big Data Systems

Traditional systems typically handle structured data with predefined schemas and are designed for relational databases, while big data systems accommodate unstructured data and utilize distributed file systems.

Batch vs Real-Time Processing

In batch processing, data is aggregated over a period and processed at once; examples include monthly sales reports. In contrast, real-time processing manages data streams as they occur for immediate insights, such as monitoring social media feeds.

1.2 Big Data Challenges

Challenges in big data include:

Volume, Velocity, Variety: Managing the sheer quantity, speed, and diversity of data.
Scalability Issues: Building systems that can grow with increasing data.
Fault Tolerance: Ensuring systems are resilient against data loss or system crashes, requiring redundancy.
Cost Optimization: Balancing the expenses involved in data storage and processing versus the value gained.

1.3 Big Data Architecture Overview

Key components of big data architecture include:

Data Ingestion: The initial process of gathering data from different sources.
Storage: Where data is held; involves choosing platforms that suit the data type and access speed.
Processing: The manipulation or analysis of data to extract insights.
Analytics: Employing various methods to interpret data and gain actionable intelligence.
ML Integration: Incorporation of machine learning capabilities to automate and enhance data analysis.

Module 2: Google Cloud Data Platform Overview

2.1 Why Google Cloud for Big Data?

Google Cloud is favored for big data applications due to its advanced internal data architecture, which includes:

Dremel: A tool for executing SQL queries against large datasets quickly.
Borg: The cluster management system that allows for efficient scheduling and resource utilization.
Colossus: The scalable file system designed to handle massive amounts of data.

Serverless Analytics Model

The serverless model simplifies data processing by removing the need for server management, allowing users to focus on data and analytics without infrastructure concerns.

2.2 GCP Big Data Ecosystem

Main components of the GCP Big Data ecosystem comprise:

Cloud Storage: A scalable and secure object storage service.
BigQuery: A fully-managed data warehouse that allows for super-fast SQL queries and analysis.
Pub/Sub: A message-oriented middleware system for event-driven architectures.
Dataflow: A fully managed service for stream and batch data processing.
Dataproc: A managed Hadoop and Spark service, providing flexibility and cost efficiency.
Composer (Apache Airflow): A service for workflow management, facilitating complex data pipeline orchestration.
Vertex AI: A platform that helps to build, deploy, and scale machine learning models.

2.3 End-to-End Data Architecture

Understanding the flow of data through a system includes:

Streaming + Batch Architecture: Combining both streaming and batch data processing approaches facilitates comprehensive data analysis.
Lambda vs Kappa Architecture:
- Lambda Architecture: Uses both batch and streaming methods for fault tolerance and consistency.
- Kappa Architecture: Focuses solely on streaming data, simplifying design but requiring more complex processing methods.

Module 3: Data Storage in Google Cloud

3.1 Cloud Storage

Buckets and Objects

Cloud Storage organizes data into containers called buckets, and the individual files stored within these buckets are referred to as objects.

Storage Classes

Storage classes dictate access levels and pricing, with options like Standard, Nearline, Coldline, etc., aiding in cost-effective storage management.

Lifecycle Rules

Lifecycle rules automate the migration and deletion of data based on conditions like age or access patterns, optimizing resource utilization.

Data Lake Design

A well-structured data lake allows organizations to store vast amounts of raw data, preparing it for analysis while supporting multiple data formats.

3.2 BigQuery Architecture

BigQuery utilizes:

Columnar Storage: Optimizes storage and retrieval efficiency for analytical workloads.
Dremel Execution Engine: Handles processing and querying of data rapidly across distributed architectures.
Separation of Compute and Storage: Allows for independent scaling and pricing of computing and storage resources, enhancing performance and lowering costs.

3.3 Data Lake vs Data Warehouse vs Lakehouse

Use-case Comparison

Data Lake: Best for raw and varied data without strict schema requirements, suitable for exploratory analytics.
Data Warehouse: Ideal for structured data requiring predefined schemas, optimized for reporting and complex queries.
Lakehouse: Combines features of data lakes and warehouses, addressing the need for both unstructured storage and structured data processing.

When to Use BigQuery

BigQuery is optimal when analytical workloads require SQL-like querying against large datasets, particularly in real-time scenarios.

Module 4: Data Ingestion & Integration

4.1 Batch Data Ingestion

GCS BigQuery

Ingestion methods may include exporting data from Google Cloud Storage (GCS) directly into BigQuery.

Database Exports

Transferring data via database export mechanisms into BigQuery is a common batch ingestion method.

Transfer Service

Google's Transfer Service streamlines data movement from other cloud and on-premise sources into the cloud environment.

4.2 Streaming Data Ingestion

Pub/Sub Architecture

The Pub/Sub architecture manages message tracking and handling, facilitating robust data streaming environments.

Message Ordering

It ensures that messages are processed in the order they are published, critical for many applications.

Exactly-Once Semantics

Guarantees that messages are delivered and processed only once, preventing duplication and data integrity issues.

4.3 Real-Time Data Pipelines

A typical real-time data pipeline utilizes the sequence: Pub/Sub → Dataflow → BigQuery, enabling seamless and continuous data streaming analytics.

Module 5: Data Processing & Transformation

5.1 Dataflow (Apache Beam)

Beam Programming Model

Dataflow is based on the Apache Beam programming model, which provides abstractions for data processing applications.

Windowing

Defines how data is grouped over time for further processing, essential for event-time oriented tasks.

Triggers

Triggers define how and when windowed data should be processed and emitted.

Watermarks

Watermarks represent the progress of event time, facilitating late data processing and ensuring timely data insights.

5.2 Dataproc (Hadoop / Spark)

Managed Hadoop Ecosystem

Dataproc provides a managed environment for running Hadoop, Spark, and other ecosystem tools efficiently.

Spark, Hive, HBase

Supports data processing frameworks and storage systems within the Hadoop ecosystem, enabling diverse analytics capabilities.

Lift-and-Shift Workloads

Allows existing Hadoop/Spark workloads to migrate to the cloud with minimal changes, benefiting from cloud resources.

5.3 Choosing Dataflow vs Dataproc

When deciding on which service to use, consider:

Serverless vs Cluster-Based: Dataflow operates in a serverless context, while Dataproc requires setting up and maintaining clusters.
Cost and Performance Trade-Offs: Analyze workload requirements, data size, throughput needs, and budget constraints to choose appropriately.

Module 6: BigQuery Analytics & Optimization

6.1 BigQuery SQL

Partitioning and Clustering

Partitioning allows datasets to be divided based on specific criteria (e.g., date), optimizing query performance. Clustering organizes data within partitions based on common fields, further enhancing efficiency.

Nested and Repeated Fields

BigQuery supports complex data structures with nested and repeated fields, facilitating more intricate data modeling capabilities.

Analytical Functions

Utilize various built-in functions to perform complex aggregations, window functions, and statistical analysis directly within SQL.

6.2 Performance Optimization

Query Cost Optimization

Effective strategies to minimize costs include optimizing SQL queries and structuring datasets to reduce processing time.

Slot Management

Management of query slots allows for efficient resource utilization, scaling up or down according to workload.

BI Engine

BigQuery's BI Engine provides in-memory analysis capabilities for complex visualization and reporting needs, enhancing user experience and performance.

6.3 BigQuery Security

Column-Level Security

Column-level security restricts access to specific fields within tables, ensuring that sensitive information is protected.

Row-Level Security

Row-level security controls access to specific rows in a table based on user attributes or roles or applied filters.

Authorized Views

Authorized views allow for sharing subsets of data without exposing the entire underlying dataset, maintaining security while enabling collaboration.

Module 7: Data Orchestration & Automation

7.1 Cloud Composer (Apache Airflow)

DAG Concepts

Cloud Composer utilizes Directed Acyclic Graphs (DAGs) for managing workflows, outlining task dependencies.

Scheduling

Task execution scheduling within specified time intervals or based on certain triggers ensures that workflows are executed timely and reliably.

Dependency Management

Managing dependencies between tasks helps coordinate execution order to facilitate proper data flows throughout the pipeline.

7.2 Orchestrating Data Pipelines

Batch Workflows

Implementing orchestration strategies for batch workflows to schedule and execute long-running data processing tasks efficiently.

Streaming Workflows

Orchestrating streaming data workflows to ensure they can respond to real-time data changes and deliver insights without delay.

Retry and Failure Handling

Robust strategies must be in place for handling task failures, relying on retries and alternative workflows to maintain data pipeline integrity.

7.3 Enterprise Data Pipelines

End-to-End Orchestration Design

An effective orchestration design for enterprise-level data pipelines enhances workflow efficiency, ensures data integrity, and maintains a scalable architecture that aligns with business goals.