Data Analytics - Week 4

Sharding and Replication in MongoDB

Overview of Sharding
- Sharding is a method used for horizontal scaling in MongoDB, specifically designed to handle very large datasets by distributing them across multiple machines or 'shards'.
- It involves splitting data into smaller, more manageable logical pieces, then physically distributing these pieces across separate MongoDB instances.
- The primary goal of sharding is to enhance read and write performance, increase storage capacity, and manage high-volume data operations that a single server cannot handle.
- Applications interact with the sharded cluster by using shard keys, which are chosen fields that determine how data is partitioned and distributed among the shards.
Replication
- Replication provides high availability and data redundancy in MongoDB by maintaining multiple copies of data on different servers.
- A replica set consists of a primary node, which handles all write operations, and one or more secondary nodes, which maintain exact copies of the primary's data.
- For robust fault tolerance and to ensure a valid election for a new primary, a replica set typically requires at least three members: one primary and two secondary nodes, or a primary, a secondary, and an arbiter.
- Mechanism for Failure:
  - If the primary node fails or becomes unavailable, a process called an election is triggered among the remaining nodes.
  - A voter arbiter (or any data-bearing secondary node with voting rights) participates in this election to select one of the secondary nodes to become the new primary, ensuring continuous operation with minimal downtime.

Sharding by Primary Key

Shard Keys and Data Distribution
- Each shard within a sharded cluster is itself a replica set, comprising a primary node and several secondary nodes to ensure high availability and data redundancy within that shard.
- Client applications connect to the sharded cluster through an interface known as Mongos, which acts as a query router. Mongos directs client operations to the appropriate shard(s) transparently.
- Mongos stores crucial metadata, managed by config servers, to determine the exact location of specific data on its respective shard. This metadata includes the ranges of shard key values and which shards hold those ranges.
Best Practices for Defining Shard Keys
- Selecting an appropriate shard key is a critical decision, as it significantly impacts cluster performance and future scalability. Once defined, changing the shard key or the number of shards can be very difficult and resource-intensive.
- It is highly recommended to establish the shard key from the outset of designing the database architecture, considering anticipated query patterns and data distribution.
- While older versions made changes almost impossible, post MongoDB version 4.2, modifications to shard keys are possible through a complex migration process, but it is still advised to design them well initially to avoid such operations.

Difference Between Sharding and Replication

Sharding
- Sharding is fundamentally a strategy for horizontal scaling that focuses on breaking a large dataset into smaller, independent pieces (fragments) for distribution across multiple servers.
- Its primary purpose is to increase storage capacity and improve operational performance by parallelizing read and write operations. It does not inherently provide data duplication for redundancy.
Replication
- Replication is a strategy for high availability and fault tolerance that ensures data redundancy and durability by maintaining identical copies of data across multiple servers.
- It provides repeated data in designated replica sets, significantly increasing data safety and availability in case of server failures. Replication does not primarily scale write capacity beyond the primary node.

Quiz Overview

Concepts of Sharding
- Question on Sharding as Fragmentation: Sharding distributes data by breaking it into fragments across separate servers, which is a form of horizontal partitioning, not data duplication or replication.
- Redundancy: While sharding itself doesn't duplicate data, each shard is typically implemented as a replica set. Therefore, shard replica sets do provide redundancy for the partitioned data, ensuring high availability for each fragment.
- Horizontal Partitioning Misunderstanding: Horizontal partitioning (sharding) refers to adding more servers (nodes) to distribute data and workload, thereby increasing capacity and performance across the cluster. It's distinct from vertical partitioning (splitting tables by columns) or basic scaling concepts that don't involve data distribution.
- Job-Based Sharding: This concept refers to intelligently placing data close to specific users or applications, which can significantly improve application performance by reducing data latency and network overhead, ensuring that operations are handled by geographically or logically proximate shards.
Directory Based Sharding:
- Directory-based sharding requires a lookup table or a routing component (like Mongos) that identifies the exact shard allocation for specific data based on its shard key. This differs from hash-based or range-based sharding methods which use algorithms to determine data placement.

ACID Transactions in MongoDB

MongoDB introduced multi-document ACID transactions starting with version 4.0 for replica sets and 4.2 for sharded clusters, enabling reliable management of complex transactions in a concurrent environment.
ACID properties ensure the integrity and reliability of database transactions:
- Atomicity: Ensures that all operations within a transaction are completed successfully; if any part fails, the entire transaction is rolled back, leaving the database unchanged.
- Consistency: Guarantees that a transaction brings the database from one valid state to another, maintaining all defined rules and constraints.
- Isolation: Ensures that concurrent transactions execute independently without interference from each other, appearing as if they are executed sequentially.
- Durability: Guarantees that once a transaction has been committed, it will remain committed even in the event of power loss, crashes, or system errors.

Schema Validation with Large Datasets

When dealing with massive datasets, such as 500 million JSON documents streaming from IoT devices, ensuring data integrity and consistency is crucial.
Architects often need to evaluate schema-on-read approaches, which allow documents to be uploaded flexibly without a rigid schema enforced at write time, with validation occurring when the data is read or processed.
This approach helps avoid data loss during high-volume uploads where strict schema-on-write validation might cause rejections or delays. However, it shifts the burden of validation and potential data inconsistency issues to the application layer or later processing stages.

Cassandra Introduction

Cassandra is a highly available, decentralized, and linearly scalable NoSQL database system, designed to handle very large amounts of structured, semi-structured, and unstructured data across many commodity servers.
It's presented as another powerful NoSQL database to be discussed in upcoming sessions, contrasting its design principles with MongoDB's document model.
We will delve into Cassandra's data model (oriented around keyspaces and column families/tables), data partitioning strategies, clustering architecture, various replication models, and best practices for high-performance data processing in distributed environments.

Comparison with MongoDB

NoSQL Types Overview: The NoSQL landscape includes various database types:
- Key-value stores: Simple, high-performance stores (e.g., Redis).
- Document-based stores: Flexible, schema-less JSON-like document storage (e.g., MongoDB).
- Column-based stores: Optimized for aggregate queries over large datasets by storing data in columns (e.g., Cassandra, HBase).
- Graph-based stores: Designed for data with complex relationships (e.g., Neo4j).
Cassandra specifics: Cassandra is celebrated for its exceptional scalability, high availability, and strong partition tolerance (favoring A and P over C in the CAP theorem, offering tunable consistency).
- Its distributed nature allows for linear scalability by simply adding more nodes.
- The availability and consistency levels can be adjusted based on application requirements, ranging from eventual consistency to strong consistency.
- A comparison of availability often considers individual deployments. For instance, while traditional relational databases like Oracle offer high availability, Cassandra achieves it through a peer-to-peer, masterless architecture that avoids single points of failure.

Data Distribution in Cassandra

Keyspaces and Column Families: Unlike MongoDB, which organizes data into databases and collections of JSON-like documents, Cassandra operates on keyspaces (similar to a schema or database) and column families (now more commonly referred to as tables).
- A keyspace defines replication strategies and options, while tables define the structure of the data.
- Data scalability is achieved through independent nodes within a cluster operating in a peer-to-peer manner, where every node can accept read and write requests and is aware of the data distribution across the entire cluster.
- Cassandra utilizes a partition key (a primary key component) for distributing data uniformly across different nodes in the cluster, ensuring that related data is often grouped together for efficient retrieval.

Data Modelling

Partition and Clustering Keys Importance: Data modeling in Cassandra revolves heavily around query patterns and the careful selection of partition and clustering keys.
- The partition key is critical; the user must design partition keys accurately to ensure that the data for a given partition fits within a single node and to distribute data evenly across the cluster. Poor partition key selection can lead to hot spots or massive partitions that degrade performance.
- Clustering keys define the order in which data is stored within a partition on a node. They arrange data rows internally within partitions, enabling efficient range queries and sorting.
- Composite keys (using multiple columns for either the partition key or clustering key) can be used flexibly but should be designed based on foreseen traffic intentions, expected data volume, and typical query filters to optimize for specific access patterns.

Data Partitioning and Reading Data

Best Practices for Querying: To retrieve results efficiently in Cassandra, all components of the partition key must be mentioned in WHERE clauses of queries. This enables Cassandra to directly route the query to the correct nodes holding the relevant partition.
- Specification of clustering keys in a query is not mandatory if only the partition key is used and you want all data within that partition. However, if clustering keys are defined and filtering or ordering is desired on them, they should follow the proper order in the WHERE clause (e.g., WHERE partition_key = X AND clustering_key1 = Y AND clustering_key2 > Z).
- Validation of Queries: Invalid queries prominently include those lacking a complete specification of the partition key in the WHERE clause, as Cassandra cannot determine where to find the data without it. Queries that attempt to filter or order on clustering keys in an incorrect sequence (not matching their definition order) will also be invalid or inefficient.

Loading Data in Cassandra

Loading data into Cassandra can be done using various methods, including the COPY FROM command in cqlsh, client drivers, and tools like dsbulk for bulk loading.
Dashboards and Monitoring: Effective monitoring tools and dashboards are essential for gaining insights into the performance metrics of the Cassandra system. These tools track usage metrics (CPU, memory, disk I/O, network) and latency for read/write operations, helping identify bottlenecks and ensure optimal cluster health.
Working with Structured and Semi-Structured Data: Cassandra supports various data types, including collections like maps, sets, and lists. These can be used with unique identifiers for effective querying, leveraging Cassandra's flexible schema capabilities to store diverse data within column families.

MapReduce Equivalent in Hadoop

MapReduce is a core programming model in Hadoop for processing large datasets in a distributed computing environment. It involves two main phases:
- The Map phase processes input data in parallel, transforming it into key-value pairs.
- The Reduce phase then aggregates and combines these key-value pairs to produce a final output.
A key principle of MapReduce is locality of reference, whereby computation occurs locally on the nodes where the data resides, rather than transferring large data sets between nodes. This significantly reduces network traffic and improves processing efficiency for large-scale data analysis.

Summary of Hadoop Developments

Hadoop has evolved significantly since its inception, becoming an open-source framework for distributed storage and processing of large datasets on commodity hardware.
Its history traces back to Google's MapReduce and GFS papers. Hadoop's applications span data warehousing, log processing, machine learning data preparation, and more.
MapReduce remains a fundamental component within the broader Hadoop ecosystem, which now includes various tools like HDFS (distributed file system), YARN (resource management), Hive (data warehousing), Pig (high-level platform), and Spark (fast, general-purpose cluster computing), all designed to address challenges in big data handling in large-scale applications.

Discussions and Questions

This section covers interactive discussions on critical aspects of distributed systems and data management, such as:
- Scalability strategies: How to achieve linear scalability and handle increasing data volumes and user loads.
- Optimal partition sizes: Best practices for defining partition keys to ensure balanced data distribution, efficient disk management, and optimal performance, avoiding over- or under-sized partitions.
- Handling multiple clustering keys in query settings: Advanced techniques for querying data when multiple clustering keys are defined, including range queries and filtering across different clustering key components.

Conclusion and Further Reading

Scheduled topics for future discussions will delve deeper into the Hadoop ecosystem, focusing on advanced features, recognizing its limitations for certain workloads, and understanding its operational context for processing even larger and more complex datasets.
Future sessions may cover specific Hadoop tools in more detail, performance tuning, and integrating Hadoop with other big data technologies.