Understanding Big Data - Scalability, Data Distribution, CAP Theorem, ACID, and BASE

Databases

Databases are designed for storing and retrieving data.
Database management systems must:
- Store data persistently.
- Maintain data consistency.
- Ensure data availability.
Balancing consistency and high availability may increase transaction execution times.

Big Data Storage

Big data systems need to store and query vast amounts of variant data with multiple copies.
Examples include Facebook and Google, which process Exabytes and Zettabytes of data.
Big data requires scalable, flexible, available, and cost-effective storage strategies and technologies.
Scaling up becomes difficult and expensive as data volumes increase.
Scaling out (over a cluster) is a more appealing option but introduces complexity.

Scalability

Scalability is a system's ability to increase or decrease performance and cost in response to changes in processing demands.
It's a system's property to handle a growing amount of work by adding resources.
Scalability refers to how well a hardware system performs when the number of users increases.
Also refers to how well a database withstands growing numbers of queries.

Scale Up vs Scale Out

Scaling up:
- Making a component bigger or faster to handle more load.
- Example: Upgrading storage or processors.
Scaling out:
- Adding more components in parallel to spread out a load.
- Example: Independent CPU, independent memory, etc.

Clusters

A cluster is a tightly coupled collection of servers or nodes connected via a network to work as a single unit.
Each node in the cluster has its dedicated resources, such as memory, a processor, and a hard drive.
A cluster can execute a task by splitting it into small pieces and distributing their execution onto different computers that belong to the cluster.

Distributed File Systems

A distributed file system can store large files spread across the nodes of a cluster.
To the user, files appear to be local; however, physically, the files are distributed throughout the cluster.
This local view is presented via the distributed file system, and it enables the files to be accessed from multiple locations.
Examples include the Google File System (GFS) and Hadoop Distributed File System (HDFS).

Data Distribution

Data Distribution Models
- Single server (is an option for some applications).
- Multiple servers.
Orthogonal aspects of data distribution
- Sharding: different data on different nodes.
- Replication: the same data copied over multiple nodes.

Sharding

Sharding is the process of horizontally partitioning a large dataset into a collection of smaller, more manageable datasets called shards.
The shards are distributed across multiple nodes, where a node is a server or a machine.
Each shard shares the same schema, and all shards collectively represent the complete dataset.
It allows the distribution of processing loads across multiple nodes to achieve horizontal scalability.
Since each node is responsible for only a part of the whole dataset, read/write times are greatly improved.
It provides partial tolerance toward failures; in case of a node failure, only data stored on that node is affected.
With regards to data partitioning, query patterns need to be considered so that shards themselves do not become performance bottlenecks.
Queries requiring data from multiple shards will impose performance penalties.
Data Locality keeps commonly accessed data co-located on a single shard and helps counter such performance issues.
Main rules of sharding:
1. Place the data close to where it's accessed – Data Locality (e.g., Orders form Aswan: data in the Upper Egypt data centre).
2. Try to keep the load even – All nodes should get equal amounts of the load.
3. Put together data that may be read in sequence – Same orders, same node.

Replication

Replication stores multiple copies of a dataset, known as replicas, on multiple nodes.
It provides scalability and availability since the same data is replicated on various nodes.
Fault tolerance is also achieved since data redundancy ensures that data is not lost when an individual node fails.
Two different methods that are used to implement replication:
- Master-Slave
- Peer-to-peer

Master-Slave

In a master-slave configuration, all data is written to a master node. Once saved, the data is replicated over to multiple slave nodes.
All external write requests, including insert, update, and delete, occur on the master node.
Read requests can be fulfilled by any slave node.
Ideal for read-intensive loads rather than write-intensive loads.
Write performance will suffer as the amount of writes increases.
Reads are still possible via any of the slave nodes if the master node fails.
Writes are not supported until a master node is re-established if the master node fails.
The master node is either restored to life, or a new master node is chosen from the slave nodes.
A slave node can be configured as a backup node for the master node.
One concern with master-slave replication is read inconsistency, which can be an issue if a slave node is read prior to an update to the master being copied to it.
A voting system can be implemented where a read is declared consistent if the majority of the slaves contain the same version of the record.
Implementation of such a voting system requires a reliable and fast communication mechanism between the slaves.

Peer-to-Peer Replication

All the replicas have equal weight; there is no master-slave relationship between the nodes.
The loss of any of them doesn't prevent access to the data store.
Each node, known as a peer, is equally capable of handling reads and writes.
Each write is copied to all peers - prone to write inconsistencies.
Peer-to-peer replication is prone to write inconsistencies that occur as a result of a simultaneous update of the same data across multiple peers.
This is addressed by implementing one of the following concurrency strategy:
- Pessimistic Concurrency (2PC) - Proactive strategy that prevents inconsistency.
  - It uses locking to ensure that only one update to a record can occur at a time on the cost of availability since the database record being updated remains unavailable until all locks are released.
- Optimistic Concurrency (OCC) - Reactive strategy that does not use locking.
  - Instead, it allows inconsistency to occur with knowledge that eventually consistency will be achieved after all updates have propagated.
- Voting Protocols - Paxos or Raft

Combining Sharding and Master-Slave Replication

There are multiple masters, but each data item has a single master.
Two schemes:
- A node can be a master for some data and slaves for others.
- Nodes are dedicated for master or slave duties.
Write consistency is maintained by the master-shard. However, if the master-shard becomes non-operational, fault tolerance with regards to write operations is impacted.
Replicas of shards are kept on multiple slave nodes to provide scalability and fault tolerance for read operations.

Combining Sharding and Peer-to-Peer Replication

Each shard is replicated to multiple peers.
Each peer is only responsible for a subset of the overall dataset.
Collectively, this helps achieve increased scalability and fault tolerance.
As there is no master involved, there is no single point of failure, and fault-tolerance for both read and write operations is supported.

Maintain Data Consistency

Consistency means that only valid data, according to all defined rules, will be written to the persistent storage.
In the context of distributed systems, consistency also refers to maintaining a single and logically coherent view of data.
Relational database systems are designed to support consistency by the concept of atomic transaction (ACID Property).

Ensure Data Availability

Availability is the degree to which data can be instantly accessed even if a failure occurs.
One way to avoid unavailability is to have two database servers:
- Primary Server
- Backup Server.
To keep them consistent, a 2PC or 3PC Protocol can be used.

Consistency and Availability

There is a fundamental trade-off between Consistency and availability
There are two broad options:
- Ensuring consistency among all replicas on the cost availability.
- Accepting and coping with inconsistent writes to ensure availability.
These points are at the ends of a spectrum where we trade off consistency for availability.
Different domains have different tolerances for inconsistency, and we need to take this tolerance into account as we make our decisions.

Network Partition

Network partitioning is a network failure that causes the nodes to split into multiple groups such that a node in a group cannot communicate with nodes in other groups.
In a partition scenario, all sides of the original cluster operate independently assuming nodes in other sides are failed.

The CAP Theorem

"Given the properties of Consistency, Availability, and Partition tolerance, you can only get two”
Brewer’s Theorem
A triple constraint related to distributed database systems
- Consistency - read from any node results in the same data across multiple nodes.
- Availability - a guarantee that every request receives a response about whether it was successful or failed).
- Partition tolerance - the system continues to operate despite arbitrary message loss or failure of part of the system.
It is impossible for a distributed system to simultaneously provide all three properties together.
- CP: If a network partition occurs; the system will not respond to requests until it can ensure that all nodes have the same data. Example, used in banking systems (e.g., transfers should not allow inconsistent balance updates).
- AP: The system will continue to respond to requests even during a partition, but different nodes might return different data. Example, used in social media feeds, shopping carts (eventual consistency is okay).
- CA: This is generally not practical for distributed systems, as network partitions are inevitable in real-world scenarios. Example, in traditional relational databases (SQL) when running on a single node.
In a distributed database, scalability and fault tolerance can be improved through additional nodes, although this challenges consistency (C).
The addition of nodes can also cause availability (A) to suffer due to the latency caused by increased communication between nodes.
In distributed database systems, partition tolerance (P) must always be supported; therefore, CAP is generally a choice between choosing either C+P or A+P.
The requirements of the system will dictate which is chosen.

CAP Example

Alice is trying to book a room of the ABC Hotel on a node located in London of a booking system.
Bob is trying to do the same on a node located in Mumbai
The booking system uses a peer-to-peer distribution
There is only one room available
The network link breaks

Pick Two	Solution	Result
CA	Neither user can book any hotel room.	Neither user can book any hotel room.
CP	Designate Mumbai node as the master	Bob can make the reservation; Alice can see the inconsistent room information.
AP	Both nodes accept the hotel reservation	Overbooking

ACID

ACID is a database design principle related to transaction management. It is an acronym that stands for:
- Atomicity: All of the operations in the transaction will complete, or none will.
- Consistency: Transactions never observe or result in inconsistent data.
- Isolation: The transaction will behave as if it is the only operation being performed upon the database (i.e. uncommitted transactions are isolated).
- Durability: Upon completion of the transaction, the operation will not be reversed (i.e. committed transactions are permanent).

Forms of Consistency

ACID transaction Strong (or immediate) consistency Logical consistency
- No read-write conflicts (atomic transactions)
- Updates are serialized Sequential consistency
- Within a user’s session Session (or read-your-writes) consistency
- You may have replication inconsistencies, but eventually, all nodes will be updated to the same value Eventual consistency

BASE Properties

BASE is a database design principle based on the CAP theorem and leveraged by database systems that use distributed technology.
Favors availability over consistency by relaxing the strong consistency constraints mandated by the ACID properties.
- Basically Available means that there can be a partial failure in some parts of the distributed system, and the rest of the system continues to function.
- Soft State means data may eventually be overwritten with more recent data.
- Eventually Consistent means that there may be times when the database is in an inconsistent state.
  - Casual consistency
  - Read-your-writes consistency
  - Session consistency
  - Monotonic read consistency
  - Monotonic write consistency

RDBs and Scalability

The ACID properties are the cornerstone of SQL databases, ensuring reliable processing of transactions.
The ACID properties seem indispensable, and yet they are incompatible with availability and performance in very large systems.
Maintaining these properties across a distributed system is a challenge
Relational databases are designed to scale up on expensive single machines.