Why NoSQL?
•Relational databases have been a successful technology for twenty years, providing persistence, concurrency control, and an integration mechanism.
•Application developers have been frustrated with the impedance mismatch between the relational model and the in-memory data structures.
•The vital factor for a change in data storage was the need to support large volumes of data by running on clusters.
•Relational databases are not designed to run efficiently on clusters.
Common characteristics of a NoSQL database
•No relational model
•Suited to clusters
•Open-source
•Suits unstructured data - Schema less
Current NoSQL trends
customer shift continues online
the internet is connecting everything
big data is getting better
applications are movie to the cloud
the world has gone mobile
Different NoSQL database models
key value
document
column family
graph (aggregate ignorant)
Aggregate orientated
Makes it easier for the database to manage data storage over clusters
Aggregate
A collection of data that can be interrogated as a unit
Aggregate orientated works well when
most of the time the same aggregate is needed
aggregate orientated does not work well when
users cahnge how the interrogate the data regularly
update consistency
write-write conflicts (lost updates, values overwritten)
read consistency
read-write conflict (reading in the middle of someone else’s write)
relaxed consistency
consistent DB is possible, but at what performance impact? Tradeoff may be necessary and the domain may tolerate some inconsistency
The CAP Theorem (Consistency, Availibility, Partition Tolerance)
Given the three properties, you can only get two
Consistency
When data is queried, the user recieves the most up to date version of data
Availability
When data is queried, the user always recieves a response, even if it is not the most up to date version)
Partition Tolerance
Where a database is distributed over a clustered network, should part of the network fail, the rest of the clustered network can continue to operate (partitions on the network can be tolerated)
if the network is working normally
all nodes are operating normally, reading, writing, and syncing with each other (Consistency AND availability)
if the network becomes partitioned
parts of the network has failed, which partitions the networks and nodes cant communicate with each other normally (consistency OR availability)
Write-write conflicts occur when
two clients try to write the same data at the same time
read-write conflicts occur when
one client reads inconsistent data in the middle of another client’s write
pessimistic approaches lock data records to
prevent conflicts
optimistic approaches detect conflicts and
fix them
distributed systems see read-write conflicts due to
some nodes having recieved updates while other nodes have not
eventual consistency
at some point the system will become consistent once all the writes have propagated to all the nodes
the CAP theorem states that if you get a network partition, you have to
trade off availability of data versus consistency
document data model two options
Embedded or Normalised
Embedded Data Model
Capture relationships between data by storing related data in a single document structure
allow applications to retrieve and manipulate related data in a single database operation
Normalised Data Model
references store the relationships between data by including links from one document to another
Distributed database
a logically interrelated collection of shared data, physically distibured over a computer network
Distributed DBMS
software system that permits the management of the distributed database and makes the distribution transparent to users
Types of DDBMS
Homogeneous and Heterogeneous
Homogeneous DDBMS
All sites use same DBMS product, much easier to design and manage
approach provides incremental growth and allows increased performance
Heterogeneous DDBMS
Sites may run different DBMS products with possibly different underlying data models
occurs when sites have implemented their own databases and integration is considered later
heterogeneous DDBMS require translations to allow for
different hardware and different DBMS products
functions of a DDBMS
functionality of a DBMS, extended communication services, data dictionary
concurrency control and recovery services, as well as distributed query processing
three key issues of distributed database design
fragmentation, allocation and replication
Fragmentation
Relation may be divieded into a number of sub-relations, which are then distributed
Allocation
Each fragment is stored at site with optimal distribution
Replication
copy of fragments may be maintained at several sites
definition and allocation of fragments carried out strategically to achieve
locality of reference
improved reliability and availability
improved performance
balanced storage capacities and costs
minimal communications costs
quantitative infomation for fragmentation may include
frequency with which an application is run
site from which an application is run
performance criteria for transactions and applications
qualitative information may include
transactions that are executed by application
type of access
predicates of read operations
four alternative strategies regarding placement of data for data allocation:
centralized
partitioned (fragmented)
complete replication
selective replication
Centralized (Data Allocation)
Consists of single database and DBMS stored at one site with users distributed across the network
Partitioned (data allocation)
database partitioned into disjoint fragments
each fragment assigned to one site
complete replication (data allocation)
consists of maintaining complete copy of database at each site
selective replication (data allocation)
combination of partitioning, replication, and centralization
Why fragment?
applications work with views rather than entire relations
data is stored close to where it is most frequently used
data that is not needed by local applications is not stored
with fragments as unit of distribution, transaction can be divided into several subqueries that operate on fragments
data not required by local applications is not stored and so not available to unauthorized users
disadvantaged od fragmenting
performance, integrity
types of fragmentation
horizontal, vertical, mixed, derived
Transparencies in a DDBMS
distribution
fragmentation
location
replication
local mapping
naming
transaction
concurrency
failure
performance
DBMS
distribution transparency
allows user to percieve database as single, logical entity
if DDBMS exhibits distribution transparency, user does not need to know
data is fragmented, location of data items, otherwise it would be local mapping transparency
naming transparency
each item in a DDB must have a unique name
DDBMS must ensure that no two sites create a database object with same name
One solution is to create central name server, however this results in loss of local autonomy
central site may become a bottleneck
low availability
transaction transparency
ensures that all distributed transactions maintain distributed database’s integrity and consistency
distributed transaction accesses data stored at more than one location
each transaction is divided into number of subtransactions
one for each site that has to be accessed
concurrency transparency
all transactions must execute independently and be logically consistent with results obtained if transactions executed one at a time, in some arbitrary serial order, same fundamental principles as for centalized DBMS
Failure transparency
DDBMS must ensure atomicity and durability of global transaction, means ensuring that subtransactions of global transaction either all commit or all abort
performance transparency
DDBMS must perform as if it were a centralized DBMS
DDBMS should not suffer any performance degradation due to the distributed architecture
DDBMS should determine most conse-effective strategy to execute a request
12 rules for a DDBMS
Local Autonomy
No Reliance on a Central Site
Continuous Operation
Location Independence
Fragmentation Independence
Replication Independence
Distributed Query Processing
Distributed Transaction Processing
Hardware Independence
Operating System Independence
Network Independence
Database Independence
aggregation suits what databases
key-value, document, column-family
aggregates are a natural unit for
replication and sharding
aggregates easier for developers to work with as they
naturally manipulate data through aggregate structures
paths for distributing the DB
replication, sharding, single server
replication
same data copied to multiple nodes
sharding
different data copied on different nodes
single server - if the driver to use NoSQL is not running the DB on a cluster
no distribution of the DB is needed
Benefits of single server
eliminates all the complexities that the other distribution options introduce
easy for operations people to manage
easy for application developers to reason about
What database works best in a single server configuration
graph
each shard (data)
read and writes its own data
with sharding, each node has
different data
with sharding, ideally each user accesses
one node each
how to decide the allocation of shards
aggregate orientation obvious unit of distribution
store aggregates together if they are normally read in sequence
use application logic to decide
some NoSQL applications will offer auto-sharding
how to improve performance of sharding
locality of reference, distribute aggregates evenly across nodes
sharding benefits
improves read and write performance
sharding drawbacks
may affect resilience
a node failure makes that shard’s data unavailable in the same way as a non-distributed model
only the user of that data on that shard will suffer
clusters may use less reliable machines making node failure more likely
master-slave replication
all changes are made to the master
changes propagate to slaves
reads can be done from either
master-slave replication details
data replicated across multiple nodes
one node is appointed (automatically or manually) as the master, it is the authoriative source for the data and responsible for it’s updates
the other nodes are slaves
benefits of master-slave replication
good for scaling out if the dataset is read intensive (add more slaves to handle the read load - will mean the master has to synchronise to more slaves when writing)
read reilience - if the master should fail, the slaves can still handle read requests, writes have to wait until the master recovers or is replaced
drawbacks of master-slave replication
not good for datasets with heavy write traffic, may cause inconsistency
peer-to-peer replication
no master node, all the replicas have equal standing, all nodes can accept writes
benefits of ptp replication
a failed node doesnt mean no writes are possible, adding more nodes improves performance
drawbacks of ptp replication
consistency - write-write conflicts are forever
how to mitigate w-w conflicts in ptp replication
nodes replicas co-ordinate to avoid conflict, allow inconsistent writes - both solutions trade consistency for availability
combining sharding and replication
multiple masters
each data item only has a single master
a node might be a master for some data and a slave for others, or nodes may be dedicated for master or slave duties
ptp and sharding is a common strategy for column family databases
ke value model
simplest NoSQL DB model
uses the concept of key-value pairs
aggregate oriented but sees no structure in the aggregate
scaling is achieved with sharding
removing aggregates is preformed using a key only
key value model in theory
the developer cannot search on fields within the aggregate, or retrieve parts of the aggregate
key value model in practice
some databases which would classify as key-value may still allow some structure on the data beyond a big blob of data, so the distinction between key-value DBs and document DBs has a grey area
document model NoSQL database
aggregate oriented
recognises structure in the aggregate (structures and types result in the limits on what can be stored but allows the developer flexibility in retrievals)
often developers add an ID field in a document database to do a key-value style lookup
key-value model applications
session information, user profiles/preferences, shopping cart data
document model scaling
achieved through a combination of sharding and replication, this allows for high availability and can be a challenge for consistency
document model applications
event logging
content management systems
web analytics
e-commerce applications
column family model NoSQL databases
based on a column oriented model
data in cells grouped in columns rather than rows of data
use a concept of a keyspace which shows the structure of the column family (like tables as they contain rows, each row contains columns)
use of a query language to query data
ptp replication
column family scaling
acheived through adding more nodes to the cluster
column family applications
event logging
content management systems
graph model NoSQL databases
have different drivers with an opposite model
small records with complex connections
graph data structure of nodes connected by edges/arcs
once the database is populated with nodes and edges the developer can query it
queries can exploit the complex relationships in the graph much easier than a RDBMS
the queries can be chopped and changed regularly
graph model scaling
can more likely run as a single server DB
no distribution is required so data is highly consistent
scaling is a challenge as nodes on different machines is a performance concern
graph model applications
social networks
product preferences
eligibility rules
routing
dispatch
location based services
recommendation engines
each database model can be compared by
features
scaling
availability
consistency
data retrieval
secure systems require
what are the assests to be secured?
what are the threats to which those assets are vulnerable?
what services should be put in place to address those threats?
what are the current technological mechanisms that can support the services required?
what are the assets to be secured?
network resources, DB servers, application servers
what are the threats to which those assets are vulnerable?
information leakage - disclosure to unauthorised parties
integrity violation - data loss or corruption
denial of service - unavailability of system/service/network
illegitimate use - use of resource by unauthorised person or in unauthorised way