1/113
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
NoSQL
A generation of database management systems not based on the traditional relational model.
What "NoSQL" actually stands for
"Not Only SQL"
Five NoSQL characteristics
(1) Not based on the relational model, (2) Support distributed architectures, (3) Provide fault tolerance and high scalability/availability, (4) Support large amounts of sparse data, (5) Geared toward performance over consistency.
Four main categories of NoSQL databases
Key-Value, Document, Graph, Column-Oriented.
Key-Value (KV) database
A NoSQL model that stores data as key-value pairs in which the value is unintelligible to the DBMS.
Bucket (KV database)
A logical grouping of keys, similar to a table; a key can appear only once within a bucket.
Three KV operations
Get (retrieve value by key), Store (write value to key, replacing any existing value), Delete (remove the key-value pair).
Examples of Key-Value databases
Dynamo, Riak, Redis, Voldemort.
Document database
A NoSQL model that stores key-value pairs in which the value is a tag-encoded document (XML, JSON, BSON), and the DBMS understands the document's content.
Collection (document database)
The grouping container for key-value pairs, analogous to a bucket in KV databases.
Key difference between KV and Document databases
Document DBMSs understand and can query the value's internal structure; KV DBMSs do not.
Examples of Document databases
MongoDB, CouchDB, OrientDB, RavenDB.
Graph database
A NoSQL database that uses graph theory to store entity instances and the relationships between them, represented as nodes and edges.
Node (graph DB)
A single entity instance.
Edge (graph DB)
A relationship between nodes.
Property (graph DB)
An attribute describing a node or an edge.
How is graph data physically stored
Often in structures like an adjacency matrix or as key-value pairs, even though it is visualized as nodes and edges.
Examples of Graph databases
Neo4j, ArangoDB, GraphBase, Aerospike.
Hadoop
A Java-based framework (not a database) for distributing and processing very large data sets across clusters of computers.
Two most important parts of Hadoop
HDFS (Hadoop Distributed File System) and MapReduce.
HDFS
A highly distributed, fault-tolerant file storage system designed to manage large amounts of data at high speed; a low-level distributed file system used directly for storage.
Four HDFS assumptions
(1) High volume (terabyte+ files), (2) Write-once, read-many (no edits after close), (3) Streaming access (process whole files as a stream), (4) Fault tolerance (replicate data across many machines).
Client node (HDFS)
A node that makes requests to the file system.
Name node (HDFS)
The node that stores metadata about which blocks belong to which files and which data nodes hold them.
Data node (HDFS)
A node that stores the actual file data blocks.
Block report
A report sent every 6 hours from a data node to the name node listing which blocks it holds.
Heartbeat
A signal sent every 3 seconds from a data node to the name node to confirm it is still available.
What happens when a name node stops receiving heartbeats from a data node
It excludes that data node from future read/write lists and may instruct other nodes to replicate the missing data.
MapReduce
A divide-and-conquer parallel processing technique: split a large data block into sub-blocks, compute intermediate results, then summarize into one final answer.
Mapper
A program that performs the Map function
Reducer
A program that performs the Reduce function
Big Data
A term describing data sets so large, fast, or varied that traditional RDBMSs cannot handle them efficiently.
The 3 Vs
Volume, Velocity, Variety.
Volume
A characteristic of Big Data describing the quantity of data to be stored.
Velocity
A characteristic of Big Data describing the speed at which data enters the system and the speed at which it must be processed.
Variety
A characteristic of Big Data describing variations in the structure of the data being stored.
Scaling up
Handling data growth by migrating to a more powerful single system (more CPUs, more storage on one machine).
Scaling out
Handling data growth by distributing storage across a cluster of commodity servers; the dominant approach for Big Data.
Why RDBMSs are ill-suited for clusters
Distributing an RDBMS requires heavy communication and coordination among nodes, with significant performance cost.
Stream processing
Processing data as it enters the system to decide what to keep and what to discard before storage (focuses on inputs).
Feedback loop processing
Analyzing stored data to produce actionable results (focuses on outputs).
Structured data
Data that conforms to a predefined data model (e.g., relational tables).
Unstructured data
Data that does not conform to a predefined data model (e.g., images, video, audio).
BLOB (Binary Large Object)
An RDBMS data type for storing unstructured objects as a single atomic value; semantic content is opaque to the DBMS.
Variability
Big Data characteristic where the same data values may have different meanings or interpretations over time.
Veracity
Big Data characteristic regarding the trustworthiness/quality of the data.
Value
Big Data characteristic regarding the degree to which data can provide meaningful insights.
Visualization
The ability to graphically present data in a way that makes it understandable to users.
Concurrency control
A DBMS feature that coordinates simultaneous execution of transactions in a multi-user system while preserving data integrity.
Which ACID property does concurrency control mostly preserve
Isolation.
The three concurrency control problems
Lost update, uncommitted data (dirty read), inconsistent retrieval.
Lost update
A concurrency problem in which a data update is overwritten and lost during concurrent execution of transactions.
Uncommitted data (dirty read)
A concurrency problem in which a transaction reads data written by another transaction that later rolls back.
Inconsistent retrieval
A concurrency problem in which a transaction uses an aggregate function on data while other transactions are updating that data, producing incorrect aggregate results.
Scheduler
A DBMS component that establishes the order in which concurrent transaction operations are executed, interleaving them to ensure serializability.
Serializable schedule
A schedule of operations whose interleaved execution yields the same result as some serial execution.
Lock
A device that guarantees unique use of a data item for a particular transaction operation.
Pessimistic locking
Use of locks based on the assumption that conflicts between transactions will occur.
Lock manager
A DBMS component responsible for assigning and releasing locks.
Lock granularity
The level at which locks are applied: database, table, page, row, or field (broadest to most fine-grained).
Database-level lock
A lock that restricts database access to the lock owner; only one user at a time can use the database.
Table-level lock
A lock that allows only one transaction at a time to access a given table.
Page-level lock
A lock that restricts access to a disk page (a section of disk).
Row-level lock
A lock that allows concurrent transactions to access different rows of the same table, even if those rows live on the same page.
Field-level lock
A lock that allows concurrent transactions to access the same row but different fields; most flexible, highest overhead.
Trade-off as lock granularity gets finer
More concurrency, but higher overhead cost.
Binary lock
A lock with only two states: locked and unlocked.
Exclusive lock
A lock issued when a transaction requests permission to update a data item and no other locks are held on it.
Shared lock
A lock issued when a transaction requests permission to read a data item and no exclusive lock is held on it by another transaction.
Deadlock
A condition in which two or more transactions wait indefinitely for each other to release locks (also called a deadly embrace).
Deadlock prevention
A transaction requesting a lock is aborted if there is any chance of deadlock.
Deadlock detection
The DBMS periodically tests the database for deadlocks; if found, one transaction is aborted.
Deadlock avoidance
Transactions must obtain every lock they will need before being allowed to execute.
Transaction
A sequence of database requests that accesses the database; a logical unit of work that either entirely completes or is aborted.
Consistent database state
A state in which all data integrity constraints are satisfied.
Rollback
Reverting the database to its previous consistent state because a transaction failed or was explicitly aborted.
Atomicity (the A in ACID)
All parts of a transaction are treated as a single, indivisible logical unit
Consistency (the C in ACID)
Data integrity constraints are satisfied; transactions must start and end in consistent states.
Isolation (the I in ACID)
A data item used by one transaction is not available to other transactions until the first one ends.
Durability (the D in ACID)
Once a transaction is committed, its changes cannot be undone or lost, even after a system failure.
Serializability
The selected order of concurrent transaction operations produces the same final database state as some serial execution would have produced.
COMMIT
Permanently records all changes made by the transaction and ends the transaction.
ROLLBACK
Aborts all changes made by the transaction and reverts the database to its previous state.
START TRANSACTION (MySQL)
Explicitly begins a transaction; required in MySQL where transactions are not always implicit.
Implicit COMMIT
When the SQL command set ends successfully, all changes are recorded automatically as if COMMIT were issued.
Implicit ROLLBACK
When the SQL command set terminates abnormally, changes are aborted automatically as if ROLLBACK were issued.
Transaction log
A DBMS feature that keeps track of all transaction operations that update the database, used for recovery from rollbacks, abnormal termination, or system failure.
Six things stored in the transaction log
(1) Begin marker for each transaction, (2) Operation type (INSERT/UPDATE/DELETE), (3) Names of affected objects, (4) Before-and-after values for updated fields, (5) Pointers to previous and next log entries, (6) End/COMMIT marker.
Embedded SQL
SQL statements contained within an application written in a host programming language such as C, C++, Java, or ASP.NET.
Host language
Any programming language that contains embedded SQL.
Steps to build an embedded-SQL program
(1) Programmer writes embedded SQL inside host code, (2) Pre-processor transforms it into DBMS- and language-specific procedure calls, (3) Host compiler compiles the program, (4) Linker produces the executable plus an "access plan" module.
Access plan
The compiled module containing the instructions needed to run embedded SQL code at runtime.
Main weakness of embedded SQL
Executables can be decompiled, exposing table names and dictionary structure; SQL errors are not caught at compile time and may surface at run-time.
Stored procedure
Business logic stored on the database server in the form of SQL code (or a DBMS-specific procedural language) that can be called by applications.
Two main advantages of stored procedures
(1) Reduce network traffic and improve performance (SQL is not transmitted across the network), (2) Reduce code duplication, lowering errors and maintenance cost.
Stored procedure syntax (MySQL)
CREATE PROCEDURE name(parameter_list) BEGIN SQL_statements; END;
IN parameter
A value supplied by the caller into the stored procedure.
OUT parameter
A value returned from the stored procedure to the caller.
How to invoke a stored procedure manually
Use CALL procedure_name(arg1, arg2, …);
Why prefer stored procedures over inline SQL strings in app code
Centralizes business logic, improves security (less SQL injection surface), reduces network traffic, easier to maintain.