Database Models and SQL Concepts

3.1 Relational Model Overview
  • Database Models

    • A high-level, conceptual framework for organizing and structuring complex database systems. These models define how data is represented, stored, and manipulated within a database.

    • Parts:

      • Data Structures: The fundamental way data is organized, describing the types of data, their relationships, and the overall schema (e.g., tables, trees, graphs).

      • Operations: A set of predefined actions or functions used to manipulate and query the data structures (e.g., inserting, updating, deleting, retrieving data).

      • Rules (Integrity Constraints): A set of predefined conditions or rules that govern and ensure the validity, consistency, and integrity of the data stored in the database.

    • Models can be academic, serving as theoretical blueprints, or derived from existing commercial database systems, reflecting practical implementations.

3.2 Relational Model
  • Overview: This model is fundamentally based on a simple yet powerful concept: organizing all data into two-dimensional table structures, where each table represents a "relation."

  • Published: The foundational paper, "A Relational Model of Data for Large Shared Data Banks," was published in 1970 by Edgar F. Codd while he was working at IBM. This paper introduced the theoretical underpinnings of relational databases.

  • Commercial Release: While the theoretical model was established in 1970, the first commercially viable relational database management systems (RDBMS) began appearing around 1980.

    • Along with commercialization, industry efforts led to the standardization of operations and rules through the Structured Query Language (SQL), which became the de facto language for interacting with relational databases.

  • Historical Context: In the 1960s and 1970s, database systems were largely dominated by less flexible models like hierarchical databases (e.g., IBM's IMS) and network databases (e.g., CODASYL IDMS).

    • The relational model offered significant advantages in data independence, flexibility in query formulation, and a simpler logical view for users, which, combined with growing computing performance, led to its rapid rise and eventual dominance in the database market.

3.2.1 Initial Applications

  • The relational model was primarily designed and optimized for transactional applications, which require high concurrency, data integrity, and efficient processing of frequent, small data modifications. Examples include:

    • Bank transactions: Processing deposits, withdrawals, transfers, and account balance updates, requiring ACID (Atomicity, Consistency, Isolation, Durability) properties.

    • Airline reservations: Managing real-time seat availability, booking, and cancellation for flights, ensuring accurate and consistent data across multiple users.

    • Inventory management: Tracking stock levels, sales, and supply chain movements for retail and manufacturing.

3.2.2 Growth of Big Data

  • From the 1990s onward, the exponential growth of data — often termed "big data" — driven by the internet, sensors, and mobile devices, led to the emergence of various new, optimized database models.

    • These new models (e.g., NoSQL databases) often prioritize scalability, flexibility, and availability over strict ACID properties.

    • Despite these new entrants, relational databases adapted significantly, incorporating features like column stores, in-memory processing, and distributed architectures to provide more robust support and analytical capabilities for large datasets, often alongside specialized big data systems.

3.2.3 Other Database Models

  • Beyond the relational model, many other paradigms exist, each with different strengths and use cases. Some prominent examples include:

    • Hierarchical (e.g., IMS, 1960s): Organizes data in a tree-like structure, with parent-child relationships. Efficient for one-to-many relationships but rigid in structure.

    • Network (e.g., IDMS, 1970s): An extension of the hierarchical model, allowing a record to have multiple parents, forming a more complex graph-like structure. Offers more flexibility than hierarchical but still complex to manage.

    • Relational (e.g., Oracle, IBM Db2, Microsoft SQL Server; 1980s onwards): Stores data in tables with rows and columns. Known for strong data consistency and SQL query language.

    • Object (e.g., ObjectStore, GemStone/S; 1990s): Stores data as objects, similar to object-oriented programming. Good for complex data types and direct integration with OO applications.

    • Document (e.g., MongoDB, Couchbase; 2000s onwards): A NoSQL model that stores data in flexible, semi-structured documents (e.g., JSON, XML). Ideal for managing unstructured or rapidly changing data.

    • Key-Value (e.g., Redis, DynamoDB): The simplest NoSQL model, storing data as a collection of key-value pairs. Offers high scalability and performance for simple lookups.

    • Graph (e.g., Neo4j, Amazon Neptune): Stores data as nodes and edges, representing entities and their relationships. Optimized for traversing complex connections and relationship analysis.

3.3 Relational Data Structure
  • Based on Set Theory: The relational model draws heavily from mathematical set theory and first-order predicate logic.

    • Set: An unordered collection of distinct elements, where the order of elements does not matter, and duplicates are not allowed (e.g., a,b,c{a, b, c} is the same as c,a,b{c, a, b}).

    • Tuple: An ordered collection of elements. In the context of databases, a tuple corresponds to a row in a table, where the order of attributes (columns) is defined by the table schema (e.g., (a,b,c)(a, b, c) is distinct from (b,c,a)(b, c, a)).

    • In a relational database, tables themselves are considered a set of rows (tuples), which implies that the rows within a table have no inherent order. The order in which they are retrieved is at the discretion of the database system unless an explicit ORDER BY clause is used in a query.

3.3.1 Organizing Data in Tables

  • Table: Formally known as a "relation," a table is a named, two-dimensional structure composed of rows and columns, designed to store a specific type of data entity.

    • Each table has a defined schema (data structure) that specifies the names of its columns and their respective data types.

  • Rows (Tuples): Each row in a table represents a single record or entity instance. Rows are unordered, meaning their physical storage order does not convey any logical meaning.

    • Each row contains values that correspond to the data types and definitions of its respective columns.

  • Columns (Attributes): Each column in a table represents a specific characteristic or property (attribute) of the entity that the table describes.

    • All values within a given column must be of the same data type.

  • Common Database Terms: To bridge the gap between academic theory and practical implementation:

    • Table in commercial RDBMS is analogous to Relation in academic literature.

    • Column in commercial RDBMS is analogous to Attribute in academic literature.

    • Row in commercial RDBMS is analogous to Tuple in academic literature.

    • Domain refers to the set of all possible allowed values for a given column.

3.3.2 Relational Operations

  • Relational operations are a set of fundamental procedures, also based on set theory, used to manipulate relations (tables). These operations take one or more input tables and produce a new result table, without modifying the original tables. Collectively, these operations form Relational Algebra, a procedural query language for relational databases.

    • Select (σ\sigma): Also known as restriction, this operation filters a relation (table) to produce a subset of its rows that satisfy a specified selection condition. For example, selecting all employees earning more than $50,000.

    • Project (π\pi): This operation selects a subset of columns from a relation (table), effectively removing unwanted columns. It can also eliminate duplicate rows if the projected columns result in identical rows. For example, projecting only the names and email addresses from an employee table.

    • Product (Cartesian Product, ×\times): Combines every row from one table with every row from another table. If table A has mm rows and table B has nn rows, their Cartesian product will yield a new table with m×nm \times n rows. It is often a precursor to a Join operation.

    • Join (\Join): A fundamental operation that combines rows from two or more tables based on a related column between them. Various types of joins exist:

      • Inner Join: Returns only the rows where there is a match in both tables based on the join condition.

      • Left (Outer) Join: Returns all rows from the left table, and the matched rows from the right table. If no match, NULLs are returned for the right table's columns.

      • Right (Outer) Join: Returns all rows from the right table, and the matched rows from the left table. If no match, NULLs are returned for the left table's columns.

      • Full (Outer) Join: Returns all rows when there is a match in one of the tables. If no match, NULLs are returned for columns from the table without a match.

    • Union (\cup): Combines all unique rows from two or more tables that have the same number of columns and compatible data types. Duplicate rows are automatically eliminated.

    • Intersect (\cap): Returns only the common rows that exist in both (or all) input tables. Also requires compatible schemas.

    • Difference (Minus, -): Returns the rows that are present in the first table but not in the second table. This operation is asymmetrical.

    • Rename (ρ\rho): Changes the name of a relation (table) or an attribute (column) within a relation.

    • Aggregate: Not strictly part of classical relational algebra, but crucial for practical database operations. These are functions that perform calculations on a set of values (e.g., a column) and return a single scalar value. Common aggregate functions include SUM(), COUNT(), AVG(), MAX(), and MIN().

3.4 SQL and Relational Database Rules
  • Constraints are strict rules applied to tables and columns to ensure the validity, accuracy, and consistency of data within a database. They are critical for maintaining data integrity.

    • Every table in a relational database must adhere to certain fundamental integrity rules:

    • Entity Integrity (Primary Key Rule): Every table must have a Primary Key, which is a column or a set of columns with unique values for each row. Furthermore, no part of a primary key can contain NULL values. This ensures that each record (entity) in the table can be uniquely identified.

    • Referential Integrity (Foreign Key Rule): If a Foreign Key exists in a table, either its value must match an existing Primary Key value in the referenced table, or it must be NULL. This ensures consistency between related tables.

    • Unique Column Names: Within a single table, all column names must be distinct to avoid ambiguity when referring to attributes.

    • No Duplicate Rows: Because tables are conceptually sets of tuples, and sets do not allow duplicate members, no two rows in a table can be identical across all columns. The presence of a primary key naturally enforces this rule.

3.4.1 Implementing Constraints in SQL

  • SQL provides mechanisms to define and enforce these relational rules directly within the database schema. These constraints are declared as part of the CREATE TABLE or ALTER TABLE statements.

    • For instance, each table must explicitly define its primary key using the PRIMARY KEY constraint, which automatically enforces UNIQUE and NOT NULL for the primary key column(s).

    • Values in primary key columns cannot repeat, and attempts to insert or update a row with a duplicate primary key value will result in a database error.

    • Operations within SQL (e.g., INSERT, UPDATE, DELETE) are checked against these defined constraints. If an operation would violate any constraint, the database management system will typically reject the operation and issue an error, thus maintaining data integrity.

3.5 Introduction to SQL
  • Structured Query Language (SQL): A declarative, standard language specifically designed for managing and querying data in relational database management systems (RDBMS). It allows users to create, modify, retrieve, and control access to data.

  • Components of SQL: SQL is logically divided into several sub-languages, each serving a distinct purpose:

    • Data Definition Language (DDL): Used to define and manage the structure of database objects (e.g., tables, indexes, views). Commands include CREATE (for creating objects like CREATE TABLE, CREATE INDEX), ALTER (for modifying existing objects like ALTER TABLE), and DROP (for deleting objects like DROP TABLE).

    • Data Manipulation Language (DML): Used for managing data within the database objects. Commands include INSERT (to add new rows of data), UPDATE (to modify existing data), and DELETE (to remove rows of data).

    • Data Query Language (DQL): Primarily used for retrieving data from the database. The most common DQL command is SELECT, which allows specifying criteria for data retrieval, sorting, grouping, and aggregation.

    • Data Control Language (DCL): Used to manage database access permissions and control security. Commands include GRANT (to give users permissions to access or manipulate database objects) and REVOKE (to remove those permissions).

    • Transaction Control Language (TCL) (often grouped with DML): Used to manage transactions within a database, ensuring data consistency and integrity. Commands include COMMIT (to save changes), ROLLBACK (to undo changes), and SAVEPOINT (to set a point within a transaction to which you can later roll back).

3.6 SQL Syntax and Elements
  • SQL statements are constructed from a combination of specific elements that adhere to a predefined syntax:

    • Literals (Constants): Fixed data values that appear directly in SQL statements. These can include:

      • String literals: Enclosed in single quotes (e.g., 'John Doe', 'New York').

      • Numeric literals: Integers or decimal numbers (e.g., 123, 3.14).

      • Date/Time literals: Often enclosed in single quotes, sometimes with specific prefixes (e.g., '2023-10-27', '14:30:00').

      • Boolean literals: TRUE, FALSE.

    • Identifiers: User-specified names given to database objects such as tables, columns, views, indexes, or stored procedures. They must follow naming conventions (e.g., start with a letter, contain letters, numbers, and underscores, cannot be a reserved keyword).

    • Keywords: Reserved words that have special meaning in SQL and are used to perform specific operations (e.g., SELECT, FROM, WHERE, INSERT, UPDATE, DELETE, CREATE, TABLE, PRIMARY KEY). Keywords cannot be used as identifiers.

    • Comments: Non-executable text included within SQL statements to provide explanations or documentation. SQL supports two main types:

      • Single-line comments: -- This is a single-line comment

      • Multi-line comments: /* This is a multi-line comment that can span several lines */

  • SQL statements are typically constructed from one or more clauses (e.g., SELECT, FROM, WHERE, GROUP BY, ORDER BY), each serving a specific function in defining the query or operation. They must adhere to specific syntax rules to be parsed and executed correctly by the database system.

3.7 NULL Values in SQL
  • Definition: In SQL, NULL is a special marker used to represent missing, unknown, or inapplicable data. It is not a data type but a state that a column can have if it doesn't contain a specific value.

  • Differences: It's crucial to understand that NULL is distinct from and not equivalent to:

    • Zero (0): NULL is not the number zero; it signifies the absence of a numerical value, whereas zero is a concrete numerical value.

    • Blank space or Empty string (''): NULL is not an empty string; it means no string value is present, while an empty string is a string with zero characters.

    • False: NULL is not a boolean false; it represents an unknown truth value.

    • The presence of NULL can lead to three-valued logic in SQL (TRUE, FALSE, UNKNOWN) when comparisons are made.

  • Handling NULL: Because NULL is not a value, standard comparison operators (e.g., =, !=, <, >) do not work as expected with NULL. Instead, special operators are needed:

    • IS NULL: Used to check if a value is NULL (e.g., WHERE email IS NULL).

    • IS NOT NULL: Used to check if a value is not NULL (e.g., WHERE phone_number IS NOT NULL).

    • Any arithmetic operation involving NULL will generally result in NULL (e.g., 5+NULL=NULL5 + \text{NULL} = \text{NULL}). Similarly, most aggregate functions (except COUNT(*)) ignore NULL values by default.

3.8 Modifying Data
  • SQL provides specific DML statements for altering the rows stored in database tables:

    • INSERT Statement: Used to add one or more new rows of data into a table. It specifies the table name, optionally the columns to be populated, and the values for those columns.

    • UPDATE Statement: Used to modify existing row values in one or more columns within a table. It requires specifying the table, the columns and their new values (SET clause), and typically a condition (WHERE clause) to identify which rows to update.

    • DELETE Statement: Used to remove one or more rows from a table. It specifies the table name and typically a condition (WHERE clause) to identify which rows to delete. Without a WHERE clause, all rows in the table will be deleted.

3.8.1 Updating and Inserting Rows

  • Inserting Rows: New rows are added to a table using the INSERT INTO statement. The basic syntax involves specifying the table name, the columns you are inserting data into (optional if inserting into all columns in their defined order), and the VALUES clause with the corresponding data.

    • Example: INSERT INTO Employees (EmployeeID, FirstName, LastName) VALUES (101, 'Alice', 'Smith');

    • Multiple rows can be inserted in a single INSERT statement by providing multiple sets of values.

  • Updating Rows: Existing data can be modified using the UPDATE statement. You specify the table to update, use the SET clause to assign new values to one or more columns, and crucially, use a WHERE clause to filter which rows should be updated.

    • Example: UPDATE Employees SET Salary = 60000 WHERE EmployeeID = 101;

    • If the WHERE clause is omitted, the UPDATE statement will apply the changes to every row in the table.

  • Deleting Rows: Rows are removed from tables using the DELETE FROM statement. The WHERE clause is essential here to specify which rows to remove.

    • Example: DELETE FROM Employees WHERE Department = 'Sales';

    • Cautions: Executing DELETE FROM TableName; without a WHERE clause will remove all records from the table, effectively emptying it. This is irreversible unless within a transaction that is rolled back.

3.9 Primary Keys
  • Primary Key Requirement: A primary key is a crucial relational database constraint that uniquely identifies each record in a table and ensures data integrity.

    • Unique identifiers: The values in a primary key column (or set of columns) must be unique across all rows in the table. This is how the database ensures no two records are identical.

    • NOT NULL: Primary key columns cannot contain NULL values. This ensures that every record has a definite and identifiable value for its key.

    • A primary key can be defined as a single column (simple primary key) or as a combination of multiple columns (composite primary key) where the combination of values across those columns must be unique.

    • The primary key is often used to establish relationships with other tables through foreign keys.

3.9.1 Defining Primary Keys in SQL

  • Primary keys are typically defined when a table is created using the CREATE TABLE statement.

    • Column-level constraint: CREATE TABLE Customers ( CustomerID INT PRIMARY KEY, FirstName VARCHAR(50) );

    • Table-level constraint: CREATE TABLE Orders ( OrderID INT, OrderDate DATE, CustomerID INT, PRIMARY KEY (OrderID) );

    • Composite Primary Key: CREATE TABLE Enrollment ( StudentID INT, CourseID INT, Grade VARCHAR(2), PRIMARY KEY (StudentID, CourseID) );

  • A table can only have one primary key.

3.10 Foreign Keys
  • Foreign Key Basics and Functionality: A foreign key is a column or a set of columns in one table (the referencing table) that refers to the primary key in another table (the referenced or parent table). Its main purpose is to establish and enforce a link between tables, ensuring referential integrity.

    • Foreign keys create explicit relationships between tables, indicating how data in one table is related to data in another. This is fundamental to the relational model's ability to represent complex relationships efficiently.

    • Example: A CustomerID in an Orders table could be a foreign key referencing the CustomerID primary key in a Customers table. This ensures that every order is associated with a valid, existing customer.

  • Types of Foreign Key Relations and Constraints: Foreign keys define the cardinality of relationships, such as one-to-many, one-to-one, or many-to-many (through an intermediary table).

    • Referential Actions: When defining a foreign key, you can specify actions to take if the referenced primary key in the parent table is updated or deleted:

      • ON DELETE CASCADE: If a row is deleted from the parent table, all rows referring to that primary key in the child table are also deleted.

      • ON UPDATE CASCADE: If a primary key is updated in the parent table, the corresponding foreign key values in the child table are automatically updated.

      • ON DELETE SET NULL / ON UPDATE SET NULL: If a row in the parent table is deleted or its primary key is updated, the foreign key values in the child table are set to NULL (provided the foreign key column allows NULLs).

      • ON DELETE RESTRICT / ON UPDATE RESTRICT (or NO ACTION): This is the default behavior in many systems. It prevents deletion or update of a primary key in the parent table if there are related foreign key records in the child table.

3.11 Referential Integrity
  • Rules governing foreign key links: Referential integrity is a core concept in relational databases that ensures consistency between related tables.

    • It dictates that for every value in a foreign key column, that value must either exist as a primary key in the referenced table, or it must be NULL (if the foreign key column is nullable).

    • This prevents orphan records – rows in a child table that refer to a non-existent parent record.

  • Violations must be resolved to maintain database integrity: If an attempt is made to:

    • Insert a row into the child table with a foreign key value that does not exist in the parent table's primary key.

    • Update a foreign key value in the child table to a value that does not exist in the parent table's primary key.

    • Delete a row from the parent table when there are existing matching foreign key values in the child table (and no CASCADE or SET NULL rule is defined).

    • Update a primary key value in the parent table when there are existing matching foreign key values in the child table (and no CASCADE or SET NULL rule is defined).

    The database system will typically reject the operation, thus preventing a violation of referential integrity and maintaining the consistency of the database. The specific behavior (e.g., error, cascade) depends on the ON DELETE and ON UPDATE actions defined for the foreign key.

3.12 Constraints in SQL
  • In addition to primary and foreign keys, SQL provides several other types of constraints to enforce specific data validation rules at the column or table level:

    • NOT NULL: Ensures that a column cannot store NULL values. Every row must have a definite value for this column. Example: CREATE TABLE Users (UserID INT NOT NULL, UserName VARCHAR(100) NOT NULL);

    • UNIQUE: Ensures that all values in a column or a combination of columns are distinct. While similar to a primary key, a table can have multiple unique constraints, and unique columns can contain NULL values (though typically only one NULL is allowed per unique constraint, depending on the RDBMS).

      • Example: CREATE TABLE Employees (Email VARCHAR(255) UNIQUE);

    • CHECK: Enforces domain integrity by ensuring that all values in a column satisfy a specific condition. The CHECK constraint evaluates a boolean expression for each row.

      • Example: CREATE TABLE Products (Price DECIMAL(10, 2) CHECK (Price >= 0)); or CREATE TABLE People (Age INT CHECK (Age BETWEEN 0 AND 120));

  • Constraints are declared within the table structure, either directly after the column definition (column-level constraint) or at the end of the CREATE TABLE statement (table-level constraint for multiple columns or more complex rules). They are enforced automatically by the database system upon data insertion or modification.

3.12.1 Adding and Dropping Constraints

  • Existing tables can have constraints added or removed using the ALTER TABLE statement.

    • Adding a constraint: ALTER TABLE Employees ADD CONSTRAINT CHK_Salary CHECK (Salary >= 0);

      • To add a primary key: ALTER TABLE Customers ADD PRIMARY KEY (CustomerID);

      • To add a foreign key: ALTER TABLE Orders ADD CONSTRAINT FK_CustomerOrder FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID);

    • Dropping a constraint: ALTER TABLE Employees DROP CONSTRAINT CHK_Salary;

      • The name of the constraint (e.g., CHK_Salary, FK_CustomerOrder) must usually