Relational Algebra and Database Design - Notes

Relational Model and Relational Algebra

The transcript introduces the relational model and relational algebra as foundational concepts for querying and designing relational databases. A database is defined as a collection of persistent data, and a DBMS is software that supports creation, population, and querying of a database. A Relational Database Management System (RDBMS) stores data in tables (relations) that share a single schema describing the table definitions and attributes. Each relation has a schema, e.g., Students(sid, name, login, age, gpa), where sid is a primary key and identifies tuples (rows) in the relation. Keys are central: a primary key is a minimal set of fields that uniquely identifies a tuple, while foreign keys connect between relations (e.g., Enrolled(cid, grade, studid) relates the Courses and Students tables). A relation is a set of n-tuples drawn from domains of attributes, with arity equal to the number of attributes. A relation instance is the current state of a relation, shown as a table of tuples. Importantly, relations are unordered sets of tuples; their order does not matter for queries.

Core concepts and notation

Database: a persistent collection of data; DBMS manages creation, population, and querying.
Relational database: a collection of relations with a common schema; relations have attributes (columns) and tuples (rows).
Schema: the definition of a table, including its attributes and their types.
Primary key: a minimal subset of attributes that uniquely identifies a tuple in a relation.
Foreign key: an attribute (or set) that references a primary key in another relation, establishing a relationship between tables.
Arity: the number of attributes (domains) in a relation.
Domain: a data type corresponding to an attribute.
Relation instance: the current content of a relation, represented as a table of tuples.
Nulls: representations of missing or unknown values in a relation (with three-valued logic considerations in later sections).

Example schema and data (illustrative)

Courses(cid, instructor, quarter, dept)
Students(sid, name, gpa)
Enrolled(cid, grade, studid)
Keys: Students.sid is the primary key for Students; Courses.cid is the primary key for Courses; Enrolled uses foreign keys studid referencing Students.sid and cid referencing Courses.cid.
An example of the Students table: (sid, name, login, age, gpa) with tuples such as
- (50000, Dave, dave@cs, 19, 3.3)
- (53666, Jones, jones@cs, 18, 3.4)
- (53688, Smith, smith@ee, 18, 3.2)
- (53650, Smith, smith@math, 19, 3.8)
- (53831, Madayan, madayan@music, 11, 1.8)
- (53832, Guldu, guldu@music, 12, 2.0)
The notion of keys and foreign keys is reinforced across Students and Courses, and the relationship Enrolled represents a many-to-many association resolved via an enrollment relation.

Basic relational algebra operators

Selection (σ): filter rows by a predicate. Example: σ_{gpa>3.3}(S1) selects students with gpa > 3.3 from relation S1.
Projection (π): select specific columns, removing duplicates in the result. Example: π_{name, gpa}(S1) retains only name and gpa.
Union (U): combine two relations with the same arity.
Intersection (∩): return the common tuples of two relations with the same arity.
Set Difference (−): tuples in one relation but not in the other (same arity).
Cartesian Product (×): combine every tuple of R with every tuple of S; result arity is the sum of the arities.
Rename (ρ): change the relation name and optionally attribute names to enable composing expressions.
Composition: complex queries can be built by composing multiple operators, e.g., sA=C(r × s) or πA,B (σ…(r)).

Notation and examples

Selection notation: s p(r) where p is a predicate built from terms of the form op or , and op ∈ {=, ≠, >, ≥, <, ≤}.
Example: σ_{branch-name = "Perryridge"}(account) selects accounts in Perryridge branch.
Projection notation: π_{A1, A2, …, Ak}(r) returns a relation with attributes A1…Ak; duplicates are removed.
Combining selection and projection: π{name, gpa}(σ{gpa>3.3}(S1)).

Relational joins and more advanced operators

Joins: join two relations on common attributes to combine related data. Natural join is a common form where equal attributes are matched and duplicates removed.
Division (÷): used for "for all" style queries; given R(A1,…,Am,B1,…,Bn) and S(B1,…,Bn), R ÷ S yields a relation on (A1,…,Am) such that for each A1,…,Am tuple t, all combinations with B1,…,Bn exist in R if t is included.
Union, intersection, and set difference extend to complex schemas under arity and domain compatibility constraints.

Example relational-algebra queries (banking and classroom examples)

Find all loans with amount > 1200: π{loannumber}(σ_{amount > 1200}(loan)).
Names of customers who have either a loan or an account (or both): π{customername}(borrower) ∪ π{customername}(depositor).
Name of customers who have a loan at Perryridge but no account anywhere: π{customername} (σ{sbranchname = "Perryridge"}(borrower × loan)) − π{customername}(depositor).
More complex examples mix selection, projection and joins, e.g., combining Enrolled and Courses to list student grades for specific courses.

Relational Database Design and Normalization

Normalization overview

Normalization is the design technique to reduce redundancy and prevent update anomalies by decomposing large relations into smaller, well-structured relations.
The process relies on functional dependencies (FDs) among attributes and aims to achieve normal forms such as 1NF, 2NF, 3NF, BCNF, 4NF, and beyond.
BCNF (Boyce-Codd Normal Form) is stricter than 3NF; 3NF often preserves dependencies and ensures lossless join, whereas BCNF may sacrifice some dependency preservation.
The goal is often to reach BCNF with lossless joins and dependency preservation, though in some cases a 3NF design is preferable to preserve dependencies while maintaining losslessness.

Functional dependencies and closure

FD: X → Y means the attribute set Y is determined by X; any two tuples with the same X values have the same Y values.
Closure F+ of a set F of FDs is all dependencies implied by F using Armstrong’s axioms (reflexivity, augmentation, and transitivity, plus derived rules like union and decomposition).
Key concepts: superkey (a set of attributes that functionally determines all attributes of the relation), candidate keys (minimal superkeys), prime attributes (attributes that are part of any candidate key).
Attribute closure: for a set of attributes X, X+ is the set of attributes functionally determined by X under F.
Canonical cover Fc: a minimal-equivalent set of FDs that preserves F+ and is used to simplify validation and decomposition.
Extraneous attributes: parts of the left or right side of a dependency that can be removed without changing F+; detecting extraneous attributes is essential for computing Fc.

Normal forms and decomposition

1NF: all domain values atomic (no repeating groups); every attribute value must be atomic.
2NF: in addition to 1NF, no partial dependency of any non-prime attribute on any candidate key is allowed. This means every non-prime attribute must depend on the entire candidate key.
3NF: every non-prime attribute must be non-transitively dependent on every candidate key; equivalently, for every FD X → A, either X is a superkey, or A is a prime attribute.
BCNF: stronger than 3NF; for every FD X → A that holds in the relation, X must be a superkey.
4NF and beyond: handle multivalued dependencies (MVDS) and other complex constraints; 4NF requires that for any nontrivial MVD X →→ Y, X must be a superkey; higher normal forms (PJNF, DK/NFK) generalize this, but are rarely used in practice.

Decomposition and losslessness

A decomposition of R into R1, R2 is lossless if R1 ∩ R2 → R1 or R1 ∩ R2 → R2 holds in F+; equivalently, the natural join of the decomposed relations yields R without spurious tuples.
Dependency preservation: a decomposition preserves dependencies if the dependencies in F+ can be tested by enforcing dependencies only within the decomposed relations, without requiring a join. This is desirable because it makes consistency checks cheaper.
The 3NF decomposition algorithm and BCNF decomposition algorithms provide systematic ways to decompose a relation while trying to maintain losslessness and dependency preservation.

Practical themes and examples

Anomalies arise when data redundancy exists (insert, delete, update anomalies).
Design trade-offs: BCNF may lose dependency preservation; 3NF tends to preserve dependencies but may require some redundancy.
E-R modeling and normalization should align: a carefully designed E-R schema often minimizes the need for heavy normalization, but real designs may require FD-driven decompositions.
Temporal data, denormalization for performance, and domain-specific concerns (e.g., 4NF and MVDS) are advanced topics often encountered in larger systems.

Example: 3NF decomposition of dept_advisor

Original relation: deptadvisor(sID, iID, deptname).
FDs: iID → deptname and sID, deptname → i_ID.
Candidate keys: {sID, deptname}, {iID, sID}.
Fc (canonical cover) reduces to: {sID, deptname} → iID and iID → dept_name.
A 3NF decomposition yields relations e.g.:
- deptadvisor(sID, deptname, iID) // contains a candidate key
- instructor(iID, deptname) // separate out the dependent attribute
This example illustrates how 3NF achieves non-transitive dependencies while preserving keys, and why BCNF might force further decomposition that could lose dependency preservation.

Multivalued Dependencies and Fourth Normal Form

MVDS and 4NF

MVDS capture a situation where two or more attributes can vary independently of each other within a relation.
If Y and Z are independent given X, expressed as X →→ Y and X →→ Z, then the combination YZ is determined by X via MVDS.
4NF requires that for every nontrivial MVDS X →→ Y, X must be a superkey; otherwise decompose to remove the MVDS.
The theory also notes that MVDS semantics can lead to decompositions that are lossless and dependency-preserving under certain conditions.

Transaction Processing and Concurrency Control

Transaction concepts and ACID

A transaction T is a unit of work that may read and update data.
The ACID properties: Atomicity, Consistency, Isolation, Durability.
- Atomicity: either all updates of a transaction are completed or none are.
- Consistency: a transaction takes the database from one consistent state to another.
- Isolation: a transaction should appear to execute in isolation from others; intermediate states are not visible.
- Durability: once committed, changes persist despite failures.

Transaction states and schedules

States: Active, Partially committed, Committed, Aborted, Failed.
A schedule is a sequence of read/write operations from multiple transactions; a serial schedule executes transactions one after another.
Serializability: a concurrent schedule is serializable if it is equivalent to some serial schedule. Two main notions: Conflict serializability and View serializability.
Conflicts arise when two operations on the same data item are ordered differently in a schedule and at least one operation writes.

Concurrency control approaches

Lock-based protocols (2PL, strict 2PL, rigorous 2PL): obtain locks before accessing items, hold locks until commit/abort; ensures serializability but can cause deadlocks.
Two-Phase Locking (2PL): growing phase (acquire), shrinking phase (release); ensures serializability but may cause deadlocks.
Strict 2PL and rigorous 2PL: hold exclusive locks until commit/abort to ensure recoverability and avoid cascading aborts.
Timestamp-based protocols (TSO): assign timestamps to transactions; ensure operations respect timestamp order; can prevent deadlocks but may force rollbacks and cause less concurrency.
Validation-based protocols and multiversion schemes: different strategies to maintain isolation with varying trade-offs.
Graph-based protocols and tree protocols: alternative locking strategies to achieve concurrency with different performance characteristics.

Recovery and logs

WAL (Write-Ahead Logging): log records must be flushed to stable storage before corresponding data pages are written to disk.
Immediate vs deferred modification: immediate writes update disk before commit; deferred writes delay updates until commit.
Checkpoints: periodically record the system state to speed up recovery by limiting the amount of log that must be scanned.
ARIES (Advanced Recovery Techniques): sophisticated recovery algorithm with multiple phases (Analysis, Redo, Undo) and data structures (LSNs, Dirty Page Table, Log Buffer, etc.).
Shadow paging: an alternative recovery approach that uses shadow page tables to recover, avoiding some log-based overhead but with trade-offs (e.g., page-table copying, fragmentation).
Recovery goals: ensure atomicity, consistency, and durability in the presence of failures (crashes, partial updates, and nonvolatile storage failures).

Fault tolerance and backups

Remote backup systems and hot-spare configurations support high availability by maintaining up-to-date copies of the database at remote sites.
Recovery involves undoing incomplete transactions and redoing committed ones as needed, with strategies to minimize downtime.

Temporal data and modeling considerations

Temporal data introduces time-variant attributes (addresses, employment, etc.) and requires specialized modeling to reflect validity periods.
Start and end times can be added to relations to express validity windows; this complicates FD definitions and may require additional constraints to enforce non-overlapping periods.

ER modeling and real-world design notes

Real-world database design often involves converting ER models into relational schemas, after which normalization is applied to reduce redundancy and anomalies.
Denormalization may be employed for performance reasons, trading off some normalization for faster queries and simpler index structures.
Temporal and multivalued data constraints underscore the need for careful modeling beyond simple FD-based normalization.

Important mathematical concepts and formulas to remember

Functional dependency: X → Y means same X values determine same Y values across all tuples.
Closure: F+ is the set of all dependencies implied by F using Armstrong’s axioms:
- Reflexivity: if B ⊆ A then A → B
- Augmentation: if A → B then AC → BC for any attribute set C
- Transitivity: if A → B and B → C then A → C
Attribute closure: given F and an attribute set X, X+ is the set of attributes functionally determined by X.
Canonical cover Fc: a minimal set of FDs equivalent to F+ with no extraneous attributes and with each left side as small as possible.
Lossless join condition: R1 ∩ R2 → R1 or R1 ∩ R2 → R2 must hold in F+ for the decomposition R → {R1, R2} to be lossless.
Dependency preservation: a decomposition is dependency preserving if (F1 ∪ F2 ∪ … ∪ Fn)+ = F+ where Fi is the set of FDs that hold entirely within Ri.
1NF/2NF/3NF/BCNF are steps in normalization with increasingly stricter constraint requirements to avoid anomalies and redundancy.

Summary of practical implications

Relational algebra provides a formal foundation for specifying queries in a procedural-like, step-by-step manner and serves as the basis for SQL.
Normalization reduces redundancy and anomalies but may require trade-offs with performance and dependency preservation.
Concurrency control and recovery mechanisms are essential for ensuring database durability and correctness in multi-user environments.
Advanced topics such as MVDS, 4NF, BCNF, ARIES, and logging strategies are critical for understanding how real-world DBMSs maintain consistency and performance under failure and high concurrency.

References to typical exam-style relational-algebra tasks

Given a schema with passenger, agency, flight, and booking relations, express queries like:
- Complete details of all flights to a specific destination: σ_{dest = 'New Delhi'}(flight)
- Flights from Chennai to a destination: σ_{src = 'Chennai' ∧ dest = 'New Delhi'}(flight)
- Flight numbers for a passenger with a given id for flights to a city before a date: π{fid}(σ{pid = 123 ∧ dest = 'Chennai' ∧ fdate < '2020-11-06'}(flight) ⨝ booking ×)
- Passenger names with bookings on at least one flight: π_{pname}(passenger ⨝ booking)
Normalize a given relation using FDs to 1NF → 2NF → 3NF → BCNF with appropriate decomposition steps and verify whether the decomposition is lossless and dependency-preserving.
For MVDS and 4NF, identify sets of attributes that may vary independently and decompose to eliminate non-trivial MVDS, ensuring that the result remains lossless and, when possible, dependency-preserving.
In recovery, understand logs, checkpoints, and the ARIES three-pass recovery model (Analysis/Redo/Undo), the role of LSNs, and the use of compensating log records (CLRs) for robust recovery.
Understand lock-based vs timestamp-based concurrency control, two-phase locking, deadlock avoidance and detection, and the trade-offs between concurrency and safety.
Be comfortable with basic proofs of correctness for 3NF decomposition (dependency preservation and losslessness) and the concepts behind BCNF vs 3NF trade-offs.

These notes compile core ideas from the transcript, turning the slide content into cohesive study material with definitions, notation, examples, and practical implications for relational algebra, database design, normalization, and transaction processing.