4.1–4.8 Relational databases and SQL: Key concepts, operators, queries, joins, subqueries, Python integration, and pandas translation
4.1 Relational databases
Purpose and scope
Relational databases are used for extracting and manipulating data from multi-user, geographically distributed environments.
Compared to files/spreadsheets, databases offer benefits in:
Performance: optimized data layout for fast queries
Security: access control and encryption
Concurrency: managing conflicts when multiple users update data
Recovery: restoring to a consistent state after failures
Most leading systems are relational and organize data in rows and columns within tables.
Tables are related via keys and queried with Structured Query Language (SQL).
Core terms and structures
Table: fixed set of columns, variable rows; each column has a data type and a name.
Row: an unnamed sequence of values; each value corresponds to a column’s data type.
Column: named data field with a data type.
Cell: a single value at the intersection of a row and a column (one value per cell).
Data independence: the ability to tune physical data storage without changing query results.
Tables and rules (4.1.4)
A table has at least one column and any number of rows.
One value per cell; no multi-valued cells.
No duplicate column names within a table; duplicate column names may occur across different tables.
No duplicate rows (no two rows have identical values in all columns).
No ordered requirement for rows; storage order does not affect query results.
Keys and references
Primary key: a column (or group of columns) that uniquely identifies each row. Must be unique and cannot be NULL. Marked in diagrams with a solid circle (●) and usually leftmost.
Foreign key: a column (or group of columns) that refers to a primary key in another table. Values must be NULL or match some primary key value. Foreign keys are not necessarily unique.
Data types of foreign keys and their referenced primary keys must match (names may differ).
Foreign keys are usually in a different table from the referenced primary key, but may be in the same table.
NULL values
NULL represents unknown or inapplicable data.
NULL is not the same as zero or an empty string.
Arithmetic or comparison operations with NULL yield NULL.
Example truths involving NULL (partial truth table):
TRUE OR NULL -> TRUE
TRUE AND NULL -> NULL
NOT NULL -> NULL
Example data model (illustrative)
Countries table with CountryCode, CountryName, SurfaceArea, ContinentName.
Cities table with CityName, CountryCode, Population.
A nested/JSON-like example shows how countries and their cities might be represented, reinforcing table-based relational models.
SQL statement structure
A statement is a complete action, consisting of one or more clauses and ending with a semicolon.
Keywords, clauses, literals, and identifiers:
Keyword: reserved word with special meaning.
Clause: part of a statement (e.g., SELECT, FROM, WHERE).
Literal: fixed numeric or text value (text literals use single or double quotes).
- Identifier: user-defined name for a table, column, database, etc.
4.1.5 Primary and foreign keys (summary)
Primary keys are unique and cannot be NULL.
Foreign keys refer to primary keys and may be NULL or match a primary key value.
Data types of FK and PK must be the same.
In diagrams, primary keys are typically shown with a special marker and appear as leftmost columns.
4.1.6 Foreign keys (summary)
FK constraints ensure references only to existing, valid rows.
A foreign key helps maintain referential integrity across related tables.
4.1.7 SQL language elements (overview)
Keywords, clauses, literals, identifiers, and statements form SQL syntax.
A statement consists of one or more clauses and ends with a semicolon.
Common focus in this course: statements that retrieve data (queries), including SELECT, FROM, WHERE, etc.
4.1.8 Data independence and physical design considerations
Data independence enables tuning storage without altering queries.
Physical design can be optimized (indexing, partitioning) to improve performance while queries stay the same.
4.2 Simple queries
Operators
Arithmetic operators: +,\, -, \, *, /, \, \% , (^ \; exponent)
Examples: 4 + 3 = 7,\, 11 - 5 = 6,\, -(-2) = 2,\, 3 * 5 = 15,\, 4 / 2 = 2,\, 5 \% 2 = 1,\, 1^2 = 1
Comparison operators: =,
eq, <, >, \, \le, \, \geLogical operators: AND, OR, NOT (NOT is unary; AND/OR are binary)
NULL handling (arithmetic and comparisons)
Any arithmetic or comparison with a NULL operand yields NULL.
Logical operators involving NULL yield NULL or propagate NULL depending on operands (e.g., TRUE OR NULL -> TRUE; TRUE AND NULL -> NULL; NOT NULL -> NULL).
Expression evaluation and operator precedence
Expressions consist of operands (columns or literals), operators, and parentheses.
Operator precedence (simplified):
1) - (unary)
2) ^
3) =, !=, <, >, <=, >=
4) *, /, %
5) +, -
6) NOT
7) AND (binary)
8) OR
Parentheses override precedence: expressions inside parentheses evaluated first.
Expressions examples
Example calculations:
1) 7 + 3 * 2 = 13
2) (7 + 3) * 2 = 20
More complex: ( ext{8} mod 3 + 10 > 15) ext{ AND } ext{TRUE}
Compute: 8 mod 3 = 2; 2 + 10 = 12; 12 > 15 is FALSE; FALSE AND TRUE = FALSE.
Another example: Age = 8 AND Military = 'Army' evaluates to TRUE in the scenario where those values hold.
SQL language elements (recap)
Keywords and identifiers are usually not case-sensitive in many DBMSs.
White space is ignored; formatting is for clarity (e.g., placing each clause on its own line).
Statements can retrieve data (SELECT) or modify data (INSERT, UPDATE, DELETE) and manage objects (CREATE/DROP).
4.3 IN and LIKE; DISTINCT and ORDER BY; limiting clauses
IN operator (4.3.1)
Used in WHERE to check if a value matches any value in a given list.
Example pattern: SELECT * FROM CountryLanguage WHERE Language IN ('Dutch', 'Kongo', 'Albanian');
LIKE operator (4.3.2)
Pattern matching with wildcards % (any sequence) and _ (single character).
Case sensitivity depends on database and optional BINARY flag.
Escape characters: to search for literal % or _, prefix with a backslash: LIKE 'a\%' matches 'a%'.
LIKE practice and matching exercise (4.3.3, 4.3.4)
Practical activity: match languages, e.g., using LIKE to filter by patterns like 'A%n' or 'A%'.
Example to extract titles containing a substring like 'star' in Movie titles.
DISTINCT and ORDER BY (4.3.5, 4.3.6)
DISTINCT returns unique values for a column in the result set.
ORDER BY sorts results by one or more columns; default is ascending order; DESC for descending.
Limiting results (4.3.7, 4.3.4-ish)
MySQL: LIMIT n
SQL Server: TOP n (follows SELECT)
Oracle: FETCH FIRST n ROWS ONLY (with FETCH clause)
When using LIMIT/TOP/FETCH without ORDER BY, result order is not guaranteed; with ORDER BY, results are deterministic.
SELECT with asterisk (4.2.3/4.2.8 recap)
To select all columns: SELECT * FROM TableName;
4.4 Aggregate functions
Overview
Aggregate functions operate on sets of rows and return a single summary value.
Common aggregate functions: COUNT, MIN, MAX, SUM, AVG, VARIANCE.
They ignore NULL values in their arguments.
Examples and usage
COUNT(*) counts rows; MIN(Salary) finds the minimum; AVG(Salary) averages over rows; SUM(Price) sums values; etc.
GROUP BY and HAVING (4.4.5, 4.4.6, 4.4.7, 4.4.9)
GROUP BY groups rows by one or more columns and returns one row per group.
HAVING filters groups after aggregation (must reference an aggregate function in HAVING).
Example: group by Year and Type in Auto; then apply HAVING MAX(Price) > 15000, etc.
Practical examples
Group by Make in Auto and count occurrences; filter groups having more than one occurrence with HAVING.
Use AVG(Price), MAX(Price), and GROUP BY Year, Type to analyze auto data.
Flattening/subqueries relation (context in 4.6) and notes on aggregation performance.
4.5 Join queries
What joins do
Joins combine data from two tables (left and right) into a single result table by matching on related columns, typically foreign key to primary key.
Prefixes and aliases
When columns share names, qualify with table prefixes (e.g., Department.Name).
Aliases shorten and disambiguate, e.g., SELECT Department.Name AS Dept, Employee.Name AS Supervisor FROM Department JOIN Employee ON Department.Manager = Employee.ID.
Types of joins (summary)
INNER JOIN: returns only matching rows from both tables.
FULL JOIN: returns all rows from both tables, with NULLs where there is no match.
LEFT JOIN: returns all rows from the left table and matching rows from the right; non-matching right-side values are NULL.
RIGHT JOIN: returns all rows from the right table and matching rows from the left; non-matching left-side values are NULL.
Example scenario and mappings
Given Department and Employee, INNER JOIN on Manager = ID yields matched department-employee pairs.
FULL JOIN shows all rows from both sides, with NULLs where there is no match.
Practical activity (4.5.4–4.5.7) demonstrates choosing INNER vs FULL vs LEFT vs RIGHT depending on the desired result set.
4.6 Subqueries
Definitions
Subquery (inner query) is a query inside another SQL query, commonly used in WHERE to constrain outer query results.
Correlated subqueries reference a column from the outer query, requiring execution per outer-row.
EXISTS and NOT EXISTS
EXISTS returns TRUE if the subquery returns at least one row; NOT EXISTS returns TRUE if the subquery returns no rows.
Correlated subqueries (examples)
Example: SELECT Name, CountryCode, Population FROM City C WHERE Population > (SELECT AVG(Population) FROM City WHERE CountryCode = C.CountryCode);
Correlated references use table aliases to distinguish columns (City.CountryCode vs City.Code vs CountryCode in subqueries).
Flattening subqueries
Rewriting subqueries as joins can improve performance; retain outer query structure (SELECT, FROM, GROUP BY, HAVING, ORDER BY) and convert the subquery to INNER JOINs where applicable.
Practice activities
Exercises include selecting with EXISTS, NOT EXISTS, IN, and correlated subqueries; understanding how many times subqueries execute and how results can be reformulated as joins.
EXISTS with correlated subqueries
Example: EXISTS can be used to check existence of related rows (e.g., spouses, children) and return matching outer rows.
4.7 Queries in Python (Connector/Python)
Purpose
Demonstrates connecting to MySQL from Python using Connector/Python, a DB-API-compliant driver.
Connections
A connection is created via mysql.connector.connect(…) and assigned to a variable (e.g., reservationConnection).
Use try/except blocks to handle connection errors; close the connection when finished.
Cursors
A cursor (MySQLCursor) executes SQL statements via cursor.execute(sql, params).
The execute() method accepts a SQL string and optional parameter tuple for query parameters to prevent SQL injection.
Cursors are tied to a connection and can execute multiple statements; they must be closed with cursor.close() before closing the connection.
Query parameters and SQL injection prevention
Use placeholders (%s) in SQL strings and pass a tuple of values as the second argument to cursor.execute(…).
This automatically escapes inputs and prevents injection attacks.
Fetching results
cursor.fetchone() returns a single row or None if no more rows.
cursor.fetchall() returns a list of all rows returned by the query.
Basic CRUD example flow
Create a cursor, execute a query with parameters, fetch results, then close cursor and connection.
4.8 Case study: SQL vs pandas; Python integration
High-level comparison
SQL is ideal for data retrieval and basic aggregation; Python excels at data processing, modeling, and visualization.
SQL is often faster for querying large datasets; Python adds a broader ecosystem via packages for analysis and visualization.
Data pipelines commonly use SQL for extraction and Python (pandas) for transformation/analysis.
SQL-to-pandas translation (4.8.3/4.8.4)
Pandas queries use syntax inspired by Python and can emulate many SQL operations via DataFrame.query and methods like .loc, .groupby, etc.
Examples guide translating a simple SQL SELECT to a pandas DataFrame.query(…).
Pandas queries syntax and operators
DataFrame.query uses Python-like boolean expressions and supports logical operators (and, or, not) as well as comparison operators.
Example transformation exercise: transform SQL queries like SELECT CountryName FROM Country WHERE SurfaceArea > 300000; into pandas equivalents using Country.query("SurfaceArea > 300000")["CountryName"].
Case study activities (4.8.5)
Explore how many movies satisfy certain conditions when translated between SQL and pandas, reinforcing the mapping between SQL operations and pandas operations.
Practical guidance
Use SQL for data extraction and initial filtering; use pandas for downstream data processing, feature engineering, and visualization.
Key SQL-to-Pandas mapping notes
SELECT … FROM table WHERE condition → df.query("condition").[columns] or df.loc[df["column"] op value, [columns]]
Aggregations with GROUP BY → df.groupby([keys]).agg({"col": [aggregation]})
Joins → pd.merge(leftdf, rightdf, on=[key], how='inner'|'left'|'right'|'outer')
Subqueries → often replaced by merges/filters in pandas when flattening
Ethical and practical implications
SQL injection remains a critical risk; always use parameterized queries and proper input validation when building dynamic queries.
Data handling decisions (e.g., NULLs, missing values) should be explicit to avoid misinterpretation in analyses.
Consider performance: flattening subqueries to joins can improve performance; always test correctness and performance across representative data.
Formulas and representative equations in this content
Operator precedence (summary):
1. ext{ Unary }-, 2. ext{ Exponentiation }(^), 3. ext{ Comparison }(=,\neq,<,>,\le,\ge), 4. ext{ Multiplication/Division/Modulo }(*,/,% ), 5. ext{ Addition/Subtraction }(+,-), 6. ext{ NOT}, 7. ext{ AND}, 8. ext{ OR}NULL propagation rules (conceptual): if any operand is NULL in arithmetic or comparison, the result is NULL; in boolean logic with NULLs, results follow the truth table where certain combinations yield NULL or TRUE depending on operators involved.
Example expressions and outcomes (as shown in 4.2):
7 + 3 * 2 = 13
(7 + 3) * 2 = 20
(8 \% 3 + 10 > 15) \, AND \, TRUE = FALSE
Quick references to factual rules from the transcript
A table’s primary key must be unique and not NULL.
Foreign keys refer to primary keys and must be NULL or match an existing PK value; FK values are not required to be unique.
The data type of a foreign key must match that of the referenced primary key.
The last rule of 4.1.4 highlights data independence in relational databases.
Sample study prompts to review
Explain the difference between NULL and zero in SQL and how NULL affects arithmetic and comparison operations.
Describe how to enforce referential integrity using primary and foreign keys.
Distinguish between INNER, LEFT, RIGHT, and FULL joins with simple examples.
Explain when to use IN vs. LIKE vs. EXISTS in subqueries and provide a small example for each.
Outline how to translate common SQL queries to pandas equivalents and when to prefer one tool over the other.