4.1–4.8 Relational databases and SQL: Key concepts, operators, queries, joins, subqueries, Python integration, and pandas translation

4.1 Relational databases

Purpose and scope
- Relational databases are used for extracting and manipulating data from multi-user, geographically distributed environments.
- Compared to files/spreadsheets, databases offer benefits in:
- Performance: optimized data layout for fast queries
- Security: access control and encryption
- Concurrency: managing conflicts when multiple users update data
- Recovery: restoring to a consistent state after failures
- Most leading systems are relational and organize data in rows and columns within tables.
- Tables are related via keys and queried with Structured Query Language (SQL).
Core terms and structures
- Table: fixed set of columns, variable rows; each column has a data type and a name.
- Row: an unnamed sequence of values; each value corresponds to a column’s data type.
- Column: named data field with a data type.
- Cell: a single value at the intersection of a row and a column (one value per cell).
- Data independence: the ability to tune physical data storage without changing query results.
Tables and rules (4.1.4)
- A table has at least one column and any number of rows.
- One value per cell; no multi-valued cells.
- No duplicate column names within a table; duplicate column names may occur across different tables.
- No duplicate rows (no two rows have identical values in all columns).
- No ordered requirement for rows; storage order does not affect query results.
Keys and references
- Primary key: a column (or group of columns) that uniquely identifies each row. Must be unique and cannot be NULL. Marked in diagrams with a solid circle (●) and usually leftmost.
- Foreign key: a column (or group of columns) that refers to a primary key in another table. Values must be NULL or match some primary key value. Foreign keys are not necessarily unique.
- Data types of foreign keys and their referenced primary keys must match (names may differ).
- Foreign keys are usually in a different table from the referenced primary key, but may be in the same table.
NULL values
- NULL represents unknown or inapplicable data.
- NULL is not the same as zero or an empty string.
- Arithmetic or comparison operations with NULL yield NULL.
- Example truths involving NULL (partial truth table):
- TRUE OR NULL -> TRUE
- TRUE AND NULL -> NULL
- NOT NULL -> NULL
Example data model (illustrative)
- Countries table with CountryCode, CountryName, SurfaceArea, ContinentName.
- Cities table with CityName, CountryCode, Population.
- A nested/JSON-like example shows how countries and their cities might be represented, reinforcing table-based relational models.
SQL statement structure
- A statement is a complete action, consisting of one or more clauses and ending with a semicolon.
- Keywords, clauses, literals, and identifiers:
- Keyword: reserved word with special meaning.
- Clause: part of a statement (e.g., SELECT, FROM, WHERE).
- Literal: fixed numeric or text value (text literals use single or double quotes).

- Identifier: user-defined name for a table, column, database, etc.

4.1.5 Primary and foreign keys (summary)
- Primary keys are unique and cannot be NULL.
- Foreign keys refer to primary keys and may be NULL or match a primary key value.
- Data types of FK and PK must be the same.
- In diagrams, primary keys are typically shown with a special marker and appear as leftmost columns.
4.1.6 Foreign keys (summary)
- FK constraints ensure references only to existing, valid rows.
- A foreign key helps maintain referential integrity across related tables.

4.1.7 SQL language elements (overview)

Keywords, clauses, literals, identifiers, and statements form SQL syntax.
A statement consists of one or more clauses and ends with a semicolon.
Common focus in this course: statements that retrieve data (queries), including SELECT, FROM, WHERE, etc.

4.1.8 Data independence and physical design considerations

Data independence enables tuning storage without altering queries.
Physical design can be optimized (indexing, partitioning) to improve performance while queries stay the same.

4.2 Simple queries

Operators
- Arithmetic operators: +,\, -, \, *, /, \, \% , (^ \; exponent)
- Examples: $4 + 3 = 7,\, 11 - 5 = 6,\, -(-2) = 2,\, 3 * 5 = 15,\, 4 / 2 = 2,\, 5 \% 2 = 1,\, 1^2 = 1$
- Comparison operators: =,
  eq, <, >, \, \le, \, \ge
- Logical operators: $AND, OR, NOT$ (NOT is unary; AND/OR are binary)
NULL handling (arithmetic and comparisons)
- Any arithmetic or comparison with a NULL operand yields NULL.
- Logical operators involving NULL yield NULL or propagate NULL depending on operands (e.g., TRUE OR NULL -> TRUE; TRUE AND NULL -> NULL; NOT NULL -> NULL).
Expression evaluation and operator precedence
- Expressions consist of operands (columns or literals), operators, and parentheses.
- Operator precedence (simplified):
- 1) - (unary)
- 2) ^
- 3) =, !=, <, >, <=, >=
- 4) *, /, %
- 5) +, -
- 6) NOT
- 7) AND (binary)
- 8) OR
- Parentheses override precedence: expressions inside parentheses evaluated first.
Expressions examples
- Example calculations:
- 1) $7 + 3 * 2 = 13$
- 2) $(7 + 3) * 2 = 20$
- More complex: ( ext{8} mod 3 + 10 > 15) ext{ AND } ext{TRUE}
- Compute: $8 \bmod 3 = 2$ ; $2 + 10 = 12$ ; 12 > 15 is FALSE; FALSE AND TRUE = FALSE.
- Another example: Age = 8 AND Military = 'Army' evaluates to TRUE in the scenario where those values hold.
SQL language elements (recap)
- Keywords and identifiers are usually not case-sensitive in many DBMSs.
- White space is ignored; formatting is for clarity (e.g., placing each clause on its own line).
- Statements can retrieve data (SELECT) or modify data (INSERT, UPDATE, DELETE) and manage objects (CREATE/DROP).

4.3 IN and LIKE; DISTINCT and ORDER BY; limiting clauses

IN operator (4.3.1)
- Used in WHERE to check if a value matches any value in a given list.
- Example pattern: SELECT * FROM CountryLanguage WHERE Language IN ('Dutch', 'Kongo', 'Albanian');
LIKE operator (4.3.2)
- Pattern matching with wildcards % (any sequence) and _ (single character).
- Case sensitivity depends on database and optional BINARY flag.
- Escape characters: to search for literal % or _, prefix with a backslash: LIKE 'a\%' matches 'a%'.
LIKE practice and matching exercise (4.3.3, 4.3.4)
- Practical activity: match languages, e.g., using LIKE to filter by patterns like 'A%n' or 'A%'.
- Example to extract titles containing a substring like 'star' in Movie titles.
DISTINCT and ORDER BY (4.3.5, 4.3.6)
- DISTINCT returns unique values for a column in the result set.
- ORDER BY sorts results by one or more columns; default is ascending order; DESC for descending.
Limiting results (4.3.7, 4.3.4-ish)
- MySQL: LIMIT n
- SQL Server: TOP n (follows SELECT)
- Oracle: FETCH FIRST n ROWS ONLY (with FETCH clause)
- When using LIMIT/TOP/FETCH without ORDER BY, result order is not guaranteed; with ORDER BY, results are deterministic.
SELECT with asterisk (4.2.3/4.2.8 recap)
- To select all columns: SELECT * FROM TableName;

4.4 Aggregate functions

Overview
- Aggregate functions operate on sets of rows and return a single summary value.
- Common aggregate functions: COUNT, MIN, MAX, SUM, AVG, VARIANCE.
- They ignore NULL values in their arguments.
Examples and usage
- COUNT(*) counts rows; MIN(Salary) finds the minimum; AVG(Salary) averages over rows; SUM(Price) sums values; etc.
GROUP BY and HAVING (4.4.5, 4.4.6, 4.4.7, 4.4.9)
- GROUP BY groups rows by one or more columns and returns one row per group.
- HAVING filters groups after aggregation (must reference an aggregate function in HAVING).
- Example: group by Year and Type in Auto; then apply HAVING MAX(Price) > 15000, etc.
Practical examples
- Group by Make in Auto and count occurrences; filter groups having more than one occurrence with HAVING.
- Use AVG(Price), MAX(Price), and GROUP BY Year, Type to analyze auto data.
Flattening/subqueries relation (context in 4.6) and notes on aggregation performance.

4.5 Join queries

What joins do
- Joins combine data from two tables (left and right) into a single result table by matching on related columns, typically foreign key to primary key.
Prefixes and aliases
- When columns share names, qualify with table prefixes (e.g., Department.Name).
- Aliases shorten and disambiguate, e.g., SELECT Department.Name AS Dept, Employee.Name AS Supervisor FROM Department JOIN Employee ON Department.Manager = Employee.ID.
Types of joins (summary)
- INNER JOIN: returns only matching rows from both tables.
- FULL JOIN: returns all rows from both tables, with NULLs where there is no match.
- LEFT JOIN: returns all rows from the left table and matching rows from the right; non-matching right-side values are NULL.
- RIGHT JOIN: returns all rows from the right table and matching rows from the left; non-matching left-side values are NULL.
Example scenario and mappings
- Given Department and Employee, INNER JOIN on Manager = ID yields matched department-employee pairs.
- FULL JOIN shows all rows from both sides, with NULLs where there is no match.
Practical activity (4.5.4–4.5.7) demonstrates choosing INNER vs FULL vs LEFT vs RIGHT depending on the desired result set.

4.6 Subqueries

Definitions
- Subquery (inner query) is a query inside another SQL query, commonly used in WHERE to constrain outer query results.
- Correlated subqueries reference a column from the outer query, requiring execution per outer-row.
EXISTS and NOT EXISTS
- EXISTS returns TRUE if the subquery returns at least one row; NOT EXISTS returns TRUE if the subquery returns no rows.
Correlated subqueries (examples)
- Example: SELECT Name, CountryCode, Population FROM City C WHERE Population > (SELECT AVG(Population) FROM City WHERE CountryCode = C.CountryCode);
- Correlated references use table aliases to distinguish columns (City.CountryCode vs City.Code vs CountryCode in subqueries).
Flattening subqueries
- Rewriting subqueries as joins can improve performance; retain outer query structure (SELECT, FROM, GROUP BY, HAVING, ORDER BY) and convert the subquery to INNER JOINs where applicable.
Practice activities
- Exercises include selecting with EXISTS, NOT EXISTS, IN, and correlated subqueries; understanding how many times subqueries execute and how results can be reformulated as joins.
EXISTS with correlated subqueries
- Example: EXISTS can be used to check existence of related rows (e.g., spouses, children) and return matching outer rows.

4.7 Queries in Python (Connector/Python)

Purpose
- Demonstrates connecting to MySQL from Python using Connector/Python, a DB-API-compliant driver.
Connections
- A connection is created via mysql.connector.connect(…) and assigned to a variable (e.g., reservationConnection).
- Use try/except blocks to handle connection errors; close the connection when finished.
Cursors
- A cursor (MySQLCursor) executes SQL statements via cursor.execute(sql, params).
- The execute() method accepts a SQL string and optional parameter tuple for query parameters to prevent SQL injection.
- Cursors are tied to a connection and can execute multiple statements; they must be closed with cursor.close() before closing the connection.
Query parameters and SQL injection prevention
- Use placeholders (%s) in SQL strings and pass a tuple of values as the second argument to cursor.execute(…).
- This automatically escapes inputs and prevents injection attacks.
Fetching results
- cursor.fetchone() returns a single row or None if no more rows.
- cursor.fetchall() returns a list of all rows returned by the query.
Basic CRUD example flow
- Create a cursor, execute a query with parameters, fetch results, then close cursor and connection.

4.8 Case study: SQL vs pandas; Python integration

High-level comparison
- SQL is ideal for data retrieval and basic aggregation; Python excels at data processing, modeling, and visualization.
- SQL is often faster for querying large datasets; Python adds a broader ecosystem via packages for analysis and visualization.
- Data pipelines commonly use SQL for extraction and Python (pandas) for transformation/analysis.
SQL-to-pandas translation (4.8.3/4.8.4)
- Pandas queries use syntax inspired by Python and can emulate many SQL operations via DataFrame.query and methods like .loc, .groupby, etc.
- Examples guide translating a simple SQL SELECT to a pandas DataFrame.query(…).
Pandas queries syntax and operators
- DataFrame.query uses Python-like boolean expressions and supports logical operators (and, or, not) as well as comparison operators.
- Example transformation exercise: transform SQL queries like SELECT CountryName FROM Country WHERE SurfaceArea > 300000; into pandas equivalents using Country.query("SurfaceArea > 300000")["CountryName"].
Case study activities (4.8.5)
- Explore how many movies satisfy certain conditions when translated between SQL and pandas, reinforcing the mapping between SQL operations and pandas operations.
Practical guidance
- Use SQL for data extraction and initial filtering; use pandas for downstream data processing, feature engineering, and visualization.
Key SQL-to-Pandas mapping notes
- SELECT … FROM table WHERE condition → df.query("condition").[columns] or df.loc[df["column"] op value, [columns]]
- Aggregations with GROUP BY → df.groupby([keys]).agg({"col": [aggregation]})
- Joins → pd.merge(leftdf, rightdf, on=[key], how='inner'|'left'|'right'|'outer')
- Subqueries → often replaced by merges/filters in pandas when flattening
Ethical and practical implications
- SQL injection remains a critical risk; always use parameterized queries and proper input validation when building dynamic queries.
- Data handling decisions (e.g., NULLs, missing values) should be explicit to avoid misinterpretation in analyses.
- Consider performance: flattening subqueries to joins can improve performance; always test correctness and performance across representative data.
Formulas and representative equations in this content
- Operator precedence (summary):
  $1. ext{ Unary }-$ , $2. ext{ Exponentiation }(^)$ , 3. ext{ Comparison }(=,\neq,<,>,\le,\ge), $4. ext{ Multiplication/Division/Modulo }(*,/,% )$ , $5. ext{ Addition/Subtraction }(+,-)$ , $6. ext{ NOT}$ , $7. ext{ AND}$ , $8. ext{ OR}$
- NULL propagation rules (conceptual): if any operand is NULL in arithmetic or comparison, the result is NULL; in boolean logic with NULLs, results follow the truth table where certain combinations yield NULL or TRUE depending on operators involved.
- Example expressions and outcomes (as shown in 4.2):
- $7 + 3 * 2 = 13$
- $(7 + 3) * 2 = 20$
- (8 \% 3 + 10 > 15) \, AND \, TRUE = FALSE
Quick references to factual rules from the transcript
- A table’s primary key must be unique and not NULL.
- Foreign keys refer to primary keys and must be NULL or match an existing PK value; FK values are not required to be unique.
- The data type of a foreign key must match that of the referenced primary key.
- The last rule of 4.1.4 highlights data independence in relational databases.
Sample study prompts to review
- Explain the difference between NULL and zero in SQL and how NULL affects arithmetic and comparison operations.
- Describe how to enforce referential integrity using primary and foreign keys.
- Distinguish between INNER, LEFT, RIGHT, and FULL joins with simple examples.
- Explain when to use IN vs. LIKE vs. EXISTS in subqueries and provide a small example for each.
- Outline how to translate common SQL queries to pandas equivalents and when to prefer one tool over the other.