4.1–4.8 Relational databases and SQL: Key concepts, operators, queries, joins, subqueries, Python integration, and pandas translation

4.1 Relational databases

  • Purpose and scope

    • Relational databases are used for extracting and manipulating data from multi-user, geographically distributed environments.

    • Compared to files/spreadsheets, databases offer benefits in:

    • Performance: optimized data layout for fast queries

    • Security: access control and encryption

    • Concurrency: managing conflicts when multiple users update data

    • Recovery: restoring to a consistent state after failures

    • Most leading systems are relational and organize data in rows and columns within tables.

    • Tables are related via keys and queried with Structured Query Language (SQL).

  • Core terms and structures

    • Table: fixed set of columns, variable rows; each column has a data type and a name.

    • Row: an unnamed sequence of values; each value corresponds to a column’s data type.

    • Column: named data field with a data type.

    • Cell: a single value at the intersection of a row and a column (one value per cell).

    • Data independence: the ability to tune physical data storage without changing query results.

  • Tables and rules (4.1.4)

    • A table has at least one column and any number of rows.

    • One value per cell; no multi-valued cells.

    • No duplicate column names within a table; duplicate column names may occur across different tables.

    • No duplicate rows (no two rows have identical values in all columns).

    • No ordered requirement for rows; storage order does not affect query results.

  • Keys and references

    • Primary key: a column (or group of columns) that uniquely identifies each row. Must be unique and cannot be NULL. Marked in diagrams with a solid circle (●) and usually leftmost.

    • Foreign key: a column (or group of columns) that refers to a primary key in another table. Values must be NULL or match some primary key value. Foreign keys are not necessarily unique.

    • Data types of foreign keys and their referenced primary keys must match (names may differ).

    • Foreign keys are usually in a different table from the referenced primary key, but may be in the same table.

  • NULL values

    • NULL represents unknown or inapplicable data.

    • NULL is not the same as zero or an empty string.

    • Arithmetic or comparison operations with NULL yield NULL.

    • Example truths involving NULL (partial truth table):

    • TRUE OR NULL -> TRUE

    • TRUE AND NULL -> NULL

    • NOT NULL -> NULL

  • Example data model (illustrative)

    • Countries table with CountryCode, CountryName, SurfaceArea, ContinentName.

    • Cities table with CityName, CountryCode, Population.

    • A nested/JSON-like example shows how countries and their cities might be represented, reinforcing table-based relational models.

  • SQL statement structure

    • A statement is a complete action, consisting of one or more clauses and ending with a semicolon.

    • Keywords, clauses, literals, and identifiers:

    • Keyword: reserved word with special meaning.

    • Clause: part of a statement (e.g., SELECT, FROM, WHERE).

    • Literal: fixed numeric or text value (text literals use single or double quotes).

- Identifier: user-defined name for a table, column, database, etc.

  • 4.1.5 Primary and foreign keys (summary)

    • Primary keys are unique and cannot be NULL.

    • Foreign keys refer to primary keys and may be NULL or match a primary key value.

    • Data types of FK and PK must be the same.

    • In diagrams, primary keys are typically shown with a special marker and appear as leftmost columns.

  • 4.1.6 Foreign keys (summary)

    • FK constraints ensure references only to existing, valid rows.

    • A foreign key helps maintain referential integrity across related tables.

4.1.7 SQL language elements (overview)

  • Keywords, clauses, literals, identifiers, and statements form SQL syntax.

  • A statement consists of one or more clauses and ends with a semicolon.

  • Common focus in this course: statements that retrieve data (queries), including SELECT, FROM, WHERE, etc.

4.1.8 Data independence and physical design considerations

  • Data independence enables tuning storage without altering queries.

  • Physical design can be optimized (indexing, partitioning) to improve performance while queries stay the same.

4.2 Simple queries

  • Operators

    • Arithmetic operators: +,\, -, \, *, /, \, \% , (^ \; exponent)

    • Examples: 4 + 3 = 7,\, 11 - 5 = 6,\, -(-2) = 2,\, 3 * 5 = 15,\, 4 / 2 = 2,\, 5 \% 2 = 1,\, 1^2 = 1

    • Comparison operators: =,
      eq, <, >, \, \le, \, \ge

    • Logical operators: AND, OR, NOT (NOT is unary; AND/OR are binary)

  • NULL handling (arithmetic and comparisons)

    • Any arithmetic or comparison with a NULL operand yields NULL.

    • Logical operators involving NULL yield NULL or propagate NULL depending on operands (e.g., TRUE OR NULL -> TRUE; TRUE AND NULL -> NULL; NOT NULL -> NULL).

  • Expression evaluation and operator precedence

    • Expressions consist of operands (columns or literals), operators, and parentheses.

    • Operator precedence (simplified):

    • 1) - (unary)

    • 2) ^

    • 3) =, !=, <, >, <=, >=

    • 4) *, /, %

    • 5) +, -

    • 6) NOT

    • 7) AND (binary)

    • 8) OR

    • Parentheses override precedence: expressions inside parentheses evaluated first.

  • Expressions examples

    • Example calculations:

    • 1) 7 + 3 * 2 = 13

    • 2) (7 + 3) * 2 = 20

    • More complex: ( ext{8} mod 3 + 10 > 15) ext{ AND } ext{TRUE}

    • Compute: 8 mod 3 = 2; 2 + 10 = 12; 12 > 15 is FALSE; FALSE AND TRUE = FALSE.

    • Another example: Age = 8 AND Military = 'Army' evaluates to TRUE in the scenario where those values hold.

  • SQL language elements (recap)

    • Keywords and identifiers are usually not case-sensitive in many DBMSs.

    • White space is ignored; formatting is for clarity (e.g., placing each clause on its own line).

    • Statements can retrieve data (SELECT) or modify data (INSERT, UPDATE, DELETE) and manage objects (CREATE/DROP).

4.3 IN and LIKE; DISTINCT and ORDER BY; limiting clauses

  • IN operator (4.3.1)

    • Used in WHERE to check if a value matches any value in a given list.

    • Example pattern: SELECT * FROM CountryLanguage WHERE Language IN ('Dutch', 'Kongo', 'Albanian');

  • LIKE operator (4.3.2)

    • Pattern matching with wildcards % (any sequence) and _ (single character).

    • Case sensitivity depends on database and optional BINARY flag.

    • Escape characters: to search for literal % or _, prefix with a backslash: LIKE 'a\%' matches 'a%'.

  • LIKE practice and matching exercise (4.3.3, 4.3.4)

    • Practical activity: match languages, e.g., using LIKE to filter by patterns like 'A%n' or 'A%'.

    • Example to extract titles containing a substring like 'star' in Movie titles.

  • DISTINCT and ORDER BY (4.3.5, 4.3.6)

    • DISTINCT returns unique values for a column in the result set.

    • ORDER BY sorts results by one or more columns; default is ascending order; DESC for descending.

  • Limiting results (4.3.7, 4.3.4-ish)

    • MySQL: LIMIT n

    • SQL Server: TOP n (follows SELECT)

    • Oracle: FETCH FIRST n ROWS ONLY (with FETCH clause)

    • When using LIMIT/TOP/FETCH without ORDER BY, result order is not guaranteed; with ORDER BY, results are deterministic.

  • SELECT with asterisk (4.2.3/4.2.8 recap)

    • To select all columns: SELECT * FROM TableName;

4.4 Aggregate functions

  • Overview

    • Aggregate functions operate on sets of rows and return a single summary value.

    • Common aggregate functions: COUNT, MIN, MAX, SUM, AVG, VARIANCE.

    • They ignore NULL values in their arguments.

  • Examples and usage

    • COUNT(*) counts rows; MIN(Salary) finds the minimum; AVG(Salary) averages over rows; SUM(Price) sums values; etc.

  • GROUP BY and HAVING (4.4.5, 4.4.6, 4.4.7, 4.4.9)

    • GROUP BY groups rows by one or more columns and returns one row per group.

    • HAVING filters groups after aggregation (must reference an aggregate function in HAVING).

    • Example: group by Year and Type in Auto; then apply HAVING MAX(Price) > 15000, etc.

  • Practical examples

    • Group by Make in Auto and count occurrences; filter groups having more than one occurrence with HAVING.

    • Use AVG(Price), MAX(Price), and GROUP BY Year, Type to analyze auto data.

  • Flattening/subqueries relation (context in 4.6) and notes on aggregation performance.

4.5 Join queries

  • What joins do

    • Joins combine data from two tables (left and right) into a single result table by matching on related columns, typically foreign key to primary key.

  • Prefixes and aliases

    • When columns share names, qualify with table prefixes (e.g., Department.Name).

    • Aliases shorten and disambiguate, e.g., SELECT Department.Name AS Dept, Employee.Name AS Supervisor FROM Department JOIN Employee ON Department.Manager = Employee.ID.

  • Types of joins (summary)

    • INNER JOIN: returns only matching rows from both tables.

    • FULL JOIN: returns all rows from both tables, with NULLs where there is no match.

    • LEFT JOIN: returns all rows from the left table and matching rows from the right; non-matching right-side values are NULL.

    • RIGHT JOIN: returns all rows from the right table and matching rows from the left; non-matching left-side values are NULL.

  • Example scenario and mappings

    • Given Department and Employee, INNER JOIN on Manager = ID yields matched department-employee pairs.

    • FULL JOIN shows all rows from both sides, with NULLs where there is no match.

  • Practical activity (4.5.4–4.5.7) demonstrates choosing INNER vs FULL vs LEFT vs RIGHT depending on the desired result set.

4.6 Subqueries

  • Definitions

    • Subquery (inner query) is a query inside another SQL query, commonly used in WHERE to constrain outer query results.

    • Correlated subqueries reference a column from the outer query, requiring execution per outer-row.

  • EXISTS and NOT EXISTS

    • EXISTS returns TRUE if the subquery returns at least one row; NOT EXISTS returns TRUE if the subquery returns no rows.

  • Correlated subqueries (examples)

    • Example: SELECT Name, CountryCode, Population FROM City C WHERE Population > (SELECT AVG(Population) FROM City WHERE CountryCode = C.CountryCode);

    • Correlated references use table aliases to distinguish columns (City.CountryCode vs City.Code vs CountryCode in subqueries).

  • Flattening subqueries

    • Rewriting subqueries as joins can improve performance; retain outer query structure (SELECT, FROM, GROUP BY, HAVING, ORDER BY) and convert the subquery to INNER JOINs where applicable.

  • Practice activities

    • Exercises include selecting with EXISTS, NOT EXISTS, IN, and correlated subqueries; understanding how many times subqueries execute and how results can be reformulated as joins.

  • EXISTS with correlated subqueries

    • Example: EXISTS can be used to check existence of related rows (e.g., spouses, children) and return matching outer rows.

4.7 Queries in Python (Connector/Python)

  • Purpose

    • Demonstrates connecting to MySQL from Python using Connector/Python, a DB-API-compliant driver.

  • Connections

    • A connection is created via mysql.connector.connect(…) and assigned to a variable (e.g., reservationConnection).

    • Use try/except blocks to handle connection errors; close the connection when finished.

  • Cursors

    • A cursor (MySQLCursor) executes SQL statements via cursor.execute(sql, params).

    • The execute() method accepts a SQL string and optional parameter tuple for query parameters to prevent SQL injection.

    • Cursors are tied to a connection and can execute multiple statements; they must be closed with cursor.close() before closing the connection.

  • Query parameters and SQL injection prevention

    • Use placeholders (%s) in SQL strings and pass a tuple of values as the second argument to cursor.execute(…).

    • This automatically escapes inputs and prevents injection attacks.

  • Fetching results

    • cursor.fetchone() returns a single row or None if no more rows.

    • cursor.fetchall() returns a list of all rows returned by the query.

  • Basic CRUD example flow

    • Create a cursor, execute a query with parameters, fetch results, then close cursor and connection.

4.8 Case study: SQL vs pandas; Python integration

  • High-level comparison

    • SQL is ideal for data retrieval and basic aggregation; Python excels at data processing, modeling, and visualization.

    • SQL is often faster for querying large datasets; Python adds a broader ecosystem via packages for analysis and visualization.

    • Data pipelines commonly use SQL for extraction and Python (pandas) for transformation/analysis.

  • SQL-to-pandas translation (4.8.3/4.8.4)

    • Pandas queries use syntax inspired by Python and can emulate many SQL operations via DataFrame.query and methods like .loc, .groupby, etc.

    • Examples guide translating a simple SQL SELECT to a pandas DataFrame.query(…).

  • Pandas queries syntax and operators

    • DataFrame.query uses Python-like boolean expressions and supports logical operators (and, or, not) as well as comparison operators.

    • Example transformation exercise: transform SQL queries like SELECT CountryName FROM Country WHERE SurfaceArea > 300000; into pandas equivalents using Country.query("SurfaceArea > 300000")["CountryName"].

  • Case study activities (4.8.5)

    • Explore how many movies satisfy certain conditions when translated between SQL and pandas, reinforcing the mapping between SQL operations and pandas operations.

  • Practical guidance

    • Use SQL for data extraction and initial filtering; use pandas for downstream data processing, feature engineering, and visualization.

  • Key SQL-to-Pandas mapping notes

    • SELECT … FROM table WHERE condition → df.query("condition").[columns] or df.loc[df["column"] op value, [columns]]

    • Aggregations with GROUP BY → df.groupby([keys]).agg({"col": [aggregation]})

    • Joins → pd.merge(leftdf, rightdf, on=[key], how='inner'|'left'|'right'|'outer')

    • Subqueries → often replaced by merges/filters in pandas when flattening

  • Ethical and practical implications

    • SQL injection remains a critical risk; always use parameterized queries and proper input validation when building dynamic queries.

    • Data handling decisions (e.g., NULLs, missing values) should be explicit to avoid misinterpretation in analyses.

    • Consider performance: flattening subqueries to joins can improve performance; always test correctness and performance across representative data.

  • Formulas and representative equations in this content

    • Operator precedence (summary):
      1. ext{ Unary }-, 2. ext{ Exponentiation }(^), 3. ext{ Comparison }(=,\neq,<,>,\le,\ge), 4. ext{ Multiplication/Division/Modulo }(*,/,% ), 5. ext{ Addition/Subtraction }(+,-), 6. ext{ NOT}, 7. ext{ AND}, 8. ext{ OR}

    • NULL propagation rules (conceptual): if any operand is NULL in arithmetic or comparison, the result is NULL; in boolean logic with NULLs, results follow the truth table where certain combinations yield NULL or TRUE depending on operators involved.

    • Example expressions and outcomes (as shown in 4.2):

    • 7 + 3 * 2 = 13

    • (7 + 3) * 2 = 20

    • (8 \% 3 + 10 > 15) \, AND \, TRUE = FALSE

  • Quick references to factual rules from the transcript

    • A table’s primary key must be unique and not NULL.

    • Foreign keys refer to primary keys and must be NULL or match an existing PK value; FK values are not required to be unique.

    • The data type of a foreign key must match that of the referenced primary key.

    • The last rule of 4.1.4 highlights data independence in relational databases.

  • Sample study prompts to review

    • Explain the difference between NULL and zero in SQL and how NULL affects arithmetic and comparison operations.

    • Describe how to enforce referential integrity using primary and foreign keys.

    • Distinguish between INNER, LEFT, RIGHT, and FULL joins with simple examples.

    • Explain when to use IN vs. LIKE vs. EXISTS in subqueries and provide a small example for each.

    • Outline how to translate common SQL queries to pandas equivalents and when to prefer one tool over the other.