SQL Joins and Multi-Table Retrieval - Comprehensive Notes
SQL Joins and Multi-Table Retrieval – Comprehensive Notes
Overview: transactional data are high-volume and low-time, so relational databases use normalization to break information into many small, interconnected tables. This reduces redundancy and inconsistency by recording information only once, but makes data retrieval more complex because information is spread across multiple tables.
- Redundancy vs. consistency trade-off: normalization avoids duplicating data but complicates queries that need data from multiple places.
- In prior lecture: retrieval from a single table was discussed; relational databases typically require querying across multiple tables.
Core idea: joining tables to retrieve information that spans multiple tables by conceptually creating an intermediate, larger table that combines rows from the involved tables.
- Example setup:
- Employee table: columns include first name, last name, SSN, department number.
- Department table: includes department name (and department number as key).
- To know an employee’s department name, you must join employee with department on the department number.
- If you don’t have a direct one-table source for the needed information, you join to synthesize the needed data.
Nested loop intuition for joins:
- Outer loop: iterate over rows of one table (e.g., student table).
- Inner loop: for each outer row, scan the related inner table (e.g., skills table) to find a matching key (e.g., skill ID).
- Example: Student table with SkillID and Skills table with SkillID → SkillName.
- For each student, scan skills to find the matching SkillID and read the SkillName.
- This illustrates the basic inner-join mechanism as a nested loop; the inner loop is often the lookup table in such a scenario.
- Note: In practice, database engines use many algorithms; nested loops are one way to conceptualize a join, especially when a foreign key primary key relationship exists (one matching row in inner table for each outer row).
- Important caveat: a nested-loop intuition works even when there isn’t a strict PK/FK relationship; if multiple matches exist for a given key, multiple rows in the result should appear (e.g., multiple skills with the same SkillID).
PK/FK relationships and unary (self) joins:
- When there is a PK/FK relationship, joins are straightforward to reason about (one-to-many or one-to-one).
- Self-join (special case): join a single table to itself by creating two copies with different aliases. Used for unary relationships where two roles exist within the same entity (e.g., employees and their managers).
- Example: Employee table with manager SSN; E = employee as an employee, S = employee as a supervisor.
- Join condition: E.ManagerSSN = S.SSN.
- SELECT: E.FirstName, E.LastName, S.FirstName, S.LastName (clearly qualify columns with table aliases).
- Unary one-to-many relationship example: a department manager and employees in that department; manager is a person who also appears in the same employee entity.
- In self-joins, you typically present two copies of the same table to capture different roles; you must clearly qualify which table’s columns you’re selecting.
Join types: inner vs. outer and their behavior
- Inner join (most common):
- Output contains only rows with matches in both tables.
- Non-matching rows from either side are discarded.
- Example observation from a student-skills join: if a student has a SkillID that doesn’t exist in Skills, or if a student has SkillID NULL, those records do not appear in the inner join result.
- Syntax (explicit):
- FROM People P JOIN Skill S ON P.SkillID = S.SkillID
- SELECT P.FirstName, P.LastName, S.SkillName
- Role of column aliasing: when a column appears in both tables (e.g., SkillID), you must qualify with the table alias (P.SkillID vs S.SkillID).
- Left join (left outer join):
- Keeps all rows from the left table, and fills with NULLs where there is no match in the right table.
- Example: If a person has a SkillID that doesn’t exist in Skills, a left join will still include the person, with SkillName as NULL.
- Right join (right outer join):
- Keeps all rows from the right table, filling with NULLs where there’s no match on the left.
- Outer join (general concept):
- Retains all records from one or both sides, with non-matching side values shown as NULL.
- Practical takeaway: inner join returns the intersection; outer joins retain more data but may introduce NULLs for non-matching sides.
Explicit vs. implicit joins
- Explicit join (preferred): uses the JOIN keyword and ON clause to specify the joining condition.
- Example:
FROM Employee E JOIN Employee_Supervisor S ON E.ManagerSSN = S.SSN. - Implicit join (older style): lists multiple tables in the FROM clause and uses a WHERE clause to specify the join condition.
- Example:
FROM Employee E, Employee_Supervisor S WHERE E.ManagerSSN = S.SSN. - Important note: both approaches can yield the same results for inner joins, but explicit JOIN syntax is clearer and preferable, especially when dealing with left/right joins and complex queries.
Self-join example (detailed): two copies of the same table to show two roles (employee and supervisor)
- Tables: Employee E and Supervisor S (both derived from the same Employee table).
- Join condition: E.ManagerSSN = S.SSN.
- SELECT: E.FirstName, E.LastName, S.FirstName, S.LastName.
- This illustrates how a single entity can be represented in multiple roles using a self-join.
Multi-table joins and data integration across departments, employees, projects
- One-to-many example: Department to Employee (department number is a foreign key in Employee; Department has primary key DepartmentNumber).
- Many-to-many example: Employee to Project via an intersection/association table Works_On (EmployeeSSN, ProjectNumber).
- To retrieve department name along with employee info, you join Employee with Department on DepartmentNumber.
- To retrieve employee info along with project names, you join Employee with WorksOn and then with Project on appropriate keys (employee to workson via EmployeeSSN; works_on to project via ProjectNumber).
- Practical lesson: when information spans multiple entities, you build an intermediate result by joining several tables, then you can run standard single-table queries (or aggregations) on that intermediate result.
Example: finding employees with department names
- Problem: Employee table has department number; Department table has department name.
- Solution: join Employee E with Department D on E.DepartmentNumber = D.DepartmentNumber.
- Optional: apply a filter on DepartmentName (e.g., DepartmentName = 'Research') and compute aggregates on Salary if needed.
Aggregates and grouping with joins
- Typical use-case: find statistics about salaries per department, combining Employee with Department.
- Example queries:
- Retrieve max, min, and average salaries for employees in each department:
- SELECT D.DepartmentName, MAX(E.Salary) AS MaxSalary, MIN(E.Salary) AS MinSalary, AVG(E.Salary) AS AvgSalary
- FROM Employee E JOIN Department D ON E.DepartmentNumber = D.DepartmentNumber
- GROUP BY D.DepartmentName;
- Condition on aggregates (HAVING): e.g., departments with average salary > 60,000:
- HAVING AVG(E.Salary) > 60{,}000
- Represented in LaTeX: ext{AVG}(E.Salary) > 60{,}000
- Example: team size of projects via Works_On and Project tables
- Works_On is an intersection table between Employee and Project.
- To find projects with more than two employees: join Works_On with Project (to get project names) and group by project to count employees.
- SQL concept: group by project, count, then HAVING count(*) > 2.
- Important: to display project names, you join WorksOn with Project on WorksOn.ProjectNumber = Project.ProjectNumber.
Complex business questions: ordering results across entities
- Example: list employees and projects ordered by department name, with department name (department), employee name (employee), and project name (project).
- Data relationships: one-to-many between Department and Employee; many-to-many between Employee and Project via Works_On.
- Query strategy: join Department-Employee and Employee-Projects via Works_On and Project to produce a combined result; then apply ORDER BY DepartmentName, EmployeeLastName, ProjectName (as needed).
- Note: In such queries, you may omit table qualifiers in SELECT when there is no ambiguity, but explicit qualification is safer to avoid confusion.
Understanding table relationships and query construction
- To join Department and Employee: DepartmentNumber is the foreign key in Employee; DepartmentNumber is the primary key in Department.
- To join Employee and Project: use the Works_On intersection table, connecting Employee via EmployeeSSN and Project via ProjectNumber.
- The chain can extend to four tables (Department → Employee → Works_On → Project) to retrieve department name, employee names, and project names in a single joined result.
- The general rule: understand the logical entities and relationships first, then map them to tables and join conditions.
Unions and nested queries
- Union operator basics:
- Combine results from two queries with the same number of columns and compatible data types.
- UNION returns unique rows; UNION ALL retains duplicates.
- Example: cities where there is either a customer or a branch:
- SELECT City FROM Customer
- UNION
- SELECT City FROM Branch
- Nested queries (subqueries):
- Subqueries can be nested inside WHERE clauses or used in SELECT lists.
- Common operators used with subqueries: IN, NOT IN, EXISTS, NOT EXISTS, LIKE, UNION, JOIN, etc.
- Example scenario ( Saul Goodman deposits ):
- Subquery to identify branches where Saul Goodman has a deposit, then outer query to find other customers with deposits in those branches.
- This is a two-step process: first compute a set of branches, then filter the deposits table using that set to produce customer names.
- IN vs NOT IN example:
- IN finds values in a set produced by a subquery; NOT IN excludes values in that set.
- Example: customers who have a deposit in the Mesa branch and do not have a loan there.
- EXISTS operator example:
- EXISTS checks whether a subquery returns any row for a given outer row; if yes, the outer row is kept.
- Correlated subqueries (most complex in the class):
- Outer query references a column from the outer table inside the subquery.
- Example described: for each employee E, compute the maximum salary in E's department from an inner table E2 where E.DepartmentNumber = E2.DepartmentNumber; if E.Salary equals that maximum, then include the employee.
- Mechanism described: outer loop over Employee E; inner subquery fixes E.DepartmentNumber and computes a max salary from E2 with that department, returning a scalar; compare with E.Salary to decide whether to keep the row.
- Conceptual takeaway: the correlated subquery returns a value dependent on the outer row, enabling per-row comparisons to a department-wide or group-wide metric.
- High-level intuition from the lecturer: if you could precompute a per-department maximum salary and attach it to each employee, the problem could be reduced to a simple comparison; subqueries provide that dynamic computation when needed.
Execution plans and practical notes
- SQL engines are highly optimized and decide the actual sequence of operations; typical high-level flow is:
- Joins are processed first to create intermediate results, then filtering (WHERE), then grouping (GROUP BY), then aggregations (HAVING), then projection (SELECT).
- In practice, the engine may rearrange operations for performance based on statistics and indexes; the exact plan is shown in the execution plan.
- Key practical tip: having explicit PK-FK relationships makes understanding joins easier and often improves performance due to better optimizer decisions.
- Inner join is a natural tool for subset selection because it yields only the overlapping part of two relations.
Business questions, data granularity, and the role of group by
- Business questions typically start with a question, then data is collected, and the appropriate level of granularity is chosen.
- Granularity is controlled by GROUP BY: choosing what to group on determines whether you compute metrics at the transaction level, customer level, store level, department level, etc.
- Aggregations (SUM, MAX, MIN, AVG, COUNT, etc.) provide the statistics needed to answer business questions.
Practical implications and takeaways
- Normalization reduces redundancy and maintains data integrity but requires joins to reconstruct complete information.
- When information is distributed across multiple tables, you must design appropriate joins to retrieve a complete picture.
- The choice between inner, left, and right joins depends on whether you need only matching data (inner) or want to preserve all data from one side (outer joins).
- Self-joins enable modeling of hierarchical or role-based relationships within the same entity.
- Subqueries (nested queries) offer powerful ways to compute per-row metrics or to express existence and set-based criteria, including correlated subqueries for per-row comparisons.
Connections to real-world relevance
- Normalized schemas reflect how real systems store data (employees, departments, projects, customers, branches, deposits, borrowers, etc.).
- Typical business queries involve joining several tables to assemble a complete picture (e.g., department name, employee name, and project name).
- Understanding which keys link tables (PK/FK) and recognizing when to use which join type are critical for writing correct and efficient SQL.
Ethical, philosophical, and practical implications discussed in context
- Data integrity and consistency: normalization reduces anomalies; improper joins can yield incorrect results or duplicates.
- Data privacy and access control: retrieving multi-table results may expose sensitive information; careful permission checks and least-privilege access are essential when performing joins across tables containing personal data.
- Practicality and maintainability: clear explicit joins and well-documented schemas improve readability and maintainability, important for collaboration and long-term data stewardship.
Quick reference: notable examples and formulas from the lecture
- Inner join subset example: retrieve employee first name, last name, and department name by joining Employee and Department on DepartmentNumber.
- Aggregate example: per-department salary statistics:
- ext{AVG}( ext{Salary}) > 60{,}000
- Project team size example: join Works_On with Project, group by ProjectNumber, and count() to get team size; filter with HAVING count() > 2.
- Cross-entity ordering example: list department name, employee name, and project name ordered by department name, then by employee name.
- Subquery example (Saul Goodman): two-step approach where inner subquery identifies branches where Saul Goodman deposited, then outer query finds other customers with deposits in those branches.
- NOT IN vs IN example: two-step approach to identify deposits in a branch while excluding customers who have a loan in that same branch.
- EXISTS example: identify customers who live in cities where there is at least one branch.
Summary takeaway
- Joins are the primary mechanism to combine related data spread across multiple tables.
- Inner joins extract the intersection of related data; outer joins preserve data from one side with NULLs where there is no match.
- Self-joins enable modeling multiple roles within the same entity.
- Subqueries (including correlated ones) enable per-row computations and complex filtering criteria.
- Understanding the schema (tables and relationships) is essential to design effective queries and to reason about results.
Final note on next steps
- Next class: demonstration of multiple-table queries with live execution plans and hands-on exercises to reinforce the concepts discussed here.