Normalisation & SQL Join - W5

Week 5 Lecture: Normalisation & SQL Join

Introduction

Lecturer: Bin Huang
Online attendees: Mute microphone, use chat for questions, turn off video if internet is poor.
In-person attendees: Do not attend if you have any symptoms.

Acknowledgement of Country

UNSW Business School acknowledges the Bidjigal (Kensington campus) and Gadigal (City campus) as the traditional custodians of the lands.
Acknowledgement of Aboriginal and Torres Strait Islander Elders, past and present, and their communities.
Recognition of their ongoing leadership and contributions to business, education, and industry.

Database Design Process

Overview of data vs. information.
Data stored in databases.
Database management system (DBMS).
Database design defines database structure.
Conceptual Model:
- Entity-Relationship Modelling technique.
- Chen’s notation for high-level conceptual model.
- Crow’s Foot as the design standard.
- Entity type and entity instance.
- Attribute and value.
- Relationship (Degree, Connectivity, Cardinality).
Logical Model:
- Converting Conceptual model to detailed Logical Model (using Crow’s Foot).
- Advanced topics:
  - Relationship strength.
  - Composite entity.
  - Relationship degree.
  - Supertype and Subtype.
  - Selecting Primary Key.
Relational Model:
- Convert ER model to a set of tables (relations) in the relational model.
- Apply Normalisation on the relations to remove any anomalies.
SQL Data Definition & Manipulation Language:
- Use Relational Model to implement database by creating a table for each normalised relations.
- Data Definition Language defines the tables.
- Data Manipulation Language queries/updates the tables.

Normalisation and SQL Join Overview

Normalisation.
Functional Dependencies.
Normal Forms:
- 1NF
- 2NF
- 3NF
- BCNF
SQL Join to combine normalised tables.

What is Normalisation?

The process of correcting table structures to minimise data redundancies.
Why do we need Normalisation?
- To analyse the relationship among the attributes within each entity.
- To determine if the data structure can be improved to achieve a well-structured relational model that:
  - contains minimal data redundancy
  - allows users to insert, delete, and update rows without causing data inconsistencies and anomalies
  - has some desirable (“good”) properties in the data structure
A technique to define “goodness” (or “badness”) of a relation.
Converting a relation to a standard normal form in stages.
Minimising or even eliminating redundancy, i.e., duplication of data.
Decomposing a relation/table into smaller components while still capturing the precise content of the original relation/table.

Redundancy

Data about one entity is recorded more than once in a database.
- To be reduced to save space and prevent problems, i.e., database should not store same data several times
Redundancies could induce anomalies and should be minimised by normalisation.
One solution: Constraints can prevent anomalies, but difficult to get constraints right!
With redundancy in DB, anomalies may arise …
- Insertion Anomaly: adding new rows forces user to create duplicate data
- Deletion Anomaly: deleting rows may cause a loss of data that would be needed for other future rows
- Modification (Update) Anomaly: changing data in a row forces changes to other rows because of duplication
Better solution: NORMALISATION - database design should minimize redundancies, so that we are not dependent on constraints to prevent anomalies

Normal Form

A Normal Form is a state of a relation determined by applying rules on Functional Dependency
Normalisation (more and more normalised)
De-normalisation (less and less normalised)
Normal Form Characteristics
- First normal form (1NF): Table format, no repeating groups, and PK identified
- Second normal form (2NF): 1NF and no partial functional dependencies
- Third normal form (3NF): 2NF and no transitive functional dependencies
- Boyce-Codd normal form (BCNF): 3NF and every determinant is a candidate key (special case of 3NF)
- Fourth normal form (4NF): BCNF and no independent multivalued dependencies (not covered)
Functional Dependency (FD): semantic restriction that some values for a relation are not possible in reality
- relationships between attributes in a relation
- the semantics of the attributes in a relation
- can be inferred in a systematic way by applying a set of inference rules

Armstrong's Axioms

Attribute B “functionally depends” on an attribute A
Attribute A determines attribute B
“If I know the value of A, then I know the value of B”
Inclusion (Reflexive) rule: if Y ⊆ X, then X → Y (⊆: the notation of subset)
- If zID ⊆ {zID, LastName}, then {zID, LastName} → zID
- If zID is a part of the attribute set {zID, LastName}, then {zID, LastName} determines zID
Augmentation rule: if X → Y, then W, X → {W, Y}
- If zID → LastName, then {zID, FirstName} → {LastName, FirstName}
Transitivity rule: if X → Y and Y → Z, then X → Z
- If zID → MobileNumber and MobileNumber → LastName, then zID → LastName

Functional Dependence's Formal Definitions

Determinant: Any attribute(s) whose value determines other values within a row
Dependant: Attribute whose value is determined by a determinant
Functional dependence:
- The attribute B is fully functionally dependent on the attribute A, if each value of A determines one and only one value of B.
Fully functional dependence (composite key):
- If attribute B is functionally dependent on a composite key A but not on any subset of A, the attribute B is fully functionally dependent on A.
Functional dependence (Generalised definition):
- Attribute A determines attribute B (that is, B is functionally dependent on A) if all (generalized definition) of the rows in the table that agree in value for attribute A also agree in value for attribute B.
In plain English…
- If two or more rows in a table have the same value of A, they must also have the same value of B, i.e., knowing the value of A guarantees the value of B.
- If A is the same, but B is different in some rows, functional dependence does not exist.

Types of Functional Dependencies

Partial dependency
- Tends to be straight-forward and easy to identify
- Determinant is only part of primary key or candidate key
- Dependant is a non-prime attribute
- E.g., {A, B} -> {C, D}; B -> C; {A, B} is primary key and the only candidate key
- Functional dependency B -> C is a partial dependency because only part of the primary key {B} is needed to determine the value of C
Transitive dependency
- X -> Y; Y -> Z; X is primary key
- X determines the value of Z via Y, i.e., X -> Z
- More difficult to identify among a set of data
- Only when functional dependencies exist among non-prime attributes that that are not part of any candidate key
- E.g., if Suburb -> Postcode and Postcode -> State then Suburb -> State
What is non-prime attribute?
- Not part of any candidate key
- Cannot determine the record in any way
- a.k.a, non-key field

Normalisation

Normalisation is to decompose tables into well-formed tables and remove dependencies
- Look at one relation at a time
- Identify the dependencies of a relation (table)
- Progressively decompose tables into new set of tables using inferences rules (what attribute can infer what attribute?)
  - Reduce size and redundancy
  - Maintain lossless join property, i.e., decomposed components could be joined back to the original table

Case Study: Construction Company Reports

Summary
- Building project has: Project number, Name, Employees assigned to the project.
- Employee has: Employee number, Name, Job classification.
- The company charges its clients by billing the hours spent on each project.
- The hourly billing rate is dependent on the employee’s position.
Issues?
- Many multivalued attributes
- Trying to fit multiple employees' data into one row for one project
- Many data redundancies, e.g., employee data JOBCLASS, CHARGEHOUR is in many rows because employees can work on multiple projects
- Data inconsistencies, e.g. “Database Designer” or “DB Designer”
- All above could lead to insertion, deletion, update anomalies

Conversion to First Normal Form (1NF)

Creating a valid relation
1NF requirements
- PK attributes are defined and not NULL, i.e., valid PK
- All attributes are dependent on the primary key
- There are no repeating groups in the table, i.e., each cell of the table contains one and only one value
  - Repeating group: a group of multiple entries of 1 or more types exist for a single key attribute’s value (e.g., fit multiple employees' data and multiple locations’ data into the same row of one project)
  - This could be a few multi-valued attributes combined to form repeating groups
  - Eliminating repeating groups ensure that all attributes contain only one value, i.e., no multivalued attributes
Steps to reach 1NF
- Eliminate the repeating groups
  - Split repeating groups of data to multiple rows
  - Hence multivalued attributes are converted into single-valued attributes
- Identify the Primary Key and add appropriate entry in at least the PK column(s) to avoid null PK

Eliminating Repeating Groups & Identifying Primary Key

Split repeating groups of data on one row: each single project number (PROJ_NUM) occurrence can reference a group of data of the same employee type (employee number, employee name, job classification, and charge per hour), due to a combination of multi-valued attributes
Change the table from project level to assignment level for each employee assigned to a project
In this case, create additional rows for each assignment (for one employee on one project)
All attributes are dependent on PROJNUM + EMPNUM, hence composite PK: {Proj_Num, Emp_Num}
Add appropriate entry in these columns, if any empty values, to avoid null PK

Converting 1NF to Second Normal Form (2NF)

2NF requirements
- No partial dependencies, i.e., each non-prime attribute is functionally dependent only on the entire PK or other entire Candidate Keys (if any), and shall not be dependent on only a portion of PK or CK
- The relation/table must be already in 1NF
Steps to reach 2NF
- Identify partial dependencies, and assign corresponding dependent attributes
- Make new tables by separating the data items into a separate relation using appropriate PKs
  - The new tables after the split may require composite entity to bridge the tables

Identifying & Removing Partial Dependencies

Attributes are dependent on part of a PK, or part of composite PK
In this case
- Project number determines project name
- Employee number determines employee data, including name, job class, charge per hour
Remove partial dependencies by creating new tables
- The partial dependency of empnum, empname, jobclass and chghour can become another table, with emp_num as the PK.
- The partial dependency of projnum and projname can become a table, with proj_num as the PK.
Create new table(s) to capture the relationship between the new tables after the split, in this case:
- A project can contain multiple employees. An employee can be working on multiple projects concurrently. This is a many-to-many relationship.
- Recall that in ERD, many-to-many relationships shall be decomposed to two 1:M relationships
- Hence a composite (bridge) entity must be used to capture the employees assigned to each project
Find the right tables for remaining attributes outside the partial dependency, in this case:
- The new assignment entity is the composite entity bridging Project and Employee, and can include HOURS as an additional attribute, specific to the relationship between 1 project and 1 employee
- Attribute name changes from HOURS to ASSIGN_HOURS to make it more meaningful
Having remaining attributes that are not in any partial dependency but are dependant on the original composite PK could suggest the need of a new entity with that composite PK

Converting 2NF to Third Normal Form (3NF)

3NF requirements
- no transitive dependencies
- In simple terms, non-prime attributes being dependent on other non-prime attributes does not exist
- The relation/table must be already in 2NF
Steps to reach 3NF
- Identify transitive dependencies
- Make new tables to eliminate all transitive dependencies, and reassign corresponding dependent attributes

Identifying & Removing Transitive Dependencies

Non-prime attribute determined by other non-prime attributes
In this case
- Job classification determines charge per hour
- Job classification is not in any candidate key
- i.e., Job classification cannot help determine other values in the row in any way
Make new tables and reassign corresponding dependent attributes
- For every transitive dependency, put determinant in a new table as PK; place the dependent attributes in the same table as the determinant; remove the dependents from original table
- In this case, we put JOBCLASS in a new table JOB, place CHGHOUR in the new table, and remove it from original table
- How can Charge Per Hour be linked to an employee?
- CHGHOUR is dependent on JOBCLASS in JOB table, and JOB_CLASS is associated with each employee

Recap: Dependency Diagram

No repeating groups to meet 1NF requirement
Remove partial dependency for 2NF
Remove transitive dependency for 3NF

Normalisation Ensures Well-Formed Tables

Each table represents a single subject
No data item will be unnecessarily stored in more than one table
All non-prime attributes in a table are dependent on the primary key
Each table is void of insertion, update, and deletion anomalies
Ensures that all tables are in at least 3NF (rule of thumb)

Normalisation Impact

Normalisation makes database well-structured but increases no. of tables
- Number of database tables grows when tables are decomposed during normalization process
- Ensures data is structured efficiently
- Minimizes redundancy, i.e., duplicate data, reducing unnecessary updates
- Removes unwanted dependency to avoid insertion, deletion, update anomalies
Joining a larger number of tables takes additional processing logic and reduces system speed
Less efficient updates since we need to modify multiple tables

SQL Joins

With normalised tables, we need SQL Joins to combine tables for querying / reporting
Types of Joins:
- Cross Join
- Inner Join
- Natural Join
- Full Outer Join
Example Queries:
- What is the total charge on project 111?
- How much did employee Ankit earn in the last month from all projects?
- What is the total charge of Mary on Project 111?
- What is the project name with both Ankit and Mary on that project?

Cross Join (Cartesian Product)

Select all possible combinations of tuples in R with tuples in S, “R ∗ S”
“all possible tuple combinations of two relations”, “everything joined to everything”
Explicit cross join: SELECT * FROM R CROSS JOIN S;
Implicit cross join: SELECT * FROM R, S;

Inner Join

Returns combined tuples from two relations that have the same value for a defined attribute (match on the attribute)
i.e., a Cross Join (Cartesian Product) with all tuples removed that do not match on the defined attribute.
Most common join type, the default in most database systems
Inner join is usually an equi join with a condition with equality operator (=).
A theta join is when other comparison operators are used (
Explicit inner join: SELECT * FROM R INNER JOIN S ON R.City = S.City;

Inner Join vs. Natural Join

Inner Join joins 2 tables using common columns mentioned in the ON clause, and find matching values
A natural join joins tuples based on all attributes with identical names in the two relations, i.e., all common columns, and find matching values in each pair of common columns
Example:
- Explicit inner join: SELECT * FROM R INNER JOIN S ON R.City = S.City
- Natural join: SELECT * FROM R NATURAL JOIN S

Full Outer Join

Joins tuples from two tables that match on a defined attribute
If no match, the combined row will still appear with missing attributes as NULL (both tables are preserved)
Examples:
- SELECT * FROM R FULL OUTER JOIN S ON R.City = S.City; (gets both City columns from 2 tables)
- SELECT * FROM R FULL OUTER JOIN S USING(City); (only get one City column in the result)

Left Outer Join

Left table is completely preserved
If no match, the attributes from the right side will contain NULL values
Examples:
- SELECT * FROM R LEFT OUTER JOIN S ON R.City = S.City; (gets both City columns from 2 tables)
- SELECT * FROM R LEFT OUTER JOIN S USING(City); (only get one City column in the result)

Right Outer Join

Right table is completely preserved
If no match, the attributes from the left side will contain NULL values
Examples:
- SELECT * FROM R RIGHT OUTER JOIN S ON R.City = S.City; (gets both City columns from 2 tables)
- SELECT * FROM R RIGHT OUTER JOIN S USING(City); (only get one City column in the result)

Recap

Normalisation
Functional Dependencies
Normal Forms
- 1NF
- 2NF
- 3NF
- BCNF
SQL Join to combine normalised tables
- Cross join
- Inner join
- Full outer join
- Left outer join
- Right outer join

Next lecture

Normalisation advanced
Relational Algebra

Contact

Lecturer: Bin Huang, bin.huang@unsw.edu.au
The lecture recording will be made available in Moodle course site.