Normalisation & SQL Join - W5
Week 5 Lecture: Normalisation & SQL Join
Introduction
- Lecturer: Bin Huang
- Online attendees: Mute microphone, use chat for questions, turn off video if internet is poor.
- In-person attendees: Do not attend if you have any symptoms.
Acknowledgement of Country
- UNSW Business School acknowledges the Bidjigal (Kensington campus) and Gadigal (City campus) as the traditional custodians of the lands.
- Acknowledgement of Aboriginal and Torres Strait Islander Elders, past and present, and their communities.
- Recognition of their ongoing leadership and contributions to business, education, and industry.
Database Design Process
- Overview of data vs. information.
- Data stored in databases.
- Database management system (DBMS).
- Database design defines database structure.
- Conceptual Model:
- Entity-Relationship Modelling technique.
- Chen’s notation for high-level conceptual model.
- Crow’s Foot as the design standard.
- Entity type and entity instance.
- Attribute and value.
- Relationship (Degree, Connectivity, Cardinality).
- Logical Model:
- Converting Conceptual model to detailed Logical Model (using Crow’s Foot).
- Advanced topics:
- Relationship strength.
- Composite entity.
- Relationship degree.
- Supertype and Subtype.
- Selecting Primary Key.
- Relational Model:
- Convert ER model to a set of tables (relations) in the relational model.
- Apply Normalisation on the relations to remove any anomalies.
- SQL Data Definition & Manipulation Language:
- Use Relational Model to implement database by creating a table for each normalised relations.
- Data Definition Language defines the tables.
- Data Manipulation Language queries/updates the tables.
Normalisation and SQL Join Overview
- Normalisation.
- Functional Dependencies.
- Normal Forms:
- SQL Join to combine normalised tables.
What is Normalisation?
- The process of correcting table structures to minimise data redundancies.
- Why do we need Normalisation?
- To analyse the relationship among the attributes within each entity.
- To determine if the data structure can be improved to achieve a well-structured relational model that:
- contains minimal data redundancy
- allows users to insert, delete, and update rows without causing data inconsistencies and anomalies
- has some desirable (“good”) properties in the data structure
- A technique to define “goodness” (or “badness”) of a relation.
- Converting a relation to a standard normal form in stages.
- Minimising or even eliminating redundancy, i.e., duplication of data.
- Decomposing a relation/table into smaller components while still capturing the precise content of the original relation/table.
Redundancy
- Data about one entity is recorded more than once in a database.
- To be reduced to save space and prevent problems, i.e., database should not store same data several times
- Redundancies could induce anomalies and should be minimised by normalisation.
- One solution: Constraints can prevent anomalies, but difficult to get constraints right!
- With redundancy in DB, anomalies may arise …
- Insertion Anomaly: adding new rows forces user to create duplicate data
- Deletion Anomaly: deleting rows may cause a loss of data that would be needed for other future rows
- Modification (Update) Anomaly: changing data in a row forces changes to other rows because of duplication
- Better solution: NORMALISATION - database design should minimize redundancies, so that we are not dependent on constraints to prevent anomalies
- A Normal Form is a state of a relation determined by applying rules on Functional Dependency
- Normalisation (more and more normalised)
- De-normalisation (less and less normalised)
- Normal Form Characteristics
- First normal form (1NF): Table format, no repeating groups, and PK identified
- Second normal form (2NF): 1NF and no partial functional dependencies
- Third normal form (3NF): 2NF and no transitive functional dependencies
- Boyce-Codd normal form (BCNF): 3NF and every determinant is a candidate key (special case of 3NF)
- Fourth normal form (4NF): BCNF and no independent multivalued dependencies (not covered)
- Functional Dependency (FD): semantic restriction that some values for a relation are not possible in reality
- relationships between attributes in a relation
- the semantics of the attributes in a relation
- can be inferred in a systematic way by applying a set of inference rules
Armstrong's Axioms
- Attribute B “functionally depends” on an attribute A
- Attribute A determines attribute B
- “If I know the value of A, then I know the value of B”
- Inclusion (Reflexive) rule: if Y ⊆ X, then X → Y (⊆: the notation of subset)
- If zID ⊆ {zID, LastName}, then {zID, LastName} → zID
- If zID is a part of the attribute set {zID, LastName}, then {zID, LastName} determines zID
- Augmentation rule: if X → Y, then W, X → {W, Y}
- If zID → LastName, then {zID, FirstName} → {LastName, FirstName}
- Transitivity rule: if X → Y and Y → Z, then X → Z
- If zID → MobileNumber and MobileNumber → LastName, then zID → LastName
- Determinant: Any attribute(s) whose value determines other values within a row
- Dependant: Attribute whose value is determined by a determinant
- Functional dependence:
- The attribute B is fully functionally dependent on the attribute A, if each value of A determines one and only one value of B.
- Fully functional dependence (composite key):
- If attribute B is functionally dependent on a composite key A but not on any subset of A, the attribute B is fully functionally dependent on A.
- Functional dependence (Generalised definition):
- Attribute A determines attribute B (that is, B is functionally dependent on A) if all (generalized definition) of the rows in the table that agree in value for attribute A also agree in value for attribute B.
- In plain English…
- If two or more rows in a table have the same value of A, they must also have the same value of B, i.e., knowing the value of A guarantees the value of B.
- If A is the same, but B is different in some rows, functional dependence does not exist.
Types of Functional Dependencies
- Partial dependency
- Tends to be straight-forward and easy to identify
- Determinant is only part of primary key or candidate key
- Dependant is a non-prime attribute
- E.g., {A, B} -> {C, D}; B -> C; {A, B} is primary key and the only candidate key
- Functional dependency B -> C is a partial dependency because only part of the primary key {B} is needed to determine the value of C
- Transitive dependency
- X -> Y; Y -> Z; X is primary key
- X determines the value of Z via Y, i.e., X -> Z
- More difficult to identify among a set of data
- Only when functional dependencies exist among non-prime attributes that that are not part of any candidate key
- E.g., if Suburb -> Postcode and Postcode -> State then Suburb -> State
- What is non-prime attribute?
- Not part of any candidate key
- Cannot determine the record in any way
- a.k.a, non-key field
Normalisation
- Normalisation is to decompose tables into well-formed tables and remove dependencies
- Look at one relation at a time
- Identify the dependencies of a relation (table)
- Progressively decompose tables into new set of tables using inferences rules (what attribute can infer what attribute?)
- Reduce size and redundancy
- Maintain lossless join property, i.e., decomposed components could be joined back to the original table
Case Study: Construction Company Reports
- Summary
- Building project has: Project number, Name, Employees assigned to the project.
- Employee has: Employee number, Name, Job classification.
- The company charges its clients by billing the hours spent on each project.
- The hourly billing rate is dependent on the employee’s position.
- Issues?
- Many multivalued attributes
- Trying to fit multiple employees' data into one row for one project
- Many data redundancies, e.g., employee data JOBCLASS, CHARGEHOUR is in many rows because employees can work on multiple projects
- Data inconsistencies, e.g. “Database Designer” or “DB Designer”
- All above could lead to insertion, deletion, update anomalies
- Creating a valid relation
- 1NF requirements
- PK attributes are defined and not NULL, i.e., valid PK
- All attributes are dependent on the primary key
- There are no repeating groups in the table, i.e., each cell of the table contains one and only one value
- Repeating group: a group of multiple entries of 1 or more types exist for a single key attribute’s value (e.g., fit multiple employees' data and multiple locations’ data into the same row of one project)
- This could be a few multi-valued attributes combined to form repeating groups
- Eliminating repeating groups ensure that all attributes contain only one value, i.e., no multivalued attributes
- Steps to reach 1NF
- Eliminate the repeating groups
- Split repeating groups of data to multiple rows
- Hence multivalued attributes are converted into single-valued attributes
- Identify the Primary Key and add appropriate entry in at least the PK column(s) to avoid null PK
Eliminating Repeating Groups & Identifying Primary Key
- Split repeating groups of data on one row: each single project number (PROJ_NUM) occurrence can reference a group of data of the same employee type (employee number, employee name, job classification, and charge per hour), due to a combination of multi-valued attributes
- Change the table from project level to assignment level for each employee assigned to a project
- In this case, create additional rows for each assignment (for one employee on one project)
- All attributes are dependent on PROJNUM + EMPNUM, hence composite PK: {Proj_Num, Emp_Num}
- Add appropriate entry in these columns, if any empty values, to avoid null PK
- 2NF requirements
- No partial dependencies, i.e., each non-prime attribute is functionally dependent only on the entire PK or other entire Candidate Keys (if any), and shall not be dependent on only a portion of PK or CK
- The relation/table must be already in 1NF
- Steps to reach 2NF
- Identify partial dependencies, and assign corresponding dependent attributes
- Make new tables by separating the data items into a separate relation using appropriate PKs
- The new tables after the split may require composite entity to bridge the tables
Identifying & Removing Partial Dependencies
- Attributes are dependent on part of a PK, or part of composite PK
- In this case
- Project number determines project name
- Employee number determines employee data, including name, job class, charge per hour
- Remove partial dependencies by creating new tables
- The partial dependency of empnum, empname, jobclass and chghour can become another table, with emp_num as the PK.
- The partial dependency of projnum and projname can become a table, with proj_num as the PK.
- Create new table(s) to capture the relationship between the new tables after the split, in this case:
- A project can contain multiple employees. An employee can be working on multiple projects concurrently. This is a many-to-many relationship.
- Recall that in ERD, many-to-many relationships shall be decomposed to two 1:M relationships
- Hence a composite (bridge) entity must be used to capture the employees assigned to each project
- Find the right tables for remaining attributes outside the partial dependency, in this case:
- The new assignment entity is the composite entity bridging Project and Employee, and can include HOURS as an additional attribute, specific to the relationship between 1 project and 1 employee
- Attribute name changes from HOURS to ASSIGN_HOURS to make it more meaningful
- Having remaining attributes that are not in any partial dependency but are dependant on the original composite PK could suggest the need of a new entity with that composite PK
- 3NF requirements
- no transitive dependencies
- In simple terms, non-prime attributes being dependent on other non-prime attributes does not exist
- The relation/table must be already in 2NF
- Steps to reach 3NF
- Identify transitive dependencies
- Make new tables to eliminate all transitive dependencies, and reassign corresponding dependent attributes
Identifying & Removing Transitive Dependencies
- Non-prime attribute determined by other non-prime attributes
- In this case
- Job classification determines charge per hour
- Job classification is not in any candidate key
- i.e., Job classification cannot help determine other values in the row in any way
- Make new tables and reassign corresponding dependent attributes
- For every transitive dependency, put determinant in a new table as PK; place the dependent attributes in the same table as the determinant; remove the dependents from original table
- In this case, we put JOBCLASS in a new table JOB, place CHGHOUR in the new table, and remove it from original table
- How can Charge Per Hour be linked to an employee?
- CHGHOUR is dependent on JOBCLASS in JOB table, and JOB_CLASS is associated with each employee
Recap: Dependency Diagram
- No repeating groups to meet 1NF requirement
- Remove partial dependency for 2NF
- Remove transitive dependency for 3NF
- Each table represents a single subject
- No data item will be unnecessarily stored in more than one table
- All non-prime attributes in a table are dependent on the primary key
- Each table is void of insertion, update, and deletion anomalies
- Ensures that all tables are in at least 3NF (rule of thumb)
Normalisation Impact
- Normalisation makes database well-structured but increases no. of tables
- Number of database tables grows when tables are decomposed during normalization process
- Ensures data is structured efficiently
- Minimizes redundancy, i.e., duplicate data, reducing unnecessary updates
- Removes unwanted dependency to avoid insertion, deletion, update anomalies
- Joining a larger number of tables takes additional processing logic and reduces system speed
- Less efficient updates since we need to modify multiple tables
SQL Joins
- With normalised tables, we need SQL Joins to combine tables for querying / reporting
- Types of Joins:
- Cross Join
- Inner Join
- Natural Join
- Full Outer Join
- Example Queries:
- What is the total charge on project 111?
- How much did employee Ankit earn in the last month from all projects?
- What is the total charge of Mary on Project 111?
- What is the project name with both Ankit and Mary on that project?
Cross Join (Cartesian Product)
- Select all possible combinations of tuples in R with tuples in S, “R ∗ S”
- “all possible tuple combinations of two relations”, “everything joined to everything”
- Explicit cross join:
SELECT * FROM R CROSS JOIN S; - Implicit cross join:
SELECT * FROM R, S;
Inner Join
- Returns combined tuples from two relations that have the same value for a defined attribute (match on the attribute)
- i.e., a Cross Join (Cartesian Product) with all tuples removed that do not match on the defined attribute.
- Most common join type, the default in most database systems
- Inner join is usually an equi join with a condition with equality operator (=).
- A theta join is when other comparison operators are used (
- Explicit inner join:
SELECT * FROM R INNER JOIN S ON R.City = S.City;
Inner Join vs. Natural Join
- Inner Join joins 2 tables using common columns mentioned in the ON clause, and find matching values
- A natural join joins tuples based on all attributes with identical names in the two relations, i.e., all common columns, and find matching values in each pair of common columns
- Example:
- Explicit inner join:
SELECT * FROM R INNER JOIN S ON R.City = S.City - Natural join:
SELECT * FROM R NATURAL JOIN S
Full Outer Join
- Joins tuples from two tables that match on a defined attribute
- If no match, the combined row will still appear with missing attributes as NULL (both tables are preserved)
- Examples:
SELECT * FROM R FULL OUTER JOIN S ON R.City = S.City; (gets both City columns from 2 tables)SELECT * FROM R FULL OUTER JOIN S USING(City); (only get one City column in the result)
Left Outer Join
- Left table is completely preserved
- If no match, the attributes from the right side will contain NULL values
- Examples:
SELECT * FROM R LEFT OUTER JOIN S ON R.City = S.City; (gets both City columns from 2 tables)SELECT * FROM R LEFT OUTER JOIN S USING(City); (only get one City column in the result)
Right Outer Join
- Right table is completely preserved
- If no match, the attributes from the left side will contain NULL values
- Examples:
SELECT * FROM R RIGHT OUTER JOIN S ON R.City = S.City; (gets both City columns from 2 tables)SELECT * FROM R RIGHT OUTER JOIN S USING(City); (only get one City column in the result)
Recap
- Normalisation
- Functional Dependencies
- Normal Forms
- SQL Join to combine normalised tables
- Cross join
- Inner join
- Full outer join
- Left outer join
- Right outer join
Next lecture
- Normalisation advanced
- Relational Algebra
- Lecturer: Bin Huang, bin.huang@unsw.edu.au
- The lecture recording will be made available in Moodle course site.