Normalisation & SQL Join - W5

Week 5 Lecture: Normalisation & SQL Join

Introduction

  • Lecturer: Bin Huang
  • Online attendees: Mute microphone, use chat for questions, turn off video if internet is poor.
  • In-person attendees: Do not attend if you have any symptoms.

Acknowledgement of Country

  • UNSW Business School acknowledges the Bidjigal (Kensington campus) and Gadigal (City campus) as the traditional custodians of the lands.
  • Acknowledgement of Aboriginal and Torres Strait Islander Elders, past and present, and their communities.
  • Recognition of their ongoing leadership and contributions to business, education, and industry.

Database Design Process

  • Overview of data vs. information.
  • Data stored in databases.
  • Database management system (DBMS).
  • Database design defines database structure.
  • Conceptual Model:
    • Entity-Relationship Modelling technique.
    • Chen’s notation for high-level conceptual model.
    • Crow’s Foot as the design standard.
    • Entity type and entity instance.
    • Attribute and value.
    • Relationship (Degree, Connectivity, Cardinality).
  • Logical Model:
    • Converting Conceptual model to detailed Logical Model (using Crow’s Foot).
    • Advanced topics:
      • Relationship strength.
      • Composite entity.
      • Relationship degree.
      • Supertype and Subtype.
      • Selecting Primary Key.
  • Relational Model:
    • Convert ER model to a set of tables (relations) in the relational model.
    • Apply Normalisation on the relations to remove any anomalies.
  • SQL Data Definition & Manipulation Language:
    • Use Relational Model to implement database by creating a table for each normalised relations.
    • Data Definition Language defines the tables.
    • Data Manipulation Language queries/updates the tables.

Normalisation and SQL Join Overview

  • Normalisation.
  • Functional Dependencies.
  • Normal Forms:
    • 1NF
    • 2NF
    • 3NF
    • BCNF
  • SQL Join to combine normalised tables.

What is Normalisation?

  • The process of correcting table structures to minimise data redundancies.
  • Why do we need Normalisation?
    • To analyse the relationship among the attributes within each entity.
    • To determine if the data structure can be improved to achieve a well-structured relational model that:
      • contains minimal data redundancy
      • allows users to insert, delete, and update rows without causing data inconsistencies and anomalies
      • has some desirable (“good”) properties in the data structure
  • A technique to define “goodness” (or “badness”) of a relation.
  • Converting a relation to a standard normal form in stages.
  • Minimising or even eliminating redundancy, i.e., duplication of data.
  • Decomposing a relation/table into smaller components while still capturing the precise content of the original relation/table.

Redundancy

  • Data about one entity is recorded more than once in a database.
    • To be reduced to save space and prevent problems, i.e., database should not store same data several times
  • Redundancies could induce anomalies and should be minimised by normalisation.
  • One solution: Constraints can prevent anomalies, but difficult to get constraints right!
  • With redundancy in DB, anomalies may arise …
    • Insertion Anomaly: adding new rows forces user to create duplicate data
    • Deletion Anomaly: deleting rows may cause a loss of data that would be needed for other future rows
    • Modification (Update) Anomaly: changing data in a row forces changes to other rows because of duplication
  • Better solution: NORMALISATION - database design should minimize redundancies, so that we are not dependent on constraints to prevent anomalies

Normal Form

  • A Normal Form is a state of a relation determined by applying rules on Functional Dependency
  • Normalisation (more and more normalised)
  • De-normalisation (less and less normalised)
  • Normal Form Characteristics
    • First normal form (1NF): Table format, no repeating groups, and PK identified
    • Second normal form (2NF): 1NF and no partial functional dependencies
    • Third normal form (3NF): 2NF and no transitive functional dependencies
    • Boyce-Codd normal form (BCNF): 3NF and every determinant is a candidate key (special case of 3NF)
    • Fourth normal form (4NF): BCNF and no independent multivalued dependencies (not covered)
  • Functional Dependency (FD): semantic restriction that some values for a relation are not possible in reality
    • relationships between attributes in a relation
    • the semantics of the attributes in a relation
    • can be inferred in a systematic way by applying a set of inference rules

Armstrong's Axioms

  • Attribute B “functionally depends” on an attribute A
  • Attribute A determines attribute B
  • “If I know the value of A, then I know the value of B”
  • Inclusion (Reflexive) rule: if Y ⊆ X, then X → Y (⊆: the notation of subset)
    • If zID ⊆ {zID, LastName}, then {zID, LastName} → zID
    • If zID is a part of the attribute set {zID, LastName}, then {zID, LastName} determines zID
  • Augmentation rule: if X → Y, then W, X → {W, Y}
    • If zID → LastName, then {zID, FirstName} → {LastName, FirstName}
  • Transitivity rule: if X → Y and Y → Z, then X → Z
    • If zID → MobileNumber and MobileNumber → LastName, then zID → LastName

Functional Dependence's Formal Definitions

  • Determinant: Any attribute(s) whose value determines other values within a row
  • Dependant: Attribute whose value is determined by a determinant
  • Functional dependence:
    • The attribute B is fully functionally dependent on the attribute A, if each value of A determines one and only one value of B.
  • Fully functional dependence (composite key):
    • If attribute B is functionally dependent on a composite key A but not on any subset of A, the attribute B is fully functionally dependent on A.
  • Functional dependence (Generalised definition):
    • Attribute A determines attribute B (that is, B is functionally dependent on A) if all (generalized definition) of the rows in the table that agree in value for attribute A also agree in value for attribute B.
  • In plain English…
    • If two or more rows in a table have the same value of A, they must also have the same value of B, i.e., knowing the value of A guarantees the value of B.
    • If A is the same, but B is different in some rows, functional dependence does not exist.

Types of Functional Dependencies

  • Partial dependency
    • Tends to be straight-forward and easy to identify
    • Determinant is only part of primary key or candidate key
    • Dependant is a non-prime attribute
    • E.g., {A, B} -> {C, D}; B -> C; {A, B} is primary key and the only candidate key
    • Functional dependency B -> C is a partial dependency because only part of the primary key {B} is needed to determine the value of C
  • Transitive dependency
    • X -> Y; Y -> Z; X is primary key
    • X determines the value of Z via Y, i.e., X -> Z
    • More difficult to identify among a set of data
    • Only when functional dependencies exist among non-prime attributes that that are not part of any candidate key
    • E.g., if Suburb -> Postcode and Postcode -> State then Suburb -> State
  • What is non-prime attribute?
    • Not part of any candidate key
    • Cannot determine the record in any way
    • a.k.a, non-key field

Normalisation

  • Normalisation is to decompose tables into well-formed tables and remove dependencies
    • Look at one relation at a time
    • Identify the dependencies of a relation (table)
    • Progressively decompose tables into new set of tables using inferences rules (what attribute can infer what attribute?)
      • Reduce size and redundancy
      • Maintain lossless join property, i.e., decomposed components could be joined back to the original table

Case Study: Construction Company Reports

  • Summary
    • Building project has: Project number, Name, Employees assigned to the project.
    • Employee has: Employee number, Name, Job classification.
    • The company charges its clients by billing the hours spent on each project.
    • The hourly billing rate is dependent on the employee’s position.
  • Issues?
    • Many multivalued attributes
    • Trying to fit multiple employees' data into one row for one project
    • Many data redundancies, e.g., employee data JOBCLASS, CHARGEHOUR is in many rows because employees can work on multiple projects
    • Data inconsistencies, e.g. “Database Designer” or “DB Designer”
    • All above could lead to insertion, deletion, update anomalies

Conversion to First Normal Form (1NF)

  • Creating a valid relation
  • 1NF requirements
    • PK attributes are defined and not NULL, i.e., valid PK
    • All attributes are dependent on the primary key
    • There are no repeating groups in the table, i.e., each cell of the table contains one and only one value
      • Repeating group: a group of multiple entries of 1 or more types exist for a single key attribute’s value (e.g., fit multiple employees' data and multiple locations’ data into the same row of one project)
      • This could be a few multi-valued attributes combined to form repeating groups
      • Eliminating repeating groups ensure that all attributes contain only one value, i.e., no multivalued attributes
  • Steps to reach 1NF
    • Eliminate the repeating groups
      • Split repeating groups of data to multiple rows
      • Hence multivalued attributes are converted into single-valued attributes
    • Identify the Primary Key and add appropriate entry in at least the PK column(s) to avoid null PK

Eliminating Repeating Groups & Identifying Primary Key

  • Split repeating groups of data on one row: each single project number (PROJ_NUM) occurrence can reference a group of data of the same employee type (employee number, employee name, job classification, and charge per hour), due to a combination of multi-valued attributes
  • Change the table from project level to assignment level for each employee assigned to a project
  • In this case, create additional rows for each assignment (for one employee on one project)
  • All attributes are dependent on PROJNUM + EMPNUM, hence composite PK: {Proj_Num, Emp_Num}
  • Add appropriate entry in these columns, if any empty values, to avoid null PK

Converting 1NF to Second Normal Form (2NF)

  • 2NF requirements
    • No partial dependencies, i.e., each non-prime attribute is functionally dependent only on the entire PK or other entire Candidate Keys (if any), and shall not be dependent on only a portion of PK or CK
    • The relation/table must be already in 1NF
  • Steps to reach 2NF
    • Identify partial dependencies, and assign corresponding dependent attributes
    • Make new tables by separating the data items into a separate relation using appropriate PKs
      • The new tables after the split may require composite entity to bridge the tables

Identifying & Removing Partial Dependencies

  • Attributes are dependent on part of a PK, or part of composite PK
  • In this case
    • Project number determines project name
    • Employee number determines employee data, including name, job class, charge per hour
  • Remove partial dependencies by creating new tables
    • The partial dependency of empnum, empname, jobclass and chghour can become another table, with emp_num as the PK.
    • The partial dependency of projnum and projname can become a table, with proj_num as the PK.
  • Create new table(s) to capture the relationship between the new tables after the split, in this case:
    • A project can contain multiple employees. An employee can be working on multiple projects concurrently. This is a many-to-many relationship.
    • Recall that in ERD, many-to-many relationships shall be decomposed to two 1:M relationships
    • Hence a composite (bridge) entity must be used to capture the employees assigned to each project
  • Find the right tables for remaining attributes outside the partial dependency, in this case:
    • The new assignment entity is the composite entity bridging Project and Employee, and can include HOURS as an additional attribute, specific to the relationship between 1 project and 1 employee
    • Attribute name changes from HOURS to ASSIGN_HOURS to make it more meaningful
  • Having remaining attributes that are not in any partial dependency but are dependant on the original composite PK could suggest the need of a new entity with that composite PK

Converting 2NF to Third Normal Form (3NF)

  • 3NF requirements
    • no transitive dependencies
    • In simple terms, non-prime attributes being dependent on other non-prime attributes does not exist
    • The relation/table must be already in 2NF
  • Steps to reach 3NF
    • Identify transitive dependencies
    • Make new tables to eliminate all transitive dependencies, and reassign corresponding dependent attributes

Identifying & Removing Transitive Dependencies

  • Non-prime attribute determined by other non-prime attributes
  • In this case
    • Job classification determines charge per hour
    • Job classification is not in any candidate key
    • i.e., Job classification cannot help determine other values in the row in any way
  • Make new tables and reassign corresponding dependent attributes
    • For every transitive dependency, put determinant in a new table as PK; place the dependent attributes in the same table as the determinant; remove the dependents from original table
    • In this case, we put JOBCLASS in a new table JOB, place CHGHOUR in the new table, and remove it from original table
    • How can Charge Per Hour be linked to an employee?
    • CHGHOUR is dependent on JOBCLASS in JOB table, and JOB_CLASS is associated with each employee

Recap: Dependency Diagram

  • No repeating groups to meet 1NF requirement
  • Remove partial dependency for 2NF
  • Remove transitive dependency for 3NF

Normalisation Ensures Well-Formed Tables

  • Each table represents a single subject
  • No data item will be unnecessarily stored in more than one table
  • All non-prime attributes in a table are dependent on the primary key
  • Each table is void of insertion, update, and deletion anomalies
  • Ensures that all tables are in at least 3NF (rule of thumb)

Normalisation Impact

  • Normalisation makes database well-structured but increases no. of tables
    • Number of database tables grows when tables are decomposed during normalization process
    • Ensures data is structured efficiently
    • Minimizes redundancy, i.e., duplicate data, reducing unnecessary updates
    • Removes unwanted dependency to avoid insertion, deletion, update anomalies
  • Joining a larger number of tables takes additional processing logic and reduces system speed
  • Less efficient updates since we need to modify multiple tables

SQL Joins

  • With normalised tables, we need SQL Joins to combine tables for querying / reporting
  • Types of Joins:
    • Cross Join
    • Inner Join
    • Natural Join
    • Full Outer Join
  • Example Queries:
    • What is the total charge on project 111?
    • How much did employee Ankit earn in the last month from all projects?
    • What is the total charge of Mary on Project 111?
    • What is the project name with both Ankit and Mary on that project?

Cross Join (Cartesian Product)

  • Select all possible combinations of tuples in R with tuples in S, “R ∗ S”
  • “all possible tuple combinations of two relations”, “everything joined to everything”
  • Explicit cross join: SELECT * FROM R CROSS JOIN S;
  • Implicit cross join: SELECT * FROM R, S;

Inner Join

  • Returns combined tuples from two relations that have the same value for a defined attribute (match on the attribute)
  • i.e., a Cross Join (Cartesian Product) with all tuples removed that do not match on the defined attribute.
  • Most common join type, the default in most database systems
  • Inner join is usually an equi join with a condition with equality operator (=).
  • A theta join is when other comparison operators are used (
  • Explicit inner join: SELECT * FROM R INNER JOIN S ON R.City = S.City;

Inner Join vs. Natural Join

  • Inner Join joins 2 tables using common columns mentioned in the ON clause, and find matching values
  • A natural join joins tuples based on all attributes with identical names in the two relations, i.e., all common columns, and find matching values in each pair of common columns
  • Example:
    • Explicit inner join: SELECT * FROM R INNER JOIN S ON R.City = S.City
    • Natural join: SELECT * FROM R NATURAL JOIN S

Full Outer Join

  • Joins tuples from two tables that match on a defined attribute
  • If no match, the combined row will still appear with missing attributes as NULL (both tables are preserved)
  • Examples:
    • SELECT * FROM R FULL OUTER JOIN S ON R.City = S.City; (gets both City columns from 2 tables)
    • SELECT * FROM R FULL OUTER JOIN S USING(City); (only get one City column in the result)

Left Outer Join

  • Left table is completely preserved
  • If no match, the attributes from the right side will contain NULL values
  • Examples:
    • SELECT * FROM R LEFT OUTER JOIN S ON R.City = S.City; (gets both City columns from 2 tables)
    • SELECT * FROM R LEFT OUTER JOIN S USING(City); (only get one City column in the result)

Right Outer Join

  • Right table is completely preserved
  • If no match, the attributes from the left side will contain NULL values
  • Examples:
    • SELECT * FROM R RIGHT OUTER JOIN S ON R.City = S.City; (gets both City columns from 2 tables)
    • SELECT * FROM R RIGHT OUTER JOIN S USING(City); (only get one City column in the result)

Recap

  • Normalisation
  • Functional Dependencies
  • Normal Forms
    • 1NF
    • 2NF
    • 3NF
    • BCNF
  • SQL Join to combine normalised tables
    • Cross join
    • Inner join
    • Full outer join
    • Left outer join
    • Right outer join

Next lecture

  • Normalisation advanced
  • Relational Algebra

Contact

  • Lecturer: Bin Huang, bin.huang@unsw.edu.au
  • The lecture recording will be made available in Moodle course site.