This lecture covers Normalization Part 2 and SQL wrap-up.
Topics include Normal Forms (1NF, 2NF, 3NF, BCNF) and SQL wrap-up (Aggregate functions, GROUP BY, Multiple table operation, Data Definition Language, Data Manipulation Language, Transaction Control Language).
Data vs. Information
Conceptual Model: Entity-Relationship Modelling technique.
Chen’s notation: good for high-level conceptual model initially.
Crow’s Foot: the design standard in the textbook and this course.
Entity type and entity instance.
Attribute and value.
Relationship (Degree, Connectivity, Cardinality).
Logical Model: Converting Conceptual model to detailed Logical Model (using Crow’s Foot) ready for DB implementation.
Advanced topics:
Relationship strength
Composite entity
Relationship degree
Supertype and Subtype
Selecting Primary Key
Relational Model: Convert ER model to a set of tables (relations) in the relational model.
Apply Normalisation on the relations to remove any anomalies
SQL Data Definition & Manipulation Language
Use Relational Model to implement database by creating a table for each normalised relations
Data Definition Language defines the tables
Data Manipulation Language queries/updates the tables
Chapter 6. Normalisation of Database Tables (6-7 to 6-8)
Chapter 7. Introduction to SQL (7-8 to 7-10)
Chapter 8. Advanced SQL (8-1 to 8-4)
Normal Forms
Recap 1NF, 2NF, 3NF
BCNF
SQL wrap-up
Aggregate functions
GROUP BY
Multiple table operation
Data Definition Language (DROP, ALTER)
Data Manipulation Language (INSERT, DELETE, UPDATE)
Transaction Control Language (COMMIT, ROLLBACK, SAVEPOINT)
Normalization is a table design technique aimed at minimizing data redundancies.
Normalization focuses on the characteristics of specific entities.
The first three normal forms (1NF, 2NF, and 3NF) are most commonly used.
Best practice: continue the iterative ER process until all entities and their attributes are defined and all equivalent tables are in 3NF.
Normalization is an important part—but only a part—of the design process.
Proposed entities must meet the required normal form before table structures are created.
Principles and normalisation procedures to be understood to redesign and modify databases.
ERD is created through an iterative process.
Building project has: Project number, Name, Employees assigned to the project.
Employee has: Employee number, Name, Job classification.
The company charges its clients by billing the hours spent on each project.
The hourly billing rate is dependent on the employee’s position.
Many multivalued attributes
Trying to fit multiple employees' data into one row for one project
Many data redundancies, e.g., employee data JOBCLASS, CHARGEHOUR is in many rows because employees can work on multiple projects
Data inconsistencies, e.g. “Database Designer” or “DB Designer”
All above could lead to insertion, deletion, update anomalies
Eliminate the repeating groups and identify the Primary Key to fulfill 1NF
Table Name: RPT_FORMAT
PROJ_NUM | PROJECT_NAME | EMP_NUMBER | EMP_NAME | JOB_CLASS | CHARGE_HOUR | HOURS_BILLED |
---|---|---|---|---|---|---|
15 | Evergreen | 103,101,105, 106, 102 | June E. Arbough, John G. News, Alice K. Johnson | |||
*, William Smithfield, David H. Senior | Elec. Engineer, Database Designer, Database Designer, Programmer, Systems Analyst | 85.5, 105., 105., 35.75, 98.75 | 48.1, 18.36 | |||
18 | Amber Wave | 114, 118, 104, 112 | Annelise Jones, James J. Frommer, Anne K. Ramoras *, Darlene M. Smithson | Applications Designer, General Support, Systems Analyst, DSS Analyst | 23.8, 19.4, 35.7, 12.6, 23.8 | 96.75, 45.95 |
Table Name: DATAORG1NF
PROJ_NUM | PROJ_NAME | EMP_NUM | EMP_NAME | JOB_CLASS | CHG_HOUR | HOURS |
---|---|---|---|---|---|---|
15 | Evergreen | 103 | June E. Arbough | Elect. Engineer | 84.50 | 23.8 |
15 | Evergreen | 101 | John G. News | Database Designer | 105.00 | 19.4 |
15 | Evergreen | 105 | Alice K. Johnson | Database Designer | 105.00 | 35.7 |
15 | Evergreen | 106 | William Smithfield | Programmer | 35.75 | 12.6 |
15 | Evergreen | 102 | David H. Senior | Systems Analyst | 96.75 | 23.8 |
18 | Amber Wave | 114 | Annelise Jones | Applications Designer | 48.10 | 24.6 |
18 | Amber Wave | 118 | James J. Frommer | General Support | 18.36 | 45.3 |
18 | Amber Wave | 104 | Anne K. Ramoras | Systems Analyst | 96.75 | 32.4 |
18 | Amber Wave | 112 | Darlene M. Smithson | DSS Analyst | 45.95 | 44.0 |
Primary Key: (PROJNUM, EMPNUM)
No repeating groups to meet 1NF requirement
Remove partial dependency for 2NF
Remove transitive dependency for 3NF
Dependency Diagram – how do we fulfill 2NF, 3NF?
Create new table(s) to capture the relationship between the new tables after the split, in this case:
A project can contain multiple employees. An employee can be working on multiple projects concurrently. This is a many-to-many relationship.
Recall that in ERD, many-to-many relationships shall be decomposed to two 1:M relationships
Hence a composite (bridge) entity must be used to capture the employees assigned to each project
Remove partial dependencies and check if a composite entity is needed, to fulfill 2NF
Find the right tables for remaining attributes outside the partial dependency, in this case:
The new assignment entity is the composite entity bridging Project and Employee and can include HOURS as an additional attribute, specific to the relationship between 1 project and 1 employee.
Attribute name changes from HOURS to ASSIGN_HOURS to make it more meaningful.
Having remaining attributes that are not in any partial dependency but are dependant on the original composite PK could suggest the need of a new entity with that composite PK
What if PROJECT and EMPLOYEE have a 1-to-many relationship? Do we absolutely need a composite entity ASSIGNMENT? Which table would ASSIGN_HOURS go into?
Find the right tables for remaining attributes.
Remaining attribute(s) going into a new table.
Remove transitive dependency by making new tables & reassigning attributes for 3NF
For every transitive dependency, put determinant in a new table as PK; place the dependent attributes in the same table as the determinant; remove the dependents from the original table.
In this case, we put JOBCLASS in a new table JOB, place CHGHOUR in the new table, and remove it from the original table
How can Charge Per Hour be linked to an employee?
CHGHOUR is dependent on JOBCLASS in JOB table, and JOB_CLASS is associated with each employee
Considered a special case of 3NF, to cover some specific aspects and problems with the 3NF, also known as 3.5NF
Not always achievable (in contrast, 3NF is always achievable)
Can be violated only when the table contains more than one candidate key (Equivalent to 3NF when the table contains only one candidate key)
Already 3NF AND every determinant in the Dependency Diagram should be a complete candidate key
Every determinant of every functional dependency must be a complete CK
When we have multiple candidate keys…
Two candidate keys:
{CLASS_CODE}, chosen as PK
{CRSCODE, CLASSSECTION}
2NF and 3NF must be adhered to when we check the PK and all other candidate keys
Is this 2NF? Yes — Can part of PK or CK determine other non-prime attributes? No
Is this 3NF? Yes — Can a non-prime attribute determine another non-prime attribute? No
This is 3NF with no partial or transitive dependency
CLASSCODE -> STAFFID
Not partial, because dependant STAFF_ID is a prime attribute (partial dependency only applies when the dependant is a non-prime attribute)
Not transitive because CLASSCODE is a prime attribute, being part of a candidate key {STUID, CLASS_CODE} (transitive means non-prime attributes being determined by other non-prime attributes)
Not BCNF: in CLASSCODE -> STAFFID, the determinant CLASS_CODE is not a complete CK
Intuitively, the table is dual-purpose, staff assignments to classes AND enrolment, causing anomalies
Update Anomaly: if a different staff is assigned to CLASS_CODE 123, all the rows in the table for that class must be updated correctly
Deletion Anomaly: if all 5 existing students dropped out of CLASS_CODE 123 in week 3, after these 5 rows are deleted, information about who taught that class is lost forever
BCNF might be violated when multiple CK exists
For a table to be in BCNF, every determinant in all dependencies must be a complete CK
Need to remove the dependency CLASSCODE -> STAFFID, which is not partial or transitive, but it is problematic
Solution: decompose the table
Now every determinant in a dependency in each table is a candidate key
In this case, both tables have only 1 candidate key, so 3NF and BCNF clearly both hold
{STUID, CLASSCODE} is the only candidate key in the left table
{CLASS_CODE} is the only candidate key in the right table
Denormalization is needed to avoid excessive computation (e.g., JOIN, CASE, maths)
Reason | Example | Benefits of denormalisation | Control needed to avoid anomalies |
---|---|---|---|
Redundant Data | Storing ZIP and CITY in AGENT table when ZIP determines CITY | Reduce frequent JOIN to ZIP_CITY table; Improve performance lag in queries; Dropdown on user interface, where city is auto-populated based on zip code | Periodic data consistency checks |
Derived Data | Storing STUCLASS (Freshman, Junior…) in STUDENT table derived from STUHRS (student total hours in class to date) | Reduce runtime calculations (CASE WHEN STU_HRS < 30 THEN 'Freshman’ …); Avoid complex query doing calculation on the fly | Automatic recalculation is only triggered when STUHRS change; STUCLASS can be made read-only field to prevent any manual edits |
Pre-aggregated Data | Storing STU_WAM permanently in STUDENT table, redundant data based on each course score | Avoid expensive WAM maths recalculations upon every SELECT query; WAM auto-updated only when grades change | Admin-only access to update aggregated WAM |
Temporary Reporting | Tables Creating pivoted table with repeating groups for reporting purpose (more details next) | Reduce complex SQL JOIN and pivoting operations | Temp tables are automatically deleted after report issuance; Temp tables’ write access must be restricted to only reporting team |
The last four semesters taught are not necessarily the same for all staff, e.g., sabbaticals, new joiner
Normalised EVALDATA table is the master table
FACHIST table (faculty history) is a temporary denormalised table created from EVALDATA table
FACHIST table is used for reporting and filtering, e.g., filter by department, by feedback score (e.g., >3.5)
SELECT [DISTINCT | ALL] {* | [column_expression [AS new_name]] [, …] }
FROM table_name [alias] [, …]
[WHERE condition]
[GROUP BY column_list]
[HAVING condition]
[ORDER BY column_list];
[ ]
: indicates optional elements.
{ }
: indicates that the element may or may not appear.
|
: indicates “or”.
;
: indicates the end of the statement.
Aggregate functions can summarize statistics of selected rows
COUNT: number of selected rows.
MAX: maximum value within selected rows
MIN: minimum value within selected rows
AVG: arithmetic mean (average) for the selected rows.
SUM: total value for a numeric attribute within selected rows.
-- Example 1
-- How many vendors referenced in the PRODUCT table have supplied products with prices that are less than or equal A$10.00?
SELECT COUNT(DISTINCT V_Code)
FROM PRODUCT
WHERE P_Price <= 10.00;
-- Example 2
-- What is the max price for all products?
SELECT MAX(P_Price)
FROM PRODUCT;
All these functions yield only one single value
-- Example 3
-- What is the total dollar value of all the items on hand in our inventory?
SELECT SUM(P_Onhand * P_Price)
FROM PRODUCT;
-- Example 4
-- What is the average of product price in apple category?
SELECT AVG(P_Price)
FROM PRODUCT
WHERE lower(P_Description) LIKE ‘%apple%’;
SELECT MAX(P_Price)
FROM PRODUCT;
-- This previous query shows the highest price from the product table…
-- …but does not show the product details
SELECT P_Code, P_Description, P_Price
FROM PRODUCT
WHERE P_Price = (SELECT MAX(P_Price) FROM PRODUCT);
We need a nested query (query in a query)
The nested query (inner query, subquery) is performed first, and the single-value result is used for the outer query
SELECT …
FROM …
GROUP BY <column>
HAVING <condition>
Groups the rows from SELECT operation and produces a single summary row for each group
The GROUP BY clause is valid only when SELECT contains aggregate functions, such as AVG, SUM, COUNT
-- How many product units per product description are available from the vendor with vendor code 21344? (min 200 units)
SELECT V_Code, P_Description, SUM(P_Onhand)
FROM PRODUCT
WHERE V_Code = 21344
GROUP BY V_Code, P_Description
HAVING SUM(P_Onhand) > 200;
Multiple table operations are similar to JOIN, but no explicit “JOIN … ON”
-- List all student details in course ‘COMM2822’:
SELECT StdNo, StdName, Major, CourseNo, Grade
FROM Student, Enrolment
WHERE Student.StdNo = Enrolment.StdNo
AND Enrolment.CourseNo = 'COMM2822';
-- List the student and course details for student number 38214 (Note that we use aliases instead of the full table names)
SELECT StdName, CourseTitle, Grade
FROM Student S, Enrolment E, Course C
WHERE S.StdNo = E.StdNo
AND C.CourseNo = E.CourseNo
AND E.StdNo = 38214
Category | Command | Keyword | Explanation |
---|---|---|---|
DDL (Data Definition Language) | CREATE | Creates new tables, indexes, or database structures | |
ALTER | Modifies existing database structures, e.g., adding or removing columns | ||
DROP | Removes tables, indexes, or other database objects permanently | ||
DML (Data Manipulation | INSERT | Adds new rows of data to a table | |
DELETE | Removes existing rows from a table | ||
UPDATE | Modifies existing data within a table | ||
SELECT | Retrieves data from one or more tables, allowing filtering, aggregation | ||
TCL (Transaction Control | COMMIT | Saves all changes made during the current transaction to the database | |
ROLLBACK | Reverts changes made in the current transaction if an error occurs | ||
SAVEPOINT | Creates a temporary save point within a transaction, allowing partial rollbacks |
INSERT can add raw data into a table or add data from another table
-- INSERT raw data
INSERT INTO table_name [(column_name, …)]
VALUES (data_value, …);
-- INSERT data from an existing table
INSERT INTO table_name (column_name, …)
SELECT column_name, …
FROM copy_from_table_name
[WHERE (condition)]
-- Add a new customer
INSERT INTO Customers (CustomerName, ContactName, Address, City, PostalCode, Country)
VALUES ('Kai Trading', 'John Wick', 'Houston Rd 1', 'Kensington', '2033', 'Australia’);
-- we launched a truck fleet service and can serve all our suppliers to ship their products; hence all suppliers become customers
INSERT INTO Customers (CustomerName, City, Country)
SELECT SupplierName, City, Country
FROM Suppliers;
UPDATE can update selected rows and columns
UPDATE table_name
SET column_name = new_data_value, …
[WHERE (condition)]
-- Kai Trading moved to a new address
UPDATE Customers
SET City = 'Bondi Junction',
PostalCode = '2022',
Address = 'Council St 23'
WHERE CustomerName = 'Kai Trading';
DELETE FROM table_name
[WHERE (condition)]
WHERE clause specifies which rows should be deleted
If you do not have a filter with WHERE, all records in the table will be deleted!
Think a million times if you do NOT need the WHERE filter
-- Mohit Metals (CustomerID 333) stops the contract with our company
DELETE
FROM Customers
WHERE CustomerID = 333;
-- This should just delete one row with PK being 333
ALTER can change table structure by adding or dropping columns
ALTER TABLE table_name
ADD column_name datatype;
ALTER TABLE table_name
MODIFY column_name new_datatype;
ALTER TABLE table_name
DROP COLUMN column_name;
-- Add a new column to table customers
ALTER TABLE Customers
ADD Phone VARCHAR2(15);
-- Change the new column Phone to NUMBER data type (only works if column is empty or all values in the column are indeed convertible)
ALTER TABLE Customers
MODIFY Phone NUMBER (15);
-- Drop the column Phone
ALTER TABLE Customers
DROP COLUMN Phone;
DROP can drop the entire table, if no violation of referential integrity
Foreign Key
Is this 1-to-1 or 1-to-many relationship? 1-to-many
Which entity is the child, i.e. the “many” side? Customers
What if we try to drop Sales_agents table before dropping Customers table?
AgentID 101 and 102 in Customers table would be invalid – referential integrity violation
SAVEPOINT name;
creates a named checkpoint within a transaction, allowing partial rollbacks
ROLLBACK TO name;
reverts changes to a specific SAVEPOINT
ROLLBACK;
undo ALL changes since the last COMMIT OR the first DML statement of the current set of uncommitted changes (in Oracle only)
COMMIT;
marks end of transaction and save changes
COMMIT;
INSERT INTO employees (id, name)
VALUES (1, 'Mickey’);
SAVEPOINT after_insert;
UPDATE employees
SET name = 'Donald'
WHERE id = 1;
SAVEPOINT after_update;
DELETE
FROM employees
WHERE id = 1;
-- Partial rollback: undo DELETE
ROLLBACK TO after_update;
UPDATE employees
SET name = 'Daisy'
WHERE id = 1;
ROLLBACK;
-- Full rollback: undo all changes since the last COMMIT and we revert to an empty table
COMMIT;
BCNF
SQL wrap-up
Aggregate functions
GROUP BY
Multiple table operation
Data Definition Language (DROP, ALTER)
Data Manipulation Language (INSERT, DELETE, UPDATE)
Transaction Control Language (COMMIT, ROLLBACK, SAVEPOINT)
Next lecture Recap
Relational Algebra
Database Development