Week 1 - Principles, Origins and Objectives

Learning outcomes:

Understand the principles, origins and objectives of Database Management software
Understand the algebraic and logical foundations of the Relational Model

Introduction to databases (Bitesize)

Data: the raw, unprocessed facts (e.g. 29, John, Birkbeck)

Information: processed, organised data (e.g. John is 29 years old and studies at Birkbeck)

A database is an organised electronic collection of data. It can be stored locally or on the cloud. The database should be organised in a way that makes the data easy to access, retrieve and update. In essence, it is a digital filing system.

A database is persistent, meaning it still exists when a computer is turned off.

A field is a piece of information in a database, e.g. an address in an address book. These fields come together to create a record (e.g. the name, address and phone number of a business in an address book) which can then be stored in a table. Each record has a primary key – an identifier for each record (e.g. a hospital number).

Principles, origins and objectives

Database Management System – a software system that allows users to define (set constraints, specify data type, structure), construct (store data), manipulate (query and update the database) and share data. This way a large amount of data can be managed through one software.

Why do we need databases? It’s useful to have data that is persistent. Imagine if a computer crashes then all the data is lost! A regular file may be too small and inefficient for the vast amount of data a database may hold (some have terabytes of data). Databases also tend to be a shared resource – using one reduces the likelihood of data being overwritten by another user.

For example, imagine a new student about to enrol on a course. Without a database the data for the student might be held in record files maintained by different departments

Academic Dept Student Record

Name: A Student

Home Address: 26 Acacia Ave

Next of Kin: A Partner

Registry Student Record

Name: A Student

Home Address: 26 Acacia Ave

Enrolment Date: 29 Sept 2018

Fees Office Student Record

Name: A Student

Home Address: 28 Acacia Ave

Fees Balance: 1200.00

Exams Office Student Record

Name: A Student

Home Address: 28 Acasia Ave

Exam Entry Date: 11 Dec 2018

Why this sucks – redundancy (waste of space), varies and not shared

A DBMS has 3 levels of abstraction (hiding irrelevant details from the user): physical/internal, logical/conceptual and view/external.

Physical – the lowest level. It tells us how data is actually stored in memory.

Logical – the type of data that is stored and the relationship between the data.

View – the level users see.

Why do we need a Database Management System (DBMS)?

Provides a query language – a specialised language for making queries and retrieving information from databases. One example is Structured Query Language (SQL). This abstracts the code behind data storage and retrieval, making it more user-friendly

Saves time with constraints – the rules or conditions applied to database data, to enforce data quality by limiting the kind of data that can go into a table. These can be automatically checked by a DBMS

Reduces maintenance time with data independence – The ability to make changes at one level without affecting a higher level. In other words, changes to the data structure/storage don’t affect applications that use the data. The two main types are logical (changing the logical structure e.g. adding an email column to an employee database) and physical (changing how data is physically stored e.g. moving from an HDD to an SSD).

Provides concurrent access – multiple users/applications can access the database simultaneously without causing inconsistencies. This reduces response time and increases efficiency

Recovery – if there is a failure (e.g. due to power loss) there is a lower risk of data loss. Transaction logs record changes made to the database and can be used to roll back.

Security – a DBMS provides many security features to prevent unauthorised access to a database

Can also reduce redundancy (the same data being stored in multiple places) but this isn’t always wanted!

Some disadvantages of using a DBMS are the cost and complexity. It would need to be purchased and implemented/managed by the appropriate staff.

History of database systems

Humans have been storing information for a very long time.

The first computerised databases came about in the 1960s. There were 2 popular data models: IMS and CODASYL.

IMS is a hierarchal model meaning data is in a tree-like structure. The parent (or root) node can have multiple child nodes connected to it. While the child has one parent. An example for this is a directory.

CODASYL is a network model. Here, child nodes can have multiple parent nodes though in this case they are called owners. This allows more flexibility between data points.

In the 70s the relational model was introduced. Here, data is organised into tables (relations). Each row represents an entity and each column is an attribute. This underpins many of today’s most popular DBMS.

There is now also Not only SQL (NoSQL). The schema (the blueprint for how the database is organised, includes logical constraints like data type and the relationship between entities) is flexible and the database scalable. However schemas change infrequently as doing so requires taking the database offline.

Entity Relationship Model

The ER model is a conceptual model for designing databases. It is useful to have before moving to a lower-level model (e.g. relational). It contains entities, attributes and relationships. The diagram of this model is called an ERD.

What’s required in an ERD?

Entities and entity types (sets): an entity is a “thing” that exists and can be uniquely identified (e.g. a person). An entity type (or set) is a related group of things. Entity types can be thought of as similar to a class in OOP – they have attributes and properties related to the database e.g. entity type ‘Student’ would have various students and their attributes such as name, course, ID number etc. In the ERD they are represented as a rectangle

Attributes – the properties of entity types. Usually composed of simple values e.g. an integer or string

Relationship type – an association between 2 or more entity types. They can be many-to-one/one-to-many (e.g. many employees in one department, represented by a line), one-to-one (e.g. one manager heads one department, represented by a line with crow’s feet) or many-to-many (e.g. a student has many lecturers and a lecturer has many students, represented by a line with crow’s feet on both sides)

A diagram of a company

AI-generated content may be incorrect.

Department & employee are the entity types. They have a one-to-many relationship.

Any Employee ‘objects’ have ID, last name and first name as attributes

Remember – the relational model organises the data into tables (relations). Table definitions are defined in the schema

A screenshot of a computer

AI-generated content may be incorrect.

Each entity becomes a table

Each attribute is a column

Each instance is a row.

The third table is a bridge table. They are optional in one-to-many (we could put the department name in the employee table) but required in many-to-many. It links the other two tables together with foreign key references (e.g. employee 4, Linda Hurst, belongs to sales).

ERDs continued

ERD entities often have a business context (e.g. people/roles, tangible and intangible objects). Relationship is how they relate to each other in a system. They are useful for

· Database design. Altering a DBMS directly can be risky so drawing an ER can help to identify and correct flaws before executing changes in the database.

· Debugging – having a full picture of the database shema with an ERD makes it easier to locate entities, view their attributes and identify relations, therefore making it easier to analyse a database to reveal problems

· Creation – some ERD tools can automatically generate a database turning the diagram into an actual reflection of the database structure

· Requirements – they are more user-friendly and an be used to determine the requirements of a database

An entity in ERD is a definable thing or concept in a system. Each entity becomes a relation (table) in a relational database.

An attribute is a property or characteristic of an entity. It is also known as a column. In ERDs they are given a name that describes the property and a type that describes the kind of attribute it is such as int or char. It is important the type is supported by the target DBMS

A primary key is a special attribute that uniquely identifies a record (or row) in the database table, therefore there cannot be duplicates.

A screenshot of a computer screen

AI-generated content may be incorrect.

A foreign key is a reference to the primary key in a different (foreign) table. For example, the department ID

Here we have a DEPARTMENT entity and TEACHER entity. The ID in the teacher table is the primary key. The department ID in the teacher table is the foreign key (hence it can be duplicated, or even null) – in the department table it is the primary key (where it is unique).

A relationship between two entities means that they are somehow related. Cardinality defines the number of relationships an entity has with another. The three common ones are one-to-many/many-to-one, one-to-one and many-to-many.

2: Introduction to the relational model

What we’re doing here is in the logical layer.

Reminder: the schema is the structure of the database, formed by multiple relations but not the data itself.

Relation: an individual table in a relational database. It is composed of rows (tuples) and columns (attributes)

This relational database has 4 relations: SUPPLIER, PART, PROJECT and SUPPLY. Suppliers supply a certain quantity of parts to projects. The bridge table on the right shows us how these entities relate. For example supplier S1 (Smith with a status of 20 and in London) supplies part P1 (red nuts weighing 12, from London) to project J1 (Sorter in Paris). There is a many-to-many relationship (as one entity can be linked to many other entities and vice versa).

Each relation (table) is formed of tuples (rows) and named attributes (columns). The number of tuples in a relation is the cardinality and the number of attributes is the degree. The tuples contain a value for each attribute. The domain is a set of legal values that attributes draw their actual values from (e.g. like an enum or positive integers).

The set of attribute names for a relation is called the relation scheme.

The scheme is sometimes written as

Relation (Attribute1: Degree 1, A2: D2, … An: Dn)

e.g. SUPPLIER (S#: S#, SNAME: NAME, STATUS: STATUS, CITY: CITY)

Bear in mind that a schema is fixed, while the data itself is time-varying.

This can be simplified to

SUPPLIER(S#, SNAME, STATUS, CITY)

Therefore a user sees a relational database as a collection of time-varying relations of assorted degrees.

Relational Model formalisation

We can look at these in terms of set theory.

	Set	Tuple
Elements	Unique	Can have duplicates
Ordered	No (so {1, 3, 2} = {2, 3, 1})	Yes
Symbol	{ }	( )

A relation is a subset of the Cartesian product of a list of domains. Let’s break that down:

The Cartesian product (A x B) is a set of all possible ordered pairs {a, b}. for example if A = {1, 2, 3} and B = {5} then A x B = {(1, 5), (2, 5), (3, 5)}

Now imagine 4 students and their university course:

Name	Course
Anna	1
Bob	1, 2
Carol	3, 4
Dave	None

The Cartesian product of the domains would show us every possible combination i.e. {(Anna, 1), (Anna, 2), (Anna, 3) …}

But we don’t want every possible pair! Only a subset of them. So the subsets for our relation would be

Relation R = {(Anna, 1), (Bob, 1), (Bob, 2), (Carol, 3), (Carol, 4)}

Which shows us that a relation is a set of tuples that are a subset of the Cartesian product of a list of domains.

In terms of the relational model the Cartesian product of all domains (D1, D2, …, Dn, n being the number of attributes) is the set of all n-tuples (v1, v2, …, vn) such that v1 is in D1, v2 is in D2 etc

For example, let’s consider the CONSISTS_OF relation

It has 2 attributes: DEP_NAME drawn from a domain of department names (D1) and EMP_ID drawn from a domain of employee IDs (D2).

D1 = {SALES, TECHNICAL, ACCOUNTS, ADMIN}

D2 = {1, 2, 3, 4, 5}

Then D1 x D2 = { (SALES, 1), (SALES, 2), (SALES, 3), (SALES, 4), (SALES, 5), (TECHNICAL, 1), (TECHNICAL, 2), (TECHNICAL, 3), (TECHNICAL, 4), (TECHNICAL, 5), (ACCOUNTS, 1), (ACCOUNTS, 2), (ACCOUNTS, 3), (ACCOUNTS, 4), (ACCOUNTS, 5), (ADMIN, 1), (ADMIN, 2), (ADMIN, 3), (ADMIN, 4), (ADMIN, 5)}

Because there are 5 tuples in D1 and 5 tuples in D2 we get 25 sets in the Cartesian product.

Since a relation is a set of tuples, within any given relation there are no duplicate tuples and the tuples are unordered. The attributes are also ordered (e.g. (SALES, 1) is not the same tuple as (1, SALES))

Relational Integrity Constraints

Since a database models a part of the real world, there needs to be constraints so a database can’t contain values that would be impossible in real-world situations.

A DBMS would enforce database integrity constraints to ensure impossible values do not occur. For the Suppliers-Parts-Projects database for example constraints could include part weights must be above 0 or every supplier supplies at least one part. In terms of data the constraints also help us to (1) prevent a deleted supplier from appearing in the bridge table and (2) prevent adding a non-existent supplier.

Those constraints are specific to the S-P-P database, but some authors argue that the relational model has a few database-independent constraints. Let’s look at 2 rules in particular: entity integrity rule and the referential integrity rule. Both of these involve keys and null values.

Keys

A key in DBMS is used to uniquely identify a tuple. A (candidate) key of a relation is one or more attributes of relation R that satisfy the following two time-independent properties: (1) no two distinct tuples of R have the same value for each of the key attributes and the first property does not hold for any proper subset (a subset that is not equal to the original set, they are smaller) of the key attributes. Property 1 means key attributes are unique, while 2 ensures a key is the minimal group of attributes that act as an identifier.

For example, consider this table

STUDENT_ID	EMAI	NAME
01	a@email.com	Alice
02	b@email.com	Bob
03	c@email.com	Claire

{STUDENT_ID} and {EMAIL} are unique. {STUDENT_ID x EMAIL} is also unique however it is not a candidate key. This is because proper subsets of {STUDENT_ID x EMAIL} include {STUDENT_ID} and {EMAIL} – which have already been identified as unique. Meaning {STUDENT_ID x EMAIL} is not the minimum number of attributes.

Consider the SUPPLIER table. We’d need a way to uniquely identify tuples. S# does this so it could be a key. SNAME could do this but there is the chance of a new supplier having the same name. This does not make it time-independent so it wouldn’t be an ideal key. If a table doesn’t have a column to identify tuples then it’s good to make one! In S-P-P the S#, P# and J# do this.

Types of keys

Composite (or concatenated): a key that consists of more than one attribute. It acts as a primary key if there is no primary key in the table. A relation may have many composite keys.

Candidate: the minimum set of attributes that can uniquely identify a tuple. There can be more than one in a relation.

Nonkey: an attribute that is not part of any candidate keys. For example the COLOUR attribute in the PART table

Primary: a candidate key that is chosen by the database designer to identify tuples. It cannot contain any null values or duplicates. The remaining keys are also known as the alternate keys.

Foreign: a key that references a candidate key in another table or attribute. It is key (ha-ha) to the relationship between tables as it acts as a pointer. It can be composite. The foreign key doesn’t have to be a candidate key itself.

EMP_ID	LAST_NAME	FIRST_NAME	MANAGER_ID
1	JONES	BILL	3
2	CONNORS	MARY
3	HURST	LINDA	2

EMP_ID is the primary key and MANAGER_ID is a foreign key (it references the primary key)

Note that the second tuple has a null MANAGER_ID.

NULL values

These are missing pieces of information and can occur for many reasons. For example in the S-P-P database we might know of a new supplier based in Wigan, but we don’t yet know the name or status. Here they would be represented as null.

We can also use them to represent the fact that a value is not relevant in a particular context. For example in an EMPLOYEE database that lists managers under the attribute MANAGER_ID the tuple representing the highest-level manager could be null in that column, as they wouldn’t have a manager.

Remember that null isn’t the same as zero. If someone’s age was missing from a database we wouldn’t want to list it as zero! Null values can be a pain to deal with; hence a lot of forms have fields that must be completed.

Classes of Integrity Constraint rules

Entity integrity: primary key values can uniquely identify tuples in a relation

Referential integrity: each non-null foreign key value must have matching candidate key values in some relation