Database Systems

0.0(0)
studied byStudied by 1 person
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/67

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

68 Terms

1
New cards

Function

a relation mapping from input value to specific output value, each input = exactly one output.

2
New cards

Relation

between 2 sets, form collection of ordered pairs.

3
New cards

Relational model concerned with

  • Data structure - table, column

  • Data Integrity - null, ref intg

  • Data Manipulation - rel algebra.

4
New cards

Rel model goals:

  • data independence - allow changes without affecting the application using it

  • Control redundancy - not stored in multiple places.

  • consistency - copies r same, accurate, Utd.

5
New cards

Relational scheme

textual rep of data model

6
New cards

Domain

Set of possible values a column can take

7
New cards

Candidate key

min set of columns unique identifies each row.

8
New cards

primary key

chosen cand key to uniquely identify each row

9
New cards

alternate key

remaining cand keys (also referred as alt keys)

10
New cards

foreign key

column(s) matching candidate key of associate table

11
New cards

composite key

refers to key consisting of 2+ columns

12
New cards

Null Values

required to deal with missing information

Why do we use it – to express an unknow value while completing the database/relation, you cannot save a database with an empty, unpopulated area

You need to complete all entries before you can save your work – therefore we use the null values

13
New cards

•Entity integrity rule: constraint on primary keys

◦Value of a primary key must be unique and

◦not null

14
New cards

•Referential integrity rule: constraint on foreign keys

◦The value of a foreign key must either match the candidate key it refers to

◦or be null

15
New cards

Entity

•An entity represents a group of objects of interest with the same characteristics.

16
New cards

Attributes

•Attributes refer to properties of an entity or a relationship.

17
New cards

Attribute types

Single-valued

◦Attribute contains a single value for each instance of an entity

◦E.g., date_of_birth

Multi-valued

◦Attribute may contain more than one value for an entity instance

◦E.g., hobby, telephone number

Derived

◦Attribute may be derived from other attribute(s)

◦E.g., age is derived from date of birth.

18
New cards

Types of abstraction

•The following types of abstractions are the building blocks for all data models:

Classification (grouping concepts) ß done

Aggregation (composing)

Specialisation/generalisation (hierarchy) ß we are interested in this next

19
New cards

Degree

•The degree of a relationship is the number of participating entities in the relationship.

•Degree Types

◦Unary (or recursive)

◦Binary

◦N-ary

20
New cards

why cant many to many be modelled?

•In a relational model, attribute values are atomic values.

◦That is, one value for each column for a row in a table.

◦More on this when we look at normalisation.

•For many-to-many relationships, we would need multi-valued attributes.

◦This is only possible in a conceptual data model.

21
New cards

how to resolve many to many rel

•In a relational model, a many-to-many relationship is resolved (modelled) by:

◦Introducing an association entity.

Replace with two one-to-many relationships.

•For the new association entity:

Add the two primary keys from the two entities from the many-to-many relationship.

◦These become foreign keys (FK) in the association entity.

Add any attributes that were part of (on) the many-to-many relationship.

Determine a Primary Key (PK) for association entity. Two choices:

Composite Key from the two foreign keys + any other attribute to make it unique if required

◦Or, add a new attribute to be PK, e.g., an autoincrement ID key

22
New cards

Resolving N-ary Relationships

•An n-ary relationship is resolved by:

◦Introducing a new entity to represent the relationship.

◦Any attributes that are part of the relationship will be added to the new entity.

•For the new entity:

Add the primary keys from all the entities that were part of the n-ary relationship.

◦These become foreign keys (FK) in the new entity.

Add any attributes that were part of (on) the relationship.

Determine a Primary Key (PK) for the new entity. Two choices:

Composite Key from the foreign keys + any other attribute(s) to make it unique if required

◦Or, add a new attribute to be PK, e.g., an autoincrement ID key

23
New cards

func depedancy

◦Describes the relationship between attributes in a table.

◦Is a property of the meaning (or semantics) of the attributes in a table.

24
New cards

Lossless join property

◦To preserve all the rows in the original table.

◦To ensure that no spurious rows are generated when the decomposed tables are combined through a natural join operation

25
New cards

Dependency preservation property

◦To ensure that constraints on the original table are preserved by enforcing constraints on the decomposed tables.

26
New cards

UNF (Unnormalised Forms)

UNF (Unnormalised Forms:

- Contains repeating groups

27
New cards

1NF

1NF (First Normal Form):

- All attributes contain atomic (single) values.

- Remove repeating groups and select a primary key.

- Example: Flattening the StudentData table to remove repeating groups.

28
New cards

2NF

- The table must be in 1NF.

- Eliminate partial dependencies by ensuring non-key attributes depend on the entire primary key.

- Example: Decomposing the StudentModule table into Student, Module, and Registration tables.

29
New cards

3NF

3NF (Third Normal Form):

- The table must be in 2NF.

- Eliminate transitive dependencies by ensuring non-key attributes depend only on the primary key.

- Example: Further decomposing the Module table to remove the transitive dependency ConvenorID → ConvenorName.

30
New cards

Consistency requirements

In general consistency requirements include

Database integrity constraints: Entity integrity (primary keys cannot be null) and Referential integrity (foreign keys must reference existing records)

Application constraints: For example, the sum of the balances of accounts A and B is the same before and after the transaction T1.

A transaction transforms database from one consistent state to another, although consistency may be violated during

transaction

31
New cards

Why we need transactions

Counting or general statistics on changing data can be error-prone, especially when multiple users are accessing or modifying the data simultaneously.

Transactions provide a consistent snapshot of the database, ensuring that operations are executed in a controlled manner, even when multiple users are interacting with the database at the same time.

For example, a booking system for student wellbeing appointments may come into a number of potential problems:

- Multiple students trying to book the same slot simultaneously.

- Lost Update Problem: The last booking overwrites previous ones, leading to inconsistent data.

The solution is to use transactions and concurrency control mechanisms in order to prevent such issues

32
New cards

ACID Transaction Properties

Atomicity: A transaction is either fully completed or fully aborted (all or nothing).

Consistency: The database remains in a consistent state before and after the transaction.

Isolation: Partial effects of incomplete transactions should not be visible to other transactions.

Durability: Once a transaction is committed, its effects are permanent.

33
New cards

HIGHER NORMAL FORMS

•We generally aim for 3NF, but higher normal forms exist.

Boyce-Codd Normal Form (BCNF) is a stronger form of the 3NF

◦All determinants in a relation must be candidate keys

◦If all non-key attributes depend only on the complete key, a table is in BCNF

4NF is based on multi-valued dependencies

◦A table is in 4NF if it is in BCNF and contains no non-trivial multi-valued dependencies

◦Based on lossless join dependency

5NF involves relations that must be decomposed into three or more tables.

34
New cards

ER Modelling vs. Normalisation

ER modeling

Normalization

Top-down

Bottom-up

Informal, intuitive

Formal

“Real world” oriented

Database oriented

Can represent full semantics

Limited semantics only, unless additional steps introduced

Good model not guaranteed but skilled modeler may produce an excellent model

Good model guaranteed if rules applied rigorously but may not necessarily be the best

35
New cards

The two main issues that occur with respect to handling concurrent control:

System failures of all kinds: Hardware, Software and Network issues may occur, leading to a failure

Concurrent transaction problems: Multiple people updating data at the same time

In order to counteract this, we can factor these four elements into our control:

Goal: Allow multiple transactions to run concurrently without interfering with each other.

Schedules: A sequence of operations by concurrent transactions. Operations from different transactions can be interleaved.

Serial Schedule: Transactions run one after another with no concurrency. This avoids concurrency issues but has poor performance.

Serializable Schedule: A schedule that produces the same result as a serial schedule, even though transactions are interleaved.

The three kinds of concurreny problems that may occur during a transaction include:

Lost Updates (Dirty Write): One transaction overwrites the changes of another.

Reading Uncommitted Data (Dirty Reads): A transaction reads uncommitted data from another transaction.

Non-Repeatable Reads (Inconsistent Analysis): A transaction reads the same data twice but gets different results due to updates by another transaction.

36
New cards

Pessimistic concurrency

Pessimistic concurrency control techniques assume that conflicts between transactions are likely to occur. Therefore, they prevent conflicts by checking for potential issues before executing operations.

37
New cards

Optimistic concurrency

Optimistic concurrency control techniques assume that conflicts between transactions are rare. Therefore, they allow transactions to proceed without checking for conflicts during execution and only validate for conflicts at the end of the transaction.

38
New cards

All types of locking:

Read Lock (Shared Lock): Multiple transactions can read the same data item, but no transaction can write to it.

Write Lock (Exclusive Lock): Only one transaction can write to a data item, and no other transactions can read or write to it.

Two-Phase Locking (2PL): a concurrency control protocol where a transaction acquires all its locks in a growing phase (Locks are acquired but not released) and releases all locks in a shrinking phase (Locks are released after the transaction commits or aborts), ensuring serializability.

The primary issues that can occur with locks is the fact that it's possible for deadlock and livelocking to occur, thus potentially freezing the entire transaction.

39
New cards

Timestamping:

Each transaction is assigned a unique timestamp, and operations are ordered based on these timestamps. If a transaction tries to read or write data that has been updated by a later transaction, it is rolled back and restarted. The primary advantage of timestamping is the removal of all deadlocks that can possibly occur. However, but frequent rollbacks can occur if conflicts are common, thus possibly slowing down the program.

40
New cards

Optimising query processing

The primary goal of query processing is to transform a high-level query (typically written in SQL) into an efficient execution strategy expressed in a lower-level language (e.g., relational algebra). The strategy is then executed to retrieve the required data.

Particularly when it comes to larger, more complex data structures, it becomes imperative to figure out ways to shorten and compress the ways in which queries are performed and executed.

41
New cards

Optimising query processing, Performance factors

There are a number of factors to consider when looking at performance:

Throughput: The number of tasks (e.g., queries) that can be completed within a given period.

Response Time: The time required to complete a single task (e.g., a query).

Additional performance considerations include:

Time Performance: How quickly the query can be executed.

Space Performance: How much memory or disk space is required for intermediate results.

42
New cards

There are a number of factors to consider when looking at performance:

Throughput: The number of tasks (e.g., queries) that can be completed within a given period.

Response Time: The time required to complete a single task (e.g., a query).

Additional performance considerations include:

Time Performance: How quickly the query can be executed.

Space Performance: How much memory or disk space is required for intermediate results

43
New cards

There are multiple stages with respect to query processing:

Query Decomposition:

-Transform the user query into a sequence of operations on relations.

- Check the query for correctness (syntactically and semantically).

- Select an execution plan (e.g., relational algebra).

Query Optimization: Choose the most efficient execution strategy.

Code Generation: Generate the code to execute the query.

Run-Time Query Execution: Execute the query and retrieve the results.

44
New cards

Key considerations to think about when designing a database include:

Transaction Requirements: Frequency, volume, and type of access (read, update, insert).

System Resources: Main memory, CPU, disk access, storage, and network performance.

Objectives: Fast throughput, acceptable response time, and minimized disk storage.

Trade-offs: Physical design must balance conflicting objectives and adapt to changing requirements.

45
New cards

File Organisation

The three primary types of file organisation are:

Heap (Unordered) Files: Records are stored in insertion order. Efficient for bulk loading but inefficient for selective retrieval.

Sequential (Ordered) Files: Records are sorted by a specified field. Efficient for range queries but problematic for insertions and deletions.

Hash Files: Records are stored based on a hash function. Efficient for exact matches but not suitable for range queries or pattern matching.

46
New cards

Heap (unordered) files

Features of heap files:

Structure: Records are stored in the order they are inserted (chronological order), making the file unordered.

Insertion: Efficient, as new records are added at the end of the file.

Retrieval: Inefficient, as searching requires a linear scan from the start of the file.

Deletion: Problematic because:

- Deleted records are marked as deleted but not immediately removed.

- Periodic reorganization is needed to reclaim space, especially in volatile files (frequent deletions).

Default Structure: Heap files are the default table structure in many databases.

47
New cards

Heaps advantages

Advantages:

- Bulk Loading: Efficient for loading large amounts of data into a table.

- Large Retrievals: Useful when retrieving a large portion of the records.

Disadvantages:

- Selective Retrieval: Inefficient for queries that require searching by key values, especially in large tables.

- Sorting: External sorting (sorting data outside of memory) can be time-consuming.

- Volatility: Not suitable for tables with frequent changes or deletions.

Heap files are simple and efficient for bulk operations but are not ideal for selective retrieval or volatile tables. They are best suited for scenarios where large-scale data loading or retrieval is required, rather than frequent updates or deletions.

48
New cards

Sequential (ordered) files

Features of sequential files:

Structure: Records are stored in sorted order based on the values of one or more fields (e.g., sorted by name or id).

Retrieval: Efficient for:

- Exact matches: Finding records with a specific value in the ordering field.

- Range queries: Retrieving records within a range of values in the ordering field.

- Binary search: Faster than linear search because the data is sorted.

Insertion and Deletion: Problematic because:

- Records must be inserted or deleted in a way that maintains the sorted order.

- This can require shifting records, which is time-consuming.

49
New cards

Sequential (ordered) files advantages

Sequential files are efficient for retrieval operations, especially when searching for specific values or ranges in the ordering fields. However, they are less efficient for insertions and deletions because maintaining the sorted order requires additional overhead. This makes them less suitable for tables that undergo frequent changes.

50
New cards

Hash files

Features of hash files:

Structure:

- A hash function is applied to one or more fields (called the hash field or hash key) to generate the address of a disk block or page (called a bucket) where the record is stored.

- Within each bucket, records are stored in slots in the order they arrive.

- Techniques like division-remainder and folding are used to spread hash values evenly across the address space.

Example:

- A hash function like Mod 3 divides records into 3 buckets based on the hash key (e.g., carId).

- Records with carId values that hash to the same bucket are stored together.

51
New cards

Hash files ADV/DSIADV

Advantages:

- Efficient for Exact Matches: Very fast for retrieving records based on exact matches of the hash key (equality selection).

- Direct Access: Records can be accessed directly using the hash function, avoiding the need for linear searches.

Disadvantages:

- Pattern Matching: Cannot efficiently handle queries that involve partial matches or patterns.

- Range Retrieval: Inefficient for range queries, as records are not stored sequentially.

- Alternative Fields: Searches on fields other than the hash key require linear searches through all buckets/pages.

- Non-Key Fields: Retrieval based on non-key fields is inefficient.

Hash files are highly efficient for exact match queries on the hash key field, making them ideal for scenarios where quick access to specific records is needed. However, they are not suitable for pattern matching, range queries, or searches on non-key fields, as these require linear searches through all buckets. Hash files are best used when the primary access pattern involves exact key-based retrieval.

52
New cards

Table Indexing

Indexes are an important feature of SQL, as they can help to:

- Improve search and retrieval efficiency.

- Improve performance of SQL queries

An index file contains search key values and addresses of records in the data file. Indexes are typically created automatically on primary keys.

53
New cards

B+ Trees

A B+ Tree is a balanced, multi-level index structure used in databases to efficiently organize and retrieve data. Here are its key characteristics:

Dynamic Structure: Expands or shrinks automatically as rows are added or deleted, maintaining balance. Ensures that the tree remains efficient even as the dataset grows or changes.

Full Index: Every record in the data file is addressed by the index, so the data file itself does not need to be ordered. This allows for fast and efficient retrieval of any record.

Balanced Tree: All leaf nodes are at the same depth, ensuring consistent search times for any record. The tree is balanced, meaning the number of nodes searched to find any record is the same.

Node Structure: Each node contains M search key values and M+1 pointers. Non-leaf nodes contain keys and pointers to child nodes. Leaf nodes contain keys and pointers to the actual data records. All leaf nodes are linked in order, making range searches efficient.

Efficient Search: Retrieval is fast because the tree structure allows for binary search-like traversal. For example, to find a record with key C05, the search starts at the root, compares keys, and follows pointers to the appropriate child nodes until the correct leaf node is reached.

Degree of the Tree: The degree (or order) of the tree refers to the number of child nodes allowed per parent node. Trees with a larger degree are broader and shallower, which makes them faster for searches.

54
New cards

B+ TREES ADV/DISAV

Advantages:

- Supports exact match, range queries, and partial key searches efficiently.

- Handles dynamic datasets well, as it automatically reorganizes itself during insertions and deletions.

Disadvantages:

- Slightly higher overhead for insertions and deletions compared to simpler structures like hash tables.

- Requires additional space to store the index structure.

B+ Trees are a powerful and versatile indexing method, offering fast and efficient access to data for both exact and range queries. They are widely used in databases because of their ability to handle large, dynamic datasets while maintaining balanced and consistent performance.

55
New cards

The importance of reliability

DBMSs must ensure that databases are reliable and remain in a constant state when failures occur. Errors that could occur withihn this include:

System crashes: Lost of main memory

Hardware failure: Loss of database storage

Natural disasters: Destruction of database systems

Human errors: Corruption or destruction of data, hardware or software

Reliability is achieved through:

- Minimising the frequence of failures and associated side effects

- Maximising the ability to recover

56
New cards

types of reliability

Primary Reliability:

- Avoid single points of failure

- Realized through hardware replication (e.g., RAID), backup power sources (UPS, emergency generators), etc.

Secondary Reliability:

- Rigorous operating procedures (e.g., script and test DB schema changes, no manual DB access)

- Access controls (e.g., limit access to necessary personnel, prevent physical access)

Database System Failure Types:

- Transaction failure: Deadlock, handled by rollback and restart

- System failure: Power failure, system should recover to a consistent state on restart

- Communication failure: Network losses, handled in distributed setups

- Media failure: Disk failure, recover from backup

57
New cards

Database recovery

Recovery mechanisms within database design should:

- Ensure atomicity (transactions are either completely executed or not at all)

- Ensure durability (changes from committed transactions are permanently stored)

Guarantee database consistency

- Minimize impact on normal processing

High-performance DBMS may use at least three kinds of disks: System and DBMS software, Live data, Transaction logs

58
New cards

Recovery mechanisms within database design should

Recovery mechanisms within database design should:

- Ensure atomicity (transactions are either completely executed or not at all)

- Ensure durability (changes from committed transactions are permanently stored)

Guarantee database consistency

- Minimize impact on normal processing

High-performance DBMS may use at least three kinds of disks: System and DBMS software, Live data, Transaction logs

59
New cards

The DBMS log file contains information about the current states of transactions and database updates. This information usually includes:

- Transaction records (e.g., Transaction ID, type of log record, start, commit, abort records)

- Pages affected by insert, update, or delete actions (e.g., Data page ID, before-image, after-image)

- Checkpoint records (where all changes made to the database are on the disk)

60
New cards

With the Transaction Processing Model:

- Pages needed to process transactions are read from disk into memory buffers.

- Buffer pages are updated, and records of updates are written to the log.

- Buffer Manager:

- Controls the buffer pool, fetches pages from disk as required, and removes pages when no longer needed.

- Modified pages are flushed to disk at the end of a transaction or when the buffer is full.

61
New cards

Types of recovery

Normal recovery: After normal shutdown, start from the last log record (checkpoint written at shutdown).

Warm recovery: After system failure, revert to the last checkpoint in the log, apply committed transactions, and remove effects of uncommitted transactions.

Cold recovery: After media failure, restore from backup/dump and apply log records to reach the latest consistent state.

62
New cards

Checkpoints

- All modified pages in memory are written to disk, ensuring an accurate record of all current transactions in the log.

- A record describing the checkpoint is written to the log, including the list of current transactions.

63
New cards

The Undo/Redo recovery method (warm recovery algorithm)

The Undo/Redo algorithm is widely used, allows buffer manager to flush modified pages before, during, or after commit.

Steps:

1. Create two empty lists: UNDO and REDO

2. Start reading the log at the last checkpoint

3. Put all active transactions on the UNDO list

4. Read the log forwards, adding transactions to UNDO or REDO based on BEGIN-TRANSACTION or COMMIT records

5. At the end, use before-images to undo all on the UNDO list and after-images to redo all on the REDO list

64
New cards

Anderson’s Rule

Anderson's database rule comprises the following:

- A system designed for ease of access is insecure

- A system designed for strict security is difficult to use

Thus this creates a trilemma between security, functionality, and scalability—only two can be optimized at a time.

The trilemma is getting the balance right:

- The more accessible and usable the database, the more vulnerable it is to security threats.

- The more secure and protected the database is from threats, the more difficult it is to access and use.

65
New cards

CIAAAAAAAAAAAAA

The CIA Triad is a widely used model guiding database security:

1. Confidentiality – Prevent unauthorized access to sensitive data.

2. Integrity – Maintain data accuracy, consistency, and trustworthiness.

3. Availability – Ensure reliable access to data for authorized users.

To achieve security, all components of a database system must be protected:

- Users should be identifiable, authorized, and monitored.

- Data must be protected, tamper-proof, auditable, and recoverable.

- Hardware must be maintained and secured from physical tampering.

- Software should be updated and protected from vulnerabilities.

66
New cards

Common database security threats

Privilege Abuse - Users granted excessive privileges can misuse them. Example: An employee granted both read and write access to an accounts table when only read access is needed. Mitigation: Follow the principle of least privilege (grant only necessary access).

Input Injection - Attackers manipulate user input to execute unauthorized database queries through an SQL injection for example

Malware - Malicious software may steal sensitive data (e.g., database passwords). Mitigations include: Installing and updating anti-malware tools, keeping applications and database software up to date, and educating staff on security risks (avoid suspicious websites & downloads).

Storage Media Exposure - Database backups must be secured to prevent unauthorized access.Mitigations include: Encrypting backups, storing them in secure, locked locations, and using cloud storage with encryption.

67
New cards

Common database security threats 2

Misconfigured Databases - Default installations may have security vulnerabilities. Example: Some MySQL versions install without a root password. Mitigations include: securing default accounts and disable unnecessary features.

Unmanaged Sensitive Data - Sensitive data should be encrypted. Encryption Methods include: Data at Rest – Encrypt stored data (e.g., full-disk encryption, database-level encryption), Data in Transit – Encrypt data sent over networks (e.g., TLS/SSL), Client-side Encryption – Encrypt data before it is transmitted to the database.

Denial of Service - Attackers flood the database server with requests, making it unavailable. Mitigation: Using firewalls and rate limiting, ensuring extra server capacity (CPU, memory, bandwidth), and deploying distributed hosting to prevent single points of failure.

68
New cards

The best practices for database administration include the following:

The best practices for database administration include the following:

- Regularly update software and apply security patches.

- Implement audit trails to track database activity.

- Physically secure servers and backups.

- Ensure Database Administrators (DBAs) undergo regular training.