Databases

Basic Concepts

A.1.1 - Data vs Information

  • data: the quantities, symbols, and characters on which operations are being performed by a computer system, being stored, transmitted, and distributed in electrical signals, and recorded in optical/magnetic/mechanical recording media

  • often considered the more objective facts, having characters and symbols with no distinct meaning

  • information: occurs when connections between pieces of data are identified (processed data with meaning)

A.1.2 - Information System vs Database

  • information system: an organized collection of interconnected components that work together to collect, store, process, and distribute information, to support decision-making and operations within an organization 

  • hardware, software, data, people, process

  • database: an organized collection of data, typically stored electronically in a computer system, designed for efficient storage, retrieval, and management of data 

A.1.3 - The Need for Databases

  • needs: allow easy data sharing which enhances collaboration, improved efficiency and decision making, potential innovation and better customer experience 

  • challenges: high implementation cost, complex to manage, performance overhead 

A.1.4 - Transaction States 

  • db state: the actual data stored in the database at a specific point in time. it reflects the current contents and changes made through operations (insertion, deletion, update). during a lifetime of a transaction, there can be multiple states to go through

A.1.5 - Database Transaction

  • transaction: a sequence of one or more operations treated as a single unit of work. the goal is the ensure either all operations succeed, or none do, to maintain data consistency

  • common operations are the CRUD operations 

A.1.6 - Concurrency

  • two or more events/circumstances that are existing and happening at the same time

  • data concurrency refers to the ability of a database to handle multiple users and processes manipulating/accessing data simultaneously, without compromising data integrity and consistency

  • pros: better system resource utilization, higher throughput, however it is crucial that these transactions do not conflict with each other

  • dirty read: one transaction reads uncommitted data from another transaction

  • incorrect summary: one transaction reads before another has completed an update

  • non-repeatable read: multiple reads on one transaction gets different values 

  • phantom read: when a read variable is read again but an error occurs due to deletion 

  • lost update: an update is lost when another transaction overwrites the update 

  • popular solutions (thinking procedurally) → lock based (locking data when in use), time stamp based (using timestamps to monitor multiple transactions)  

A.1.7 - ACID Properties

  • atomicity → transactions are treated as an atomic unit (they either succeed, or not at all) 

  • consistency → transaction follows predefined rules/constraints

  • isolation → data isolated from each other to avoid conflicts and interference

  • durability → data is persisted even during system failure 

A.1.8 - Database Functions

  • query functions (read) → accessing and retrieving data for analyzing and reporting purposes 

  • update functions (delete, create, update) → making sure data is consistent and accurate/up to date 

A.1.9 - Data Validation and Verification

  • validation: ensures data is of the same type/format and within the expected range

  • verification: ensures data fits with the goals of the programs 

Relational Databases

A.2.1 - DBMS and RDBMS 

DBMS

  • Stores data individually

  • Data elements are accessed individually

  • No relationship is present between data

  • No normalization is present

  • No support for distributed databases

RDBMS

  • Stores data in table format

  • Multiple data elements can be accessed in the same time

  • Data is stored in the form of tables which are in relationship with each other 

  • Normalization is present

  • Supports distributed databases

Normalization

  • Process of organizing data within a database to minimize redundancy and to improve data integrity 

Distributed Databases

  • Systems where data is stored and managed across multiple computers or sites connected by a network 

A.2.2 - Functions and Tools of a DBMS

Functions of a DBMS

  • Creation (setting up the structure of the database)

  • Defining schema → determining the tables, fields, data, types, and relationships

  • Setting constraints → primary keys, foreign keys, unique constraints, default values

  • Data modeling → designing entity-relationship diagrams (ERDs) to map out logical structure

  • Manipulation (inserting, updating, deleting, and maintaining data within the database)

  • Data entry and updates → adding/modifying records

  • Data validation → ensuring data meets required formats/rules

  • Transaction management → ensuring data integrity (rollback/commit)

  • Interrogation (retrieving and analyzing data from the database)

  • Querying → fetching specific data using conditions or joins

  • Reporting → presenting data in user-friendly formats

  • Aggregation → using functions like SUM(), AVG(), COUNT()

Tools of a DBMS

  • MySQL

  • Microsoft SQL Server

  • SQLite

A.2.3 - DBMS and Data Security

Personal vs Organizational Security

  • Data security is the practice of protecting digital information from unauthorized access, corruption, or loss

  • It involves implementing measures to ensure the confidentiality, integrity, and availability of data throughout its lifecycle

  • Personal data and multi-factor authentication → may be less purposeful for bigger datasets in an organization → 2/3 factor authentication

  • Organizational data security areas → application security, data security

Data Validation

  • Data type → checks that the data matches the data type for a field (letters, numbers, etc) → range validation also would be needed

  • Data range → the data falls within an acceptable range of values defined for the field → data constraint needed

  • Data constraint → data meets certain conditions or criteria for a field (type of data, number of characters, etc.) → data range validation also needed

  • Data consistency → data makes sense in the context of other related data → data might be consistent but still inaccurate

  • Code structure → data follow or conforms to a set structure → format/pattern

  • Code validation → checks if it exists → application code systematically performs all of the previously mentioned validations during user data input → might not properly validate all possible variations with data input 

Access Rights (AKA Database Permissions/Privileges)

  • Define what actions users or roles can perform on various database objects, such as tables, views, schemas, or the entire database

  • Fundamental component of database security and access control, ensuring that only authorized individuals/systems can interact with data in specific ways 

  • Example → policy based access rights → only administrators can create and manage policies for database permissions and then apply those policies to users → permission levels

Data Locking

  • Mechanism in database management systems to control concurrent access to data, ensuring data integrity and consistency, especially in multi-user environments

  • Prevents conflicting operations from being performed on the same data by different transactions simultaneously

  • Approaches to data locking → allowing many reads but limited edits, locking specific tables, pages, database rows

A.2.4 - Schema

  • Defines the structure and organization of data within a database, acting as a blueprint for how data is stored, accessed, and manipulated

  • Outlines tables, fields, data types, relationships, and constraints that govern the database’s logical design 

  • Common components: tables, columns/fields, primary keys, foreign keys, relationships

  • Primary key → uniquely identifies a record in the table

  • Foreign key → field in the table that is a primary in another table

A.2.5 - Levels of Schema 

  • Conceptual level → represents the “what” of the data focusing on business concepts and their relationships. it is a high-level view, abstracting away technical details (stakeholder focus is upper business leaders)

  • Logical level → defines the structures and details of the data, still independent of any specific database system. it builds upon the conceptual model by adding more structure (data architect and analysts)

  • Physical level → defines how the data will be physically stored in a specific database system, it is the implementation of the blueprint (database developers)

A.2.6 - Nature of the Data Dictionary 

  • A collection of names, definitions, and attributes about data elements that are being used in a database/information system/part of a research project

  • describes the meanings and purposes of data elements within the context of a project

  • provides guidance on interpretation, accepted meanings/representation

  • in DBMS, it helps users manage data in an orderly manner, preventing data redundancy

  • important to provide clarity about the entities/ relationships, and attributes that exist in the data model

A.2.7 - Data Definition Language 

  • a syntax for creating and modifying database objects such as tables, indices, and users

  • these commands define implementation details of the database schema, which are usually hidden from the users

  • create, alter, drop, rename, truncate, comment

  • controls the design, not the data itself

  • it is a part of SQL (structured query language) commands used to manage RDBMS

A.2.8 - Importance of Data Modeling

  • process of creating a visual representation of data and how it relates to each other, often in the form of diagrams, to represent data structures and how they will be stored and used within a system

  • acts as a blueprint for databases/information systems, defining entities, attributes, and relationships

  • benefits → enhanced decision-making, optimized database performance, efficient data integration 

A.2.9 - Database Terms 

Basic Database Terms

Database Keys

A.2.10 - Types of Relationships

A.2.11 - Issues of Redundant Data 

  • Data redundancy simply refers to unneeded or unnecessary data

  • Good redundancy → system backup redundancy → failover system is a backup operational mode that automatically switches to a redundant/ standby system when a primary system fails 

  • If one system goes down, the others allows processes to continue without interruptions

  • Bad redundancy → within a database, it is possible for the same data to be duplicated multiple times (same tables, across multiple tables) → increased storage costs, potential data inconsistencies and complex data management 

  • Foreign keys can reduce redundancy → establishing relationships between tables in a relational database allows for data to be stored in a normalized fashion 

  • Pros: better data security, faster data updates and access, improved data reliability 

  • Cons: possible data inconsistency, increase in database corruption, increase in database size leading to unnecessary storage, increase in cost due to the need of more processing power

A.2.12 - Importance of Referential Integrity

  • Normalization is the process of dividing large tables in smaller ones and establishing relationships between them → improve db’s efficiency

  • Normalized db’s structure data into multiple linked tables → preventing anomalies (crud operation errors) 

  • Referential integrity is a fundamental principle of relational db that ensures the consistency of relationships between tables by requiring that foreign key values always point to valid and existing primary key values in a referenced table 

A.2.13 - Types of Normalized Data Forms

  • Normal forms are rules for structuring tables to minimize redundancy and improve data integrity. They are organized by levels that progressively get stricter

  • 1NF → eliminate repeating values → each column should hold atomic values → one column cannot hold multiple values 

  • 2NF → eliminate partial functional dependency → no attribute should depend on just a part of a composite key → full dependency is required

  • 3NF → eliminate transitive/indirect dependency → no non-key attribute should be dependent of another non key attribute 

A.2.14 - Characteristics of a Normalized Database 

  • reduced data redundancy

  • data consistency and accuracy

  • fewer data anomalies/oddities

  • clear table relationships 

A.2.15 - Different Data Types

  • data type: an attribute associated with a piece of data that tells a computer system how to interpret its value, reinforcing code reliability and efficiency

  • numeric → int (whole numbers), float (decimals) 

  • character string → char (fixed length text), varchar (variable length text) 

  • unicode character string → nchar (fixed length unicode text), ntext (multilingual text) 

  • binary → binary (0s and 1s for small binary data), image (stores large binary objects) 

  • date/time → date (year, month, day), datetime (date and time stamps) 

  • miscellaneous → json, xml (used for data transfer/serialization) 

A.2.16 - Entity-Relationship Diagram 

  • visual tool used to design databases and provide a structured blueprint for a database, ensuring clarity/consistency/efficiency during development

  • helps prevents redundancy and data loss

  • include entities, attributes, and relationships 

  • memorize the symbols 

A.2.17 - Constructing Relational Databases 

A.2.18 - Query in a Database 

  • a query is a request for information

  • a query asks the database to return data that meet certain conditions 

  • a query can perform multiple functions in SQL as needed

  • select using SELECT

  • filter using WHERE

  • sort data using ORDER BY

  • database view is a type of temporary virtual table that shows data in its current state without changing what is in the tables 

  • in SQL, CREATE VIEW can save a specific query

A.2.19 - Simple vs Complex Query

  • a simple query retrieves data from one table using basic commands (SELECT, FROM, and possibly WHERE)

  • a complex query involves multiple tables and advanced operations (JOIN, UNION, GROUP BY/HAVING, subqueries/nested queries, views/functions)

  • SQL parameter queries (prepared statement) separates the SQL query structure from the data value. this approach enhances security, performance and code maintainability. 

  • a parameter value is used as a placeholder that allows the user to provide an inputted value → minimizes risk for SQL injections → removes the ability for easy injection since input is viewed as a string and not SQL code

  • SQL and boolean operators are logical operators that combine/modify boolean expressions (true/false values) to control program flow. common operators (AND, OR, NOT) are used to make decisions in code, similar to how it is used in search queries

  • SQL derived fields are new tables created in the query by performing calculations on existing fields (a new column is formed through adding, subtracting, multiplying, etc.) 

A.2.20 - Query Construction 

  • When interrogating data, queries are created through SQL commands 

  • Commands must follow a specific order of execution 

  • Commands/clauses

  • Data dictionary language → changes structure of the database

  • Data manipulation language (update, insert, delete, select) → changes data stored in the structure of the database 

  • Order of execution (from, where, group by, having, select, order by, limit)

Other Database Areas

A.3.1 Database Administrator

  • A DBA is an IT professional who manages, organizes, and secures a company’s data by maintaining its databases

  • Their responsibilities include designing database structures, ensuring data access and security, performing backups and recovery, and troubleshooting performance issues to support business needs

A.3.2 End User Interactions

Direct Interaction: using query languages or tools via DBMS

  • SQL queries: are a common direct interaction method for end users to directly issue commands to a database → primarily performed by DB specialists

  • Query by example: DB query method that uses a visual interface to allow users to directly search data by providing examples, rather than writing code in languages like SQL → also focus on the DB specialists → provides GUI that shows tables with rows and columns → users fill in the blanks or example values in the relevant table column instead of writing text-based commands → users can type keywords, wildcards, or logical operations into the criteria row to specify search conditions → the DBS returns all records that match the user’s provided examples and conditions

Indirect interaction: via applications on the computers, the web, etc.

  • Visual queries: a visual query tool provides a GUI to construct and execute DB queries without requiring extensive knowledge of SQL syntax → searches are primarily multimedia-based searches performed by customers → users can drag and drop tables, define relationships, and specify criteria visually → automatically generates the corresponding query in the background → supports various transformations like filtering, sorting, grouping, and basic arithmetic operations on columns

  • Natural language interfaces (NLIDBs): systems that allow users to interact with DB using everyday, conversational language instead of technical query language like SQL → a user types a question in natural human language → then the system uses natural language processing to understand the intent of the question → it links the terms in the question to elements in the DB schema (tables and columns) → the system generates a structured query based on the user’s request and the DB schema → the SQL query is executed against the DB and the results are returned to the users.

A.3.3 Methods of Database Recovery

  • DB recovery is the process of restoring the DB to a correct and consistent state after failure (to maintain atomicity and durability)

  • Sudden power outage, software bug, hardware crash → DBMS needs a way to get back up and running without losing data or leaving it in a corrupted state

Log-based recovery

  • Log-based recovery: the ability to maintain or recover data by keeping a record of transaction on some stable storage device to provide easy access to data when the system fails

  • a log file will be created to store a log of every operation just before it is performed on the DB → when a transaction is recorded to the DB another log is then recorded as a confirmation → SQL log files are in their own format and needs a special reader → as a log file records every step of the DB transaction, when a failure occurs, a DBA can decide which recovery makes the most sense

  • undo for atomicity → reverse changes from transactions that did not finish

  • redo for durability → reapply changes from committed transactions

  • immediate updates → updates before transactions commit

  • deferred updates → updates after commits occur

Shadow paging

  • works by keeping two versions of the DB pages during the transaction → does not require log pages

  • page tables are logical divisions of data linked to physical storage locations on a disk

  • current/master page table: a copy of shadow page table when transaction begins, all updates are made here, and approved page tables become the new shadow page table

  • shadow page table: old and stable state of DB, never changes during transactions, if failure occurs, DB returns to this version

  • 1. when a transaction starts, a shadow page table is created with pointers to the same locations as current/master table

  • 2. prior to commit, pages needing updates are first copied → updates are included in the copies and current/master pointers are updated to new pages

  • 3. upon transaction success, the shadow page table pointers are updated → at this point, the shadow page table has become the current/master page table ready for the next transaction

Backup and Restore

  • creating periodic backups/copies of a DB to protect against data loss and using those copies to recover the data and operations in case of corruption or failure

  • process typically done via GUI tools but can be done via command line

  • full backup: entire data set, regardless of any previous backups or circumstances

  • differential backup: additions and alterations since the most recent full backup

  • incremental backup: additions and alterations since the most recent backup

  • backups will typically just replace bad database information with the backed-up data

A.3.4 Integrated Database Systems Functions

  • unifying data from different sources into a single, consistent view of data

  • it is impossible to have one single database solution for all the different departments of a business

  • different levels of data integration can be helpful

  • 1. extracting data from multiple sources

  • 2. transformation and loading of data into a central repository or data warehouse

  • 3. providing a common interface for visualization/reporting of data for users to interact with all data as if it were in one system

  • pros: better data quality, cost savings, better decision-making and collaboration, improved efficiency, higher quality customer experiences, increased revenue streams, stronger data security, improved data accessibility

A.3.5 Use of Databases in Different Areas

stock control

  • store product details (item codes, descriptions, quantities, price, suppliers)

  • automatically update stock levels when items are sold or delivered

  • reorder levels can be set to trigger alerts when stock is low

  • supplier management

  • historical sales data can be analyzed to forecast demand and reduce waste or shortages

police records

  • store personal details, criminal records, case histories

  • allow quick searching and cross-referencing of suspects and crimes

  • db help track investigations and identify patterns in crime

  • secure access controls protect sensitive information and ensure only authorized officers can view or edit data

A.3.6 Methods of Database Privacy

  • data security: focused on protecting personal data from unauthorized third-party access, malicious attacks, and exploitation. set up to protect personal data using different methods to ensure data privacy

  • data privacy: the right of individuals to control their personal information, dictating how organizations collect, store, use, and share it, ensuring it is handled responsibly in compliance with the law

data protection methods

  • network security: firewalls, network segmentation, VPNs, intrusion detection/prevention system

  • access management: user accounts/passwords, multi-factor authentication, role-based access control, account lockouts after input errors

  • threat protection: anti-malware software, SQL injection protection, web application firewalls, security monitoring and alerts

  • information protection: encryption at rest on server and in transit over networks, data masking, regular backups, data retention policies

data legislation

  • laws that control how data is collected, stores, used, and shared

  • different regions and countries have worked to develop laws regarding data privacy

  • computer misuse (unauthorized access to computer material)

  • data protection act (storage limitations, accuracy)

A.3.7 Need for Database Publicity

  • lawful access to data allows access to otherwise protected data when required by law, typically by authorized bodies such as police, courts or regulators

principles governing lawful access

  • legal basis: required by law (court order) and access must be clearly documented

  • least privilege: minimum data access needed

  • purpose limits: access for specific legal investigation → not for secondary/unrelated use

  • audibility

  • oversight

  • time-bound

  • integrity.custody

A.3.8 Data Matching and Mining

data matching

  • works to identify and link records that refer to the same real-world entity and to find similarities, creating a single, comprehensive view

  • compares fields (names, addresses, ID) across different datasets to resolve inconsistencies like typos or different formats

  • it is often used for security and verification

  • purpose is to improve data quality, cleansing, entity resolution, and building reliable DBs

data mining

  • works to discover patterns, correlations, anomalies, and trends in large datasets

  • uses algorithms (classification and clustering) to find buried knowledge that is not immediately obvious → works on larger datasets

  • purpose is for predictive analytics, targeted marketing, fraud detection, business intelligence