Databases

Basic Concepts

A.1.1 - Data vs Information

  • data: the quantities, symbols, and characters on which operations are being performed by a computer system, being stored, transmitted, and distributed in electrical signals, and recorded in optical/magnetic/mechanical recording media

  • often considered the more objective facts, having characters and symbols with no distinct meaning

  • information: occurs when connections between pieces of data are identified (processed data with meaning)

A.1.2 - Information System vs Database

  • information system: an organized collection of interconnected components that work together to collect, store, process, and distribute information, to support decision-making and operations within an organization 

  • hardware, software, data, people, process

  • database: an organized collection of data, typically stored electronically in a computer system, designed for efficient storage, retrieval, and management of data 

A.1.3 - The Need for Databases

  • needs: allow easy data sharing which enhances collaboration, improved improved efficiency and decision making, potential innovation and better customer experience 

  • challenges: high implementation cost, complex to manage, performance overhead 

A.1.4 - Transaction States 

  • db state: the actual data stored in the database at a specific point in time. it reflects the current contents and changes made through operations (insertion, deletion, update). during a lifetime of a transaction, there can be multiple states to go through

A.1.5 - Database Transaction

  • transaction: a sequence of one or more operations treated as a single unit of work. the goal is the ensure either all operations succeed, or none do, to maintain data consistency

  • common operations are the CRUD operations 

A.1.6 - Concurrency

  • two or more events/circumstances that are existing and happening at the same time

  • data concurrency refers to the ability of a database to handle multiple users and  processes manipulating/accessing data simultaneously, without compromising data integrity and consistency

  • pros: better system resource utilization, higher throughput, however it is crucial that these transactions do not conflict with each other

  • dirty read: one transaction reads uncommitted data from another transaction

  • incorrect summary: one transaction reads before another has completed an update

  • non-repeatable read: multiple reads on one transaction gets different values 

  • phantom read: when a read variable is read again but an error occurs due to deletion 

  • lost update: an update is lost when another transaction overwrites the update 

  • popular solutions (thinking procedurally) → lock based (locking data when in use), time stamp based (using timestamps to monitor multiple transactions)  

A.1.7 - ACID Properties

  • atomicity → transactions are treated as an atomic unit (they either succeed, or not at all) 

  • consistency → transaction follows predefined rules/constraints

  • isolation → data isolated from each other to avoid conflicts and interference

  • durability → data is persisted even during system failure 

A.1.8 - Database Functions

  • query functions (read) → accessing and retrieving data for analyzing and reporting purposes 

  • update functions (delete, create, update) → making sure data is consistent and accurate/up to date 

A.1.9 - Data Validation and Verification

  • validation: ensures data is of the same type/format and within the expected range

  • verification: ensures data fits with the goals of the programs 

Relational Databases

A.2.1 - DBMS and RDBMS 

DBMS

  • Stores data individually

  • Data elements are accessed individually

  • No relationship is present between data

  • No normalization is present

  • No support for distributed databases

RDBMS

  • Stores data in table format

  • Multiple data elements can be accessed in the same time

  • Data is stored in the form of tables which are in relationship with each other 

  • Normalization is present

  • Supports distributed databases

Normalization

  • Process of organizing data within a database to minimize redundancy and to improve data integrity 

Distributed Databases

  • Systems where data is stored and managed across multiple computers or sites connected by a network 

A.2.2 - Functions and Tools of a DBMS

Functions of a DBMS

  • Creation (setting up the structure of the database)

  • Defining schema → determining the tables, fields, data, types, and relationships

  • Setting constraints → primary keys, foreign keys, unique constraints, default values

  • Data modeling → designing entity-relationship diagrams (ERDs) to map out logical structure

  • Manipulation (inserting, updating, deleting, and maintaining data within the database)

  • Data entry and updates → adding/modifying records

  • Data validation → ensuring data meets required formats/rules

  • Transaction management → ensuring data integrity (rollback/commit)

  • Interrogation (retrieving and analyzing data from the database)

  • Querying → fetching specific data using conditions or joins

  • Reporting → presenting data in user-friendly formats

  • Aggregation → using functions like SUM(), AVG(), COUNT()

Tools of a DBMS

  • MySQL

  • Microsoft SQL Server

  • SQLite

A.2.3 - DBMS and Data Security

Personal vs Organizational Security

  • Data security is the practice of protecting digital information from unauthorized access, corruption, or loss

  • It involves implementing measures to ensure the confidentiality, integrity, and availability of data throughout its lifecycle

  • Personal data and multi-factor authentication → may be less purposeful for bigger datasets in an organization → 2/3 factor authentication

  • Organizational data security areas → application security, data security

Data Validation

  • Data type → checks that the data matches the data type for a field (letters, numbers, etc) → range validation also would be needed

  • Data range → the data falls within an acceptable range of values defined for the field → data constraint needed

  • Data constraint → data meets certain conditions or criteria for a field (type of data, number of characters, etc.) → data range validation also needed

  • Data consistency → data makes sense in the context of other related data → data might be consistent but still inaccurate

  • Code structure → data follow or conforms to a set structure 

  • Code validation → application code systematically performs all of the previously mentioned validations during user data input → might not properly validate all possible variations with data input 

Access Rights (AKA Database Permissions/Privileges)

  • Define what actions users or roles can perform on various database objects, such as tables, views, schemas, or the entire database

  • Fundamental component of database security and access control, ensuring that only authorized individuals/systems can interact with data in specific ways 

  • Example → policy based access rights → only administrators can create and manage policies for database permissions and then apply those policies to users → permission levels

Data Locking

  • Mechanism in database management systems to control concurrent access to data, ensuring data integrity and consistency, especially in multi-user environments

  • Prevents conflicting operations from being performed on the same data by different transactions simultaneously

  • Approaches to data locking → allowing many reads but limited edits, locking specific tables, pages, database rows

A.2.4 - Schema

  • Defines the structure and organization of data within a database, acting as a blueprint for how data is stored, accessed, and manipulated

  • Outlines tables, fields, data types, relationships, and constraints that govern the database’s logical design 

  • Common components: tables, columns/fields, primary keys, foreign keys, relationships

  • Primary key → uniquely identifies a record in the table

  • Foreign key → field in the table that is a primary in another table

A.2.5 - Levels of Schema 

  • Conceptual level → represents the “what” of the data focusing on business concepts and their relationships. it is a high-level view, abstracting away technical details (stakeholder focus is upper business leaders)

  • Logical level → defines the structures and details of the data, still independent of any specific database system. it builds upon the conceptual model by adding more structure (data architect and analysts)

  • Physical level → defines how the data will be physically stored in a specific database system, it is the implementation of the blueprint (database developers)

A.2.6 - Nature of the Data Dictionary 

  • A collection of names, definitions, and attributes about data elements that are being used in a database/information system/part of a research project

  • describes the meanings and purposes of data elements within the context of a project

  • provides guidance on interpretation, accepted meanings/representation

  • in DBMS, it helps users manage data in an orderly manner, preventing data redundancy

  • important to provide clarity about the entities/ relationships, and attributes that exist in the data model

A.2.7 - Data Definition Language 

  • a syntax for creating and modifying database objects such as tables, indices, and users

  • these commands define implementation details of the database schema, which are usually hidden from the users

  • create, alter, drop, rename, truncate, comment

  • it is a part of SQL (structured query language) commands used to manage RDBMS

A.2.8 - Importance of Data Modeling

  • process of creating a visual representation of data and how it relates to each other, often in the form of diagrams, to represent data structures and how they will be stored and used within a system

  • acts as a blueprint for databases/information systems, defining entities, attributes, and relationships

  • benefits → enhanced decision-making, optimized database performance, efficient data integration 

A.2.9 - Database Terms 

Basic Database Terms

Database Keys

A.2.10 - Types of Relationships

A.2.11 - Issues of Redundant Data 

  • Data redundancy simply refers to unneeded or unnecessary data

  • Good redundancy → system backup redundancy → failover system is a backup operational mode that automatically switches to a redundant/ standby system when a primary system fails 

  • If one system goes down, the others allows processes to continue without interruptions

  • Bad redundancy → within a database, it is possible for the same data to be duplicated multiple times (same tables, across multiple tables) → increased storage costs, potential data inconsistencies and complex data management 

  • Foreign keys can reduce redundancy → establishing relationships between tables in a relational database allows for data to be stored in a normalized fashion 

  • Pros: better data security, faster data updates and access, improved data reliability 

  • Cons: possible data inconsistency, increase in database corruption, increase in database size leading to unnecessary storage, increase in cost due to the need of more processing power

A.2.12 - Importance of Referential Integrity

  • Normalization is the process of dividing large tables in smaller ones and establishing relationships between them → improve db’s efficiency

  • Normalized db’s structure data into multiple linked tables → preventing anomalies (crud operation errors) 

  • Referential integrity is a fundamental principle of relational db that ensures the consistency of relationships between tables by requiring that foreign key values always point to valid and existing primary key values in a referenced table 

A.2.13 - Types of Normalized Data Forms

  • Normal forms are rules for structuring tables to minimize redundancy and improve data integrity. They are organized by levels that progressively get stricter

  • 1NF → eliminate repeating values → each column should hold atomic values → one column cannot hold multiple values 

  • 2NF → eliminate partial functional dependency → no attribute should depend on just a part of a composite key → full dependency is required

  • 3NF → eliminate transitive/indirect dependency → no non-key attribute should be dependent of another non key attribute 

A.2.14 - Characteristics of a Normalized Database 

  • reduced data redundancy

  • data consistency and accuracy

  • fewer data anomalies/oddities

  • clear table relationships 

A.2.15 - Different Data Types

  • data type: an attribute associated with a piece of data that tells a computer system how to interpret its value, reinforcing code reliability and efficiency

  • numeric → int (whole numbers), float (decimals) 

  • character string → char (fixed length text), varchar (variable length text) 

  • unicode character string → nchar (fixed length unicode text), ntext (multilingual text) 

  • binary → binary (0s and 1s for small binary data), image (stores large binary objects) 

  • date/time → date (year, month, day), datetime (date and time stamps) 

  • miscellaneous → json, xml (used for data transfer/serialization) 

A.2.16 - Entity-Relationship Diagram 

  • visual tool used to design databases and provide a structured blueprint for a database, ensuring clarity/consistency/efficiency during development

  • helps prevents redundancy and data loss

  • include entities, attributes, and relationships 

  • memorize the symbols 

A.2.17 - Constructing Relational Databases 

A.2.18 - Query in a Database 

  • a query is a request for information

  • a query asks the database to return data that meet certain conditions 

  • a query can perform multiple functions in SQL as needed

  • select using SELECT

  • filter using WHERE

  • sort data using ORDER BY

  • database view is a type of temporary virtual table that shows data in its current state without changing what is in the tables 

  • in SQL, CREATE VIEW can save a specific query

A.2.19 - Simple vs Complex Query

  • a simple query retrieves data from one table using basic commands (SELECT, FROM, and possibly WHERE)

  • a complex query involves multiple tables and advanced operations (JOIN, UNION, GROUP BY/HAVING, subqueries/nested queries, views/functions)

  • SQL parameter queries (prepared statement) separates the SQL query structure from the data value. this approach enhances security, performance and code maintainability. 

  • a parameter value is used as a placeholder that allows the user to provide an inputted value → minimizes risk for SQL injections → removes the ability for easy injection since input is viewed as a string and not SQL code

  • SQL and boolean operators are logical operators that combine/modify boolean expressions (true/false values) to control program flow. common operators (AND, OR, NOT) are used to make decisions in code, similar to how it is used in search queries

  • SQL derived fields are new tables created in the query by performing calculations on existing fields (a new column is formed through adding, subtracting, multiplying, etc.) 

A.2.20 - Query Construction 

  • When interrogating data, queries are created through SQL commands 

  • Commands must follow a specific order of execution 

  • Commands/clauses

  • Data dictionary language → changes structure of the database

  • Data manipulation language (update, insert, delete, select) → changes data stored in the structure of the database 

  • Order of execution (from, where, group by, having, select, order by, limit)