knowt logo

IB Computer - Databases

Introduction to Databases

Definition and Purpose of Databases

  • Definition

    • A database is an organized collection of structured information or data, typically stored electronically in a computer system. 

    • A database is managed by a database management system (DBMS).

  • Purpose

    • The primary purpose of a database is to store, retrieve, and manage data efficiently and securely. 

      • Databases are used to organize data in a way that allows easy access, management, and updating. 

    • They support data integrity, data sharing, and data security, making them essential for applications ranging from simple personal data management to complex enterprise systems.

Types of Databases

  • Relational Databases:

    • Structure: Data is organized into tables (or relations) consisting of rows and columns.

    • Characteristics: Use of SQL (Structured Query Language) for data manipulation. Supports ACID properties (Atomicity, Consistency, Isolation, Durability) to ensure transaction reliability.

      • Examples: MySQL, PostgreSQL, Oracle, Microsoft SQL Server.

  • NoSQL Databases:

    • Structure: Designed for unstructured or semi-structured data. These databases do not use the tabular schema of rows and columns.

    • Types:

      • Document Stores: Store data in document formats like JSON or BSON.

        • Examples: MongoDB, CouchDB.

      • Key-Value Stores: Store data as a collection of key-value pairs.

        • Examples: Redis, Amazon DynamoDB.

      • Column-Family Stores: Store data in columns rather than rows.

        • Examples: Apache Cassandra, HBase.

      • Graph Databases: Use graph structures with nodes, edges, and properties to represent and store data.

        • Examples: Neo4j, Amazon Neptune.

    • Characteristics: Flexible schema, horizontal scalability, suitable for big data and real-time web applications.

  • Other Types:

    • Hierarchical Databases: Data is organized in a tree-like structure.

      • Example: IBM Information Management System (IMS).

    • Network Databases: More flexible than hierarchical databases, allowing many-to-many relationships.

      • Example: Integrated Data Store (IDS).

Database Management System (DBMS) Functions and Architecture

  • Functions of a DBMS:

    • Data Definition: Defining and modifying database schema using Data Definition Language (DDL).

      • Examples: CREATE, ALTER, DROP commands.

    • Data Manipulation: Inserting, updating, deleting, and retrieving data using Data Manipulation Language (DML).

      • Examples: SELECT, INSERT, UPDATE, DELETE commands.

    • Data Control: Managing user permissions and access using Data Control Language (DCL).

      • Examples: GRANT, REVOKE commands.

    • Transaction Management: Ensuring ACID properties for reliable transactions.

      • Atomicity: Ensures that all operations within a transaction are completed successfully.

      • Consistency: Ensures the database is in a valid state before and after a transaction.

      • Isolation: Ensures that transactions do not interfere with each other.

      • Durability: Ensures that once a transaction is committed, it will remain so, even in the event of a system failure.

    • Concurrency Control: Managing simultaneous data access to ensure consistency and prevent conflicts.

    • Data Security: Protecting data from unauthorized access and breaches.

    • Backup and Recovery: Ensuring data can be restored in case of data loss.

  • DBMS Architecture:

    • 1-Tier Architecture: Direct interaction between the user and the database. Suitable for simple applications.

    • 2-Tier Architecture: A client-server model where the user interface runs on the client (end user) and the database is stored on a server. The client communicates with the server directly.

    • 3-Tier Architecture: Adds an intermediate layer (application server) between the client and the database server. This layer handles business logic, improving security, scalability, and manageability.

      • Components:

        • Presentation Tier: User interface.

        • Application Tier: Business logic.

        • Data Tier: Database storage.

Database Design

Data Models

  • Hierarchical Data Model:

    • Structure: Data is organized into a tree-like structure with a single root and multiple levels of hierarchy.

      • Example: An organizational chart where each node represents an employee and the edges represent the reporting structure.

    • Advantages: Simple and easy to understand, fast data retrieval.

    • Disadvantages: Rigid structure, difficult to modify, limited flexibility in querying.

  • Network Data Model:

    • Structure: Similar to the hierarchical model but allows many-to-many relationships through a graph-like structure.

      • Example: A university database where students are enrolled in multiple courses and courses have multiple students.

    • Advantages: More flexible than hierarchical, supports complex relationships.

    • Disadvantages: More complex to design and maintain.

  • Relational Data Model:

    • Structure: Data is organized into tables (relations) with rows (tuples) and columns (attributes).

      • Example: A customer database with tables for customers, orders, and products.

    • Advantages: High flexibility, supports powerful query languages like SQL, easy to modify.

    • Disadvantages: Can be less efficient for certain types of data (e.g., hierarchical data).

Entity-Relationship (ER) Modeling

  • Entities: Objects or things in the real world that have a distinct existence (e.g., Student, Course).

  • Attributes: Properties or characteristics of entities (e.g., StudentID, CourseName).

  • Relationships: Associations between entities (e.g., a student enrolls in a course).

  • ER Diagrams: Visual representations of entities, attributes, and relationships. Includes entities as rectangles, attributes as ovals, and relationships as diamonds.

Normalization

Normalization is the process of organizing data to reduce redundancy and improve data integrity.

  • First Normal Form (1NF):

    • Definition: Each table column should contain atomic (indivisible) values, and each column should contain values of a single type.

      • Example: Splitting a "FullName" column into "FirstName" and "LastName".

  • Second Normal Form (2NF):

    • Definition: Achieved when the table is in 1NF, and all non-key attributes are fully functionally dependent on the primary key.

      • Example: Moving columns that depend on part of a composite primary key to a separate table.

  • Third Normal Form (3NF):

    • Definition: Achieved when the table is in 2NF, and all the attributes are functionally dependent only on the primary key.

      • Example: Removing transitive dependencies (e.g., if A depends on B, and B depends on C, then A should directly depend on C).

  • Boyce-Codd Normal Form (BCNF):

    • Definition: A stronger version of 3NF, where every determinant is a candidate key.

      • Example: Ensuring that for any dependency A → B, A is a superkey.

Functional Dependencies

  • Definition: A relationship that exists when one attribute uniquely determines another attribute.

  • Notation: Denoted as X → Y, meaning X determines Y.

    • Example: In a table with columns StudentID and StudentName, StudentID → StudentName because each StudentID is associated with a single StudentName.

Denormalisation

  • Definition: The process of combining tables to reduce the complexity of queries and improve performance.

  • Purpose: Used to optimize read performance at the expense of write performance and increased redundancy.

    • Example: Combining customer and order tables to avoid join operations, allowing faster data retrieval for frequent queries.

SQL

SQL Basics

  • DDL (Data Definition Language)

    • CREATE: Used to create a new table, database, index, or other objects.

      • CREATE TABLE table_name (column1 datatype, column2 datatype, ...);

    • ALTER: Used to modify an existing database object, such as a table.

      • ALTER TABLE table_name ADD column_name datatype;

    • DROP: Used to delete a table, index, or database.

      • DROP TABLE table_name;

  • DML (Data Manipulation Language)

    • INSERT: Used to add new records to a table.

      • INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...);

    • UPDATE: Used to modify existing records in a table.

      • UPDATE table_name SET column1 = value1, column2 = value2, ... WHERE condition;

    • DELETE: Used to delete records from a table.

      • DELETE FROM table_name WHERE condition;

  • DCL (Data Control Language)

    • GRANT: Used to give users access privileges to the database.

      • GRANT SELECT, INSERT ON table_name TO user_name;

    • REVOKE: Used to remove access privileges given to users.

      • REVOKE SELECT, INSERT ON table_name FROM user_name;

SQL Queries

  • SELECT: Used to retrieve data from a database.

    • Basic Select:

      • SELECT column1, column2 FROM table_name;

    • Select All Columns:

      • SELECT * FROM table_name;

    • With WHERE Clause:

      • SELECT column1, column2 FROM table_name WHERE condition;

    • With ORDER BY:

      • SELECT column1, column2 FROM table_name ORDER BY column1 ASC|DESC;

    • With GROUP BY and HAVING:

      • SELECT column1, COUNT(*) FROM table_name GROUP BY column1 HAVING COUNT(*) > 1;

  • INSERT:

    • Single Row Insert:

      • INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...);

    • Multiple Rows Insert:

      • INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...), (value3, value4, ...);

  • UPDATE:

    • Update Single Column:

      • UPDATE table_name SET column1 = value1 WHERE condition;

    • Update Multiple Columns:

      • UPDATE table_name SET column1 = value1, column2 = value2 WHERE condition;

  • DELETE:

    • Delete Specific Records:

      • DELETE FROM table_name WHERE condition;

    • Delete All Records:

      • DELETE FROM table_name; (without WHERE clause)

Advanced SQL

  • JOIN Operations

    • INNER JOIN: Select records with matching values in both tables.

      • SELECT columns FROM table1 INNER JOIN table2 ON table1.common_column = table2.common_column;

    • LEFT (OUTER) JOIN: Select all records from the left table and matched records from the right table.

      • SELECT columns FROM table1 LEFT JOIN table2 ON table1.common_column = table2.common_column;

    • RIGHT (OUTER) JOIN: Select all records from the right table and matched records from the left table.

      • SELECT columns FROM table1 RIGHT JOIN table2 ON table1.common_column = table2.common_column;

    • FULL (OUTER) JOIN: Select all records when there is a match in either left or right table.

      • SELECT columns FROM table1 FULL OUTER JOIN table2 ON table1.common_column = table2.common_column;

  • Subqueries

    • Subquery in SELECT:

      • SELECT column1, (SELECT column2 FROM table2 WHERE table2.common_column = table1.common_column) FROM table1;

    • Subquery in WHERE:

      • SELECT column1 FROM table1 WHERE column2 IN (SELECT column2 FROM table2 WHERE condition);

    • Subquery in FROM:

      • SELECT column1 FROM (SELECT column1, column2 FROM table2 WHERE condition) AS subquery;

  • Aggregate Functions

    • COUNT, SUM, AVG, MIN, MAX:

      • SELECT COUNT(column), SUM(column), AVG(column), MIN(column), MAX(column) FROM table_name;

  • Set Operations

    • UNION: Combines the results of two or more SELECT queries (removes duplicates).

      • SELECT column1 FROM table1 UNION SELECT column1 FROM table2;

    • UNION ALL: Combines the results of two or more SELECT queries (includes duplicates).

      • SELECT column1 FROM table1 UNION ALL SELECT column1 FROM table2;

    • INTERSECT: Returns the common records from two SELECT queries.

      • SELECT column1 FROM table1 INTERSECT SELECT column1 FROM table2;

    • EXCEPT (or MINUS): Returns the records from the first SELECT query that are not in the second SELECT query.

      • SELECT column1 FROM table1 EXCEPT SELECT column1 FROM table2;

Database Implementation

Physical Database Design

Physical database design focuses on how data will be stored and accessed on the hardware. It involves the creation of the actual storage structures and the implementation of the database on physical storage devices.

  • Tables and Storage Structures: Defining how tables will be stored on disk, including considerations for row-oriented vs. column-oriented storage.

  • Storage Media: Choosing appropriate storage media (HDDs, SSDs) based on access patterns, performance requirements, and cost.

  • Data Compression: Techniques to reduce storage space and improve I/O efficiency, such as columnar storage compression.

  • Partitioning: Dividing large tables into smaller, more manageable pieces (horizontal or vertical partitioning) to improve query performance and manageability.

  • File Organization: Organizing data files in a way that optimizes read and write operations.

Indexing and Hashing

Indexing and hashing are techniques used to speed up data retrieval by creating auxiliary data structures that allow faster search operations.

  • Indexing:

    • Types of Indexes:

      • Primary Indexes: An index on a set of fields that includes the primary key.

      • Secondary Indexes: Indexes on non-primary key fields.

      • Unique Indexes: Ensure all values in the indexed field are unique.

      • Composite Indexes: Indexes on multiple columns.

    • Index Data Structures:

      • B-trees: Balanced tree structures that maintain sorted data and allow searches, sequential access, insertions, and deletions in logarithmic time.

      • Hash Indexes: Use hash functions to compute the address of the data record.

      • Bitmap Indexes: Efficient for columns with a limited number of distinct values.

  • Hashing:

    • Hash Functions: Functions that map keys to positions in a hash table.

    • Collision Handling: Techniques like chaining (linked lists) or open addressing (linear probing, quadratic probing) to handle cases where multiple keys hash to the same position.

Storage and File Organization

Storage and file organization deals with the efficient placement of data on disk to minimize access time and maximize performance.

  • File Organization:

    • Heap Files: Unordered files where new records are placed at the end.

    • Sorted Files: Files where records are sorted based on one or more fields.

    • Clustered Files: Files where records of related entities are stored close together on the disk.

  • Access Methods:

    • Sequential Access: Accessing records in a sequential order.

    • Direct Access: Accessing records directly using their address.

    • Indexed Access: Using an index to find the address of records.

  • Buffer Management: Managing the in-memory buffer that holds data pages to reduce disk I/O.

Database Performance Tuning

Database performance tuning involves optimizing the performance of a database by adjusting various parameters and configurations.

  • Query Optimization:

    • Query Execution Plans: Understanding and analyzing query execution plans to identify performance bottlenecks.

    • Rewrite Queries: Modifying queries to improve performance by using efficient join strategies, eliminating unnecessary operations, and reducing the amount of data processed.

    • Use of Indexes: Ensuring appropriate indexes are in place to speed up data retrieval.

  • Hardware Tuning:

    • CPU and Memory: Allocating sufficient CPU and memory resources to the database server.

    • Disk I/O: Optimizing disk I/O through the use of SSDs, RAID configurations, and disk striping.

  • Configuration Tuning:

    • Database Parameters: Adjusting database configuration parameters such as cache size, buffer pools, and log file settings.

    • Connection Pooling: Efficiently managing database connections to handle multiple concurrent users.

  • Maintenance Tasks:

    • Index Maintenance: Regularly rebuilding or reorganizing indexes to ensure they remain efficient.

    • Statistics Updates: Keeping database statistics up to date for the query optimizer.

    • Vacuuming and Cleaning: Removing outdated or unnecessary data to improve performance and reclaim storage space.

Transactions and Concurrency Control

ACID Properties

ACID is an acronym for the four key properties of a transaction in a database system to ensure data integrity and reliability.

  • Atomicity:

    • A transaction is an indivisible unit that is either fully completed or not executed at all.

    • If any part of the transaction fails, the entire transaction is rolled back, ensuring the database remains in a consistent state.

  • Consistency:

    • A transaction must transition the database from one valid state to another valid state.

    • Ensures that any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof.

  • Isolation:

    • Transactions are executed in isolation; intermediate results within a transaction are not visible to other transactions until the transaction is committed.

    • Isolation levels include Read Uncommitted, Read Committed, Repeatable Read, and Serializable, which determine the visibility of data changes between concurrent transactions.

  • Durability:

    • Once a transaction has been committed, it remains so, even in the event of a system failure.

    • Ensures that the results of the committed transaction are permanently recorded in the database.

Transaction Management

Transaction management involves the control and coordination of various operations performed in a transaction to maintain database consistency and integrity.

  • Begin Transaction: Mark the starting point of a transaction.

  • Execute Transaction Operations: Perform various read/write operations as part of the transaction.

  • Commit: Successfully end a transaction and make all changes permanent.

  • Rollback: Undo all operations if an error occurs, reverting the database to its previous consistent state.

Concurrency Control Mechanisms

Concurrency control is vital in multi-user database environments to ensure that transactions are executed concurrently without violating the integrity of the database.

  • Locking:

    • Pessimistic Locking: Prevents other transactions from accessing data while it is being modified. Types include:

      • Exclusive Lock (X-lock): Prevents both read and write access to the locked data.

      • Shared Lock (S-lock): Allows multiple transactions to read but not modify the locked data.

    • Optimistic Locking: Transactions proceed without locking resources and validate changes before committing. If a conflict is detected, the transaction rolls back.

  • Timestamp Ordering:

    • Each transaction is assigned a unique timestamp.

    • Ensures that transactions are executed in a chronological order based on their timestamps to maintain consistency.

    • Thomas Write Rule: A refinement of basic timestamp ordering that allows more concurrency by discarding outdated write operations.

Deadlock Detection and Resolution

Deadlocks occur when two or more transactions are waiting for each other to release locks, causing a cycle of dependencies with no resolution.

  • Deadlock Detection:

    • Wait-For Graph (WFG): A directed graph where nodes represent transactions and edges represent waiting relationships. A cycle in the WFG indicates a deadlock.

    • Deadlock Detection Algorithm: Periodically checks the WFG for cycles to identify deadlocks.

  • Deadlock Resolution:

    • Timeout: Transactions are rolled back if they wait for a lock longer than a specified timeout period.

    • Deadlock Prevention: Techniques include:

      • Wait-Die Scheme: Older transactions can wait for younger transactions, but younger transactions requesting a lock held by an older transaction are rolled back (aborted).

      • Wound-Wait Scheme: Younger transactions wait for older ones, but older transactions requesting a lock held by a younger transaction preempt (force) the younger transaction to roll back.

    • Deadlock Recovery: Involves rolling back one or more transactions to break the deadlock. The choice of transaction to roll back can be based on factors such as transaction age, resource utilization, or the number of operations performed.

Database Security

Authentication and Authorization

  • Authentication:

    • The process of verifying the identity of a user or system.

    • Techniques include passwords, biometrics, two-factor authentication, and digital certificates.

    • User authentication mechanisms, such as LDAP (Lightweight Directory Access Protocol) and Kerberos.

  • Authorization:

    • Determines what an authenticated user is allowed to do.

    • Implementation of user roles and permissions.

    • Access Control Lists (ACLs) and Role-Based Access Control (RBAC).

Data Encryption

  • Encryption:

    • The process of converting plain text into cipher text to prevent unauthorized access.

    • Types include symmetric (AES, DES) and asymmetric (RSA, ECC) encryption.

  • Encryption in Databases:

    • Encrypting data at rest (e.g., on disk) and data in transit (e.g., during transmission over networks).

    • Transparent Data Encryption (TDE) for protecting database files.

    • Column-level encryption for sensitive data fields.

Security Policies and Access Control Models

  • Security Policies:

    • A set of rules and practices that regulate how an organization manages, protects, and distributes sensitive information.

    • Examples include data classification policies, password policies, and incident response policies.

  • Access Control Models:

    • Discretionary Access Control (DAC): Access rights are assigned by the owner of the resource.

    • Mandatory Access Control (MAC): Access rights are regulated by a central authority based on multiple levels of security.

    • Role-Based Access Control (RBAC): Access rights are assigned based on roles within an organization.

    • Attribute-Based Access Control (ABAC): Access rights are determined by attributes (e.g., user attributes, resource attributes).

SQL Injection and Other Security Vulnerabilities

  • SQL Injection:

    • A code injection technique that exploits vulnerabilities in an application's software by injecting malicious SQL statements into an entry field for execution.

    • Prevention techniques include using parameterized queries, stored procedures, and ORM (Object-Relational Mapping) frameworks.

  • Other Vulnerabilities:

    • Cross-Site Scripting (XSS): Injection of malicious scripts into web pages viewed by other users.

    • Cross-Site Request Forgery (CSRF): An attacker tricks the victim into submitting a malicious request.

    • Buffer Overflows: Occurs when more data is written to a buffer than it can hold, potentially allowing the execution of arbitrary code.

    • Man-in-the-Middle (MITM) Attacks: An attacker intercepts communication between two parties to steal or manipulate data.

Best Practices for Database Security

  • Regularly update and patch database systems to protect against known vulnerabilities.

  • Implement strong password policies and use multi-factor authentication (MFA).

  • Regularly back up data and store backups securely.

  • Monitor database activity to detect and respond to suspicious behavior.

  • Educate users and administrators about security best practices and potential threats.

Distributed Databases

Characteristics and Advantages of Distributed Databases

Characteristics:

  • Geographic Distribution: Data is stored across multiple locations, which can be spread out geographically.

  • Replication and Redundancy: Data is often replicated across multiple sites for fault tolerance and high availability.

  • Scalability: The system can scale horizontally by adding more nodes to handle increased loads.

  • Autonomy: Each site operates independently and can perform local transactions without depending on other sites.

  • Heterogeneity: Different sites may use different DBMS, operating systems, or data models.

Advantages:

  • Reliability and Availability: Distributed databases improve system reliability and availability through data replication and redundancy.

  • Performance: By distributing data closer to the users, distributed databases can reduce query response times and improve performance.

  • Scalability: Adding more nodes to the system allows it to handle growing amounts of data and increased user load.

  • Flexibility: Distributed databases offer greater flexibility in data management and distribution, allowing for localized control and administration.

  • Cost Efficiency: Utilizing a network of less expensive, smaller servers can be more cost-effective than investing in a single large server.

Data Fragmentation, Replication, and Allocation

Data Fragmentation:

  • Horizontal Fragmentation: Dividing a table into rows, where each fragment contains a subset of the rows.

    • Example: A customer table split into fragments based on geographic regions.

  • Vertical Fragmentation: Dividing a table into columns, where each fragment contains a subset of the columns.

    • Example: Separating personal information and order details into different fragments.

  • Mixed Fragmentation: A combination of horizontal and vertical fragmentation.

Data Replication:

  • Full Replication: The entire database is copied and stored at multiple sites.

    • Advantage: High availability and fault tolerance.

    • Disadvantage: Increased storage requirements and potential consistency issues.

  • Partial Replication: Only selected parts of the database are replicated at different sites.

    • Advantage: Balances between availability and storage requirements.

    • Disadvantage: Complexity in ensuring data consistency.

Data Allocation:

  • Centralized Allocation: All data is stored at a single central site.

  • Decentralized Allocation: Data is distributed across multiple sites based on usage patterns, access frequencies, or organizational structure.

    • Advantage: Reduces access time and network traffic.

    • Disadvantage: Increased complexity in managing data consistency and integrity.

Distributed Query Processing

  • Query Decomposition: Breaking down a high-level query into smaller sub-queries that can be executed at different sites.

  • Data Localization: Transforming global queries into queries that reference local fragments.

  • Optimization: Choosing the most efficient strategy for executing a distributed query by considering factors such as data location, network latency, and processing power.

  • Execution: Coordinating the execution of sub-queries across different sites and aggregating the results to produce the final output.

Techniques:

  • Join Processing: Efficiently performing join operations across data stored at different sites.

  • Aggregation and Sorting: Ensuring that operations like COUNT, SUM, AVG, and ORDER BY are efficiently executed in a distributed environment.

  • Data Shipping: Determining whether to move data to the query or the query to the data to minimize data transfer costs.

Distributed Transaction Management and Consistency

  • ACID Properties: Ensuring Atomicity, Consistency, Isolation, and Durability in distributed transactions.

  • Two-Phase Commit Protocol (2PC): A coordination protocol to ensure all-or-nothing execution of a transaction across multiple sites.

    • Phase 1 (Prepare): The coordinator asks all participating sites if they are ready to commit.

    • Phase 2 (Commit/Rollback): Based on the responses, the coordinator decides to either commit or rollback the transaction and informs all sites.

Consistency Models:

  • Strong Consistency: Guarantees that all nodes see the same data at the same time after a transaction.

  • Eventual Consistency: Ensures that, given enough time, all nodes will converge to the same value, allowing for temporary inconsistencies.

  • Causal Consistency: Ensures that causally related operations are seen in the same order across all nodes.

Concurrency Control:

  • Distributed Locking: Managing locks across multiple sites to prevent conflicting operations.

  • Timestamp Ordering: Using timestamps to order transactions and resolve conflicts.

  • Optimistic Concurrency Control: Allowing transactions to execute without restrictions and checking for conflicts before committing.

Big Data and NoSQL Databases

Characteristics of Big Data

Big Data refers to data sets that are so large or complex that traditional data processing applications are inadequate. The characteristics of Big Data are often described by the following "V's":

  • Volume:

    • The amount of data generated and stored. Big Data is characterized by its vast quantities, often measured in terabytes, petabytes, or even exabytes.

  • Velocity:

    • The speed at which data is generated, processed, and analyzed. This includes the rate of data flow from sources such as social media, sensors, and business transactions.

  • Variety:

    • The different types of data. Big Data encompasses structured data (like databases), semi-structured data (like XML or JSON files), and unstructured data (like text, video, or images).

  • Veracity:

    • The accuracy and trustworthiness of the data. Big Data often includes uncertain or imprecise data, making data quality and validation critical.

  • Value:

    • The potential insights and benefits that can be derived from analyzing Big Data. The value of Big Data is realized through its ability to improve decision-making, uncover hidden patterns, and enhance efficiency.

  • Variability:

    • The data flow can be inconsistent, with periodic peaks and troughs. Managing such variations and ensuring timely processing can be challenging.

Types of NoSQL Databases

NoSQL databases are designed to handle a wide variety of data models, offering flexible schemas and scalability. The primary types of NoSQL databases are:

  • Document Databases:

    • Store data in JSON, BSON, or XML documents.

    • Each document can have a different structure, making it flexible for varying data formats.

    • Examples: MongoDB, CouchDB.

  • Column-Family Databases:

    • Store data in columns rather than rows, allowing for efficient retrieval of large datasets.

    • Each row can have a different number of columns.

    • Examples: Apache Cassandra, HBase.

  • Key-Value Databases:

    • Store data as a collection of key-value pairs, similar to a dictionary.

    • Simple and fast, ideal for caching and session management.

    • Examples: Redis, Riak.

  • Graph Databases:

    • Store data in nodes, edges, and properties to represent and traverse relationships.

    • Ideal for applications involving complex relationships and networks.

    • Examples: Neo4j, Amazon Neptune.

Use Cases and Comparison with Relational Databases

Use Cases:

  • Document Databases:

    • Content management systems.

    • Blogging platforms.

    • E-commerce product catalogs.

  • Column-Family Databases:

    • Real-time data analytics.

    • Time-series data storage.

    • High-frequency trading platforms.

  • Key-Value Databases:

    • Caching mechanisms.

    • Session stores.

    • Shopping cart data.

  • Graph Databases:

    • Social networking sites.

    • Fraud detection systems.

    • Recommendation engines.

Comparison with Relational Databases:

  • Schema Flexibility:

    • NoSQL: Schema-less, allowing for dynamic and flexible data models.

    • Relational: Fixed schema, requiring predefined tables and relationships.

  • Scalability:

    • NoSQL: Horizontally scalable, designed to run on distributed systems and scale out by adding more nodes.

    • Relational: Traditionally vertically scalable, scaling up by adding more resources to a single server.

  • Data Integrity and Transactions:

    • NoSQL: Typically offers eventual consistency, though some provide strong consistency and ACID transactions (e.g., MongoDB, Couchbase).

    • Relational: Strong consistency with ACID properties, ensuring reliable transactions and data integrity.

  • Performance:

    • NoSQL: Optimized for large-scale read and write operations, handling high volumes of unstructured data.

    • Relational: Optimized for complex queries and joins, suitable for structured data and transactional applications.

  • Use Cases:

    • NoSQL: Ideal for Big Data, real-time web applications, and scenarios requiring flexible schema and high scalability.

    • Relational: Best for applications requiring complex queries, transactions, and structured data management.

IB Computer - Databases

Introduction to Databases

Definition and Purpose of Databases

  • Definition

    • A database is an organized collection of structured information or data, typically stored electronically in a computer system. 

    • A database is managed by a database management system (DBMS).

  • Purpose

    • The primary purpose of a database is to store, retrieve, and manage data efficiently and securely. 

      • Databases are used to organize data in a way that allows easy access, management, and updating. 

    • They support data integrity, data sharing, and data security, making them essential for applications ranging from simple personal data management to complex enterprise systems.

Types of Databases

  • Relational Databases:

    • Structure: Data is organized into tables (or relations) consisting of rows and columns.

    • Characteristics: Use of SQL (Structured Query Language) for data manipulation. Supports ACID properties (Atomicity, Consistency, Isolation, Durability) to ensure transaction reliability.

      • Examples: MySQL, PostgreSQL, Oracle, Microsoft SQL Server.

  • NoSQL Databases:

    • Structure: Designed for unstructured or semi-structured data. These databases do not use the tabular schema of rows and columns.

    • Types:

      • Document Stores: Store data in document formats like JSON or BSON.

        • Examples: MongoDB, CouchDB.

      • Key-Value Stores: Store data as a collection of key-value pairs.

        • Examples: Redis, Amazon DynamoDB.

      • Column-Family Stores: Store data in columns rather than rows.

        • Examples: Apache Cassandra, HBase.

      • Graph Databases: Use graph structures with nodes, edges, and properties to represent and store data.

        • Examples: Neo4j, Amazon Neptune.

    • Characteristics: Flexible schema, horizontal scalability, suitable for big data and real-time web applications.

  • Other Types:

    • Hierarchical Databases: Data is organized in a tree-like structure.

      • Example: IBM Information Management System (IMS).

    • Network Databases: More flexible than hierarchical databases, allowing many-to-many relationships.

      • Example: Integrated Data Store (IDS).

Database Management System (DBMS) Functions and Architecture

  • Functions of a DBMS:

    • Data Definition: Defining and modifying database schema using Data Definition Language (DDL).

      • Examples: CREATE, ALTER, DROP commands.

    • Data Manipulation: Inserting, updating, deleting, and retrieving data using Data Manipulation Language (DML).

      • Examples: SELECT, INSERT, UPDATE, DELETE commands.

    • Data Control: Managing user permissions and access using Data Control Language (DCL).

      • Examples: GRANT, REVOKE commands.

    • Transaction Management: Ensuring ACID properties for reliable transactions.

      • Atomicity: Ensures that all operations within a transaction are completed successfully.

      • Consistency: Ensures the database is in a valid state before and after a transaction.

      • Isolation: Ensures that transactions do not interfere with each other.

      • Durability: Ensures that once a transaction is committed, it will remain so, even in the event of a system failure.

    • Concurrency Control: Managing simultaneous data access to ensure consistency and prevent conflicts.

    • Data Security: Protecting data from unauthorized access and breaches.

    • Backup and Recovery: Ensuring data can be restored in case of data loss.

  • DBMS Architecture:

    • 1-Tier Architecture: Direct interaction between the user and the database. Suitable for simple applications.

    • 2-Tier Architecture: A client-server model where the user interface runs on the client (end user) and the database is stored on a server. The client communicates with the server directly.

    • 3-Tier Architecture: Adds an intermediate layer (application server) between the client and the database server. This layer handles business logic, improving security, scalability, and manageability.

      • Components:

        • Presentation Tier: User interface.

        • Application Tier: Business logic.

        • Data Tier: Database storage.

Database Design

Data Models

  • Hierarchical Data Model:

    • Structure: Data is organized into a tree-like structure with a single root and multiple levels of hierarchy.

      • Example: An organizational chart where each node represents an employee and the edges represent the reporting structure.

    • Advantages: Simple and easy to understand, fast data retrieval.

    • Disadvantages: Rigid structure, difficult to modify, limited flexibility in querying.

  • Network Data Model:

    • Structure: Similar to the hierarchical model but allows many-to-many relationships through a graph-like structure.

      • Example: A university database where students are enrolled in multiple courses and courses have multiple students.

    • Advantages: More flexible than hierarchical, supports complex relationships.

    • Disadvantages: More complex to design and maintain.

  • Relational Data Model:

    • Structure: Data is organized into tables (relations) with rows (tuples) and columns (attributes).

      • Example: A customer database with tables for customers, orders, and products.

    • Advantages: High flexibility, supports powerful query languages like SQL, easy to modify.

    • Disadvantages: Can be less efficient for certain types of data (e.g., hierarchical data).

Entity-Relationship (ER) Modeling

  • Entities: Objects or things in the real world that have a distinct existence (e.g., Student, Course).

  • Attributes: Properties or characteristics of entities (e.g., StudentID, CourseName).

  • Relationships: Associations between entities (e.g., a student enrolls in a course).

  • ER Diagrams: Visual representations of entities, attributes, and relationships. Includes entities as rectangles, attributes as ovals, and relationships as diamonds.

Normalization

Normalization is the process of organizing data to reduce redundancy and improve data integrity.

  • First Normal Form (1NF):

    • Definition: Each table column should contain atomic (indivisible) values, and each column should contain values of a single type.

      • Example: Splitting a "FullName" column into "FirstName" and "LastName".

  • Second Normal Form (2NF):

    • Definition: Achieved when the table is in 1NF, and all non-key attributes are fully functionally dependent on the primary key.

      • Example: Moving columns that depend on part of a composite primary key to a separate table.

  • Third Normal Form (3NF):

    • Definition: Achieved when the table is in 2NF, and all the attributes are functionally dependent only on the primary key.

      • Example: Removing transitive dependencies (e.g., if A depends on B, and B depends on C, then A should directly depend on C).

  • Boyce-Codd Normal Form (BCNF):

    • Definition: A stronger version of 3NF, where every determinant is a candidate key.

      • Example: Ensuring that for any dependency A → B, A is a superkey.

Functional Dependencies

  • Definition: A relationship that exists when one attribute uniquely determines another attribute.

  • Notation: Denoted as X → Y, meaning X determines Y.

    • Example: In a table with columns StudentID and StudentName, StudentID → StudentName because each StudentID is associated with a single StudentName.

Denormalisation

  • Definition: The process of combining tables to reduce the complexity of queries and improve performance.

  • Purpose: Used to optimize read performance at the expense of write performance and increased redundancy.

    • Example: Combining customer and order tables to avoid join operations, allowing faster data retrieval for frequent queries.

SQL

SQL Basics

  • DDL (Data Definition Language)

    • CREATE: Used to create a new table, database, index, or other objects.

      • CREATE TABLE table_name (column1 datatype, column2 datatype, ...);

    • ALTER: Used to modify an existing database object, such as a table.

      • ALTER TABLE table_name ADD column_name datatype;

    • DROP: Used to delete a table, index, or database.

      • DROP TABLE table_name;

  • DML (Data Manipulation Language)

    • INSERT: Used to add new records to a table.

      • INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...);

    • UPDATE: Used to modify existing records in a table.

      • UPDATE table_name SET column1 = value1, column2 = value2, ... WHERE condition;

    • DELETE: Used to delete records from a table.

      • DELETE FROM table_name WHERE condition;

  • DCL (Data Control Language)

    • GRANT: Used to give users access privileges to the database.

      • GRANT SELECT, INSERT ON table_name TO user_name;

    • REVOKE: Used to remove access privileges given to users.

      • REVOKE SELECT, INSERT ON table_name FROM user_name;

SQL Queries

  • SELECT: Used to retrieve data from a database.

    • Basic Select:

      • SELECT column1, column2 FROM table_name;

    • Select All Columns:

      • SELECT * FROM table_name;

    • With WHERE Clause:

      • SELECT column1, column2 FROM table_name WHERE condition;

    • With ORDER BY:

      • SELECT column1, column2 FROM table_name ORDER BY column1 ASC|DESC;

    • With GROUP BY and HAVING:

      • SELECT column1, COUNT(*) FROM table_name GROUP BY column1 HAVING COUNT(*) > 1;

  • INSERT:

    • Single Row Insert:

      • INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...);

    • Multiple Rows Insert:

      • INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...), (value3, value4, ...);

  • UPDATE:

    • Update Single Column:

      • UPDATE table_name SET column1 = value1 WHERE condition;

    • Update Multiple Columns:

      • UPDATE table_name SET column1 = value1, column2 = value2 WHERE condition;

  • DELETE:

    • Delete Specific Records:

      • DELETE FROM table_name WHERE condition;

    • Delete All Records:

      • DELETE FROM table_name; (without WHERE clause)

Advanced SQL

  • JOIN Operations

    • INNER JOIN: Select records with matching values in both tables.

      • SELECT columns FROM table1 INNER JOIN table2 ON table1.common_column = table2.common_column;

    • LEFT (OUTER) JOIN: Select all records from the left table and matched records from the right table.

      • SELECT columns FROM table1 LEFT JOIN table2 ON table1.common_column = table2.common_column;

    • RIGHT (OUTER) JOIN: Select all records from the right table and matched records from the left table.

      • SELECT columns FROM table1 RIGHT JOIN table2 ON table1.common_column = table2.common_column;

    • FULL (OUTER) JOIN: Select all records when there is a match in either left or right table.

      • SELECT columns FROM table1 FULL OUTER JOIN table2 ON table1.common_column = table2.common_column;

  • Subqueries

    • Subquery in SELECT:

      • SELECT column1, (SELECT column2 FROM table2 WHERE table2.common_column = table1.common_column) FROM table1;

    • Subquery in WHERE:

      • SELECT column1 FROM table1 WHERE column2 IN (SELECT column2 FROM table2 WHERE condition);

    • Subquery in FROM:

      • SELECT column1 FROM (SELECT column1, column2 FROM table2 WHERE condition) AS subquery;

  • Aggregate Functions

    • COUNT, SUM, AVG, MIN, MAX:

      • SELECT COUNT(column), SUM(column), AVG(column), MIN(column), MAX(column) FROM table_name;

  • Set Operations

    • UNION: Combines the results of two or more SELECT queries (removes duplicates).

      • SELECT column1 FROM table1 UNION SELECT column1 FROM table2;

    • UNION ALL: Combines the results of two or more SELECT queries (includes duplicates).

      • SELECT column1 FROM table1 UNION ALL SELECT column1 FROM table2;

    • INTERSECT: Returns the common records from two SELECT queries.

      • SELECT column1 FROM table1 INTERSECT SELECT column1 FROM table2;

    • EXCEPT (or MINUS): Returns the records from the first SELECT query that are not in the second SELECT query.

      • SELECT column1 FROM table1 EXCEPT SELECT column1 FROM table2;

Database Implementation

Physical Database Design

Physical database design focuses on how data will be stored and accessed on the hardware. It involves the creation of the actual storage structures and the implementation of the database on physical storage devices.

  • Tables and Storage Structures: Defining how tables will be stored on disk, including considerations for row-oriented vs. column-oriented storage.

  • Storage Media: Choosing appropriate storage media (HDDs, SSDs) based on access patterns, performance requirements, and cost.

  • Data Compression: Techniques to reduce storage space and improve I/O efficiency, such as columnar storage compression.

  • Partitioning: Dividing large tables into smaller, more manageable pieces (horizontal or vertical partitioning) to improve query performance and manageability.

  • File Organization: Organizing data files in a way that optimizes read and write operations.

Indexing and Hashing

Indexing and hashing are techniques used to speed up data retrieval by creating auxiliary data structures that allow faster search operations.

  • Indexing:

    • Types of Indexes:

      • Primary Indexes: An index on a set of fields that includes the primary key.

      • Secondary Indexes: Indexes on non-primary key fields.

      • Unique Indexes: Ensure all values in the indexed field are unique.

      • Composite Indexes: Indexes on multiple columns.

    • Index Data Structures:

      • B-trees: Balanced tree structures that maintain sorted data and allow searches, sequential access, insertions, and deletions in logarithmic time.

      • Hash Indexes: Use hash functions to compute the address of the data record.

      • Bitmap Indexes: Efficient for columns with a limited number of distinct values.

  • Hashing:

    • Hash Functions: Functions that map keys to positions in a hash table.

    • Collision Handling: Techniques like chaining (linked lists) or open addressing (linear probing, quadratic probing) to handle cases where multiple keys hash to the same position.

Storage and File Organization

Storage and file organization deals with the efficient placement of data on disk to minimize access time and maximize performance.

  • File Organization:

    • Heap Files: Unordered files where new records are placed at the end.

    • Sorted Files: Files where records are sorted based on one or more fields.

    • Clustered Files: Files where records of related entities are stored close together on the disk.

  • Access Methods:

    • Sequential Access: Accessing records in a sequential order.

    • Direct Access: Accessing records directly using their address.

    • Indexed Access: Using an index to find the address of records.

  • Buffer Management: Managing the in-memory buffer that holds data pages to reduce disk I/O.

Database Performance Tuning

Database performance tuning involves optimizing the performance of a database by adjusting various parameters and configurations.

  • Query Optimization:

    • Query Execution Plans: Understanding and analyzing query execution plans to identify performance bottlenecks.

    • Rewrite Queries: Modifying queries to improve performance by using efficient join strategies, eliminating unnecessary operations, and reducing the amount of data processed.

    • Use of Indexes: Ensuring appropriate indexes are in place to speed up data retrieval.

  • Hardware Tuning:

    • CPU and Memory: Allocating sufficient CPU and memory resources to the database server.

    • Disk I/O: Optimizing disk I/O through the use of SSDs, RAID configurations, and disk striping.

  • Configuration Tuning:

    • Database Parameters: Adjusting database configuration parameters such as cache size, buffer pools, and log file settings.

    • Connection Pooling: Efficiently managing database connections to handle multiple concurrent users.

  • Maintenance Tasks:

    • Index Maintenance: Regularly rebuilding or reorganizing indexes to ensure they remain efficient.

    • Statistics Updates: Keeping database statistics up to date for the query optimizer.

    • Vacuuming and Cleaning: Removing outdated or unnecessary data to improve performance and reclaim storage space.

Transactions and Concurrency Control

ACID Properties

ACID is an acronym for the four key properties of a transaction in a database system to ensure data integrity and reliability.

  • Atomicity:

    • A transaction is an indivisible unit that is either fully completed or not executed at all.

    • If any part of the transaction fails, the entire transaction is rolled back, ensuring the database remains in a consistent state.

  • Consistency:

    • A transaction must transition the database from one valid state to another valid state.

    • Ensures that any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof.

  • Isolation:

    • Transactions are executed in isolation; intermediate results within a transaction are not visible to other transactions until the transaction is committed.

    • Isolation levels include Read Uncommitted, Read Committed, Repeatable Read, and Serializable, which determine the visibility of data changes between concurrent transactions.

  • Durability:

    • Once a transaction has been committed, it remains so, even in the event of a system failure.

    • Ensures that the results of the committed transaction are permanently recorded in the database.

Transaction Management

Transaction management involves the control and coordination of various operations performed in a transaction to maintain database consistency and integrity.

  • Begin Transaction: Mark the starting point of a transaction.

  • Execute Transaction Operations: Perform various read/write operations as part of the transaction.

  • Commit: Successfully end a transaction and make all changes permanent.

  • Rollback: Undo all operations if an error occurs, reverting the database to its previous consistent state.

Concurrency Control Mechanisms

Concurrency control is vital in multi-user database environments to ensure that transactions are executed concurrently without violating the integrity of the database.

  • Locking:

    • Pessimistic Locking: Prevents other transactions from accessing data while it is being modified. Types include:

      • Exclusive Lock (X-lock): Prevents both read and write access to the locked data.

      • Shared Lock (S-lock): Allows multiple transactions to read but not modify the locked data.

    • Optimistic Locking: Transactions proceed without locking resources and validate changes before committing. If a conflict is detected, the transaction rolls back.

  • Timestamp Ordering:

    • Each transaction is assigned a unique timestamp.

    • Ensures that transactions are executed in a chronological order based on their timestamps to maintain consistency.

    • Thomas Write Rule: A refinement of basic timestamp ordering that allows more concurrency by discarding outdated write operations.

Deadlock Detection and Resolution

Deadlocks occur when two or more transactions are waiting for each other to release locks, causing a cycle of dependencies with no resolution.

  • Deadlock Detection:

    • Wait-For Graph (WFG): A directed graph where nodes represent transactions and edges represent waiting relationships. A cycle in the WFG indicates a deadlock.

    • Deadlock Detection Algorithm: Periodically checks the WFG for cycles to identify deadlocks.

  • Deadlock Resolution:

    • Timeout: Transactions are rolled back if they wait for a lock longer than a specified timeout period.

    • Deadlock Prevention: Techniques include:

      • Wait-Die Scheme: Older transactions can wait for younger transactions, but younger transactions requesting a lock held by an older transaction are rolled back (aborted).

      • Wound-Wait Scheme: Younger transactions wait for older ones, but older transactions requesting a lock held by a younger transaction preempt (force) the younger transaction to roll back.

    • Deadlock Recovery: Involves rolling back one or more transactions to break the deadlock. The choice of transaction to roll back can be based on factors such as transaction age, resource utilization, or the number of operations performed.

Database Security

Authentication and Authorization

  • Authentication:

    • The process of verifying the identity of a user or system.

    • Techniques include passwords, biometrics, two-factor authentication, and digital certificates.

    • User authentication mechanisms, such as LDAP (Lightweight Directory Access Protocol) and Kerberos.

  • Authorization:

    • Determines what an authenticated user is allowed to do.

    • Implementation of user roles and permissions.

    • Access Control Lists (ACLs) and Role-Based Access Control (RBAC).

Data Encryption

  • Encryption:

    • The process of converting plain text into cipher text to prevent unauthorized access.

    • Types include symmetric (AES, DES) and asymmetric (RSA, ECC) encryption.

  • Encryption in Databases:

    • Encrypting data at rest (e.g., on disk) and data in transit (e.g., during transmission over networks).

    • Transparent Data Encryption (TDE) for protecting database files.

    • Column-level encryption for sensitive data fields.

Security Policies and Access Control Models

  • Security Policies:

    • A set of rules and practices that regulate how an organization manages, protects, and distributes sensitive information.

    • Examples include data classification policies, password policies, and incident response policies.

  • Access Control Models:

    • Discretionary Access Control (DAC): Access rights are assigned by the owner of the resource.

    • Mandatory Access Control (MAC): Access rights are regulated by a central authority based on multiple levels of security.

    • Role-Based Access Control (RBAC): Access rights are assigned based on roles within an organization.

    • Attribute-Based Access Control (ABAC): Access rights are determined by attributes (e.g., user attributes, resource attributes).

SQL Injection and Other Security Vulnerabilities

  • SQL Injection:

    • A code injection technique that exploits vulnerabilities in an application's software by injecting malicious SQL statements into an entry field for execution.

    • Prevention techniques include using parameterized queries, stored procedures, and ORM (Object-Relational Mapping) frameworks.

  • Other Vulnerabilities:

    • Cross-Site Scripting (XSS): Injection of malicious scripts into web pages viewed by other users.

    • Cross-Site Request Forgery (CSRF): An attacker tricks the victim into submitting a malicious request.

    • Buffer Overflows: Occurs when more data is written to a buffer than it can hold, potentially allowing the execution of arbitrary code.

    • Man-in-the-Middle (MITM) Attacks: An attacker intercepts communication between two parties to steal or manipulate data.

Best Practices for Database Security

  • Regularly update and patch database systems to protect against known vulnerabilities.

  • Implement strong password policies and use multi-factor authentication (MFA).

  • Regularly back up data and store backups securely.

  • Monitor database activity to detect and respond to suspicious behavior.

  • Educate users and administrators about security best practices and potential threats.

Distributed Databases

Characteristics and Advantages of Distributed Databases

Characteristics:

  • Geographic Distribution: Data is stored across multiple locations, which can be spread out geographically.

  • Replication and Redundancy: Data is often replicated across multiple sites for fault tolerance and high availability.

  • Scalability: The system can scale horizontally by adding more nodes to handle increased loads.

  • Autonomy: Each site operates independently and can perform local transactions without depending on other sites.

  • Heterogeneity: Different sites may use different DBMS, operating systems, or data models.

Advantages:

  • Reliability and Availability: Distributed databases improve system reliability and availability through data replication and redundancy.

  • Performance: By distributing data closer to the users, distributed databases can reduce query response times and improve performance.

  • Scalability: Adding more nodes to the system allows it to handle growing amounts of data and increased user load.

  • Flexibility: Distributed databases offer greater flexibility in data management and distribution, allowing for localized control and administration.

  • Cost Efficiency: Utilizing a network of less expensive, smaller servers can be more cost-effective than investing in a single large server.

Data Fragmentation, Replication, and Allocation

Data Fragmentation:

  • Horizontal Fragmentation: Dividing a table into rows, where each fragment contains a subset of the rows.

    • Example: A customer table split into fragments based on geographic regions.

  • Vertical Fragmentation: Dividing a table into columns, where each fragment contains a subset of the columns.

    • Example: Separating personal information and order details into different fragments.

  • Mixed Fragmentation: A combination of horizontal and vertical fragmentation.

Data Replication:

  • Full Replication: The entire database is copied and stored at multiple sites.

    • Advantage: High availability and fault tolerance.

    • Disadvantage: Increased storage requirements and potential consistency issues.

  • Partial Replication: Only selected parts of the database are replicated at different sites.

    • Advantage: Balances between availability and storage requirements.

    • Disadvantage: Complexity in ensuring data consistency.

Data Allocation:

  • Centralized Allocation: All data is stored at a single central site.

  • Decentralized Allocation: Data is distributed across multiple sites based on usage patterns, access frequencies, or organizational structure.

    • Advantage: Reduces access time and network traffic.

    • Disadvantage: Increased complexity in managing data consistency and integrity.

Distributed Query Processing

  • Query Decomposition: Breaking down a high-level query into smaller sub-queries that can be executed at different sites.

  • Data Localization: Transforming global queries into queries that reference local fragments.

  • Optimization: Choosing the most efficient strategy for executing a distributed query by considering factors such as data location, network latency, and processing power.

  • Execution: Coordinating the execution of sub-queries across different sites and aggregating the results to produce the final output.

Techniques:

  • Join Processing: Efficiently performing join operations across data stored at different sites.

  • Aggregation and Sorting: Ensuring that operations like COUNT, SUM, AVG, and ORDER BY are efficiently executed in a distributed environment.

  • Data Shipping: Determining whether to move data to the query or the query to the data to minimize data transfer costs.

Distributed Transaction Management and Consistency

  • ACID Properties: Ensuring Atomicity, Consistency, Isolation, and Durability in distributed transactions.

  • Two-Phase Commit Protocol (2PC): A coordination protocol to ensure all-or-nothing execution of a transaction across multiple sites.

    • Phase 1 (Prepare): The coordinator asks all participating sites if they are ready to commit.

    • Phase 2 (Commit/Rollback): Based on the responses, the coordinator decides to either commit or rollback the transaction and informs all sites.

Consistency Models:

  • Strong Consistency: Guarantees that all nodes see the same data at the same time after a transaction.

  • Eventual Consistency: Ensures that, given enough time, all nodes will converge to the same value, allowing for temporary inconsistencies.

  • Causal Consistency: Ensures that causally related operations are seen in the same order across all nodes.

Concurrency Control:

  • Distributed Locking: Managing locks across multiple sites to prevent conflicting operations.

  • Timestamp Ordering: Using timestamps to order transactions and resolve conflicts.

  • Optimistic Concurrency Control: Allowing transactions to execute without restrictions and checking for conflicts before committing.

Big Data and NoSQL Databases

Characteristics of Big Data

Big Data refers to data sets that are so large or complex that traditional data processing applications are inadequate. The characteristics of Big Data are often described by the following "V's":

  • Volume:

    • The amount of data generated and stored. Big Data is characterized by its vast quantities, often measured in terabytes, petabytes, or even exabytes.

  • Velocity:

    • The speed at which data is generated, processed, and analyzed. This includes the rate of data flow from sources such as social media, sensors, and business transactions.

  • Variety:

    • The different types of data. Big Data encompasses structured data (like databases), semi-structured data (like XML or JSON files), and unstructured data (like text, video, or images).

  • Veracity:

    • The accuracy and trustworthiness of the data. Big Data often includes uncertain or imprecise data, making data quality and validation critical.

  • Value:

    • The potential insights and benefits that can be derived from analyzing Big Data. The value of Big Data is realized through its ability to improve decision-making, uncover hidden patterns, and enhance efficiency.

  • Variability:

    • The data flow can be inconsistent, with periodic peaks and troughs. Managing such variations and ensuring timely processing can be challenging.

Types of NoSQL Databases

NoSQL databases are designed to handle a wide variety of data models, offering flexible schemas and scalability. The primary types of NoSQL databases are:

  • Document Databases:

    • Store data in JSON, BSON, or XML documents.

    • Each document can have a different structure, making it flexible for varying data formats.

    • Examples: MongoDB, CouchDB.

  • Column-Family Databases:

    • Store data in columns rather than rows, allowing for efficient retrieval of large datasets.

    • Each row can have a different number of columns.

    • Examples: Apache Cassandra, HBase.

  • Key-Value Databases:

    • Store data as a collection of key-value pairs, similar to a dictionary.

    • Simple and fast, ideal for caching and session management.

    • Examples: Redis, Riak.

  • Graph Databases:

    • Store data in nodes, edges, and properties to represent and traverse relationships.

    • Ideal for applications involving complex relationships and networks.

    • Examples: Neo4j, Amazon Neptune.

Use Cases and Comparison with Relational Databases

Use Cases:

  • Document Databases:

    • Content management systems.

    • Blogging platforms.

    • E-commerce product catalogs.

  • Column-Family Databases:

    • Real-time data analytics.

    • Time-series data storage.

    • High-frequency trading platforms.

  • Key-Value Databases:

    • Caching mechanisms.

    • Session stores.

    • Shopping cart data.

  • Graph Databases:

    • Social networking sites.

    • Fraud detection systems.

    • Recommendation engines.

Comparison with Relational Databases:

  • Schema Flexibility:

    • NoSQL: Schema-less, allowing for dynamic and flexible data models.

    • Relational: Fixed schema, requiring predefined tables and relationships.

  • Scalability:

    • NoSQL: Horizontally scalable, designed to run on distributed systems and scale out by adding more nodes.

    • Relational: Traditionally vertically scalable, scaling up by adding more resources to a single server.

  • Data Integrity and Transactions:

    • NoSQL: Typically offers eventual consistency, though some provide strong consistency and ACID transactions (e.g., MongoDB, Couchbase).

    • Relational: Strong consistency with ACID properties, ensuring reliable transactions and data integrity.

  • Performance:

    • NoSQL: Optimized for large-scale read and write operations, handling high volumes of unstructured data.

    • Relational: Optimized for complex queries and joins, suitable for structured data and transactional applications.

  • Use Cases:

    • NoSQL: Ideal for Big Data, real-time web applications, and scenarios requiring flexible schema and high scalability.

    • Relational: Best for applications requiring complex queries, transactions, and structured data management.