IDM - Concepts

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/127

Earn XP

Description and Tags

For IDM, theoretical concepts to know

Last updated 7:20 AM on 3/26/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

128 Terms

New cards

Inpendence Mismatch

The resistance you encounter when trying to connect & persist data between your application & the database.
(E.g. converting Java objects to Tables in PostgreSQL?)

New cards

Object Persistence

New cards

ACID

ACID is a Framework to comply to, it stands for

Atomicity
Consistency
Isolation
Durability

New cards

Transactions

Small database programs that comply to ACID.
You can essentially think of it as a set of operations that happen as one unit of work.

Think of a database transaction like a promise that a series of changes will either all happen or none will. If something goes wrong with either, the entire transaction should be “rolled back” to make it as if nothing happened

New cards

How do transactions work?

Here’s the typical life cycle of a database transaction:

Begin Transaction: The transaction starts.
Execute Operations: All the actions within the transaction (like updates, inserts, or deletions) happen.
Commit Transaction: If everything goes smoothly, the transaction commits, making the changes permanent.
Rollback Transaction: If something goes wrong, the transaction rolls back, undoing any changes.

In SQL, you usually see commands like BEGIN TRANSACTION, COMMIT, and ROLLBACK to control this process. This ensures that either all parts of the transaction succeed, or none of them do.

New cards

Hierarchy of storage

Primary Storage (e.g., RAM): Fast, limited capacity, high price, usually volatile electronic storage → Frequently used data / current work data

Secondary Storage (Hard Disk / SSD): Slower, large capacity, lower price → Main stored data

Tertiary Storage (Tape): Even slower, huge capacity, even lower price, usually offline → Backup and long-term storage of not frequently used data

New cards

Biggest trade-off relation in different media types

The faster the access time, the more the storage costs

New cards

(Spinning) Hard Disk — Structure

Tracks: each surface is divided into circular tracks

Sector (e.g., 1-8 KB) is the smallest physical transfer unit – Size is fixed by manufacturer – More sectors on outer tracks of disk
Gaps between sectors make 10% of a track – Not magnetized – Serve to find the start of a sector
Platters are fixed on the main spindle – Rotate at constant speed (5400 rpm - 10000 rpm)
Head is attached to an arm —→ Arms can position heads along the surface
Blocks are logical transfer unit (more later) – Can consist of one or more sectors

<ul><li><p>Tracks: each surface is divided into circular tracks</p></li></ul><ul><li><p>Sector (e.g., 1-8 KB) is the smallest physical transfer unit – Size is fixed by manufacturer – More sectors on outer tracks of disk </p></li><li><p>Gaps between sectors make 10% of a track – Not magnetized – Serve to find the start of a sector </p></li><li><p>Platters are fixed on the main spindle – Rotate at constant speed (5400 rpm - 10000 rpm) </p></li><li><p>Head is attached to an arm —→ Arms can position heads along the surface </p></li><li><p>Blocks are logical transfer unit (more later) – Can consist of one or more sectors</p></li></ul><p></p>

New cards

HHD - How it works

Directionally magnetization of a ferromagnetic material

Realized on hard disk platters

Magnetic grains worked into base platter to form magnetic regions

Each region represents 1 Bit

Read head can detect magnetization direction of each region
Write head can change magnetization direction

<p><strong>Directionally magnetization of a ferromagnetic material </strong></p><p>Realized on hard disk platters </p><p>Magnetic grains worked into base platter to form magnetic regions </p><ul><li><p>Each region represents 1 Bit </p></li></ul><ul><li><p>Read head can detect magnetization direction of each region</p></li><li><p>Write head can change magnetization direction</p></li></ul><p></p>

New cards

What impacts latency in HHD?

1. Communication time between processor and disk controller (■ Fraction of a millisecond  can be ignored in this course )

2. Seektime to position R/W-head at the right cylinder

■ Between 0 and 42 ms  proportional to distance
■ Start time (1 ms), movement time (10 – 40 ms), Stop time (1 ms)

3. Rotational delay until first sector of the block appears under R/W-Head

4. Transfer time during disk rotation until all sectors and gaps of the block have passed the R/W head

New cards

Buffering

Buffering refers to the process of using a temporary storage area (buffer) to hold data in transit between a sender and a receiver. In buffering, whether the communication is direct or indirect, messages exchanged between communicating processes reside in a temporary queue.

New cards

Why is Buffering needed?

Buffering is essential in scenarios where devices or processes operate at different speeds or data sizes. Key needs for buffering include:

Speed Matching: It helps to match the speed between two devices, such as between a modem and a hard disk.
Data Size Adaptation: Facilitates adaptation between devices with differing data transfer sizes.
Data Manipulation: Allows devices to manipulate or reformat data before sending or receiving.
Support for Copy Semantics: Enables temporary data retention for applications that need multiple reads of the same data.

Note: The primary purpose of buffering is to manage and regulate the data flow rate, allowing the sender to transmit data at a faster rate while the receiver processes the data at its own pace.

New cards

Explain how to efficiently use HDD

—> finding where “the block” of data is takes a lot of time so:

– Avoid Random Accesses → Reposition head for each access ==> Which is slow because of: Seek Time + Rotational Delay + Block Transfer
– Aim for sequential access →Position head once, and then just read sectors & transfer
– Aim for re-using data already stored in the Buffers (RAM)

New cards

How does writing vs updating work with HHD? — Explain the latency / time involved

Writing of blocks the same algorithm and latency as reading.

Changing of blocks (aka updates) – Not possible to do it without RAM (main memory)

1. Read block into main memory

2. Modify the data

3. Write block back to disk

4. Check for success – Latency: t_read + t_write

(If no one reads anything in the meantime, the Head is still very close to the block we just read (and now want to update)

New cards

Why are Magnetic Tapes still used?

Cost Efficiency • Durability (30+ years)
Energy Efficiency (no power when not in use)
Scalability (up to exabytes)
Security → will be safe & untouched when offline
High Sequential Read/Write Speeds (400 MB/s) // HDD (30-150 MB/s), SSD (500-3500 MB/s)

—> Note: Tape is cheap BUT drivers are expensive

New cards

What is SSD?

SSD is an alternative to hard-drives that use microchips to store data.

They don’t contain moving parts

• Use the same interface as hard disk drives – Easy replacement in most applications possible

New cards

Advantages of SSD

– Low access time and latency (remember those 12ms?)

– No moving parts  shock resistant • MTBF about 1.5 million hours

– Lighter and more energy-efficient than HDDs

New cards

Disadvantages of SSD

Divided into blocks/pages

one byte changes the whole page has to be rewritten
The old page will be marked as stale • Only whole blocks can be deleted

Limited ability of being rewritten

Expensive
The “Data” cells are not permanent, they have a limit of how many times they can be changed (e.g. writing to memory)

→ Did you know that a chip of 1GB of SSD actually has about 1,5GB of storage?

→ This is because when the cells die after a lot of use, all data gets copied and transferred to new cells

→ after time, there’s no free cells (the buffer got used), now you can’t write / change anymore and your SSD becomes “READ-ONLY”

New cards

What is RAID?

RAID stands for: (Redundant Array of Independent Disks)

RAID Array treats multiple hardware disks as a single logical disk

More HDs for increased capacity
Parallel access for increased speed
Controlled redundancy for increased reliability

New cards

How does RAID work?

The RAID controller connects to multiple hard disks

Disks are virtualized and appear to be just one single logical disk
The RAID controller acts as an extended specialized HBA (Host Bus Adapter)
Still DAS (Directly Attached Storage)

New cards

RAID 0

Splitting a file into 2 disks

Double bandwidth when sequentially reading a file → Cyclical distribution of blocks, when all blocks are read/written with uniform frequency (round robin)

Disadvantages:

No acceleration when reading an individual block.
Data loss probability grows with growing number of disks.

<p>Splitting a file into 2 disks</p><p>Double bandwidth when sequentially reading a file → Cyclical distribution of blocks, when all blocks are read/written with uniform frequency (round robin) </p><p>Disadvantages:</p><ul><li><p>No acceleration when reading an individual block.</p></li><li><p>Data loss probability grows with growing number of disks.</p></li></ul><p></p>

New cards

RAID 1

—> basically having 2 complete copies of the same data:
This results to

Faster access
A back-up copy

But obviously costs more & sometimes redundant

Also when writing —> have to do it twice, which is slower ofc

<p>—> basically having 2 complete copies of the same data:<br>This results to</p><ul><li><p>Faster access</p></li><li><p>A back-up copy</p></li></ul><p>But obviously costs more & sometimes redundant</p><p>Also when writing —> have to do it twice, which is slower ofc</p><p></p>

New cards

RAID 0 + 1 (also called RAID 10)

Combines RAID 0 and RAID 1

Double storage consumption
Increased fault tolerance (through RAID 1)
Further reduced latency when reading files
File can be read from four disks in parallel.

Downside / considerations

Reading of a single block is hardly faster than without RAID
The write rate of a file doubles.
You’re using twice as much space

<p>Combines RAID 0 and RAID 1</p><ul><li><p>Double storage consumption</p></li><li><p>Increased fault tolerance (through RAID 1)</p></li><li><p>Further reduced latency when reading files</p></li><li><p>File can be read from four disks in parallel.</p></li></ul><p>Downside / considerations</p><ul><li><p>Reading of a single block is hardly faster than without RAID</p></li><li><p>The write rate of a file doubles.</p></li><li><p>You’re using twice as much space </p></li></ul><p></p>

New cards

RAID 2

RAID-2 is a specialized RAID level that uses bit-level striping combined with error correction.

In this configuration, data is distributed at the bit level across multiple drives and a dedicated parity drive is used for error detection and correction.

While it offers strong fault tolerance and higher throughput

Downsides:

cost & complexity
reading 1 sector takes same amount of time as 1/8th of a sector
every read/write accesses all disks in parallel —→ meaning there’s no acceleration for parallel reading operations

<p><span style="line-height: 1.6;"><span>RAID-2 is a specialized RAID level that uses bit-level striping combined with error correction.</span></span></p><p><span style="line-height: 1.6;"><span>In this configuration, data is distributed at the bit level across multiple drives and a dedicated parity drive is used for error detection and correction. </span></span></p><p><span style="line-height: 1.6;"><span>While it offers strong fault tolerance and higher throughput</span></span></p><p>Downsides:</p><ul><li><p>cost & complexity</p></li><li><p>reading 1 sector takes same amount of time as 1/8th of a sector</p></li><li><p>every read/write accesses all disks in parallel —→ meaning there’s no acceleration for parallel reading operations</p></li></ul><p></p>

New cards

RAID 3

Bit-level (or Byte-level) striping (like RAID 2)

Additional parity disk for the other disks

Parity = bit-wise xor
Single disk failure can be compensated
If a disk fails, based on the parity and the data on the other disks you can “puzzle” back the lost data

Downsides:

Reading/writing requires access of all disks

<p>Bit-level (or Byte-level) striping (like RAID 2)</p><p>Additional parity disk for the other disks</p><ul><li><p>Parity = bit-wise xor</p></li><li><p>Single disk failure can be compensated</p></li><li><p>If a disk fails, based on the parity and the data on the other disks you can “puzzle” back the lost data</p></li></ul><p>Downsides:</p><ul><li><p>Reading/writing requires access of all disks</p></li></ul><p></p>

New cards

RAID 4

Better Load Balancing than RAID 3

Multiple processes can read in parallel from multiple disks.

Thus: The parity disk becomes a bottleneck

Every write must access it.
Every read must (potentially) access it for error checking.

That means the old state of A must be read when changing A; and the new parity block and the new block A‘ must be written.

<p>Better Load Balancing than RAID 3 </p><ul><li><p>Multiple processes can read in parallel from multiple disks.</p></li></ul><p>Thus: The parity disk becomes a bottleneck </p><ul><li><p>Every write must access it.</p></li><li><p>Every read must (potentially) access it for error checking.</p></li></ul><p>That means the old state of A must be read when changing A; and the new parity block and the new block A‘ must be written.</p>

New cards

RAID 5

It uses Block Striping with Distributed Parity

The parity of a specific part of data is on disk that does not contain any data blocks of that data

Much better load balancing than RAID 4

Parity disk is not bottleneck anymore
Write requires n-1 disks for data and 1 disk for Parity

Often used in practice.

Typically, slower than RAID 0+1
But better storage utilization

Good compromise between performance and storage utilization

<p><strong>It uses Block Striping with Distributed Parity</strong></p><ul><li><p>The parity of a specific part of data is on disk that does not contain any data blocks of that data</p></li></ul><ul><li><p>Much better load balancing than RAID 4</p></li></ul><ul><li><p>Parity disk is not bottleneck anymore</p></li><li><p>Write requires n-1 disks for data and 1 disk for Parity</p></li></ul><p></p><p>Often used in practice.</p><ul><li><p>Typically, slower than RAID 0+1</p></li><li><p>But better storage utilization</p></li></ul><p>Good compromise between performance and storage utilization</p>

New cards

RAID 6

RAID-6 is an advanced version of RAID-5 that provides enhanced fault tolerance by introducing double distributed parity.

This allows RAID-6 to recover from the failure of up to two disks simultaneously, making it more reliable for critical systems with larger arrays.

However, the added parity calculations can impact write performance.

New cards

Which RAID options are often used?

RAID1: Simple and fast to realize, increased fault tolerance and performance, but doubles disk space requirement

RAID5: More complex, but increased fault tolerance and speed, only small increase of disk space requirement; requires at least 3 disks

New cards

What about SSD RAID?

The RAID principle is not really applicable to SSD because RAID involves a lot of writing. This is okay for HHD but as we know, with SSD a lot of writing operations in SSD damages the “data cells” and thus reduced the storage & performance of the SSD.

New cards

Differences RAID 3 & RAID 4?

Both use:

Striping & parity
Both are bottlenecked by the parity disk because all write operations have to also edit the parity disk

Differences:

Byte-level striping in RAID 3 → superior performance for big file transfers
Block-level striping in RAID 4 → improves random read speed.

New cards

What is load balancing?

Load balancing is the process of distributing a set of tasks over a set of resources with the aim of making their overall processing more efficient.

New cards

What is fault tolerance?

Fault tolerance in distributed systems is the capability to continue operating smoothly despite failures or errors in one or more of its components.

New cards

Primary Indexes + properties

Primary Index is a sorted file which is of fixed length size with two fields. The first field is the primary key and second field, is pointed to that specific data block. In the primary index, there is always one to one relationship between the entries in the index table.

* Created by default for the primary key attributes of a table

* Index physically reorders the whole table

* Efficient search is possible

* – Forward search, skip-forward search, binary search, etc.

New cards

Secondary Indexing + properties

- In secondary indexing, our main file is unsorted.

- Secondary indexing can be done on key as well as non key attribute.

- A non clustered or secondary index just tells us where the data lies, i.e. it gives us a list of virtual pointers or references to the location where the data is actually stored.

- We can have only dense ordering in the secondary indexing as sparse ordering is not possible because data is not sorted in the main file.

Advantages:
* Extremely beneficial for speeding up joins on foreign key constraints!

* Builds an additional data structure containing the index

– Usually, this is a B-Tree / Hash Table, etc.

– Costs space for storage and performance for updates

New cards

Hash Indexes

A hash index uses a hash function to map each value of the indexed column into a “bucket” (or slot) where related rows are stored. This mapping allows the database to quickly determine where to find a specific value by directly looking at the relevant bucket, rather than scanning the entire table.

Hash Function: A hash function takes an input (in this case, the column value, such as a shipment ID) and produces a unique “hash code.” This hash code represents the location of the data within the hash table (index).
Buckets: Each unique hash code corresponds to a “bucket,” which can store one or multiple rows depending on the hashing mechanism.

New cards

Advantages vs Disadvantages of Hash Indexes

PRO:
Insanely fast → hash indexes is constant time complexity O(1) for lookups, which means search time doesn’t increase with the number of rows, as long as there are no significant hash collisions.

CON:

No Range Queries: Hash indexes cannot be used for range queries (e.g., tracking_number BETWEEN 'ABC123000' AND 'ABC123999') since hash functions don’t preserve ordering.
Handling Collisions: Sometimes different values produce the same hash code (called a “collision”). While this is usually manageable, excessive collisions can slow down lookups.

New cards

What are B-Trees?

A B-tree is a self-balancing search tree data structure that maintains sorted data.

B-trees are generalizations of binary search trees, allowing more than two children per node. This characteristic makes B-trees particularly well-suited for systems that read and write large blocks of data, such as databases and file systems.

<p><span><span>A </span></span><strong>B-tree</strong><span><span> is a self-balancing search tree data structure that maintains sorted data.</span></span></p><p><span><span>B-trees are generalizations of binary search trees, allowing more than two children per node. This characteristic makes B-trees particularly well-suited for systems that read and write large blocks of data, such as databases and file systems.</span></span></p>

New cards

Name the properties of a B-tree

All leaves are at the same level: This ensures that the path length from the root to any leaf is always the same, providing balanced access times.
Nodes have a variable number of children: Each node in a B-tree can have a variable number of children, ranging from a predefined minimum to a maximum number. This flexibility helps maintain balance in the tree.
Keys are stored in sorted order: Each node contains keys that are sorted in non-decreasing order. This sorting allows for efficient searching and sequential access.
Nodes have a minimum and maximum number of keys: For a B-tree of order m (maximum number of children a node can have), each node (except the root) must have at least ⌈m/2⌉ - 1 keys and at most m - 1 keys. This ensures that nodes are never too sparse or too full.

New cards

Name the elements that are part of a B-Tree

A B-tree node contains several elements:

Keys: Ordered elements that act as separation values, dividing the data into subtrees.
Pointers: References to child nodes, which provide the hierarchical structure of the tree.
Leaf Indicator: A flag indicating whether a node is a leaf (contains no child pointers).

New cards

Advantages of B-Trees

Balanced Access: The balanced nature of B-trees ensures that operations like search, insertion, and deletion have consistent performance, typically in logarithmic time, O(log n).
Efficient Disk Access: B-trees are designed to minimize disk I/O operations, making them ideal for applications that require frequent data retrieval from disk storage. By using a larger node size (with multiple keys), B-trees can reduce the number of disk reads and writes.
Scalability: B-trees can efficiently handle large amounts of data due to their ability to maintain balance with a variable number of children per node. This scalability is crucial for applications with growing data needs.

New cards

Disadvantages of B-trees

Complexity of Implementation
Extra space since nodes have keys & values
Overhead in Small Data Sets
For small datasets that fit entirely in memory, the overhead of maintaining a B-tree structure may not be justified. Simpler structures can be faster and easier to manage.
Less Efficient for In-Memory Operations
B-trees are optimized for systems with slow access to large blocks of data (like disks). For purely in-memory operations, other balanced trees (e.g., red-black trees) or hash tables might offer better performance.

New cards

What are B+Trees?

B+ Trees: A B+ tree is a variant of the B-tree in which all values are stored at the leaf level, and internal nodes only store keys for indexing. B+ trees are widely used in databases for efficient range queries. Also the leave nodes are connected via a linked list to facilitate sequential access.

<p><strong>B+ Trees</strong>: A B+ tree is a variant of the B-tree in which all values are stored at the leaf level, and internal nodes only store keys for indexing. B+ trees are widely used in databases for efficient range queries. Also the leave nodes are connected via a linked list to facilitate sequential access. <br></p>

New cards

Name different Spatial Queries

Point queries
Range Queries (find all x within a certain radius)
Nearest Neighbor Queries

New cards

Explain how Point Queries work (in multi-dimensional cases)

Definition: Retrieve data points that exactly match a specific coordinate or multi-dimensional key.
Example: "Find the location exactly at (x=10, y=20)" or "Find the record with coordinates (latitude=52.0, longitude=4.3)."
Use case: When you want to find a precise object or event at an exact position in space.

New cards

Explain how Range Queries work (in multi-dimensional cases)

Definition: Retrieve all data points that lie within a specified multi-dimensional region or boundary.
Example: "Find all locations within a rectangular bounding box defined by (x1, y1) and (x2, y2)" or "Find all points within a radius of 5 km from a given coordinate."
Use case: When you want to find all objects or events within a geographic area, volume, or spatial region.

New cards

Explain how Nearest Neighbor Queries work (in multi-dimensional cases)

Query: Find the 3 nearest restaurants to your current location.
Process: The system calculates distances from your location to each restaurant, but uses a spatial index to quickly narrow down candidates.
Result: Returns the 3 restaurants closest to you.

Key Points

Indexing: Efficient NN queries rely on spatial indexes that prune the search space, avoiding exhaustive search.
Dimensionality: As the number of dimensions increases, NN search becomes more complex (curse of dimensionality), sometimes requiring approximate methods.
Growing Box: You use a rectangle that grows and “captures” the closest elements within it

<ul><li><p><strong>Query:</strong> Find the 3 nearest restaurants to your current location.</p></li><li><p><strong>Process:</strong> The system calculates distances from your location to each restaurant, but uses a spatial index to quickly narrow down candidates.</p></li><li><p><strong>Result:</strong> Returns the 3 restaurants closest to you.</p></li></ul><p></p><p>Key Points</p><ul><li><p><strong>Indexing:</strong> Efficient NN queries rely on spatial indexes that prune the search space, avoiding exhaustive search.</p></li><li><p><strong>Dimensionality:</strong> As the number of dimensions increases, NN search becomes more complex (curse of dimensionality), sometimes requiring approximate methods.</p></li><li><p><strong>Growing Box: </strong>You use a rectangle that grows and “captures” the closest elements within it</p></li></ul><p></p><p></p>

New cards

Explain how Multiple-Key Indexes work

==> You essentially do the same as other indexes only you use multiple (so e.g. all age indexes point to a salary range eventually points to an employee

New cards

Explain what KD-Trees are

KD stands for k-dimensional trees

They’re similar to binary trees, only now you filter based on multiple different attributes

Attributes alternate through levels (e.g. first age, then salary, then age again
The leaf nodes point to disk blocks with records
Essentially you’re break up the data into little boxes

New cards

Give 3 downsides of KD-Trees

1) Partial Queries are super ineffective, e.g if you want to look only on age then you must go down every branch that as a different attribute (since you can make any path choices except when it comes to age)

2) The pathlenght to a leaf is much longer than with B-trees
3) It’s much more complicated to balance kd-trees

New cards

What are efficient index for existence / single equality searches?

Especially Hash Indexes but B+-trees also work well

New cards

What are efficient index for existence / single equality searches?

B-trees

KD-trees (in cases of multi-dimensional trees)

New cards

What are R-Trees?

R-tree stands for Rectangle-tree.

It is a balanced tree structure, similar in concept to B-trees but optimized for spatial objects.
Instead of indexing single points, R-trees index minimum bounding rectangles (MBRs) that enclose spatial objects.
The tree organizes these MBRs hierarchically, allowing efficient querying of spatial data.

<p><strong>R-tree</strong> stands for <strong>Rectangle-tree</strong>.</p><ul><li><p>It is a <strong>balanced tree</strong> structure, similar in concept to B-trees but optimized for spatial objects.</p></li><li><p>Instead of indexing single points, R-trees index <strong>minimum bounding rectangles (MBRs)</strong> that enclose spatial objects.</p></li><li><p>The tree organizes these MBRs hierarchically, allowing efficient querying of spatial data.</p></li></ul><p></p>

New cards

What are downsides of R-Trees

Cons
Can suffer from overlapping MBRs, leading to less efficient pruning
Performance can degrade with highly irregular or skewed data
More complex to implement than simpler structures like k-d trees
Overlapping bounding rectangles may cause multiple subtree searches

New cards

What are R*-Trees?

Like R-trees, R*-trees organize spatial objects using Minimum Bounding Rectangles (MBRs).
Each node contains entries with MBRs that cover its child nodes or spatial objects.
The tree is balanced, with all leaf nodes at the same level.

Key Improvements Over R-trees

Forced Reinsertion:
- When a node overflows, instead of immediately splitting, some entries are removed and reinserted into the tree.
- This reinsertion helps to better organize entries and reduce overlap before splitting.
Better Node Splitting:
- Splitting tries to minimize the sum of perimeters (or margins) of resulting MBRs rather than just their areas.
- This reduces the chance of overlap between sibling nodes.
Overlap Minimization Priority:
- The algorithm prioritizes minimizing overlap between bounding rectangles, which directly improves query pruning efficiency.

New cards

How to Query Optimise — Cost-Based Algorithm

– Enumerate alternative query plans equivalent to
naïve plan
– Estimate cost for each alternative
– Choose the cheapest Query plan

Pro:

Typically better results than heuristic-based

Cons:

Designing a cost cost metric & function
more complex & expensive (research wise)

New cards

How to Query Optimise — Heuristic-Based Algorithm (aka rule of thumb)

– Start with naïve plan
– Repeat: Rewrite plan using common heuristics
which are “known to help”

Pros:

very easy & cheap
not a lot of data needed

Cons:

results can vary

New cards

Name 3 common rules for query optimization

1) convert catersian products with joins

2) convert subqueries into joins
3) project attributes as early as possible + only project attributes you actually need instead of the whole table

4) apply selections as early as possible

5) use pipelining for adjacent unary operators

New cards

Name 2 relevant cost metrics (we use in this course)

1) Size of intermediate result sets

2) Number of Block Accesses (e.g. how many accesses on secondary storage (like HHD which is slow)

New cards

What is the Canonical Operator Tree?

An unoptimized relational algebra statement representing the SQL query as provided by the application

New cards

What is the Annotated Operator Tree

An intermediate step towards the execution plan:
An operator tree also containing algorithm annotations

<p><span><span>An intermediate step towards the execution plan:</span></span><br><span><span>An operator tree also containing algorithm annotations</span></span><br></p>

New cards

How does query optimization happen?

1) Transform the canonical operator tree into a more efficient final operator tree for evaluation

2) The canonical and final tree have equal semantics, but different operators/execution orders

3) Common Heuristics and/or DB statistics are used to transform the canonical tree step-by-step
• Heuristic query optimization
• Cost-based query optimization

New cards

Give a plan for optimising heuristic wise

— Input: canonical query plan

• Step 1: Break up all selections
• Step 2: Push selections as far as possible
• Step 3: Break, Eliminate, Push and Introduce Projection. Try to project
intermediate result sets as greedily as possible

• Step 4: Collect selections and projections such that between other
operators there is only a single block of selections followed by a single
block of projections (and no projections followed by selections )
• Step 5: Combine selections and Cartesian products to joins
• Step 6: Prepare pipelining for blocks of unary operators

— Output: Improved query plan

New cards

State 2 problems with transactions

Recovery: —> what do you do when the system fails in the middle of execution?
Concurrency: what happens when 2 transactions try to access the same object?

New cards

Simple assumptions for transactions (in the IDM course)

A single central database server → no agreement & communication between multiple servers
Data stored non-redundantly → one copy of each data item
Multiple concurrent transaction need to be interleaved

New cards

List the different type of Concurrency Issues

Dirty reads
Dirty writes
Non repeatable reads
Phantom reads
Write Skew

New cards

What are Dirty Reads?

A Dirty Read in SQL occurs when a transaction reads data that has been modified by another transaction, but not yet committed. In other words, a transaction reads uncommitted data from another transaction, which can lead to incorrect or inconsistent results.

<p><span>A Dirty Read in SQL occurs when a transaction reads data that has been modified by another transaction, but not yet committed. In other words, a transaction reads uncommitted data from another transaction, which can lead to incorrect or inconsistent results.</span></p>

New cards

What are Dirty Writes?

A dirty write happens when two transactions modify the same data without proper synchronization, leading to data corruption.

<p><span>A </span><strong>dirty write</strong><span> happens when two transactions modify the same data without proper synchronization, leading to data corruption.</span></p>

New cards

What are Non-Repeatable Reads?

A non-repeatable read occurs, when during the course of a transaction, a row is retrieved twice and the values within the row differ between reads.

<p><span>A non-repeatable read occurs, when during the course of a transaction, a row is retrieved twice and the values within the row differ between reads.</span></p>

New cards

What are Phantom Reads?

A phantom read occurs when, in the course of a transaction, two identical queries are executed, and the collection of rows returned by the second query is different from the first.

<p><span>A phantom read occurs when, in the course of a transaction, two identical queries are executed, and the collection of rows returned by the second query is different from the first.</span></p>

New cards

What are Write Skews?

Write skew occurs when two or more transactions read overlapping data, then each makes decisions based on what they read, and finally, they update the data without being aware of each other's changes. This can lead to inconsistencies.

<p>Write skew occurs when two or more transactions read overlapping data, then each makes decisions based on what they read, and finally, they update the data without being aware of each other's changes. This can lead to inconsistencies.</p>

New cards

What are Lost Updates?

Basically it’s as if 1 update never happened since it got overwritten.

New cards

Difference between non-repeatable reads & phantom reads?

non-repeatable reads see a different value in the SAME ROW

phantom reads see a DIFFERENT SET of rows coming from the same query (e.g. some rows got added / deleted)

New cards

What are Isolation Levels?

Imagine you're at a library where several people are reading, borrowing, and returning books at the same time. To keep things organized and prevent chaos, the library might enforce certain rules. For example, they might say, "You can only borrow a book after it's been returned," or "You can check out a book, but others can't read it until you're done."

In a database, transaction isolation levels work in a similar way.

Transaction Isolation Levels define how & when changes made by one transaction become visible by others.

New cards

Give the 5 different type of Isolation Levels

Read Uncommitted
Read Committed
Repeatable Read
Serializable
Snapshot Isolation

New cards

Give an example of Write Skew

Example of Write Skew:

Transaction A and Transaction B both check if a bank account balance is above a certain threshold before allowing a withdrawal.
Both transactions see that the balance is sufficient and proceed with the withdrawal.
After both transactions commit, the account balance might drop below the threshold, even though each transaction checked it.

New cards

Explain Read Uncommitted

At the lowest isolation level, Read Uncommitted allows dirty reads, non-repeatable reads, and phantom reads. Transactions are free to read data that other transactions have modified but haven't yet committed. It's like being able to borrow a book that someone else is still reading and hasn't finished yet.

Pros: It's fast and doesn't involve much locking.
Cons: Consistency is very prone to errors

<p>At the lowest isolation level, <strong>Read Uncommitted</strong> allows dirty reads, non-repeatable reads, and phantom reads. Transactions are free to read data that other transactions have modified but haven't yet committed. It's like being able to borrow a book that someone else is still reading and hasn't finished yet.</p><ul><li><p><strong>Pros:</strong> It's fast and doesn't involve much locking.</p></li><li><p><strong>Cons:</strong> Consistency is very prone to errors</p></li></ul><p></p>

New cards

Explain Read Committed

Read Committed prevents dirty reads, ensuring that only committed data is read. However, it still allows non-repeatable reads and phantom reads. It's like waiting until a book is returned to the shelf before you can borrow it.

Pros:
- a good balance between data accuracy and performance.
- only read committed data
Cons:
- non-repeatable reads can still happen.
- other ones too

<p><strong>Read Committed</strong> prevents dirty reads, ensuring that only committed data is read. However, it still allows non-repeatable reads and phantom reads. It's like waiting until a book is returned to the shelf before you can borrow it.</p><ul><li><p><strong>Pros:</strong> </p><ul><li><p>a good balance between data accuracy and performance.</p></li><li><p>only read committed data</p></li></ul></li><li><p><strong>Cons:</strong> </p><ul><li><p>non-repeatable reads can still happen. </p></li><li><p>other ones too</p></li></ul></li></ul><p></p>

New cards

Explain Repeatable Read

With Repeatable Read, the database ensures that once you've read a piece of data, it won't change for the duration of your transaction. It's like checking out a book and knowing that no one else can take it until you're done.

Pros:
- prevents both dirty reads and non-repeatable reads.
- data read in your transaction remains consistent, even if other transactions are trying to modify it.
Cons:
- requires more locking
- doesn't fully prevent phantom reads

Often used with financial transaction

New cards

Explain Serializability

At the top level, Serializable ensures the highest degree of isolation. It's like closing the entire library just for yourself, so no one else can borrow or return books while you're in there.

Pros:
- You prevent all data consistency errors
- Your transactions will behave as if they're running one after the other, rather than concurrently.
Cons:
- can be very slow.
- to reduced system throughput and potentially long delays.

<p>At the top level, <strong>Serializable</strong> ensures the highest degree of isolation. It's like closing the entire library just for yourself, so no one else can borrow or return books while you're in there.</p><ul><li><p><strong>Pros:</strong> </p><ul><li><p>You prevent all data consistency errors</p></li><li><p>Your transactions will behave as if they're running one after the other, rather than concurrently.</p></li></ul></li><li><p><strong>Cons:</strong> </p><ul><li><p>can be very slow. </p></li><li><p>to reduced system throughput and potentially long delays.</p></li></ul></li></ul><p></p>

New cards

Explain Snapshot Isolation

When a transaction starts under Snapshot Isolation:
1) it gets a "snapshot" of the data as it existed at that moment.

2) Throughout the transaction, even if other transactions are updating the data, your transaction sees a consistent, unchanging view based on that initial snapshot.
3) No dirty reads, non-repeatable reads, or phantom reads

New cards

Explain the Page model (also called block / buffer model)

New cards

What is a Schedule?

New cards

When are 2 schedules equivalent?

They comprise the same set of operations
Every transaction in both schedules reads and writes the same
values from/to a given record

New cards

When is a schedule serializable?

A complete schedule is called serializable, if it is
equivalent to any serial schedule of the respective set of
transactions

<p><span>A complete schedule is called serializable, if it is</span><br><span>equivalent to any serial schedule of the respective set of</span><br><span>transactions</span></p>

New cards

What is Conflict Serializability?

Conflict Serializability ensures that a concurrent schedule produces the same result as some serial execution by reordering non-conflicting operations.

<p><span>Conflict Serializability ensures that a concurrent schedule produces the same result as some serial execution by reordering non-conflicting operations.</span></p>

New cards

What is Conflict Equivalence?

New cards

What is a conflict graph?

It’s a graph of a schedule S where:

All committed transactions of S are nodes
If there is an edge between two transactions there is at least one
conflict between them
The direction of each edge respects the ordering of the conflicting steps
There’s no obvious cycle

New cards

How does the graph of a non-conflict serializable schedule look like?

New cards

Quick overview of Serializable

New cards

What is locking? (give the 2 types)

Locking is a technique for concurrency control, locks are a mechanism to ensure data integrity.

There are 2 major types:

Shared locks let you read the row or the table that is being locked. Hence, it’s also called a read lock. No transaction is allowed to update the resource while it has a shared lock.
Exclusive locks lock the row or table entirely and let the transaction update the row in isolation. Only 1 transaction can get an exclusive lock at a time.

New cards

Locks Overview

New cards

How do you generate a conflict-serializable schedule?

with Two-Phase Locking

—> Two-phase locking protocols are a simple way to generate
only serializable schedules

New cards

What is Two-Phase Locking?

The Two-Phase Locking Protocol defines clear rules for managing data locks. It divides a transaction into two phases:

Growing Phase: In this step, the transaction gathers all the locks it needs to access the required data. During this phase, it cannot release any locks.
Shrinking Phase: Once a transaction starts releasing locks, it cannot acquire any new ones. This ensures that no other transaction interferes with the ongoing process.

<p><span style="line-height: 1.6; font-size: inherit; color: inherit;">The Two-Phase Locking Protocol defines clear rules for managing data locks. It divides a transaction into two phases:</span></p><ol><li><p><strong>Growing Phase</strong><span style="line-height: 1.6; font-size: inherit; color: inherit;">: In this step, the transaction gathers all the locks it needs to access the required data. During this phase, it cannot release any locks.</span></p></li><li><p><strong>Shrinking Phase</strong><span style="line-height: 1.6; font-size: inherit; color: inherit;">: Once a transaction starts releasing locks, it cannot acquire any new ones. This ensures that no other transaction interferes with the ongoing process.</span></p></li></ol><p></p>

New cards

Pros & Cons of 2-phase locking

Pros:

Correctness —> ensured serializability
Easy to build into transaction manager
Intuitive

Cons:

Deadlock —> especially during growing phase
Cascading rollbacks —> If one part fails, other parts need to be rolled back too
Reduced Concurrency —> Transactions can hold onto locks for a long time

New cards

What is timestamp concurrency control?

The Optimist approach:
Assumes conflicts are rare.
Transactions execute without waiting. Validation is performed via timestamps to ensure serializability.

New cards

How does timestamp concurrency control work?

It uses timestamps to determine the order of transaction execution and ensures that conflicting operations follow a consistent order.

Each transaction T is assigned a unique timestamp TS(T) when it enters the system. This timestamp determines the transaction’s place in the execution order.

The Timestamp Ordering Protocol enforces that older transactions (with smaller timestamps) are given higher priority. This prevents conflicts and ensures the execution is serializable and deadlock-free.

New cards

Timestamp vs Locking

New cards

What are database logs?

Database logs are:

ordered, persistent records of database state changes
more than journals, they’re the definitive history enabling the DBMS to uphold the ACID properties

100

New cards

What is undo logging?

Undo logging is a recovery mechanism where the database records the old value of a data element before it is overwritten.

The primary goal is to guarantee that an aborted or failed transaction has absolutely no effect on the disk