Note

0.0(0)

Take a practice test

Chat with Kai

Explore Top Notes

AP Biology Ultimate Guide

Studied by 130 people

Neuroscience and Psychiatric disease midterm notes:

Studied by 8 people

Roles and Processes in Schools

Studied by 1 person

Studied by 12 people

Studied by 5 people

Chapters 28.5-28.6: Money, banking, financial institutions

Studied by 7 people

Query Processing Notes

Query Processing

Database Systems (H) - Dr Chris Anagnostopoulos

Roadmap

External Sorting: Fundamental Operator
Strategies for SELECT
- Simple, conjunctive, disjunctive
Strategies for JOIN
- Five fundamental algorithms for JOIN
Principle: Estimate the cost of each plan, choose the plan with the minimum expected cost, and then execute!

Fundamental Tool: Sorting

Almost all SQL queries involve sorting of tuples with respect to ad-hoc sorting requests defined by the user.
- CREATE PRIMARY INDEX ON EMPLOYEE(SSN) means sort by SSN.
- ORDER BY Name means sort by Name.
- SELECT DISTINCT Salary means sort by Salary to create clusters and then identify the distinct values.
- SELECT DNO, COUNT (*) FROM EMPLOYEE GROUP BY DNO means sort by DNO to create clusters.
Fundamental Limitation: We cannot store the entire relation into memory for sorting the records (e.g., bubble sort, quick sort, heap sort, merge sort).
External Sorting: Sorting algorithm for large relations stored on disk that do not fit entirely in main memory.

External Sorting: Overview

Principle: Divide & Sort (Conquer)
- Divide: A file of b blocks into L smaller sub-files (b/L blocks each).
- Sort: Load each small sub-file to memory, sort using, e.g., quick-sort, bubble-sort and write it back to the disk.
- Merge: Merge two (or more) sorted sub-files loaded from disk in memory creating bigger sorted sub-files, that are merged in turn.

External Sorting: Overview (Lemma 1)

Lemma 1: The expected cost of the sort-merge strategy in block accesses is: 2b.(1 + log_M(L)). Where:
- b is the number of file blocks.
- M: degree of merging, i.e., number of sorted blocks merged in each loop.
- L: number of the initial sorted sub-files (before entering merging phase).
Proof: (See Knuth, Donald (1998). "Chapter 5.4.1. Multiway Merging and Replacement Selection". Sorting and Searching. The Art of Computer Programming. 3 (2nd ed.). Addison-Wesley. pp. 158–168.)
M = 2 gives the worst-case performance because it merges in parallel only a pair of blocks at each step.
M > 2: merge more than two blocks at each step; (M-way merging).

Strategies for SELECT

Given: SELECT * FROM relation WHERE selection-conditions
S1: Linear search over a key
- Retrieve every record; test whether it satisfies the selection condition.
- SELECT * FROM EMPLOYEE WHERE SSN = 1234
- Expected Cost: b/2
S2: Binary Search over a key
- SELECT * FROM EMPLOYEE WHERE SSN = 1234
- Expected Cost (unsorted file): log2(b) + 2b(1 + logM(L))
- Expected Cost (sorted file): log_2(b)
S3: Primary Index or Hash Function over a key
- SELECT * FROM EMPLOYEE WHERE SSN = 1234
- Precondition (Index): Primary Index of level t over key (sorted by key)
- Precondition (Hash): File hashed with the key
- Expected Cost (sorted file): t + 1
- Expected Cost (hashed file): 1 + n/2 (n = #overflow blocks < b)
S4: Hash Function over a non-key
- SELECT * FROM EMPLOYEE WHERE DNO = 10
- Precondition (Hash): File hashed with the non-key
- Expected Cost: 1 + n (n = #overflow blocks having the non-key)
S5: Primary Index over a key in a range query: $>>, \geq, <, \leq
- Use Index to find the record satisfying the equality (e.g., SSN = 9) and retrieve all subsequent blocks from the ordered file.
- SELECT * FROM EMPLOYEE WHERE SSN > 8;
- Precondition: Primary Index of level t over the sorted by key
- Expected Cost (sorted file): t + O(b)
- Estimated Cost (sorted file): t + P(X > x) . b = t + b . (1– x/r)
  - P(X = x): probability of SSN = x = 1/r
  - r = number of employees = b . f
  - f = employees/block
  - P(X > x) = 1 – P(X \leq x) = 1 - (1/r + … + 1/r) = 1 – x/r
S6: Clustering Index over ordering, non-key
- Retrieve all contiguous blocks of the cluster.
- SELECT * FROM EMPLOYEE WHERE DNO = 5;
- Precondition: Clustering Index of level t on non-key (file sorted by non-key)
- Expected cost (sorted file): t + O(b/n)
- Estimation: t + b/n or t + P(X = x) . b
  - Note 1: n = #distinct values of the non-key attribute
  - Note 2: Attribute is uniformly distributed, i.e., P(X = x) = 1/n
S7: Secondary Index (B+ Tree) over non-ordering key
- SELECT * FROM DEPARTMENT WHERE MGR_SSN = 1234
- Precondition: File is not ordered by key.
- Expected Cost: t + 1
- B+ Leaf Node points at the unique block
S8: Secondary Index (B+ Tree) over non-ordering, non-key
- Note: retrieve multiple records from different blocks having the same value.
- SELECT * FROM EMPLOYEE WHERE SALARY = 40000;
- Precondition: File is not ordered by non-key
- Expected Cost: t + m + O(b)
  - Note 1: B+ Leaf Node points to the first block of a cluster of blocks storing pointers to data blocks with Salary = 40K (2 levels of indirection)
  - Note 2: m is number of blocks with block pointers (m \geq 1)

Strategies for Disjunctive SELECT

Disjunctive Selections: conditions involving OR
SELECT * FROM EMPLOYEE WHERE SALARY > 10000 OR NAME LIKE ‘%Chris%’
Final result: contains tuples satisfying the union of all selection conditions
Methodology:
- IF an access path exists, e.g., B+/hash/primary-index for all of the attributes:
  - use each to retrieve the set of records satisfying each condition
  - union all sets to get the final result.
- ELSE if none or some of the attributes have an access path, linear search is unavoidable!

Strategies for Conjunctive SELECT

Conjunctive Selections: conditions involving AND
SELECT * FROM EMPLOYEE WHERE SALARY > 40000 AND NAME LIKE ‘%Chris%’
Methodology:
- IF an access path exists (index) for an attribute, use it to retrieve the tuples satisfying the condition, e.g., Salary > 40000 [intermediate result]
- GO through this intermediate result to check which record satisfies also the other condition(s), e.g., Name LIKE ‘%Chris%’ in memory!
- If you have two indexes, which index is to be used first?
  - Answer: use the index that generates the smallest intermediate result set hoping to fit in the memory! [selectivity = #tuples retrieved]
- Optimization: find the execution sequence of conditions that minimizes the expected cost.
- Principle: Predict the selectivity beforehand!

Strategies for JOIN

Observation: the most resource-consuming operator!
Focus: two-way equijoin, i.e., join two relations with equality ‘=’
SELECT * FROM EMPLOYEE E, DEPARTMENT D WHERE E.DNO = D.DNUMBER
Five fundamental strategies for join processing:
- Naïve join (no access path)
- Nested-loop join (no access path)
- Index-based nested-loop join (index; B+ Trees)
- Merge-join (sorted relations)
- Hash-join (hashed relations)

Naïve Join

SELECT * FROM R, S WHERE R.A = S.B --A and B are join attributes, e.g., PK, FK.
Step 1: Compute the Cartesian product of R and S, i.e., all tuples from R are concatenated (combined) with all tuples from S.
Step 2: Store the result in a file T and for each concatenated tuple t = (r, s) with r \in R and s \in S check if r.A = s.B
Algorithm Naïve Join
- T = Cartesian R x S
- Scan T, a tuple t \in T at a time: t = (r, s)
- If r.A = s.B then add (r, s) to the result file
- Go to next tuple t \in T.
Outcome: inefficient, typically the result is a small subset of the Cartesian!
What-If: no tuples are actually matched; predict the matching tuples in advance!

Nested-Loop Join

Algorithm Nested-Loop Join
- For each tuple r \in R //outer relation
- For each tuple s \in S //inner relation
- If r.A = s.B then add (r, s) to the result file;
Note: the outer & inner loops are over blocks and not over tuples!
Note: Re-form the pseudocode in a block-centric programming mode (system programming using files)
Challenge 1: Which relation should be in the outer loop and which in the inner loop to minimize the join processing cost?

Nested-Loop Join: Algorithm

Step 1:
- LOAD a set (chunk) of blocks from the outer relation R.
- LOAD one block from inner relation S
- Maintain an output buffer for the matching (resulting) tuples (r, s): r.A = s.B
Step 2:
- JOIN the S block with each R block from the chunk
- FOR each matching tuple r \in R-block and s \in S-block ADD (r, s) to Output Buffer
- IF Outer Buffer is full, PAUSE; WRITE the current join result to disk; CONTINUE
Step 3: LOAD next S-block and GOTO Step 2
Step 4: GOTO Step 1

Nested-Loop Join: Cost

SELECT * FROM EMPLOYEE E, DEPARTMENT D WHERE E.DNO = D.DNUMBER
Employee (E): nE blocks used at the outer loop
Department (D): nD blocks used at the inner loop
Memory: nB blocks available:
- 1 block for reading the inner file D,
- 1 block for writing the join result,
- nB-2 blocks for reading the outer file E: chunk size.
Observation 1: Each block of the outer relation E is read once.
Observation 2: The whole inner relation D is read every time we read a chunk of (nB-2) blocks of E.

Nested-Loop Join: Cost (Continued)

Total number of blocks read for outer relation E: nE
Outer Loops: Number of chunks of (nB-2) blocks of outer relation: ceil(nE/(nB-2))
For each chunk of (nB-2) blocks read all the blocks of inner relation D:
Total number of block read in all outer loops: nD*ceil(nE/(nB-2))
Total Expected Cost: nE + nD * ceil(nE/(nB-2)) block accesses
Example: nE = 2,000 blocks; nD = 10 blocks; nB = 7 blocks
- Strategy Cost 1: (E outer; D inner) nE + nD * ceil(nE/(nB-2)) = 6,000 block accesses
- Strategy Cost 2: (D outer; E inner) nD + nE * ceil(nD/(nB-2)) = 4,010 block accesses
Note: The file with fewer blocks goes to the outer loop.

Index-Based Nested-Loop Join

Idea: Use of an index on either A or B joining attributes: R.A = S.B.
Focus: Assume an index I on attribute B of relation S.
Algorithm Index-Based Nested-Loop Join
- For each tuple r \in R
- Use index of B: I(r.A), to retrieve all tuples s \in S having s.B = r.A
- For each such tuple s \in S, add matching tuple (r, s) to the result file;
Claim: Much faster compared to the nested-loop join, why?
- Because: We get immediate access on s \in S with s.B = r.A by searching for r.A using the index I, avoiding linear search on S.
Challenge 2: Which index to use to minimize the join processing cost?

Index-Based Nested-Loop: Cost

SELECT * FROM EMPLOYEE E, DEPARTMENT D WHERE D.MGR_SSN = E.SSN
B+ Tree on Mgr_Ssn with level xD = 2
B+ Tree on SSN with level xE = 4
E: rE = 6000 tuples; nE = 2,000 blocks; D: rD = 50 tuples; nD = 10 blocks
Strategy 1: Employee e, use B+ Tree (MgrSsn) to find department d: e.Ssn = d.MgrSsn.
- Observation: not all employees are managers; –search without meaning sometimes…
- Probability an employee being manager: 50/6000 = 0.83% (99.16% meaningless searches)
- Strategy Cost 1: nE + rE*(xD + 1) = 20,000 block accesses;
Strategy 2: Department d, use B+ Tree (SSN) to find Employee e: e.Ssn = d.Mgr_Ssn
- Observation: every department has one manager –search is fruitful…
- Probability a manager being an employee is 100% ☺
- Strategy Cost 2: nD + rD*(xE + 1) = 260 block accesses;
Huge difference (20,000 vs 260 block accesses):
- every record in Department is joined with exactly one record in Employee (manager)
- only some employees from Employee are managers of departments…
Lesson Learnt: Use the index built on the PK
Note: not for recursive relationships, e.g., employee-supervisor

Sort-Merge Join

Idea: Use of the merge-sort algorithm over two sorted relations w.r.t. their joining attributes.
Pre-condition: Relations R and S are physically ordered on their joining A and B;
Methodology:
- Step 1: Load a pair {R.block, S.block} of sorted blocks into the memory;
- Step 2: Both blocks are linearly scanned concurrently over the joining attributes (sort-merge algorithm in memory);
- Step 3: If matching tuples found then store them in a buffer.
Gain: The blocks of each file are scanned only once!
But: If R and S are not a-priori physically ordered on A and B then sort them first!

Sort-Merge Join: Example

R

A sname rating age
22 dustin 7 45.0
28 yuppy 9 35.0
44 guppy 5 35.0
58 rusty 10 35.0

S

B bid day rname
28 103 12/4/96 guppy
28 103 11/3/96 yuppy
31 101 10/10/96 dustin
31 102 10/12/96 lubber
31 101 10/11/96 lubber
58 103 11/12/96 dustin

Result Buffer

(28,yuppy,9,35,103,12/4/96,guppy)
(28,yuppy,9,35,103,11/3/96,yuppy)
(58,rusty,10,35,103,11/12/96,dustin)

Sort-Merge-Join: Cost

Requirement: Efficient if both Employee E and Department D are already sorted by their joining attributes: SSN and Mgr_Ssn.
Observation: only a single pass is made for each file.
Strategy Cost: nE + nD = 2,010 block accesses.
IF both files are not sorted THEN use external sorting!
- Strategy Cost 1: External sorting (2-way merge): 2.nE + 2. nE .log_2(ceil(nE / nB))
  - ceil(nE / nB): number of initial sorted sub-files (each sub-file is nB blocks)
  - nB: number of available memory blocks.
- Strategy Cost 2: External sorting (2-way merge): 2.nD + 2. nD .log_2(ceil(nD / nB))
Total Strategy Cost:
- nE + nD + 2.nE + 2. nE .log2(ceil(nE / nB)) + 2.nD + 2. nD .log2(ceil(nD / nB))
Example: nE = 2,000 blocks; nD = 10 blocks; nB = 7 blocks, we get:
- 2010 + 4000 + 4000 log(286) + 20 + 20 log(2) = 38,690 block accesses; only 5.1% is devoted to join!
Think before sort only for joining purposes!

Hash-Join

Pre-condition:
- File R is partitioned into M buckets w.r.t. hash function h over joining attribute A.
- File S is also partitioned into M buckets w.r.t. the same hash function h over attribute B;
Assumption: R is the smallest file and fits into main memory: M buckets of R are in memory.
Algorithm Hash-Join
- Partitioning phase
  - For each tuple r \in R,
  - Compute y = h(r.A) /* address of bucket*/
  - Place tuple r into bucket y = h(r.A) in memory
- Probing phase
  - For each tuple s \in S,
  - Compute y = h(s.B) /use the same hash function h/
  - Find the bucket y = h(s.B) in memory (of the R partition).
  - For each tuple r \in R in the bucket y = h(s.B)
  - If s.B = r.A add (r, s) to the result file; /join/

Hash-Join (Partitioning and Probing phases)

Partitioning Phase

Partition of R over attribute A using hash h(A) = A \mod M into M buckets.
M+1 main memory buffers

Probing Phase

Hashing each tuple s from S, using hash h(s.B) = s.B \mod M to identify the y = h(s.B) bucket in memory.

Hash-Join: Cost

Best Case: Memory nB > nD + 2
- nD: blocks for the smallest of the two relations (e.g., Department)
- Whole relation Department fits in memory and is hashed into M buckets.
- Each Employee tuple is loaded and hashed on joining attribute SSN.
- The corresponding bucket is found and searched for a matching tuple.
- The result is stored in another buffer (that’s why nB > nD + 2)
Best Case: nE + nD block accesses.
Normal Case (smallest relation cannot fit in memory): 3(nE + nD) block accesses

Put-All-Together: Join Cost Prediction

Naïve Join Cost: nE * nD: 20,000 block accesses
Nested-Loop Cost (best): nD + nE * ceil(nD/(nB-2)): 4,010 block accesses
Index-based Nested-Loop Cost (best): nD + rD*(xE + 1): 260 block accesses
Sort-Merge Cost (already sorted): nE + nD: 2,010 block accesses
Hash-Join Normal-Case Cost: 3(nE + nD): 6,030 block accesses
Hold on a second: the cost for writing the result-set buffer (block) from memory to disk in each strategy is not yet considered!
How many blocks are written? How many tuples are matched?…next weeks

So Far…

Naïve Join: Exploit nothing. Cartesian product and then check…
Nested-Loop: Exploit nothing. Computing-oriented join.
- Which relation should be in the outer loop? Influences the join cost.
- Can you predict the cost then? Optimization…
Index-based Nested-Loop: Exploit at least one index. Use index to find the matched tuples as quick as possible ☺
- If we have two indexes (over R.A and over S.B), which one to use? Influences the join cost! Optimization…
Merge-Join: Exploit both ordered relations; otherwise; sort them 
Hash-Join: Exploit hashing. Hash one relation.
- Use the same hashing function to find the matching tuples in the same bucket ☺

Use Case I

SELECT D.NAME, E.NAME FROM EMPLOYEE E, DEPARTMENT D WHERE E.DNO = D.DNUMBER
Clustering Index (1-level) ordering, non-key DNO with nC = 4 index-blocks
Employee: nE = 100 blocks; rE = 2000; Department: rD = 500; nD = 10 blocks
Memory: nB = 22 blocks, bfrE = 2 employees per block
Task: propose 3 strategies and choose the best.

Use Case I (Continued)

Clustering Index (1-level) ordering, non-key DNO with nC = 4 index-blocks
Employee: nE = 100 blocks; rE = 2000; Department: rD = 500; nD = 10 blocks
Memory: nB = 22 blocks, bfrE = 2 employees per block
Strategy 1: Use Clustering Index (DNO). For each department d, find details of its employees: e.DNO
- Cost for accessing a department in Clustering Index: log_2(nC) = 2 block accesses
- 2000 employees in 500 departments, i.e., m = 4 employees per department
- 2 blocks per department (m / bfrE)
- Cost-Clustering-Index = log_2(nC) + m / bfrE = 2 + 2 = 4 block accesses
- Cost-1: nD + rD*(log_2(nC) + m / bfrE) = 2010 block accesses

Use Case I (Continued 2)

Clustering Index (1-level) ordering, non-key DNO with nC = 4 index-blocks
Employee: nE = 100 blocks; rE = 2000; Department: rD = 500; nD = 10 blocks
Memory: nB = 22 blocks, bfrE = 2 employees per block
Strategy 2: Department relation fits in memory; Hash-join (M = 10 buckets).
- Hash Department and store into memory: nD = 10 block accesses.
- Map each employee to a bucket in main memory and search within the bucket.
- Partitioning phase cost: nD = 10 block accesses.
- Probing phase cost: nE = 100 block accesses.
- Cost-2: nD + nE = 110 block accesses

Use Case I (Continued 3)

Clustering Index (1-level) ordering, non-key DNO with nC = 4 index-blocks
Employee: nE = 100 blocks; rE = 2000; Department: rD = 500; nD = 10 blocks
Memory: nB = 22 blocks, bfrE = 2 employees per block
Strategy 3: Department relation is smaller than employee; nested-loop join
- Outer relation: Department
- Cost-3: nD + nE * ceil(nD/(nB-2)) = 10 + 100*(10/20) = 60 block accesses
Strategy 1 (index-based join): 2010 block accesses
Strategy 2 (hash-join): 110 block accesses
Strategy 3 (nested-loop join): 60 block accesses

Use Case II

SELECT E.NAME, S.NAME FROM EMPLOYEE E, EMPLOYEE S WHERE E.SUPER_SSN = S.SSN
Context:
- b = 2,000 blocks;
- r = 10,000 records (employees);
- B+ Index over SSN of level x = 5
- B+ Index over Super_SSN of level y = 2
- 10% of employees are supervisors;
- a supervisor does not have any supervisor: Super_SSN is NULL;
Task: Propose a plan that minimizes the expected cost using Index- based Nested-loop.

Use Case II (Continued)

SELECT E.NAME, S.NAME FROM EMPLOYEE E, EMPLOYEE S WHERE E.SUPER_SSN = S.SSN
Facts (Solution 1):
- PK is SSN & FK is Super_SSN.
- Use the B+ Index over SSN (ssn-index).
- Scan relation Employee once, i.e., b = 2000 block accesses
- For each block from Employee:
  - For each tuple e check if this employee is a supervisor, i.e., Super_SSN is NULL
  - IF employee e is NOT supervisor (w.p. 90%)
    - THEN use ssn-index(e.super_ssn)
  - ELSE go to next employee
- Total Cost: b + 0.9r(x+1) = 56,000 block accesses

Use Case II (Continued 2)

SELECT E.NAME, S.NAME FROM EMPLOYEE E, EMPLOYEE S WHERE E.SUPER_SSN = S.SSN
Facts (Solution 2):
- PK is SSN & FK is Super_SSN.
- Use the B+ Index over Super_SSN (super-index).
- Scan relation Employee once, i.e., b = 2000 block accesses
- For each block:
  - For each tuple e check if this employee is a supervisor, i.e., Super_SSN is NULL
  - IF employee e is NOT supervisor (w.p. 90%)
    - THEN use super-index(e.super_ssn)
  - ELSE go to next employee
- Total Cost: b + 0.9r(y+1) = 29,000 block accesses (48% faster)

Hash-Join: Cost (Sketch) OPTIONAL

Normal Case: The smallest relation cannot fit in memory.
Partitioning Phase
- Read both relations E & D first (one block at a time);
- Partial Cost: nE + nD
- Partition into M buckets using with the same hashing function h,
- The M main buckets fit in memory; overflow blocks in disk!
- Store the hashed buckets of each relation to the disk.
- Partial Cost: nE + nD
Probing Phase
- For each m =1…M bucket Do
- Read a pair: the m-th bucket from E and the m-th bucket from D
- Partial Cost: nE + nD
- Perform join focusing only on the tuples from the same bucket m
- Idea: e might be matched with d since h(e) = h(d) = m
Expected Cost: 3(nE + nD)$$ block accesses.

Note

0.0(0)

Take a practice test

Chat with Kai

Explore Top Notes

AP Biology Ultimate Guide

Studied by 130 people

Neuroscience and Psychiatric disease midterm notes:

Studied by 8 people

Roles and Processes in Schools

Studied by 1 person

Studied by 12 people

Studied by 5 people

Chapters 28.5-28.6: Money, banking, financial institutions

Studied by 7 people