Optimization of Queries in Relational Databases

INTRODUCTION

One of the main objectives of databases is to return information within acceptable timeframes.
This involves finding the best way to access and process data to respond to queries as quickly as possible through query evaluation.
Executing a query requires:
- Transferring data from stable memory (MS - disks) to central memory (MC).
- Processing in MC.
- Writing back to MS if there are updates.
Optimization aims to minimize the volume of these transfers.
To achieve this, the DBMS has a query evaluation module to choose the best possible execution strategy for a relational expression.

EXAMPLE Problem Statement

Working with a supplier database with the following assumptions:
- Cardinality of the supplier relation = 100.
- Cardinality of the supplier-product relation = 10000.
Query: "Give the names of suppliers of part P2."
SQL query:
- sql SELECT nomf FROM fournisseur, four-produit WHERE nump = ‘P2’ and fournisseur.numf= four-produit.numf
An unoptimized algebraic tree for the query:
- The tree structure involves joining Fournisseur and Four-produit relations, then filtering by Nump = P2, and finally projecting the nomF attribute.

Evaluation of the Non-Optimized Tree

1st step: Join the fournisseur and four-produit relations:
- Read all 10,000 tuples of four-produit, then read each of the 100 fournisseur tuples 10,000 times (resulting in 10,000 x 100 disk accesses at most).
- Construct an intermediate join result in MC (minimum 10,000 joined tuples). It is not certain whether this intermediate result can be kept in central memory.
2nd step: Restrict the result from the previous step:
- Apply the condition nump = ‘P2’ to the joined tuples (10,000 accesses at most).
- Assume 50 tuples satisfy this condition and can be kept in memory.
3rd step: Project onto nomf:
- Traverse the 50 tuples from the previous result to keep only the nomf attribute.
Observation: Performing the join first results in wasted time due to repeated reads.
Alternative Strategy:
- 1st step: Restrict four-produit to nump = ‘P2’:
  - Read the 10,000 tuples of four-produit to keep only a table of 50 tuples in memory.
- 2nd step: Join the result with fournisseur:
  - Extract the 100 fournisseur tuples (100 disk accesses).
  - The result contains 50 tuples.
- 3rd step: Perform the projection.
Commuting the join and restriction significantly reduces the number of transfers from secondary to central memory, saving time and reducing costs.

PLAN

I. Typology of queries

II. The different phases of query processing

III. Query optimization

i. Syntax method

ii. Cost estimation

iii. Implementation techniques of operators: join case

TYPOLOGY OF QUERIES

Example Database Schema:

Ouvrier (NSS, Name, Salary, Spec, #N°U)
Usine (N°U, Site, Prod, #NSSChef)
Contrat (N°Cont, Fin, Cost, #NSS, #N°U)

Classification of Queries:

Singular Query (Point Query)

Based on a single relation.
The selection predicate is composed of one or more equality conditions.
The response includes at most one tuple.
Example:
- sql SELECT * FROM Ouvrier WHERE NSS = 1233333;

Multiple Singular Query (Multipoint Query)

Based on a single relation.
The selection predicate is composed of one or more equality conditions.
The response includes multiple tuples.
Example:
- sql SELECT * FROM Ouvrier WHERE Salary > 1500 AND Spec =`soudeur ’;

Interval Query (Range Query)

Based on a single relation.
The predicate includes an interval test.
Example:
- sql SELECT * FROM Ouvrier WHERE Salary > 1500 AND Salary < 5000;

Membership Query to an ordered set of attributes Z

Queries whose conditions of equality must include a substring defined according to the set Z.
Example:
- Z = {NSS, Nom}
- sql SELECT Nom, Salary FROM Ouvrier WHERE NSS = 12345 AND Nom = `Dupont’;

Min-Max Query

Query formulated on a single relation.
The predicate is composed of equalities referring to a minimum or maximum value.
Example:
- sql SELECT Nom, Salary FROM Ouvrier WHERE Salary = max (SELECT salary FROM Ouvrier);

Grouping Query

Query whose response is partitioned by one or more attributes.
Example:
- sql SELECT N°U, AVG(Salary) FROM Ouvrier GROUP BY N°U;

Join Query

Query based on multiple relations.
Example:
- sql SELECT NSS, Nom, Site FROM Ouvrier O, Usine U WHERE O.N°U = U.N°U;

The Different Phases of Query Processing

Query Processing Phases:
- Analysis: Syntactic and semantic analysis.
- Optimization: Finding an optimal execution plan.
- Execution: Running the plan.
- Results: Output.
Evaluation of a Query involves using the:
- query itself,
- data dictionary,
- indexes, and
- statistics.
The result of analysis is a Query Tree.

Detailed Steps in Query Processing

Analysis: The SQL query is transformed into an algebraic tree (if using relational algebra operators) or a connection graph (if using relational calculus).
Optimization: Rules and heuristics are applied to the tree from the previous phase to determine the optimal ordering of relational operations and the use of data access methods, generating an optimal execution plan.
Execution: The final step involves executing the execution plan to generate the results.

Optimization Methods:

Syntactic: Transforms query trees using transformation rules on the ordering of algebraic operators.
Cost Estimation: Estimates the execution costs of various evaluation strategies based on statistical information and access methods.
Semantic: Exploits integrity constraints.

These methods are often combined, particularly methods 1 and 2.

OPTIMIZATION OF REQUESTS, SYNTAX METHOD

SYNTAX METHOD

Objective:
- Syntactic study of the question: checking its correctness and sometimes semantic analysis.
- Verification of the presence of referenced attributes with the catalogs.
- Semantic analysis: checking if the question is well formulated, e.g., checking for a missing operator or contradictions (selection with two criteria >10 and <10, for example).
Result: generation of a query tree.
- A query tree is constructed using an abstract syntax associated with the operators of relational algebra or relational calculus.
- Aim to facilitate the comprehension, works on a syntax close to the notation used in relational algebra. The query tree will then be an algebraic tree.

Analysis of the request TREE STRUCTURE

An algebraic tree is a tree where:
- Terminal nodes are relations.
- Intermediate nodes are relational algebra operators.
- The root node is the result of the query.
- Arcs represent data flows between operators.

OPTIMIZATION OF REQUESTS, ORDERING OF OPERATORS

The ordering of relational operations is complex.
Optimization depends on the order of execution of the operators appearing in the algebraic tree used.

SYNTAX METHOD BASED ON OPERATOR ORDERING

General optimization strategy:
- Perform unary operations of restriction and projection as early as possible because they reduce the degree and cardinality of a relation, respectively.
- Pre-process relations before performing the join, either by creating indexes or by performing sorting.
- Search for sub-expressions common to several sub-trees to pre-evaluate them and store the result for reuse.
- Process restriction and projection operations on the same relation in a single pass.

Rules of transformation of trees

Grouping rules

Rule 1: grouping of projections
- $π(π(R / A1, A2, …, An) / B1, B2, …, Bp) = π(R / B1, B2, …, Bp)$
- with ${Bj} ⊆ {Ai}$
- In a sequence of projection operations only the last one is necessary
  - $π(π Employé/ NSS, NOME,PRENOME)/NSS) = π Employé/NSS$
Rule 2: grouping of restrictions
- $σ(σ( R / Xi θ a)/ Xj θ b ) = σ( R / Xi θ a ∧ Xj θ b)$
- $σ(σ Employé/ Adresse =‘Alger’ / date-naissance = ’21-10-2000) = σEmployé /Adresse =‘Alger’ ∧ date-naissance = ’21-10-2000$

Commutativity rules

Rule 3: Commutativity of joins and Cartesian product
- $R1 ⋈ R2 = R2 ⋈ R1$
- $R1 ⊗ R2 = R2 ⊗ R1$
Union and intersection operators are also commutative; difference is not.

Commutation or distributivity rules

The distributivity of an operator $Θ$ over an operator $\theta$ is expressed as follows:
- $Θ(R1 \theta R2) = (Θ R1) \theta (Θ R2)$
Rule 4: Commutation of restrictions and projections
- 1st case: the restriction argument is part of the projection attributes
  - $π/(X1,…, ,Xi) (σ (R/ (Xp θ a )) = σ/ ( Xp θ a ) ( π(R /(X1,…, Xp…,Xi))$
  - $π /NSS, Adresse ( σ Employé /adresse=‘Alger’) = σ /adresse=‘Alger’ (πEmployé/ NSS, adresse)$
- 2nd case: Otherwise
  - $π /(X1,…, ,Xi) (σ (R/ (Xp θ a )) = π)/ X1,…,Xi (σ/ Xp θ a (π (R/(X1,…,Xi, Xp) ))$
Rule 5: Commutation of restrictions with the union
- $σ ( Union (R1, R2 ) / Xp θ a) = Union (σ(R1/ Xp θ a), σ( R2/ Xp θ a))$
Rule 6: Commutation of restrictions with the difference
- $σ ( Minus (R1, R2 ) / Xp θ a) = Minus (σ(R1/ Xp θ a), σ( R2/ Xp θ a))$
Rule 7: Commutation of restrictions with the Cartesian product
- 1st case: the restriction relates to the argument of a single relation: the two relations have no common attributes:
  - R1 ( …., Xi, …), R2 (…., Yj, …)
  - $σ (R1 ⊗ R2) / (Xi θ a) ) = σ (R1/ (Xi θ a)) ⊗ R2$
- 2nd case: the restriction relates to two respective arguments of R1 and R2, the two relations have no common attributes:
  - R1 ( …., Xi, …), R2 (…., Yj, …)
  - $σ (R1 ⊗ R2) / (Xi θ a ∧ Yj θ b) ) = σ (R1/ (Xi θ a)) ⊗ σ (R2/ (Yj θ b))$
- 3rd case: the restriction relates to two respective arguments of R1 and R2, the two relations have a common attribute Xp:
  - R1 ( …., Xi, Xp, …), R2 (…., Xp, …)
  - $σ (R1 ⊗ R2) / (Xi θ a ∧ Xp θ b) ) = σ ( (σ (R1/ (Xi θ a)) ⊗R2 )/ (Xp θ b) )$
Rule 8: Commutation of projections and Cartesian product
- Let R1 ( …., Xi, …) and R2 (…., Yj, …)
  - $π( (R1 ⊗ R2) / X1, …, Xp, Y1, …Yp ) = π(R1 / X1, …, Xp) ⊗π(R2 / Y1,…,Yp)$
Rule 9: commutation of projections with the union
- $π( (R1 Union R2) / X1, …, Xp ) = π(R1 / X1, …, Xp) Union π(R2 / X1,…,Xp)$
Rule 10: Case of the natural join
- A natural join of R1 ( X1…., Xi, …,Xn), R2 (Y1,…., Xi, …,Ym) is a restriction of a Cartesian product to Xi = Xi followed by the projection on R1R2
  - Commutation with projection:
    - $π( (R1 ⋈ R2 ) / X1….XpY1…Yq ) = π( π(R1 / X1, …, Xp,Xi) ⋈ π(R2 / Y1,…,Yp,Xi)/ X1..XpY1..Yq )$
    - Remark: the last projection is only necessary if Xi does not belong to X1,…Xp or Y1,..,Yq
  - Commutation with restriction:
    - R1 ( …,Xi ,…) and R2 (…,Xi,..Yj,…)
    - $σ(( R1 ⋈ R2 ) / (Yj θ b)) = R1 ⋈ (σ(R2 / (Yj θ b))$
- Associativity: A $Θ$ (B $Θ$ C) = (A $Θ$ B) $Θ$ C
  - Union, intersection, and join are all associative. Difference is not.

Simple Algorithm for Generating a Query Plan

Apply rules 1 and 2 to group the restrictions and projections
Push the restrictions as low as possible using rule R4 (to pass a projection), R5 and R6 (to pass a union or a difference) of R7 (to pass a product or a join) and possibly combine restrictions with R2
Push the projections down through the projections using the rule R1, through the restrictions (R4), through the products (R8), through the joins (R10)
Remove redundant projections: At the level of the leaves of the query tree if all the attributes are cited, At the level of the nodes by the use of the grouping rule R1
Several query plans can be generated by transformations of expressions. The optimizer will then have to choose the least expensive one. This is the cost estimation step that will allow the choice.

OPTIMIZATION OF REQUESTS ESTIMATION OF EXECUTION COSTS

A simple solution to choose the best execution plan is to generate all of them and estimate the execution cost of each and choose the lowest cost.
Unfortunately this approach is not very feasible because:
- the number of plans can be very large.
- the size of intermediate results taken into account is an additional cost.
A simple solution is to use stored statistical knowledge, knowledge about the implementation costs of operators, and knowledge about the indexing techniques used in order to choose the best execution strategy.
Explain plan - examples
- Rows Execution Plan
- - 12 SORT AGGREGATE
- - 2 SORT GROUP BY
- - 76563 NESTED LOOPS
- - 76575 NESTED LOOPS
- - 19 TABLE ACCESS FULL CNPAYRUNSALL
- - 76570 TABLE ACCESS BY INDEX ROWID CNPOSTINGDETAILS_ALL
- - 76570 INDEX RANGE SCAN (object id 178321)
- - 76563 TABLE ACCESS BY INDEX ROWID CNPAYMENTWORKSHEETS_ALL
- - 11432983 INDEX RANGE SCAN (object id 186024)

USE OF STATISTICAL DATA

The estimation of the execution cost uses the statistics of the database stored in the catalogs.
We remind you that some information such as the cardinality of the relations already exists in the catalogs already seen.
Some additional information is available such as the number of pages occupied by a table, the number of distinct values in a column, the number of index levels, the number of pages of each level…

USE OF OPERATOR IMPLEMENTATION PROCEDURES

The optimization program will have a set of predefined operator implementation procedures at its disposal.
For example:
- the restriction may be implemented differently depending on whether the qualification formula refers to a key, or to a non-key but indexed attribute, or simply to a non-key non-indexed attribute.
- The join of two relations R1 and R2 can be implemented:
  - brutally using two nested loops (a complete outer loop on the occurrences of R1 and a complete inner loop on the occurrences of R2),
  - or if there is an index on the join attribute, it will not be necessary to perform the second loop but to perform a direct access to the tuple (indexed search)
  - or it will be possible to sort the two relations according to the join attribute, so the traversal of the two relations can be synchronized and will provide only one data traversal without redundancy (implementation by sort-fusion).

Index

Indexes are files that contain the pairs attribute(s) of access → page number
A primary index is built on the primary key of a relation
A secondary index is built on any attribute of a relation

Indexing techniques

B-tree
Hashing
Bitmap

Search in a table with an index

Example: SELECT * FROM Etudiant WHERE Age = 24;
An index on Age allows to quickly obtain the RIDs of the tuples of the employees of 24 years.
Conjunctive queries:
- SELECT * FROM Employe WHERE Salary = 2000 AND Ville = umber Laghouat ’;
- Suppose the existence of two indexes; one on Salary and the other on City.
- The first index provides the RIDs satisfying the 1st condition which will be grouped in a temporary table T1. Idem for the second index ~ T2
- Finally the intersection of T1 and T2 gives the answer

The choices of Oracle

By default the index is a B tree
- As soon as a PRIMARY KEY command is used, Oracle creates a B tree on the primary key
Tree is stored in an index segment

Examples of index creation

The name of an index is chosen by the DBA and can be normalized by prefixing it with idx_ followed by the name of the indexed attribute and terminated by the name of the table
- Example:
  - CREATE UNIQUE INDEX idx_NSS_Ouvrier ON Ouvrier(NSS);
  - CREATE UNIQUE INDEX idx_NSS_Salaire_Ouvrier ON Ouvrier(NSS, Salaire);

Delete an index

DROP INDEX <Index Name>;
DROP INDEX idx_NSS_Ouvrier;

Guide to the use of indexes during the exploitation of a BD

The useful indexes are those necessary for the current exploitation of the data
Delete the indexes when it is a question of processing a large flow of transactions of type update on a table. They are first deleted, then recreated as needed after the updates
Acceleration of the join is possible by the creation of the indexes: indexing the join attributes will allow a faster access

Join algorithms

Join without index
- Nested loop join
- Sort-join
Join with index
- Indexed-loop join

Nested Loop Join (JBI): Brute force

R ∞FS
Principle:
- The first table is read sequentially and preferably stored entirely in memory
- For each tuple of R, there is a comparison with each of the tuples of S

Nested loop join: ALGORITHM

FOR each i := 1 to m  /*outer loop */
FOR each j := 1 to n  /*inner loop*/
SI = R[i].C = S[j].C then
  add the joined n-tuple R[i]*S[j] to the result
Fin Si
finPOUR
FinPOUR

Remarks:
- Requires mxn read operations
- Cardinality:
  - If join several to 1: card = at most cardinality of R or S whose foreign key is involved in the join
  - If join several to several the cardinality depends on the number of distinct values of the join attribute C in R or S.
  - mxn/dCR or mxn/dCS depending on whether we start with R or S

Indexed search

There is an index X on the join attribute of the internal relation S.
Difference with the brute force algorithm: for each tuple of the external relation R, the access to the tuples of the internal relation is direct via the index and not a systematic scan:
The total number of n-tuple readings for relations R and S is simply the cardinality of the join result: m + mxn/dCS
ALGORITHM

/* we assume that there is an index X on S.C */
For i := 1 to m:  /* outer loop*/
/* suppose there are k index entries X[1].. X[k] with as value of indexed join attribute = R[i].C */
For j:= 1 to n;  /* inner loop */
/* let S[j] be the n-tuple indexed by X[j] */
Add the joined n-tuple R[i]*S[j] to the result
fin
Fin

Join by tri fusion

More efficient than nested loops for large tables
Principle:
- Sort the two tables on the join attribute
- Perform the merge
Complexity: Sorting that costs expensive

ALGORITHM

Sort R and S by external sorting and rewrite to temporary files
Read group of lines GR(cR) of R for the first value cR of join key
Read group of lines GS(cS) of S for the first value cS of join key
WHILE there are lines of R and S to process
  SI cR = cS
    Produce the concatenated lines for each of the line combinations of GR(cR) and GS(cS);
    Read the following groups GR(cR) of R and GS(cS) of S;
  SINON
    SI cR < cS
      Read the next group GR(cR) of R
    SINON
      SI cR > cS
        Read the following group GS(cS) in S
      FINSI
    FINSI
  FINSI
FIN TANT QUE

Hash join

Very efficient when one of the two tables is small (1, 2, 3 times the size of the memory)
Algorithm in option in Oracle
Principle of the hash join
- 1- Hash the smallest of the two tables into n fragments
- 2- Hash the second table, with the same function, into n other fragments
- 3- Join the fragments in pairs, and do the join