Indexing Structures and Physical Database Design Notes

Indexing Structures for Files and Physical Database Design

Introduction

Indexes speed up record retrieval based on search conditions.
They provide secondary access paths.
Any field can be indexed.
Multiple indexes can be created.
Most indexes are based on ordered files.
Tree data structures organize the index.

Multilevel Indexes

Indexing schemes typically involve an ordered index file.
Binary search locates pointers to a disk block using approximately (log2 bi).
Each step reduces the search space by a factor of 2.
Multilevel indexes reduce the index by a value larger than 2.
bfri is the fan-out of the multilevel index, denoted as fo.
Binary search divides the search space into two halves.
Multilevel index divides it into n ways (where n = f_o) at each step.
Searching a multilevel index requires approximately (log{fo} b_i) block accesses.
The fan-out is generally much larger than 2.
With a block size of 4,096 bytes (common in DBMSs), the fan-out depends on how many (key + block pointer) entries fit within a block.
With a 4-byte block pointer and a 9-byte key (e.g., SSN), the fan-out can be calculated.
Index file is the first (or base) level of a multilevel index.
The second level is a primary index to the first level.
The third level is a primary index to the second level.
We repeat the process until all entries of some index level t fit in a single block.
The last level is the top index level.
Each level reduces entries at the previous level by a factor of f_o, the index fan-out.
A multilevel index will have approximately t levels.
Searching the index retrieves a single disk block at each level.
t disk blocks are accessed for an index search, where t is the number of index levels.

Example

Converting a dense secondary index into a multilevel index.
Given bfri = 273 index entries per block (fan-out fo = 273) and b_1 = 1099 blocks at the first level.
The number of second-level blocks will be b2 = \lceil (b1 / f_o) \rceil = \lceil (1099 / 273) \rceil = 5 blocks.
The number of third-level blocks will be b3 = \lceil (b2 / f_o) \rceil = \lceil (5 / 273) \rceil = 1 block.
The third level is the top level, and t = 3.
Accessing a record requires accessing one block at each level plus one from the data file, needing t + 1 = 3 + 1 = 4 block accesses.
Compared to a single-level index with binary search, which may require 12 block accesses.

Dynamic Multilevel Indexes Using B-Trees and B+-Trees

Tree data structure terminology:
- Tree is formed of nodes.
- Each node (except the root) has one parent and zero or more child nodes.
- Leaf node has no child nodes.
- Unbalanced if leaf nodes occur at different levels.
- Nonleaf node is called an internal node.
- Subtree of a node consists of the node and all descendant nodes.

Search Trees

Search tree guides the search for a record given a value of one of the record’s fields.
A search tree of order p is a tree where each node contains at most q: p - 1 search values and p pointers in the order P1, K1, P2, K2, …, Pq, Kq, P_q+1, where q ≤ p.
Each P_i is a pointer to a child node (or NULL).
Each K_i is a search value from some ordered set of values.
All search values are assumed to be unique.

B-Trees

B-tree has additional constraints to ensure the tree is always balanced and space wasted by deletion is controlled.
Provides multi-level access structure.
Each node is at least half-full.
Each node in a B-tree of order p can have at most p-1 search values.

Example

Search field is a nonordering key field and construct a B-tree on this field with p = 23.
Assume each node is 69% full.
Each node, on average, will have p * 0.69 = 23 * 0.69 ≈ 16 pointers and 15 search key field values.
The average fan-out f_o = 16.
- Root: 1 node, 15 key entries, 16 pointers
- Level 1: 16 nodes, 240 key entries, 256 pointers
- Level 2: 256 nodes, 3,840 key entries, 4,096 pointers
- Level 3: 4,096 nodes, 61,440 key entries

B+-Trees

Data pointers are stored only at the leaf nodes.
Leaf nodes have an entry for every value of the search field and a data pointer to the record if the search field is a key field.
For a nonkey search field, the pointer points to a block containing pointers to the data file records.
Internal nodes repeat some search field values from the leaf nodes to guide the search.
every value of the search field appears once at some level in the tree, along with a data pointer
data pointers are stored only at the leaf nodes of the tree.
the structure of leaf nodes differs from the structure of internal nodes.
The leaf nodes have an entry for every value of the search field, along with a data pointer.

Indexes on Multiple Keys

Multiple attributes are involved in many retrieval and update requests.
Composite keys use a key value that combines attributes.
Partitioned hashing is suitable for equality comparisons.

Example

Consider an EMPLOYEE file with attributes Dno (department number), Age, Street, City, Zipcode, Salary, and Skillcode, with the key of Ssn (Social Security number).
Query: List the employees in department number 4 whose age is 59.
Dno and Age are nonkey attributes, so a search value for either points to multiple records.
- Assuming Dno has an index, but Age does not: Access records with Dno = 4 and select those with Age = 59.
- Assuming Age has an index, but Dno does not: Access records with Age = 59 and select those with Dno = 4.
- If both Dno and Age are indexed: Use both indexes, intersect the sets of records or pointers to find those satisfying both conditions.
Grid files use an array with one dimension for each search attribute.

Other Types of Indexes

Hash indexes are a secondary structure for file access.
Use hashing on a search key other than the one used for the primary data file organization.
Index entries are of the form (K, Pr) or (K, P).
- Pr: pointer to the record containing the key.
- P: pointer to the block containing the record for that key.
  *
  *Hash-based indexing is a method of organizing and retrieving data efficiently using hash functions.

Bitmap Indexes

Used with a large number of rows.
Creates an index for one or more columns.
Each value or value range in the column is indexed.
Built on one particular value of a particular field.
Array of bits.
Existence bitmap.
Bitmaps for B+-tree leaf nodes.

Function-Based Indexing

The value resulting from applying some function on a field (or fields) becomes the index key.
Introduced in Oracle relational DBMS.
Example: Function UPPER(Lname) returns uppercase representation.

Some General Issues Concerning Indexing

Physical index: Pointer specifies physical record address.
- Disadvantage: pointer must be changed if the record is moved.
Logical index: Used when physical record addresses are expected to change frequently.
- Entries are of the form (K, Kp).

Index Creation

General form of the command to create an index.
UNIQUE and CLUSTER keywords optional.
Order can be ASC or DESC.
Secondary indexes can be created for any primary record organization.
Complements other primary access methods.

Indexing of Strings

Strings can be variable length.
Strings may be too long, limiting the fan-out.
Prefix compression: Stores only the prefix of the search key adequate to distinguish the keys being separated and directed to the subtree.

Tuning Indexes

Tuning goals: Dynamically evaluate requirements and reorganize indexes to yield the best performance.
Reasons for revising initial index choice:
- Certain queries may take too long due to a lack of an index.
- Certain indexes may not be utilized.
- Certain indexes may undergo too much updating if based on an attribute that undergoes frequent changes.

Additional Issues Related to Storage of Relations and Indexes

Enforcing a key constraint on an attribute: Reject insertion if a new record has the same key attribute as an existing record.
Duplicates occur if an index is created on a nonkey field.
Fully inverted file: Has a secondary index on every field.
Indexing hints in queries: Suggestions used to expedite query execution.
Column-based storage of relations: An alternative to the traditional way of storing relations by row.
Offers advantages for read-only queries.
Offers additional freedom in index creation.

Physical Database Design in Relational Databases

Physical design goals: Create an appropriate structure for data in storage and guarantee good performance.
Must know job mix for a particular set of database system applications.
Analyzing the database queries and transactions.
- Information about each retrieval query.
- Information about each update transaction.
Analyzing the expected frequency of invocation of queries and transactions.
Expected frequency of using each attribute as a selection or join attribute.
80-20 rule: 80 percent of processing accounted for by only 20 percent of queries and transactions.
Analyzing the time constraints of queries and transactions.
Selection attributes associated with time constraints are candidates for primary access structures.
Analyzing the expected frequency of update operations.
Minimize the number of access paths for a frequently updated file.
Updating the access paths themselves slows down update operations.
Analyzing the uniqueness constraints on attributes.
Access paths should be specified on all candidate key attributes that are either the primary key of a file or unique attributes.

Physical Database Design Decisions

Design decisions about indexing:
- Whether to index an attribute: If the attribute is a key or used by a query.
- What attribute(s) to index on: Single or multiple.
- Whether to set up a clustered index: One per table.
- Whether to use a hash index over a tree index: Hash indexes do not support range queries.
- Whether to use dynamic hashing: Appropriate for very volatile files.

Summary

Indexes are access structures that improve the efficiency of record retrieval from a data file.
Ordered single-level index types: Primary, clustering, and secondary.
Multilevel indexes can be implemented as B-trees and B+-trees.
Dynamic structures.
Multiple key access methods.
Logical and physical indexes.