Indexing Structures for Files and Physical Database Design

Introduction

Indexes are used to speed up record retrieval in response to specific search conditions.
Index structures provide secondary access paths.
Any field can be used to create an index.
Multiple indexes can be constructed.
Most indexes are based on ordered files.
Tree data structures are used to organize the index.

Types of Single-Level Ordered Indexes

Ordered index is similar to an index in a textbook.
Indexing field (attribute).
The index stores each value of the index field with a list of pointers to all disk blocks that contain records with that field value.
Values in the index are ordered.
- Primary index
  - Specified on the ordering key field of an ordered file of records.
- Clustering index
  - Used if numerous records can have the same value for the ordering field.
- Secondary index
  - Can be specified on any non-ordering field.
  - A data file can have several secondary indexes.

Primary Indexes

Ordered file with two fields:
- Primary key, K(i)
- Pointer to a disk block, P(i)
One index entry in the index file for each block in the data file.
Indexes may be dense or sparse.
- Dense index: Has an index entry for every search key value in the data file.
- Sparse index: Has entries for only some search values.

Example 1

Consider an ordered file with r = 300,000 records.
Disk with block size B = 4,096 bytes.
File records are of fixed size and are unspanned.
Record length R = 100 bytes.
Blocking factor for the file: bfr = \lfloor(B/R)\rfloor = \lfloor(4,096/100)\rfloor = 40 records per block.
Number of blocks needed for the file: b = \lceil(r/bfr)\rceil = \lceil(300,000/40)\rceil = 7,500 blocks.
A binary search on the data file would need approximately \lceil log2 b\rceil = \lceil (log2 7,500)\rceil = 13 block accesses.
Key field size is V = 9 bytes, and block pointer size is 6 bytes.
Size of each index entry is (9 + 6) = 15 bytes.
Blocking factor for the index: bfrI = \lfloor(B/Ri)\rfloor = \lfloor(4,096/15)\rfloor = 273 entries per block.
Total number of index entries equals the number of blocks in the data file, which is 7,500.
Number of index blocks: bi = \lceil(ri/bfri)\rceil = \lceil(7,500/273)\rceil = 28 blocks.
A binary search on the index file would need \lceil(log2 bi)\rceil = \lceil(log2 28)\rceil = 5 block accesses.
One additional block access to the data file is needed, for a total of 5 + 1 = 6 block accesses.

Insertion and Deletion

Major problem: insertion and deletion of records.
Requires moving records around and changing index values.
With a primary index, the problem is compounded.
- Inserting a record in its correct position requires moving records and changing some index entries.
Solutions:
- Use an unordered overflow file.
- Use a linked list of overflow records.

Clustering Indexes

Clustering field: File records are physically ordered on a non-key field without a distinct value for each record.
Ordered file with two fields:
- Same type as the clustering field.
- Disk block pointer.

Example 2

Consider the same ordered file with r = 300,000 records stored on a disk with block size B = 4,096 bytes.
Ordered by the attribute Zipcode, and there are 1,000 zip codes in the file (with an average of 300 records per zip code, assuming even distribution across zip codes).
The index in this case has 1,000 index entries of 11 bytes each (5-byte Zipcode and 6-byte block pointer) with a blocking factor:
- bfri = \lceil (B/Ri) \rceil = \lceil (4,096/11) \rceil = 372 index entries per block.
Number of index blocks is hence:
- bi = \lceil (ri/bfr_i) \rceil = \lceil (1,000/372) \rceil = 3 blocks.
To perform a binary search on the index file would need:
- \lceil (log2 bi) \rceil = \lceil (log_2 3) \rceil = 2 block accesses.
This index would typically be loaded in main memory (occupies 11,000 or 11 Kbytes) and takes negligible time to search in memory.
One block access to the data file would lead to the first record with a given zip code.

Secondary Indexes

Provide a secondary means of accessing a data file.
Some primary access exists.
The data file records could be ordered, unordered, or hashed.
Ordered file with two fields:
- Indexing field, K(i)
- Block pointer or record pointer, P(i)
Usually need more storage space and longer search time than a primary index.
Provide improved search time for arbitrary records.

Example 3

Consider the file of Example 1 with r = 300,000 fixed-length records of size R = 100 bytes stored on a disk with block size B = 4,096 bytes.
The file has b = 7,500 blocks, as calculated in Example 1.
Suppose we want to search for a record with a specific value for the secondary key—a nonordering key field of the file that is V = 9 bytes long.
Without the secondary index, to do a linear search on the file would require b/2 = 7,500/2 = 3,750 block accesses on average.
Suppose that we construct a secondary index on that nonordering key field of the file. As in Example 1, a block pointer is P = 6 bytes long, so each index entry is R_i = (9+6) = 15 bytes, and the blocking factor for the index is:
- bfri = \lceil (B/Ri) \rceil = \lceil (4,096/15) \rceil = 273 index entries per block.
In a dense secondary index such as this, the total number of index entries r_i is equal to the number of records in the data file, which is 300,000.
The number of blocks needed for the index is hence:
- bi = \lceil (r/bfri) \rceil = \lceil (300,000/273) \rceil = 1,099 blocks.
A binary search on this secondary index needs \lceil (log2 bi) \rceil = \lceil (log_2 1,099) \rceil = 11 block accesses.
To search for a record using the index, we need an additional block access to the data file for a total of 11 + 1 = 12 block accesses - a vast improvement over the 3,750 block accesses needed on average for a linear search, but slightly worse than the 6 block accesses required for the primary index.
This difference arose because the primary index was non-dense and hence shorter, with only 28 blocks in length as opposed to the 1,099 blocks dense index here.

Multilevel Indexes

Designed to greatly reduce remaining search space as search is conducted.
Index file:
- Considered the first (or base level) of a multilevel index.
Second level:
- Primary index to the first level.
Third level:
- Primary index to the second level.

Dynamic Multilevel Indexes Using B-Trees and B+-Trees

Tree data structure terminology:
- A tree is formed of nodes.
- Each node (except the root) has one parent and zero or more child nodes.
- A leaf node has no child nodes.
- Unbalanced if leaf nodes occur at different levels.
- A non-leaf node is called an internal node.
- The subtree of a node consists of the node and all descendant nodes.

Search Trees and B-Trees

A search tree is used to guide the search for a record given the value of one of the record’s fields.
Algorithms are necessary for inserting and deleting search values into and from the tree.

B-Trees

Provide a multi-level access structure.
The tree is always balanced.
Space wasted by deletion never becomes excessive.
Each node is at least half-full.
Each node in a B-tree of order p can have at most p-1 search values.

B+-Trees

Data pointers are stored only at the leaf nodes.
Leaf nodes have an entry for every value of the search field, and a data pointer to the record if the search field is a key field.
For a non-key search field, the pointer points to a block containing pointers to the data file records.
Internal nodes:
- Some search field values from the leaf nodes are repeated to guide the search.

Indexes on Multiple Keys

Multiple attributes are involved in many retrieval and update requests.
Composite keys:
- Access structure using a key value that combines attributes.
Partitioned hashing:
- Suitable for equality comparisons.
Grid files:
- Array with one dimension for each search attribute.

Other Types of Indexes

Hash indexes
- Secondary structure for file access.
- Uses hashing on a search key other than the one used for the primary data file organization.
- Index entries of the form (K, P_r) or (K, P)
  - P_r: pointer to the record containing the key.
  - P: pointer to the block containing the record for that key.
Bitmap Indexes
- Used with a large number of rows.
- Creates an index for one or more columns.
- Each value or value range in the column is indexed.
- Built on one particular value of a particular field.
- Array of bits.
  - Existence bitmap.
  - Bitmaps for B+-tree leaf nodes.
Function-Based Indexing
- The value resulting from applying some function on a field (or fields) becomes the index key.
- Introduced in Oracle relational DBMS.
- Example:
  - The function UPPER(Lname) returns the uppercase representation.

Some General Issues Concerning Indexing

Physical index
- A pointer specifies the physical record address.
- Disadvantage: the pointer must be changed if the record is moved.
Logical index
- Used when physical record addresses are expected to change frequently.
- Entries are of the form (K, K_p)

Index Creation

General form of the command to create an index
Unique and cluster keywords are optional.
Order can be ASC or DESC.
Secondary indexes can be created for any primary record organization.
Complements other primary access methods.

Indexing of Strings

Strings can be variable length.
Strings may be too long, limiting the fan-out.
Prefix compression:
- Stores only the prefix of the search key that is adequate to distinguish the keys that are being separated and directed to the subtree.

Tuning Indexes

Tuning goals:
- Dynamically evaluate requirements.
- Reorganize indexes to yield the best performance.
Reasons for revising initial index choice:
- Certain queries may take too long to run due to the lack of an index.
- Certain indexes may not get utilized.
- Certain indexes may undergo too much updating if based on an attribute that undergoes frequent changes.

Additional Issues Related to Storage of Relations and Indexes

Enforcing a key constraint on an attribute:
- Reject insertion if a new record has the same key attribute as an existing record.
- Duplicates occur if the index is created on a non-key field.
Fully inverted file:
- Has a secondary index on every field.
Indexing hints in queries:
- Suggestions are used to expedite query execution.
Column-based storage of relations:
- An alternative to the traditional way of storing relations by row.
- Offers advantages for read-only queries.
- Offers additional freedom in index creation.

Physical Database Design in Relational Databases

Physical design goals:
- Create an appropriate structure for data in storage.
- Guarantee good performance.
Must know job mix for a particular set of database system applications.
Analyzing the database queries and transactions:
- Information about each retrieval query.
- Information about each update transaction.
Analyzing the expected frequency of invocation of queries and transactions.
Expected frequency of using each attribute as a selection or join attribute.
80-20 rule: 80 percent of processing is accounted for by only 20 percent of queries and transactions.
Analyzing the time constraints of queries and transactions.
Selection attributes associated with time constraints are candidates for primary access structures.
Analyzing the expected frequency of update operations.
- Minimize the number of access paths for a frequently updated file.
- Updating the access paths themselves slows down update operations.
Analyzing the uniqueness constraints on attributes.
- Access paths should be specified on all candidate key attributes that are either the primary key of a file or unique attributes.

Physical Database Design Decisions

Design decisions about indexing:
- Whether to index an attribute (attribute is a key or used by a query).
- What attribute(s) to index on (single or multiple).
- Whether to set up a clustered index (one per table).
- Whether to use a hash index over a tree index (hash indexes do not support range queries).
- Whether to use dynamic hashing (appropriate for very volatile files).

Summary

Indexes are access structures that improve the efficiency of record retrieval from a data file.
Ordered single-level index types include primary, clustering, and secondary indexes.
Multilevel indexes can be implemented as B-trees and B+-trees.
Dynamic structures.
Multiple key access methods.
Logical and physical indexes.