Indexing Structures and Physical Database Design Notes
Indexing Structures for Files and Physical Database Design
Introduction
- Indexes speed up record retrieval based on search conditions.
- They provide secondary access paths.
- Any field can be indexed.
- Multiple indexes can be created.
- Most indexes are based on ordered files.
- Tree data structures organize the index.
Multilevel Indexes
- Indexing schemes typically involve an ordered index file.
- Binary search locates pointers to a disk block using approximately (log2 bi).
- Each step reduces the search space by a factor of 2.
- Multilevel indexes reduce the index by a value larger than 2.
- bfri is the fan-out of the multilevel index, denoted as fo.
- Binary search divides the search space into two halves.
- Multilevel index divides it into n ways (where n = f_o) at each step.
- Searching a multilevel index requires approximately (log{fo} b_i) block accesses.
- The fan-out is generally much larger than 2.
- With a block size of 4,096 bytes (common in DBMSs), the fan-out depends on how many (key + block pointer) entries fit within a block.
- With a 4-byte block pointer and a 9-byte key (e.g., SSN), the fan-out can be calculated.
- Index file is the first (or base) level of a multilevel index.
- The second level is a primary index to the first level.
- The third level is a primary index to the second level.
- We repeat the process until all entries of some index level t fit in a single block.
- The last level is the top index level.
- Each level reduces entries at the previous level by a factor of f_o, the index fan-out.
- A multilevel index will have approximately t levels.
- Searching the index retrieves a single disk block at each level.
- t disk blocks are accessed for an index search, where t is the number of index levels.
Example
- Converting a dense secondary index into a multilevel index.
- Given bfri = 273 index entries per block (fan-out fo = 273) and b_1 = 1099 blocks at the first level.
- The number of second-level blocks will be b2 = \lceil (b1 / f_o) \rceil = \lceil (1099 / 273) \rceil = 5 blocks.
- The number of third-level blocks will be b3 = \lceil (b2 / f_o) \rceil = \lceil (5 / 273) \rceil = 1 block.
- The third level is the top level, and t = 3.
- Accessing a record requires accessing one block at each level plus one from the data file, needing t + 1 = 3 + 1 = 4 block accesses.
- Compared to a single-level index with binary search, which may require 12 block accesses.
Dynamic Multilevel Indexes Using B-Trees and B+-Trees
- Tree data structure terminology:
- Tree is formed of nodes.
- Each node (except the root) has one parent and zero or more child nodes.
- Leaf node has no child nodes.
- Unbalanced if leaf nodes occur at different levels.
- Nonleaf node is called an internal node.
- Subtree of a node consists of the node and all descendant nodes.
Search Trees
- Search tree guides the search for a record given a value of one of the record’s fields.
- A search tree of order p is a tree where each node contains at most q: p - 1 search values and p pointers in the order P1, K1, P2, K2, …, Pq, Kq, P_q+1, where q ≤ p.
- Each P_i is a pointer to a child node (or NULL).
- Each K_i is a search value from some ordered set of values.
- All search values are assumed to be unique.
B-Trees
- B-tree has additional constraints to ensure the tree is always balanced and space wasted by deletion is controlled.
- Provides multi-level access structure.
- Each node is at least half-full.
- Each node in a B-tree of order p can have at most p-1 search values.
Example
- Search field is a nonordering key field and construct a B-tree on this field with p = 23.
- Assume each node is 69% full.
- Each node, on average, will have p * 0.69 = 23 * 0.69 ≈ 16 pointers and 15 search key field values.
- The average fan-out f_o = 16.
- Root: 1 node, 15 key entries, 16 pointers
- Level 1: 16 nodes, 240 key entries, 256 pointers
- Level 2: 256 nodes, 3,840 key entries, 4,096 pointers
- Level 3: 4,096 nodes, 61,440 key entries
B+-Trees
- Data pointers are stored only at the leaf nodes.
- Leaf nodes have an entry for every value of the search field and a data pointer to the record if the search field is a key field.
- For a nonkey search field, the pointer points to a block containing pointers to the data file records.
- Internal nodes repeat some search field values from the leaf nodes to guide the search.
- every value of the search field appears once at some level in the tree, along with a data pointer
- data pointers are stored only at the leaf nodes of the tree.
- the structure of leaf nodes differs from the structure of internal nodes.
- The leaf nodes have an entry for every value of the search field, along with a data pointer.
Indexes on Multiple Keys
- Multiple attributes are involved in many retrieval and update requests.
- Composite keys use a key value that combines attributes.
- Partitioned hashing is suitable for equality comparisons.
Example
- Consider an EMPLOYEE file with attributes Dno (department number), Age, Street, City, Zipcode, Salary, and Skillcode, with the key of Ssn (Social Security number).
- Query: List the employees in department number 4 whose age is 59.
- Dno and Age are nonkey attributes, so a search value for either points to multiple records.
- Assuming Dno has an index, but Age does not: Access records with Dno = 4 and select those with Age = 59.
- Assuming Age has an index, but Dno does not: Access records with Age = 59 and select those with Dno = 4.
- If both Dno and Age are indexed: Use both indexes, intersect the sets of records or pointers to find those satisfying both conditions.
- Grid files use an array with one dimension for each search attribute.
Other Types of Indexes
- Hash indexes are a secondary structure for file access.
- Use hashing on a search key other than the one used for the primary data file organization.
- Index entries are of the form (K, Pr) or (K, P).
- Pr: pointer to the record containing the key.
- P: pointer to the block containing the record for that key.
*
*Hash-based indexing is a method of organizing and retrieving data efficiently using hash functions.
Bitmap Indexes
- Used with a large number of rows.
- Creates an index for one or more columns.
- Each value or value range in the column is indexed.
- Built on one particular value of a particular field.
- Array of bits.
- Existence bitmap.
- Bitmaps for B+-tree leaf nodes.
Function-Based Indexing
- The value resulting from applying some function on a field (or fields) becomes the index key.
- Introduced in Oracle relational DBMS.
- Example: Function UPPER(Lname) returns uppercase representation.
Some General Issues Concerning Indexing
- Physical index: Pointer specifies physical record address.
- Disadvantage: pointer must be changed if the record is moved.
- Logical index: Used when physical record addresses are expected to change frequently.
- Entries are of the form (K, Kp).
Index Creation
- General form of the command to create an index.
- UNIQUE and CLUSTER keywords optional.
- Order can be ASC or DESC.
- Secondary indexes can be created for any primary record organization.
- Complements other primary access methods.
Indexing of Strings
- Strings can be variable length.
- Strings may be too long, limiting the fan-out.
- Prefix compression: Stores only the prefix of the search key adequate to distinguish the keys being separated and directed to the subtree.
Tuning Indexes
- Tuning goals: Dynamically evaluate requirements and reorganize indexes to yield the best performance.
- Reasons for revising initial index choice:
- Certain queries may take too long due to a lack of an index.
- Certain indexes may not be utilized.
- Certain indexes may undergo too much updating if based on an attribute that undergoes frequent changes.
- Enforcing a key constraint on an attribute: Reject insertion if a new record has the same key attribute as an existing record.
- Duplicates occur if an index is created on a nonkey field.
- Fully inverted file: Has a secondary index on every field.
- Indexing hints in queries: Suggestions used to expedite query execution.
- Column-based storage of relations: An alternative to the traditional way of storing relations by row.
- Offers advantages for read-only queries.
- Offers additional freedom in index creation.
Physical Database Design in Relational Databases
- Physical design goals: Create an appropriate structure for data in storage and guarantee good performance.
- Must know job mix for a particular set of database system applications.
- Analyzing the database queries and transactions.
- Information about each retrieval query.
- Information about each update transaction.
- Analyzing the expected frequency of invocation of queries and transactions.
- Expected frequency of using each attribute as a selection or join attribute.
- 80-20 rule: 80 percent of processing accounted for by only 20 percent of queries and transactions.
- Analyzing the time constraints of queries and transactions.
- Selection attributes associated with time constraints are candidates for primary access structures.
- Analyzing the expected frequency of update operations.
- Minimize the number of access paths for a frequently updated file.
- Updating the access paths themselves slows down update operations.
- Analyzing the uniqueness constraints on attributes.
- Access paths should be specified on all candidate key attributes that are either the primary key of a file or unique attributes.
Physical Database Design Decisions
- Design decisions about indexing:
- Whether to index an attribute: If the attribute is a key or used by a query.
- What attribute(s) to index on: Single or multiple.
- Whether to set up a clustered index: One per table.
- Whether to use a hash index over a tree index: Hash indexes do not support range queries.
- Whether to use dynamic hashing: Appropriate for very volatile files.
Summary
- Indexes are access structures that improve the efficiency of record retrieval from a data file.
- Ordered single-level index types: Primary, clustering, and secondary.
- Multilevel indexes can be implemented as B-trees and B+-trees.
- Dynamic structures.
- Multiple key access methods.
- Logical and physical indexes.