Indexing Structures and Physical Database Design Notes

Indexing Structures for Files and Physical Database Design

Introduction

  • Indexes speed up record retrieval based on search conditions.
  • They provide secondary access paths.
  • Any field can be indexed.
  • Multiple indexes can be created.
  • Most indexes are based on ordered files.
  • Tree data structures organize the index.

Multilevel Indexes

  • Indexing schemes typically involve an ordered index file.
  • Binary search locates pointers to a disk block using approximately (log2 bi).
  • Each step reduces the search space by a factor of 2.
  • Multilevel indexes reduce the index by a value larger than 2.
  • bfri is the fan-out of the multilevel index, denoted as fo.
  • Binary search divides the search space into two halves.
  • Multilevel index divides it into n ways (where n = f_o) at each step.
  • Searching a multilevel index requires approximately (log{fo} b_i) block accesses.
  • The fan-out is generally much larger than 2.
  • With a block size of 4,096 bytes (common in DBMSs), the fan-out depends on how many (key + block pointer) entries fit within a block.
  • With a 4-byte block pointer and a 9-byte key (e.g., SSN), the fan-out can be calculated.
  • Index file is the first (or base) level of a multilevel index.
  • The second level is a primary index to the first level.
  • The third level is a primary index to the second level.
  • We repeat the process until all entries of some index level t fit in a single block.
  • The last level is the top index level.
  • Each level reduces entries at the previous level by a factor of f_o, the index fan-out.
  • A multilevel index will have approximately t levels.
  • Searching the index retrieves a single disk block at each level.
  • t disk blocks are accessed for an index search, where t is the number of index levels.

Example

  • Converting a dense secondary index into a multilevel index.
  • Given bfri = 273 index entries per block (fan-out fo = 273) and b_1 = 1099 blocks at the first level.
  • The number of second-level blocks will be b2 = \lceil (b1 / f_o) \rceil = \lceil (1099 / 273) \rceil = 5 blocks.
  • The number of third-level blocks will be b3 = \lceil (b2 / f_o) \rceil = \lceil (5 / 273) \rceil = 1 block.
  • The third level is the top level, and t = 3.
  • Accessing a record requires accessing one block at each level plus one from the data file, needing t + 1 = 3 + 1 = 4 block accesses.
  • Compared to a single-level index with binary search, which may require 12 block accesses.

Dynamic Multilevel Indexes Using B-Trees and B+-Trees

  • Tree data structure terminology:
    • Tree is formed of nodes.
    • Each node (except the root) has one parent and zero or more child nodes.
    • Leaf node has no child nodes.
    • Unbalanced if leaf nodes occur at different levels.
    • Nonleaf node is called an internal node.
    • Subtree of a node consists of the node and all descendant nodes.

Search Trees

  • Search tree guides the search for a record given a value of one of the record’s fields.
  • A search tree of order p is a tree where each node contains at most q: p - 1 search values and p pointers in the order P1, K1, P2, K2, …, Pq, Kq, P_q+1, where q ≤ p.
  • Each P_i is a pointer to a child node (or NULL).
  • Each K_i is a search value from some ordered set of values.
  • All search values are assumed to be unique.

B-Trees

  • B-tree has additional constraints to ensure the tree is always balanced and space wasted by deletion is controlled.
  • Provides multi-level access structure.
  • Each node is at least half-full.
  • Each node in a B-tree of order p can have at most p-1 search values.

Example

  • Search field is a nonordering key field and construct a B-tree on this field with p = 23.
  • Assume each node is 69% full.
  • Each node, on average, will have p * 0.69 = 23 * 0.69 ≈ 16 pointers and 15 search key field values.
  • The average fan-out f_o = 16.
    • Root: 1 node, 15 key entries, 16 pointers
    • Level 1: 16 nodes, 240 key entries, 256 pointers
    • Level 2: 256 nodes, 3,840 key entries, 4,096 pointers
    • Level 3: 4,096 nodes, 61,440 key entries

B+-Trees

  • Data pointers are stored only at the leaf nodes.
  • Leaf nodes have an entry for every value of the search field and a data pointer to the record if the search field is a key field.
  • For a nonkey search field, the pointer points to a block containing pointers to the data file records.
  • Internal nodes repeat some search field values from the leaf nodes to guide the search.
  • every value of the search field appears once at some level in the tree, along with a data pointer
  • data pointers are stored only at the leaf nodes of the tree.
  • the structure of leaf nodes differs from the structure of internal nodes.
  • The leaf nodes have an entry for every value of the search field, along with a data pointer.

Indexes on Multiple Keys

  • Multiple attributes are involved in many retrieval and update requests.
  • Composite keys use a key value that combines attributes.
  • Partitioned hashing is suitable for equality comparisons.

Example

  • Consider an EMPLOYEE file with attributes Dno (department number), Age, Street, City, Zipcode, Salary, and Skillcode, with the key of Ssn (Social Security number).
  • Query: List the employees in department number 4 whose age is 59.
  • Dno and Age are nonkey attributes, so a search value for either points to multiple records.
    • Assuming Dno has an index, but Age does not: Access records with Dno = 4 and select those with Age = 59.
    • Assuming Age has an index, but Dno does not: Access records with Age = 59 and select those with Dno = 4.
    • If both Dno and Age are indexed: Use both indexes, intersect the sets of records or pointers to find those satisfying both conditions.
  • Grid files use an array with one dimension for each search attribute.

Other Types of Indexes

  • Hash indexes are a secondary structure for file access.
  • Use hashing on a search key other than the one used for the primary data file organization.
  • Index entries are of the form (K, Pr) or (K, P).
    • Pr: pointer to the record containing the key.
    • P: pointer to the block containing the record for that key.
      *
      *Hash-based indexing is a method of organizing and retrieving data efficiently using hash functions.

Bitmap Indexes

  • Used with a large number of rows.
  • Creates an index for one or more columns.
  • Each value or value range in the column is indexed.
  • Built on one particular value of a particular field.
  • Array of bits.
  • Existence bitmap.
  • Bitmaps for B+-tree leaf nodes.

Function-Based Indexing

  • The value resulting from applying some function on a field (or fields) becomes the index key.
  • Introduced in Oracle relational DBMS.
  • Example: Function UPPER(Lname) returns uppercase representation.

Some General Issues Concerning Indexing

  • Physical index: Pointer specifies physical record address.
    • Disadvantage: pointer must be changed if the record is moved.
  • Logical index: Used when physical record addresses are expected to change frequently.
    • Entries are of the form (K, Kp).

Index Creation

  • General form of the command to create an index.
  • UNIQUE and CLUSTER keywords optional.
  • Order can be ASC or DESC.
  • Secondary indexes can be created for any primary record organization.
  • Complements other primary access methods.

Indexing of Strings

  • Strings can be variable length.
  • Strings may be too long, limiting the fan-out.
  • Prefix compression: Stores only the prefix of the search key adequate to distinguish the keys being separated and directed to the subtree.

Tuning Indexes

  • Tuning goals: Dynamically evaluate requirements and reorganize indexes to yield the best performance.
  • Reasons for revising initial index choice:
    • Certain queries may take too long due to a lack of an index.
    • Certain indexes may not be utilized.
    • Certain indexes may undergo too much updating if based on an attribute that undergoes frequent changes.

Additional Issues Related to Storage of Relations and Indexes

  • Enforcing a key constraint on an attribute: Reject insertion if a new record has the same key attribute as an existing record.
  • Duplicates occur if an index is created on a nonkey field.
  • Fully inverted file: Has a secondary index on every field.
  • Indexing hints in queries: Suggestions used to expedite query execution.
  • Column-based storage of relations: An alternative to the traditional way of storing relations by row.
  • Offers advantages for read-only queries.
  • Offers additional freedom in index creation.

Physical Database Design in Relational Databases

  • Physical design goals: Create an appropriate structure for data in storage and guarantee good performance.
  • Must know job mix for a particular set of database system applications.
  • Analyzing the database queries and transactions.
    • Information about each retrieval query.
    • Information about each update transaction.
  • Analyzing the expected frequency of invocation of queries and transactions.
  • Expected frequency of using each attribute as a selection or join attribute.
  • 80-20 rule: 80 percent of processing accounted for by only 20 percent of queries and transactions.
  • Analyzing the time constraints of queries and transactions.
  • Selection attributes associated with time constraints are candidates for primary access structures.
  • Analyzing the expected frequency of update operations.
  • Minimize the number of access paths for a frequently updated file.
  • Updating the access paths themselves slows down update operations.
  • Analyzing the uniqueness constraints on attributes.
  • Access paths should be specified on all candidate key attributes that are either the primary key of a file or unique attributes.

Physical Database Design Decisions

  • Design decisions about indexing:
    • Whether to index an attribute: If the attribute is a key or used by a query.
    • What attribute(s) to index on: Single or multiple.
    • Whether to set up a clustered index: One per table.
    • Whether to use a hash index over a tree index: Hash indexes do not support range queries.
    • Whether to use dynamic hashing: Appropriate for very volatile files.

Summary

  • Indexes are access structures that improve the efficiency of record retrieval from a data file.
  • Ordered single-level index types: Primary, clustering, and secondary.
  • Multilevel indexes can be implemented as B-trees and B+-trees.
  • Dynamic structures.
  • Multiple key access methods.
  • Logical and physical indexes.