Indexing Structures for Files and Physical Database Design

Indexing Structures for Files and Physical Database Design

Introduction

  • Indexes are used to speed up record retrieval in response to specific search conditions.
  • Index structures provide secondary access paths.
  • Any field can be used to create an index.
  • Multiple indexes can be constructed.
  • Most indexes are based on ordered files.
  • Tree data structures are used to organize the index.

Types of Single-Level Ordered Indexes

  • Ordered index is similar to an index in a textbook.
  • Indexing field (attribute).
  • The index stores each value of the index field with a list of pointers to all disk blocks that contain records with that field value.
  • Values in the index are ordered.
    • Primary index
      • Specified on the ordering key field of an ordered file of records.
    • Clustering index
      • Used if numerous records can have the same value for the ordering field.
    • Secondary index
      • Can be specified on any non-ordering field.
      • A data file can have several secondary indexes.

Primary Indexes

  • Ordered file with two fields:
    • Primary key, K(i)
    • Pointer to a disk block, P(i)
  • One index entry in the index file for each block in the data file.
  • Indexes may be dense or sparse.
    • Dense index: Has an index entry for every search key value in the data file.
    • Sparse index: Has entries for only some search values.

Example 1

  • Consider an ordered file with r = 300,000 records.
  • Disk with block size B = 4,096 bytes.
  • File records are of fixed size and are unspanned.
  • Record length R = 100 bytes.
  • Blocking factor for the file: bfr = \lfloor(B/R)\rfloor = \lfloor(4,096/100)\rfloor = 40 records per block.
  • Number of blocks needed for the file: b = \lceil(r/bfr)\rceil = \lceil(300,000/40)\rceil = 7,500 blocks.
  • A binary search on the data file would need approximately \lceil log2 b\rceil = \lceil (log2 7,500)\rceil = 13 block accesses.
  • Key field size is V = 9 bytes, and block pointer size is 6 bytes.
  • Size of each index entry is (9 + 6) = 15 bytes.
  • Blocking factor for the index: bfrI = \lfloor(B/Ri)\rfloor = \lfloor(4,096/15)\rfloor = 273 entries per block.
  • Total number of index entries equals the number of blocks in the data file, which is 7,500.
  • Number of index blocks: bi = \lceil(ri/bfri)\rceil = \lceil(7,500/273)\rceil = 28 blocks.
  • A binary search on the index file would need \lceil(log2 bi)\rceil = \lceil(log2 28)\rceil = 5 block accesses.
  • One additional block access to the data file is needed, for a total of 5 + 1 = 6 block accesses.

Insertion and Deletion

  • Major problem: insertion and deletion of records.
  • Requires moving records around and changing index values.
  • With a primary index, the problem is compounded.
    • Inserting a record in its correct position requires moving records and changing some index entries.
  • Solutions:
    • Use an unordered overflow file.
    • Use a linked list of overflow records.

Clustering Indexes

  • Clustering field: File records are physically ordered on a non-key field without a distinct value for each record.
  • Ordered file with two fields:
    • Same type as the clustering field.
    • Disk block pointer.

Example 2

  • Consider the same ordered file with r = 300,000 records stored on a disk with block size B = 4,096 bytes.
  • Ordered by the attribute Zipcode, and there are 1,000 zip codes in the file (with an average of 300 records per zip code, assuming even distribution across zip codes).
  • The index in this case has 1,000 index entries of 11 bytes each (5-byte Zipcode and 6-byte block pointer) with a blocking factor:
    • bfri = \lceil (B/Ri) \rceil = \lceil (4,096/11) \rceil = 372 index entries per block.
  • Number of index blocks is hence:
    • bi = \lceil (ri/bfr_i) \rceil = \lceil (1,000/372) \rceil = 3 blocks.
  • To perform a binary search on the index file would need:
    • \lceil (log2 bi) \rceil = \lceil (log_2 3) \rceil = 2 block accesses.
  • This index would typically be loaded in main memory (occupies 11,000 or 11 Kbytes) and takes negligible time to search in memory.
  • One block access to the data file would lead to the first record with a given zip code.

Secondary Indexes

  • Provide a secondary means of accessing a data file.
  • Some primary access exists.
  • The data file records could be ordered, unordered, or hashed.
  • Ordered file with two fields:
    • Indexing field, K(i)
    • Block pointer or record pointer, P(i)
  • Usually need more storage space and longer search time than a primary index.
  • Provide improved search time for arbitrary records.

Example 3

  • Consider the file of Example 1 with r = 300,000 fixed-length records of size R = 100 bytes stored on a disk with block size B = 4,096 bytes.
  • The file has b = 7,500 blocks, as calculated in Example 1.
  • Suppose we want to search for a record with a specific value for the secondary key—a nonordering key field of the file that is V = 9 bytes long.
  • Without the secondary index, to do a linear search on the file would require b/2 = 7,500/2 = 3,750 block accesses on average.
  • Suppose that we construct a secondary index on that nonordering key field of the file. As in Example 1, a block pointer is P = 6 bytes long, so each index entry is R_i = (9+6) = 15 bytes, and the blocking factor for the index is:
    • bfri = \lceil (B/Ri) \rceil = \lceil (4,096/15) \rceil = 273 index entries per block.
  • In a dense secondary index such as this, the total number of index entries r_i is equal to the number of records in the data file, which is 300,000.
  • The number of blocks needed for the index is hence:
    • bi = \lceil (r/bfri) \rceil = \lceil (300,000/273) \rceil = 1,099 blocks.
  • A binary search on this secondary index needs \lceil (log2 bi) \rceil = \lceil (log_2 1,099) \rceil = 11 block accesses.
  • To search for a record using the index, we need an additional block access to the data file for a total of 11 + 1 = 12 block accesses - a vast improvement over the 3,750 block accesses needed on average for a linear search, but slightly worse than the 6 block accesses required for the primary index.
  • This difference arose because the primary index was non-dense and hence shorter, with only 28 blocks in length as opposed to the 1,099 blocks dense index here.

Multilevel Indexes

  • Designed to greatly reduce remaining search space as search is conducted.
  • Index file:
    • Considered the first (or base level) of a multilevel index.
  • Second level:
    • Primary index to the first level.
  • Third level:
    • Primary index to the second level.

Dynamic Multilevel Indexes Using B-Trees and B+-Trees

  • Tree data structure terminology:
    • A tree is formed of nodes.
    • Each node (except the root) has one parent and zero or more child nodes.
    • A leaf node has no child nodes.
    • Unbalanced if leaf nodes occur at different levels.
    • A non-leaf node is called an internal node.
    • The subtree of a node consists of the node and all descendant nodes.

Search Trees and B-Trees

  • A search tree is used to guide the search for a record given the value of one of the record’s fields.
  • Algorithms are necessary for inserting and deleting search values into and from the tree.

B-Trees

  • Provide a multi-level access structure.
  • The tree is always balanced.
  • Space wasted by deletion never becomes excessive.
  • Each node is at least half-full.
  • Each node in a B-tree of order p can have at most p-1 search values.

B+-Trees

  • Data pointers are stored only at the leaf nodes.
  • Leaf nodes have an entry for every value of the search field, and a data pointer to the record if the search field is a key field.
  • For a non-key search field, the pointer points to a block containing pointers to the data file records.
  • Internal nodes:
    • Some search field values from the leaf nodes are repeated to guide the search.

Indexes on Multiple Keys

  • Multiple attributes are involved in many retrieval and update requests.
  • Composite keys:
    • Access structure using a key value that combines attributes.
  • Partitioned hashing:
    • Suitable for equality comparisons.
  • Grid files:
    • Array with one dimension for each search attribute.

Other Types of Indexes

  • Hash indexes
    • Secondary structure for file access.
    • Uses hashing on a search key other than the one used for the primary data file organization.
    • Index entries of the form (K, P_r) or (K, P)
      • P_r: pointer to the record containing the key.
      • P: pointer to the block containing the record for that key.
  • Bitmap Indexes
    • Used with a large number of rows.
    • Creates an index for one or more columns.
    • Each value or value range in the column is indexed.
    • Built on one particular value of a particular field.
    • Array of bits.
      • Existence bitmap.
      • Bitmaps for B+-tree leaf nodes.
  • Function-Based Indexing
    • The value resulting from applying some function on a field (or fields) becomes the index key.
    • Introduced in Oracle relational DBMS.
    • Example:
      • The function UPPER(Lname) returns the uppercase representation.

Some General Issues Concerning Indexing

  • Physical index
    • A pointer specifies the physical record address.
    • Disadvantage: the pointer must be changed if the record is moved.
  • Logical index
    • Used when physical record addresses are expected to change frequently.
    • Entries are of the form (K, K_p)

Index Creation

  • General form of the command to create an index
  • Unique and cluster keywords are optional.
  • Order can be ASC or DESC.
  • Secondary indexes can be created for any primary record organization.
  • Complements other primary access methods.

Indexing of Strings

  • Strings can be variable length.
  • Strings may be too long, limiting the fan-out.
  • Prefix compression:
    • Stores only the prefix of the search key that is adequate to distinguish the keys that are being separated and directed to the subtree.

Tuning Indexes

  • Tuning goals:
    • Dynamically evaluate requirements.
    • Reorganize indexes to yield the best performance.
  • Reasons for revising initial index choice:
    • Certain queries may take too long to run due to the lack of an index.
    • Certain indexes may not get utilized.
    • Certain indexes may undergo too much updating if based on an attribute that undergoes frequent changes.

Additional Issues Related to Storage of Relations and Indexes

  • Enforcing a key constraint on an attribute:
    • Reject insertion if a new record has the same key attribute as an existing record.
    • Duplicates occur if the index is created on a non-key field.
  • Fully inverted file:
    • Has a secondary index on every field.
  • Indexing hints in queries:
    • Suggestions are used to expedite query execution.
  • Column-based storage of relations:
    • An alternative to the traditional way of storing relations by row.
    • Offers advantages for read-only queries.
    • Offers additional freedom in index creation.

Physical Database Design in Relational Databases

  • Physical design goals:
    • Create an appropriate structure for data in storage.
    • Guarantee good performance.
  • Must know job mix for a particular set of database system applications.
  • Analyzing the database queries and transactions:
    • Information about each retrieval query.
    • Information about each update transaction.
  • Analyzing the expected frequency of invocation of queries and transactions.
  • Expected frequency of using each attribute as a selection or join attribute.
  • 80-20 rule: 80 percent of processing is accounted for by only 20 percent of queries and transactions.
  • Analyzing the time constraints of queries and transactions.
  • Selection attributes associated with time constraints are candidates for primary access structures.
  • Analyzing the expected frequency of update operations.
    • Minimize the number of access paths for a frequently updated file.
    • Updating the access paths themselves slows down update operations.
  • Analyzing the uniqueness constraints on attributes.
    • Access paths should be specified on all candidate key attributes that are either the primary key of a file or unique attributes.

Physical Database Design Decisions

  • Design decisions about indexing:
    • Whether to index an attribute (attribute is a key or used by a query).
    • What attribute(s) to index on (single or multiple).
    • Whether to set up a clustered index (one per table).
    • Whether to use a hash index over a tree index (hash indexes do not support range queries).
    • Whether to use dynamic hashing (appropriate for very volatile files).

Summary

  • Indexes are access structures that improve the efficiency of record retrieval from a data file.
  • Ordered single-level index types include primary, clustering, and secondary indexes.
  • Multilevel indexes can be implemented as B-trees and B+-trees.
  • Dynamic structures.
  • Multiple key access methods.
  • Logical and physical indexes.