Indexing Structures for Files and Physical Database Design
Indexing Structures for Files and Physical Database Design
Introduction
- Indexes are used to speed up record retrieval in response to specific search conditions.
- Index structures provide secondary access paths.
- Any field can be used to create an index.
- Multiple indexes can be constructed.
- Most indexes are based on ordered files.
- Tree data structures are used to organize the index.
Types of Single-Level Ordered Indexes
- Ordered index is similar to an index in a textbook.
- Indexing field (attribute).
- The index stores each value of the index field with a list of pointers to all disk blocks that contain records with that field value.
- Values in the index are ordered.
- Primary index
- Specified on the ordering key field of an ordered file of records.
- Clustering index
- Used if numerous records can have the same value for the ordering field.
- Secondary index
- Can be specified on any non-ordering field.
- A data file can have several secondary indexes.
Primary Indexes
- Ordered file with two fields:
- Primary key, K(i)
- Pointer to a disk block, P(i)
- One index entry in the index file for each block in the data file.
- Indexes may be dense or sparse.
- Dense index: Has an index entry for every search key value in the data file.
- Sparse index: Has entries for only some search values.
Example 1
- Consider an ordered file with r = 300,000 records.
- Disk with block size B = 4,096 bytes.
- File records are of fixed size and are unspanned.
- Record length R = 100 bytes.
- Blocking factor for the file: bfr = \lfloor(B/R)\rfloor = \lfloor(4,096/100)\rfloor = 40 records per block.
- Number of blocks needed for the file: b = \lceil(r/bfr)\rceil = \lceil(300,000/40)\rceil = 7,500 blocks.
- A binary search on the data file would need approximately \lceil log2 b\rceil = \lceil (log2 7,500)\rceil = 13 block accesses.
- Key field size is V = 9 bytes, and block pointer size is 6 bytes.
- Size of each index entry is (9 + 6) = 15 bytes.
- Blocking factor for the index: bfrI = \lfloor(B/Ri)\rfloor = \lfloor(4,096/15)\rfloor = 273 entries per block.
- Total number of index entries equals the number of blocks in the data file, which is 7,500.
- Number of index blocks: bi = \lceil(ri/bfri)\rceil = \lceil(7,500/273)\rceil = 28 blocks.
- A binary search on the index file would need \lceil(log2 bi)\rceil = \lceil(log2 28)\rceil = 5 block accesses.
- One additional block access to the data file is needed, for a total of 5 + 1 = 6 block accesses.
Insertion and Deletion
- Major problem: insertion and deletion of records.
- Requires moving records around and changing index values.
- With a primary index, the problem is compounded.
- Inserting a record in its correct position requires moving records and changing some index entries.
- Solutions:
- Use an unordered overflow file.
- Use a linked list of overflow records.
Clustering Indexes
- Clustering field: File records are physically ordered on a non-key field without a distinct value for each record.
- Ordered file with two fields:
- Same type as the clustering field.
- Disk block pointer.
Example 2
- Consider the same ordered file with r = 300,000 records stored on a disk with block size B = 4,096 bytes.
- Ordered by the attribute Zipcode, and there are 1,000 zip codes in the file (with an average of 300 records per zip code, assuming even distribution across zip codes).
- The index in this case has 1,000 index entries of 11 bytes each (5-byte Zipcode and 6-byte block pointer) with a blocking factor:
- bfri = \lceil (B/Ri) \rceil = \lceil (4,096/11) \rceil = 372 index entries per block.
- Number of index blocks is hence:
- bi = \lceil (ri/bfr_i) \rceil = \lceil (1,000/372) \rceil = 3 blocks.
- To perform a binary search on the index file would need:
- \lceil (log2 bi) \rceil = \lceil (log_2 3) \rceil = 2 block accesses.
- This index would typically be loaded in main memory (occupies 11,000 or 11 Kbytes) and takes negligible time to search in memory.
- One block access to the data file would lead to the first record with a given zip code.
Secondary Indexes
- Provide a secondary means of accessing a data file.
- Some primary access exists.
- The data file records could be ordered, unordered, or hashed.
- Ordered file with two fields:
- Indexing field, K(i)
- Block pointer or record pointer, P(i)
- Usually need more storage space and longer search time than a primary index.
- Provide improved search time for arbitrary records.
Example 3
- Consider the file of Example 1 with r = 300,000 fixed-length records of size R = 100 bytes stored on a disk with block size B = 4,096 bytes.
- The file has b = 7,500 blocks, as calculated in Example 1.
- Suppose we want to search for a record with a specific value for the secondary key—a nonordering key field of the file that is V = 9 bytes long.
- Without the secondary index, to do a linear search on the file would require b/2 = 7,500/2 = 3,750 block accesses on average.
- Suppose that we construct a secondary index on that nonordering key field of the file. As in Example 1, a block pointer is P = 6 bytes long, so each index entry is R_i = (9+6) = 15 bytes, and the blocking factor for the index is:
- bfri = \lceil (B/Ri) \rceil = \lceil (4,096/15) \rceil = 273 index entries per block.
- In a dense secondary index such as this, the total number of index entries r_i is equal to the number of records in the data file, which is 300,000.
- The number of blocks needed for the index is hence:
- bi = \lceil (r/bfri) \rceil = \lceil (300,000/273) \rceil = 1,099 blocks.
- A binary search on this secondary index needs \lceil (log2 bi) \rceil = \lceil (log_2 1,099) \rceil = 11 block accesses.
- To search for a record using the index, we need an additional block access to the data file for a total of 11 + 1 = 12 block accesses - a vast improvement over the 3,750 block accesses needed on average for a linear search, but slightly worse than the 6 block accesses required for the primary index.
- This difference arose because the primary index was non-dense and hence shorter, with only 28 blocks in length as opposed to the 1,099 blocks dense index here.
Multilevel Indexes
- Designed to greatly reduce remaining search space as search is conducted.
- Index file:
- Considered the first (or base level) of a multilevel index.
- Second level:
- Primary index to the first level.
- Third level:
- Primary index to the second level.
Dynamic Multilevel Indexes Using B-Trees and B+-Trees
- Tree data structure terminology:
- A tree is formed of nodes.
- Each node (except the root) has one parent and zero or more child nodes.
- A leaf node has no child nodes.
- Unbalanced if leaf nodes occur at different levels.
- A non-leaf node is called an internal node.
- The subtree of a node consists of the node and all descendant nodes.
Search Trees and B-Trees
- A search tree is used to guide the search for a record given the value of one of the record’s fields.
- Algorithms are necessary for inserting and deleting search values into and from the tree.
B-Trees
- Provide a multi-level access structure.
- The tree is always balanced.
- Space wasted by deletion never becomes excessive.
- Each node is at least half-full.
- Each node in a B-tree of order p can have at most p-1 search values.
B+-Trees
- Data pointers are stored only at the leaf nodes.
- Leaf nodes have an entry for every value of the search field, and a data pointer to the record if the search field is a key field.
- For a non-key search field, the pointer points to a block containing pointers to the data file records.
- Internal nodes:
- Some search field values from the leaf nodes are repeated to guide the search.
Indexes on Multiple Keys
- Multiple attributes are involved in many retrieval and update requests.
- Composite keys:
- Access structure using a key value that combines attributes.
- Partitioned hashing:
- Suitable for equality comparisons.
- Grid files:
- Array with one dimension for each search attribute.
Other Types of Indexes
- Hash indexes
- Secondary structure for file access.
- Uses hashing on a search key other than the one used for the primary data file organization.
- Index entries of the form (K, P_r) or (K, P)
- P_r: pointer to the record containing the key.
- P: pointer to the block containing the record for that key.
- Bitmap Indexes
- Used with a large number of rows.
- Creates an index for one or more columns.
- Each value or value range in the column is indexed.
- Built on one particular value of a particular field.
- Array of bits.
- Existence bitmap.
- Bitmaps for B+-tree leaf nodes.
- Function-Based Indexing
- The value resulting from applying some function on a field (or fields) becomes the index key.
- Introduced in Oracle relational DBMS.
- Example:
- The function UPPER(Lname) returns the uppercase representation.
Some General Issues Concerning Indexing
- Physical index
- A pointer specifies the physical record address.
- Disadvantage: the pointer must be changed if the record is moved.
- Logical index
- Used when physical record addresses are expected to change frequently.
- Entries are of the form (K, K_p)
Index Creation
- General form of the command to create an index
- Unique and cluster keywords are optional.
- Order can be ASC or DESC.
- Secondary indexes can be created for any primary record organization.
- Complements other primary access methods.
Indexing of Strings
- Strings can be variable length.
- Strings may be too long, limiting the fan-out.
- Prefix compression:
- Stores only the prefix of the search key that is adequate to distinguish the keys that are being separated and directed to the subtree.
Tuning Indexes
- Tuning goals:
- Dynamically evaluate requirements.
- Reorganize indexes to yield the best performance.
- Reasons for revising initial index choice:
- Certain queries may take too long to run due to the lack of an index.
- Certain indexes may not get utilized.
- Certain indexes may undergo too much updating if based on an attribute that undergoes frequent changes.
- Enforcing a key constraint on an attribute:
- Reject insertion if a new record has the same key attribute as an existing record.
- Duplicates occur if the index is created on a non-key field.
- Fully inverted file:
- Has a secondary index on every field.
- Indexing hints in queries:
- Suggestions are used to expedite query execution.
- Column-based storage of relations:
- An alternative to the traditional way of storing relations by row.
- Offers advantages for read-only queries.
- Offers additional freedom in index creation.
Physical Database Design in Relational Databases
- Physical design goals:
- Create an appropriate structure for data in storage.
- Guarantee good performance.
- Must know job mix for a particular set of database system applications.
- Analyzing the database queries and transactions:
- Information about each retrieval query.
- Information about each update transaction.
- Analyzing the expected frequency of invocation of queries and transactions.
- Expected frequency of using each attribute as a selection or join attribute.
- 80-20 rule: 80 percent of processing is accounted for by only 20 percent of queries and transactions.
- Analyzing the time constraints of queries and transactions.
- Selection attributes associated with time constraints are candidates for primary access structures.
- Analyzing the expected frequency of update operations.
- Minimize the number of access paths for a frequently updated file.
- Updating the access paths themselves slows down update operations.
- Analyzing the uniqueness constraints on attributes.
- Access paths should be specified on all candidate key attributes that are either the primary key of a file or unique attributes.
Physical Database Design Decisions
- Design decisions about indexing:
- Whether to index an attribute (attribute is a key or used by a query).
- What attribute(s) to index on (single or multiple).
- Whether to set up a clustered index (one per table).
- Whether to use a hash index over a tree index (hash indexes do not support range queries).
- Whether to use dynamic hashing (appropriate for very volatile files).
Summary
- Indexes are access structures that improve the efficiency of record retrieval from a data file.
- Ordered single-level index types include primary, clustering, and secondary indexes.
- Multilevel indexes can be implemented as B-trees and B+-trees.
- Dynamic structures.
- Multiple key access methods.
- Logical and physical indexes.