1/100
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What’s the purpose of indexing?
indexing mechanisms used to speed up access to desired data
search key
attribute of set of attributes used to look up records in a file
index file
consists of records of the form
search-key | pointer/address |
*stored at that address (not the actual column!)
records are called what in an index file?
index entries
Are index files smaller or larger than the original file?
index files are typically much smaller than the original file
in fact, data files are usually too large so that’s why we convert it to index files
What are the two basic kinds of indices?
ordered indices and hash indices
Ordered indices
search keys are stored in sorted order on the search key value
Hash indices
search keys are distributed using a “hash function” (i.e. hash table)
primary index
in a sequentially ordered file, the index whose search key specifies the sequential order of the file
primary index is also called what?
clustering index
Are primary key and primary index the same thing?
NO primary key =! key → but the search key of a primary index is usually but not necessarily the primary key
secondary index
an index whose search key specifies an order different from the sequential order of the file
secondary index is also called what?
non-clustering index
Index-sequential file (direct-lookup)
ordered sequential file w/a primary index
How many primary indices can be in an index-sequential file?
only 1 per file!!!
range queries
some info between some range (x to y)
Example:
SELECT *
FROM faculty
WHERE id >= 20000
AND id <= 50000;
range query w/a range from (20000 to 50000)
dense index
index record appears for every search-key value in the file
What is the problem w/dense index?
It is very hard to do modifications b/c it is packed!
have to shift down/up a lot of other records for any NSERT (shift down) or DELETE (shift up)
Are search key and key the same thing?
NO, search key =! key → forget “key” from the last few chapters
*also remember that primary key =! key
sparse index
contains index records for only some search-key values (i.e., there are records not in the index)
applicable when records are sequentially ordered on search-key
To locate a record w/search-key value K we:
fine index record w/largest search-key value < K
search file sequentially starting at the record to which the index record points
index maintance
updating index to accommodate insert/delete of record(s)
dense index advantages/disadvantages
advantages: faster to search
disadvantage: update takes longer & more storage
spare index advantages/disadvantages
disadvantages: slower search
advantages: update takes shorter time & less storage
Compared to dense indices, spare indices are:
less space and less maintenance overhead for insertions and deletions
generally slower than dense index for locating records
good tradeoff
sparse index w/an index entry for every data block in file, corresponding to least search-key value in the data block
How does the dense and sparse index incorporate both ideas?
dense b/c every data block has a pointer
spares b/c not every record has a pointer
only the first record of each data block has a pointer
Are range queries good w/secondary index?
NO, range queries are VERY BAD w/secondary index b/c it has to jump around (which is very time consuming)
Secondary index implementation
index record points to a buck that contains pointers to all the actual records w/that particular search-key value
Are secondary index dense or sparse?
secondary indices have to be dense
Overall analysis of primary and secondary indices
indices offer substantial benefits when searching for records
but updating indices imposes overhead on db modification (when a file is modified, every index on the file must be updated) → index maintenance
Sequential scans on primary vs secondary indices
primary index is efficient w/sequential scan
secondary index is expensive w/sequential scan
Sequential scans are also called what?
range queries
When does access to memory become expensive w/index?
if primary index does not fit in memory, access becomes expensive
this could happen if primary index is TOO LARGE → can’t be loaded into main memory
How can we make the primary index fit into memory when it is too large?
make a multilevel index: treat primary index kept on disk as a sequential file and construct a sparse index on it
has an outer and inner index
Outer index
a sparse index of primary index → in a multilevel index
Inner index
the primary index file → in a multilevel index
What happens if the outer index becomes too large in a multilevel index?
if even outer index is too large to fit in main memory, yet another level of index can be created, and so on
Modifications w/multilevel index
indices at all levels must be updated on insertion or deletion from the file
What does the B in B+ tree stand for?
balanced
B+ tree is an example of what type of index?
B+tree is a multilevel index!
tree is an index (WOW!)
Trees are called what?
dynamic data structure
dynamic = changes in sizes, modifying structure done easily
since trees are dynamic → it helps w/index maintenance
tree is formed of?
nodes
How many parent/child do nodes have?
each node (except root) has one parent and zero or more child nodes
How many parent/child do leaf nodes have?
leaf nodes have no children
When is a tree unbalanced?
it is unbalanced if leaf nodes occur at different levels
nonleaf node is also called?
internal node
What does a subtree of node consist of?
node and all descendant nodes
Dynamic multilevel indexes using B+ Trees
provide multi-level access structure
tree is always balanced
What do you have to mention w/B+ trees?
you must mention the order of the B+ tree
example: B+ tree of order n
Typical node consists of what?
pointer | search-key | pointer | pointer | search-key | pointer | |
P1 | K1 | P2 | … | Pn-1 | Kn-1 | Pn |
pointers (Pi) and search-keys (Ki)!
In a typical node, Ki are what?
search-key values
In a typical node, Pi are what?
pointers to children (for non-leaf nodes) OR
pointers to records OR
buckets of records (for leaf nodes)
*NOTE: these (data) pointers point to where the data is stored in the data file!
What is the max # of pointers in a typical node?
n pointers
What is the max # of search keys in a typical node?
n-1 search keys
How are the search-keys ordered in a typical node?
K1 < K2 < K3 < … < Kn-1
All B+ trees (rooted trees) satisfy the following properties:
all paths from root to leaf are of the same length
each internal node has between ceiling(n/2) and n children (pointers)
What are the different types of nodes in a B+ tree?
root node (top level), internal node (middle level(s)), leaf node (last level)
A leaf node has between ___ and ___ values?
between ceiling( (n-1)/2 ) and (n-1) values!!!
What are the special cases for leaf nodes?
if the root is not a leaf, it has at least 2 children
if the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n-1) values
Data pointers are stored where in a B+ tree?
data pointers are only stored at the leaf nodes
leaf nodes have an entry for every value of the search field and a data pointer to the record if search field is a key field
internal nodes
some search field values from the leaf nodes repeated to guide search
For a B+ tree of order 5 (n=5), how many leaf values and internal nodes must there be?
leaf nodes = between 2 and 4 values
Min: ceiling ( (n-1)/2 ) = ceiling ( (5-1)/2 ) = 2
Max: n-1 = 5-1 = 4
internal nodes other than root = between 3 and 5 children
Min: ceiling (n/2) = ceiling (5/2) = 3
Max: n = 5
*NOTE: root must have at least 2 children
**NOTE: all paths from root to leaf are of the same length!
Can we traverse between the leaf nodes in a B+ tree?
YES → leaf nodes are organized into a doubly-linked list that allows us to easily traverse the leaf pages in any direction
B+ Tree adding a value
find leaf node where item belongs
insert in leaf
if node is too full,
split, and promote middle key up to parent, middle key also goes to the right
if root split, create new root containing promoted key
Splitting a B+ tree (during insertion/deletion)
Let n be the order of the B+ tree and the leaf node is full with (n-1) values, adding one more value will lead to overflow this leaf node, hence this node should be split into two leaf nodes
→ Let m = int (n/2)
if a leaf overflows:
the left most m keys are left in the node
the right most m+1 keys are moved to the new node
the (m+1)-st key is propagated to the parent node
**NOTE: The right sub-tree of each non-leaf node should always contain greater or equal key values
B+ Tree deleting a value
when a record is deleted from a B+ tree, it is always removed from the leaf level
if the deletion of the key does not cause the leaf underflow
delete the key from the leaf
if the key of the deleted record appears in an index node, sue the next key to replace it
if the deletion causes the leaf and the corresponding index node underflow
redistribute → if there is a sibling with more than m keys
merge → if there is no sibling with more than m keys
adjust the index node to reflect the change
hashing is also known as?
unordered indexing
hash table
an array of some fixed size, that positions data elements based on (keys) according to an algorithm called hash function
hash function
a key → position is an array index
key → hash function: h(key) → position
Example of a hash function → a “hash function” h for integer keys is:
h(i) = i % array.length = array index (i.e. position)
What is the lookup time using hash function?
constant-time lookup → O(1), which is extremely fast!
What is the runtime for an insertion using hash function?
O(1), which is extremely fast for insertion!
What is the runtime for a deletion using hash function?
O(1), which is extremely fast for deletion!
What about ordering for hash tables?
hash tables have no ordering information!
*NOTE: this makes hashing not good for range queries
For hash tables, it is expensive to do the following:
getMin, getMax, removeMin, removeMax
various ordered traversals
printing items in sorted order
Hash collisions
the event that two hash keys map into the same position in the array
collision resolution
a strategy for fixing collisions in a hash table
Example: we can use bucket of (data) records
What if the bucket gets full? then we make a chain of buckets!
How can collisions be reduced?
with a selection of a good hash function BUT it is not possible to avoid collisions altogether
the only way to avoid complete is to find a perfect hash function (no collisions possible)
which is hard to do & takes LOTS OF TIME & is application-specific
What if we increase the size of a hash table?
Increasing the size of the hash table leads to a decrease in collisions
Bucket
a unit of storage containing one or more records (typically a disk block)
hash file organization
obtain the bucket of a record directly from its search-key value using a hash function
Why is it hard to sequential search buckets?
records w/different search-key values may be mapped to the same bucket; thus entire bucket has to be search sequentially to locate a record
What are the common collision resolving strategies?
separate chaining (chain of buckets)
open addressing techniques
linear probing
quadratic probing
linear probing
The idea:
table remains a simple array of size N
on insert(x), compute h(x) mod N, if the cell is full, find another by sequentially searching for the next available slot
go to h(x)+1, h(x)+2, etc.
on find x, compute h(x) mod N, if the cell doesn’t match, look elsewhere
*NOTE: hash function → hash(key, size) = key % size
quadratic probing
The idea:
Let key x be stored in element h(x) = t of the array
on insert(x), if the cell at h(x) is full, find another be sequentially searching for the next available slot based on a quadratic sequence
the probing sequence is h(x), (h(x) = i2) % N) for i = 1,2,3,… until an empty slot is found
N is the table size
bucket overflow can occur b/c of:
insufficient buckets
skew in distribution of records
When can a skew in distribution of records occur?
multiple records have same search-key value
chosen hash function produces non-uniform distribution of key values
Overflow buckets
although the probability of bucket overflow can be reduced, it cannot be eliminated and instead is handled w/these buckets
Overflow chaining
the overflow buckets of a given bucket are chained together in a linked list
hash function for strings
keys could be strings → just view a string by its letters
how to map a string into an integer index? “hash” it!
one possible hash function:
treat first character as an int, and hash on that
h(s) = c0 % array.length
*NOTE: this is not a good hash function b/c it only considers the first character of a string (clustering likely to occur)
**NOTE: strings will collide if the start w/the same character or if their first characters produce the same remainder after the modulo
Typical hash functions perform what?
perform computation on the internal binary representation of the search-key
Example: for a string search-key, the binary representations of all the characters in the string could be added and the sum modulo the number of buckets could be returned
Creating indexes in SQL (B+ tree)
CREATE INDEX indexName ON tableName(attributeName)
What is the default index made in SQL?
default is a B+ tree
Why not create indexes on everything?
too much storage and index maintenance
What does SQL do automatically in regards to index?
automatically creates an index on primary key
Creating index in SQL (hash table)
CREATE INDEX hashIndex ON tableName USING HASH (columnName);
Creating indexes on more than one attribute
CREATE INDEX doubleIndex ON tableName(columnName1, columnName2)
Index Selection problem
Ask yourself: what indexes should we build to speed up the workload?
FROM/WHERE clauses → favor an index
INSERT/UPDATE clauses → discourage an index
Index selection = normally done by people, recently done automatically (SQL server)
workload
a set of SQL queries plus how often they run