Prin Info Final
Indexing
We use indexing to access data faster
Search Key - An attribute that is used to find records in a file
Index File - file that consists of records with two columns; search-key, and pointer
Each record is called an index entry
smaller than the original file
Two types of indices - Ordered and Hash
Ordered - search keys are stored in a sorted order
Hash - Search keys are distributed uniformly across buckets using a hash function
Our index evaluation metrics include:
What kinds of searches can be done quickly with this index
Looking for one exact value (Equality search)
Looking for records between two values (range search)
Access time
Insertion time
Deletion time
Space overhead
In an ordered index, index entries are always sorted based on the search key value
Clustering/primary index - in a sequentially ordered file, the index with the search key that determines that order
The search key of a primary index is usually the primary key
Nonclustering/Secondary index - an index whose search key specifies an order different from the sequential order of the file.
Index record points to a bucket that contains pointers to all the actual records with that particular search-key value
Have to be dense
Dense Index File - Index record appears for search-key value in the file
Sparse Index File - contains index records only for some search-key values
Indices are very beneficial when searching for records, but they impose overhead on database modification
If a record is inserted or deleted, every index must be updated. There can be multiple indexes per record
When a record is updated, any updated index on that record must be updated on the index file
Sequential scan using clustering index is efficient, non-clustering index is not; very expensive on a magnetic disk
If an index is too large to fit in memory, it’s no longer doing its job (making searches fast)
So we build multilevel index - indexes on indexes
Treat the index as a sorted file on disk, then build a smaller sparse index on that one
Outer index - sparse top level index with key values and pointers to blocks in the lower level index, usually fits in memory
Inner index - the original full dense index, stored on disk
If outer is still too large for main memory, build another outer index
This is where multilevel comes from
When data changes, the basic index must be updated , resulting in changes to all levels, resulting in increased update overhead
Composite Search Key - Can have an index on multiple keys at once
Index on instructor relation on attributes (name, ID)
Values sorted lexicographically, (John, 12121) < (John, 13514) and (John, 135114) < (Peter, 11223)
Only check the second if the first ones are equal
Can query on just name, or on (name, ID); non prefix lookup does not work
Hash Index
Hash index - stores data as key-value pairs, where we give a key and it returns the value
No sorted order, can’t scan keys in increasing order
Uses a dictionary/map structure
We supply the search key, and get an array index for a bucket. Then, we can find the record pointer stored there
We can jump directly to where the data should be instead of searching
However, there’s the problem of collisions, meaning hash(key1) = hash(key2)
Tradeoffs are necessary - smaller table = less memory, but more collisions and slower lookups, larger table = more memory, but faster with less collisions
Quick for equality search
Lots of different hash functions - just need something that will return an integer representation of a key, fast, and with a low collision rate
Avoid cryptographic hash functions because they’re computationally expensive, security they provide is unnecessary
Examples include:
CRC-64
MurmurHash
Google CityHash
Facebook XXHash
Google FarmHash
Static hashing - keeping the number of buckets fixed when the table is created, and not changing the hash function as the amount of data grows
Will result in buckets overflowing, increased collisions, and performance degrades
Linear probe hashing - single table of buckets; if a bucket is full, check the next one and insert there if you need to
Cuckoo hashing - Use two or more hash functions and tables for each function, if a spot is taken reinsert it elsewhere using one of the other functions; results in each key having multiple possible location; could result in an infinite loop
If we do get a cycle, we rebuild with new hash functions
Chained hashing - each bucket stores a linked list, colliding keys go into the chain
Dynamic Hashing - hash structure grows as data grows; avoids full rehashing, and keeps performance stable over time
Extendible hashing - Use a prefix of the has value to locate the bucket
Global depth = prefix length, increase/decrease for grow/shrink, directory size = 2^d
Hash function produces a large bit string
Multiple entries can point to the same bucket, which each have a local depth <= global depth
When a bucket is full…
If local depth < global depth; allocate a new bucket, increase local depth by 1, redistribute the records to the two buckets, update the bucket address table, further split if necessary
If local depth = global depth: double the size of the bucket address table and increase global depth by 1
replace each entry in the table by two entries that point to the same bucket
now local depth < global depth so use above case
Bucket can be removed if it is empty
Linear Hashing:
Maintain a pointer that tracks the next bucket to split - we split them one at a time
When any bucket overflows, split the bucket at the pointer location
Uses two hash functions during growth
Expands smoothly and not all at once
Basic process of bucket selection:
compute b = h_i(key)
If b < split, recompute using b = h_{i + 1}(key)
this is because if b < split, the element is going to a bucket that’s on a new level - it’s already been split
b is where the record lives
When split = 2^i, i.e. the full level has been split, set split = 0, then i = i + 1
Hash Functions - map large variable sized data to small, fixed size buckets
Main problem is collisions: When two different inputs produce the same output hash value, makes our table into a slow list
Simpled h(x) = x mod m is predictable, ends up creating massive collisions
Perfect hash is permutation; maps everything to its own unique bucket in a perfect shuffle, but full lookup table is too expensive
h(x) = (ax + b) mod p, perfect permutation; fast, cheap, random enough
p - large prime number, larger than any possible input value
a - random number from {1,…,p-1}
b - random number from {0,…,p-1}