Prin Info Final

Indexing

We use indexing to access data faster
Search Key - An attribute that is used to find records in a file
Index File - file that consists of records with two columns; search-key, and pointer
- Each record is called an index entry
- smaller than the original file
Two types of indices - Ordered and Hash
- Ordered - search keys are stored in a sorted order
- Hash - Search keys are distributed uniformly across buckets using a hash function
Our index evaluation metrics include:
- What kinds of searches can be done quickly with this index
  - Looking for one exact value (Equality search)
  - Looking for records between two values (range search)
- Access time
- Insertion time
- Deletion time
- Space overhead
In an ordered index, index entries are always sorted based on the search key value
Clustering/primary index - in a sequentially ordered file, the index with the search key that determines that order
- The search key of a primary index is usually the primary key
Nonclustering/Secondary index - an index whose search key specifies an order different from the sequential order of the file.
- Index record points to a bucket that contains pointers to all the actual records with that particular search-key value
- Have to be dense
Dense Index File - Index record appears for search-key value in the file
Sparse Index File - contains index records only for some search-key values
Indices are very beneficial when searching for records, but they impose overhead on database modification
If a record is inserted or deleted, every index must be updated. There can be multiple indexes per record
When a record is updated, any updated index on that record must be updated on the index file
Sequential scan using clustering index is efficient, non-clustering index is not; very expensive on a magnetic disk
If an index is too large to fit in memory, it’s no longer doing its job (making searches fast)
So we build multilevel index - indexes on indexes
Treat the index as a sorted file on disk, then build a smaller sparse index on that one
Outer index - sparse top level index with key values and pointers to blocks in the lower level index, usually fits in memory
Inner index - the original full dense index, stored on disk
If outer is still too large for main memory, build another outer index
- This is where multilevel comes from
When data changes, the basic index must be updated , resulting in changes to all levels, resulting in increased update overhead
Composite Search Key - Can have an index on multiple keys at once
- Index on instructor relation on attributes (name, ID)
- Values sorted lexicographically, (John, 12121) < (John, 13514) and (John, 135114) < (Peter, 11223)
  - Only check the second if the first ones are equal
- Can query on just name, or on (name, ID); non prefix lookup does not work

Hash Index

Hash index - stores data as key-value pairs, where we give a key and it returns the value
No sorted order, can’t scan keys in increasing order
Uses a dictionary/map structure
We supply the search key, and get an array index for a bucket. Then, we can find the record pointer stored there
We can jump directly to where the data should be instead of searching
However, there’s the problem of collisions, meaning hash(key1) = hash(key2)
Tradeoffs are necessary - smaller table = less memory, but more collisions and slower lookups, larger table = more memory, but faster with less collisions
Quick for equality search
Lots of different hash functions - just need something that will return an integer representation of a key, fast, and with a low collision rate
- Avoid cryptographic hash functions because they’re computationally expensive, security they provide is unnecessary
- Examples include:
  - CRC-64
  - MurmurHash
  - Google CityHash
  - Facebook XXHash
  - Google FarmHash
Static hashing - keeping the number of buckets fixed when the table is created, and not changing the hash function as the amount of data grows
- Will result in buckets overflowing, increased collisions, and performance degrades
- Linear probe hashing - single table of buckets; if a bucket is full, check the next one and insert there if you need to
- Cuckoo hashing - Use two or more hash functions and tables for each function, if a spot is taken reinsert it elsewhere using one of the other functions; results in each key having multiple possible location; could result in an infinite loop
  - If we do get a cycle, we rebuild with new hash functions
- Chained hashing - each bucket stores a linked list, colliding keys go into the chain
Dynamic Hashing - hash structure grows as data grows; avoids full rehashing, and keeps performance stable over time
- Extendible hashing - Use a prefix of the has value to locate the bucket
  - Global depth = prefix length, increase/decrease for grow/shrink, directory size = 2^d
  - Hash function produces a large bit string
  - Multiple entries can point to the same bucket, which each have a local depth <= global depth
  - When a bucket is full…
    - If local depth < global depth; allocate a new bucket, increase local depth by 1, redistribute the records to the two buckets, update the bucket address table, further split if necessary
    - If local depth = global depth: double the size of the bucket address table and increase global depth by 1
    - replace each entry in the table by two entries that point to the same bucket
    - now local depth < global depth so use above case
  - Bucket can be removed if it is empty
- Linear Hashing:
  - Maintain a pointer that tracks the next bucket to split - we split them one at a time
  - When any bucket overflows, split the bucket at the pointer location
  - Uses two hash functions during growth
  - Expands smoothly and not all at once
  - Basic process of bucket selection:
    - compute b = h_i(key)
    - If b < split, recompute using b = h_{i + 1}(key)
      - this is because if b < split, the element is going to a bucket that’s on a new level - it’s already been split
    - b is where the record lives
  - When split = 2^i, i.e. the full level has been split, set split = 0, then i = i + 1
Hash Functions - map large variable sized data to small, fixed size buckets
- Main problem is collisions: When two different inputs produce the same output hash value, makes our table into a slow list
- Simpled h(x) = x mod m is predictable, ends up creating massive collisions
- Perfect hash is permutation; maps everything to its own unique bucket in a perfect shuffle, but full lookup table is too expensive
- h(x) = (ax + b) mod p, perfect permutation; fast, cheap, random enough
  - p - large prime number, larger than any possible input value
  - a - random number from {1,…,p-1}
  - b - random number from {0,…,p-1}