Prin Info Final

Indexing

  • We use indexing to access data faster

  • Search Key - An attribute that is used to find records in a file

  • Index File - file that consists of records with two columns; search-key, and pointer

    • Each record is called an index entry

    • smaller than the original file

  • Two types of indices - Ordered and Hash

    • Ordered - search keys are stored in a sorted order

    • Hash - Search keys are distributed uniformly across buckets using a hash function

  • Our index evaluation metrics include:

    • What kinds of searches can be done quickly with this index

      • Looking for one exact value (Equality search)

      • Looking for records between two values (range search)

    • Access time

    • Insertion time

    • Deletion time

    • Space overhead

  • In an ordered index, index entries are always sorted based on the search key value

  • Clustering/primary index - in a sequentially ordered file, the index with the search key that determines that order

    • The search key of a primary index is usually the primary key

  • Nonclustering/Secondary index - an index whose search key specifies an order different from the sequential order of the file.

    • Index record points to a bucket that contains pointers to all the actual records with that particular search-key value

    • Have to be dense

  • Dense Index File - Index record appears for search-key value in the file

  • Sparse Index File - contains index records only for some search-key values

  • Indices are very beneficial when searching for records, but they impose overhead on database modification

  • If a record is inserted or deleted, every index must be updated. There can be multiple indexes per record

  • When a record is updated, any updated index on that record must be updated on the index file

  • Sequential scan using clustering index is efficient, non-clustering index is not; very expensive on a magnetic disk

  • If an index is too large to fit in memory, it’s no longer doing its job (making searches fast)

  • So we build multilevel index - indexes on indexes

  • Treat the index as a sorted file on disk, then build a smaller sparse index on that one

  • Outer index - sparse top level index with key values and pointers to blocks in the lower level index, usually fits in memory

  • Inner index - the original full dense index, stored on disk

  • If outer is still too large for main memory, build another outer index

    • This is where multilevel comes from

  • When data changes, the basic index must be updated , resulting in changes to all levels, resulting in increased update overhead

  • Composite Search Key - Can have an index on multiple keys at once

    • Index on instructor relation on attributes (name, ID)

    • Values sorted lexicographically, (John, 12121) < (John, 13514) and (John, 135114) < (Peter, 11223)

      • Only check the second if the first ones are equal

    • Can query on just name, or on (name, ID); non prefix lookup does not work

Hash Index

  • Hash index - stores data as key-value pairs, where we give a key and it returns the value

  • No sorted order, can’t scan keys in increasing order

  • Uses a dictionary/map structure

  • We supply the search key, and get an array index for a bucket. Then, we can find the record pointer stored there

  • We can jump directly to where the data should be instead of searching

  • However, there’s the problem of collisions, meaning hash(key1) = hash(key2)

  • Tradeoffs are necessary - smaller table = less memory, but more collisions and slower lookups, larger table = more memory, but faster with less collisions

  • Quick for equality search

  • Lots of different hash functions - just need something that will return an integer representation of a key, fast, and with a low collision rate

    • Avoid cryptographic hash functions because they’re computationally expensive, security they provide is unnecessary

    • Examples include:

      • CRC-64

      • MurmurHash

      • Google CityHash

      • Facebook XXHash

      • Google FarmHash

  • Static hashing - keeping the number of buckets fixed when the table is created, and not changing the hash function as the amount of data grows

    • Will result in buckets overflowing, increased collisions, and performance degrades

    • Linear probe hashing - single table of buckets; if a bucket is full, check the next one and insert there if you need to

    • Cuckoo hashing - Use two or more hash functions and tables for each function, if a spot is taken reinsert it elsewhere using one of the other functions; results in each key having multiple possible location; could result in an infinite loop

      • If we do get a cycle, we rebuild with new hash functions

    • Chained hashing - each bucket stores a linked list, colliding keys go into the chain

  • Dynamic Hashing - hash structure grows as data grows; avoids full rehashing, and keeps performance stable over time

    • Extendible hashing - Use a prefix of the has value to locate the bucket

      • Global depth = prefix length, increase/decrease for grow/shrink, directory size = 2^d

      • Hash function produces a large bit string

      • Multiple entries can point to the same bucket, which each have a local depth <= global depth

      • When a bucket is full…

        • If local depth < global depth; allocate a new bucket, increase local depth by 1, redistribute the records to the two buckets, update the bucket address table, further split if necessary

        • If local depth = global depth: double the size of the bucket address table and increase global depth by 1

        • replace each entry in the table by two entries that point to the same bucket

        • now local depth < global depth so use above case

      • Bucket can be removed if it is empty

    • Linear Hashing:

      • Maintain a pointer that tracks the next bucket to split - we split them one at a time

      • When any bucket overflows, split the bucket at the pointer location

      • Uses two hash functions during growth

      • Expands smoothly and not all at once

      • Basic process of bucket selection:

        • compute b = h_i(key)

        • If b < split, recompute using b = h_{i + 1}(key)

          • this is because if b < split, the element is going to a bucket that’s on a new level - it’s already been split

        • b is where the record lives

      • When split = 2^i, i.e. the full level has been split, set split = 0, then i = i + 1

  • Hash Functions - map large variable sized data to small, fixed size buckets

    • Main problem is collisions: When two different inputs produce the same output hash value, makes our table into a slow list

    • Simpled h(x) = x mod m is predictable, ends up creating massive collisions

    • Perfect hash is permutation; maps everything to its own unique bucket in a perfect shuffle, but full lookup table is too expensive

    • h(x) = (ax + b) mod p, perfect permutation; fast, cheap, random enough

      • p - large prime number, larger than any possible input value

      • a - random number from {1,…,p-1}

      • b - random number from {0,…,p-1}