DB Chapter 10 Concepts - Indexing

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/100

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

101 Terms

New cards

What’s the purpose of indexing?

indexing mechanisms used to speed up access to desired data

New cards

search key

attribute of set of attributes used to look up records in a file

New cards

index file

consists of records of the form

search-key

pointer/address

*stored at that address (not the actual column!)

New cards

records are called what in an index file?

index entries

New cards

Are index files smaller or larger than the original file?

index files are typically much smaller than the original file

in fact, data files are usually too large so that’s why we convert it to index files

New cards

What are the two basic kinds of indices?

ordered indices and hash indices

New cards

Ordered indices

search keys are stored in sorted order on the search key value

New cards

Hash indices

search keys are distributed using a “hash function” (i.e. hash table)

New cards

primary index

in a sequentially ordered file, the index whose search key specifies the sequential order of the file

New cards

primary index is also called what?

clustering index

New cards

Are primary key and primary index the same thing?

NO primary key =! key → but the search key of a primary index is usually but not necessarily the primary key

New cards

secondary index

an index whose search key specifies an order different from the sequential order of the file

New cards

secondary index is also called what?

non-clustering index

New cards

Index-sequential file (direct-lookup)

ordered sequential file w/a primary index

New cards

How many primary indices can be in an index-sequential file?

only 1 per file!!!

New cards

range queries

some info between some range (x to y)

Example:
- SELECT *
  FROM faculty
  WHERE id >= 20000
  AND id <= 50000;
- range query w/a range from (20000 to 50000)

New cards

dense index

index record appears for every search-key value in the file

New cards

What is the problem w/dense index?

It is very hard to do modifications b/c it is packed!

have to shift down/up a lot of other records for any NSERT (shift down) or DELETE (shift up)

New cards

Are search key and key the same thing?

NO, search key =! key → forget “key” from the last few chapters

*also remember that primary key =! key

New cards

sparse index

contains index records for only some search-key values (i.e., there are records not in the index)

applicable when records are sequentially ordered on search-key

New cards

To locate a record w/search-key value K we:

fine index record w/largest search-key value < K
search file sequentially starting at the record to which the index record points

New cards

index maintance

updating index to accommodate insert/delete of record(s)

New cards

dense index advantages/disadvantages

advantages: faster to search
disadvantage: update takes longer & more storage

New cards

spare index advantages/disadvantages

disadvantages: slower search
advantages: update takes shorter time & less storage

New cards

Compared to dense indices, spare indices are:

less space and less maintenance overhead for insertions and deletions
generally slower than dense index for locating records

New cards

good tradeoff

sparse index w/an index entry for every data block in file, corresponding to least search-key value in the data block

New cards

How does the dense and sparse index incorporate both ideas?

dense b/c every data block has a pointer
spares b/c not every record has a pointer
- only the first record of each data block has a pointer

New cards

Are range queries good w/secondary index?

NO, range queries are VERY BAD w/secondary index b/c it has to jump around (which is very time consuming)

New cards

Secondary index implementation

index record points to a buck that contains pointers to all the actual records w/that particular search-key value

New cards

Are secondary index dense or sparse?

secondary indices have to be dense

New cards

Overall analysis of primary and secondary indices

indices offer substantial benefits when searching for records
but updating indices imposes overhead on db modification (when a file is modified, every index on the file must be updated) → index maintenance

New cards

Sequential scans on primary vs secondary indices

primary index is efficient w/sequential scan
- secondary index is expensive w/sequential scan

New cards

Sequential scans are also called what?

range queries

New cards

When does access to memory become expensive w/index?

if primary index does not fit in memory, access becomes expensive

this could happen if primary index is TOO LARGE → can’t be loaded into main memory

New cards

How can we make the primary index fit into memory when it is too large?

make a multilevel index: treat primary index kept on disk as a sequential file and construct a sparse index on it

has an outer and inner index

New cards

Outer index

a sparse index of primary index → in a multilevel index

New cards

Inner index

the primary index file → in a multilevel index

New cards

What happens if the outer index becomes too large in a multilevel index?

if even outer index is too large to fit in main memory, yet another level of index can be created, and so on

New cards

Modifications w/multilevel index

indices at all levels must be updated on insertion or deletion from the file

New cards

What does the B in B+ tree stand for?

balanced

New cards

B+ tree is an example of what type of index?

B+tree is a multilevel index!

tree is an index (WOW!)

New cards

Trees are called what?

dynamic data structure

dynamic = changes in sizes, modifying structure done easily
since trees are dynamic → it helps w/index maintenance

New cards

tree is formed of?

nodes

New cards

How many parent/child do nodes have?

each node (except root) has one parent and zero or more child nodes

New cards

How many parent/child do leaf nodes have?

leaf nodes have no children

New cards

When is a tree unbalanced?

it is unbalanced if leaf nodes occur at different levels

New cards

nonleaf node is also called?

internal node

New cards

What does a subtree of node consist of?

node and all descendant nodes

New cards

Dynamic multilevel indexes using B+ Trees

provide multi-level access structure
tree is always balanced

New cards

What do you have to mention w/B+ trees?

you must mention the order of the B+ tree

example: B+ tree of order n

New cards

Typical node consists of what?

pointer	search-key	pointer		pointer	search-key	pointer
P₁	K₁	P₂	…	P_n-1	K_n-1	P_n

pointers (P_i) and search-keys (K_i)!

New cards

In a typical node, K_i are what?

search-key values

New cards

In a typical node, P_i are what?

pointers to children (for non-leaf nodes) OR
pointers to records OR
buckets of records (for leaf nodes)

*NOTE: these (data) pointers point to where the data is stored in the data file!

New cards

What is the max # of pointers in a typical node?

n pointers

New cards

What is the max # of search keys in a typical node?

n-1 search keys

New cards

How are the search-keys ordered in a typical node?

K₁ < K₂ < K₃ < … < K_n-1

New cards

All B+ trees (rooted trees) satisfy the following properties:

all paths from root to leaf are of the same length
each internal node has between ceiling(n/2) and n children (pointers)

New cards

What are the different types of nodes in a B+ tree?

root node (top level), internal node (middle level(s)), leaf node (last level)

New cards

A leaf node has between ___ and ___ values?

between ceiling( (n-1)/2 ) and (n-1) values!!!

New cards

What are the special cases for leaf nodes?

if the root is not a leaf, it has at least 2 children
if the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n-1) values

New cards

Data pointers are stored where in a B+ tree?

data pointers are only stored at the leaf nodes

leaf nodes have an entry for every value of the search field and a data pointer to the record if search field is a key field

New cards

internal nodes

some search field values from the leaf nodes repeated to guide search

New cards

For a B+ tree of order 5 (n=5), how many leaf values and internal nodes must there be?

leaf nodes = between 2 and 4 values
- Min: ceiling ( (n-1)/2 ) = ceiling ( (5-1)/2 ) = 2
- Max: n-1 = 5-1 = 4
internal nodes other than root = between 3 and 5 children
- Min: ceiling (n/2) = ceiling (5/2) = 3
- Max: n = 5

*NOTE: root must have at least 2 children

**NOTE: all paths from root to leaf are of the same length!

New cards

Can we traverse between the leaf nodes in a B+ tree?

YES → leaf nodes are organized into a doubly-linked list that allows us to easily traverse the leaf pages in any direction

New cards

B+ Tree adding a value

find leaf node where item belongs
insert in leaf
if node is too full,
- split, and promote middle key up to parent, middle key also goes to the right
if root split, create new root containing promoted key

New cards

Splitting a B+ tree (during insertion/deletion)

Let n be the order of the B+ tree and the leaf node is full with (n-1) values, adding one more value will lead to overflow this leaf node, hence this node should be split into two leaf nodes

→ Let m = int (n/2)

if a leaf overflows:
- the left most m keys are left in the node
- the right most m+1 keys are moved to the new node
- the (m+1)-st key is propagated to the parent node

**NOTE: The right sub-tree of each non-leaf node should always contain greater or equal key values

New cards

B+ Tree deleting a value

when a record is deleted from a B+ tree, it is always removed from the leaf level
if the deletion of the key does not cause the leaf underflow
- delete the key from the leaf
- if the key of the deleted record appears in an index node, sue the next key to replace it
if the deletion causes the leaf and the corresponding index node underflow
- redistribute → if there is a sibling with more than m keys
- merge → if there is no sibling with more than m keys
- adjust the index node to reflect the change

New cards

hashing is also known as?

unordered indexing

New cards

hash table

an array of some fixed size, that positions data elements based on (keys) according to an algorithm called hash function

New cards

hash function

a key → position is an array index

key → hash function: h(key) → position

New cards

Example of a hash function → a “hash function” h for integer keys is:

h(i) = i % array.length = array index (i.e. position)

New cards

What is the lookup time using hash function?

constant-time lookup → O(1), which is extremely fast!

New cards

What is the runtime for an insertion using hash function?

O(1), which is extremely fast for insertion!

New cards

What is the runtime for a deletion using hash function?

O(1), which is extremely fast for deletion!

New cards

What about ordering for hash tables?

hash tables have no ordering information!

*NOTE: this makes hashing not good for range queries

New cards

For hash tables, it is expensive to do the following:

getMin, getMax, removeMin, removeMax
various ordered traversals
printing items in sorted order

New cards

Hash collisions

the event that two hash keys map into the same position in the array

New cards

collision resolution

a strategy for fixing collisions in a hash table

Example: we can use bucket of (data) records
- What if the bucket gets full? then we make a chain of buckets!

New cards

How can collisions be reduced?

with a selection of a good hash function BUT it is not possible to avoid collisions altogether

the only way to avoid complete is to find a perfect hash function (no collisions possible)
which is hard to do & takes LOTS OF TIME & is application-specific

New cards

What if we increase the size of a hash table?

Increasing the size of the hash table leads to a decrease in collisions

New cards

Bucket

a unit of storage containing one or more records (typically a disk block)

New cards

hash file organization

obtain the bucket of a record directly from its search-key value using a hash function

New cards

Why is it hard to sequential search buckets?

records w/different search-key values may be mapped to the same bucket; thus entire bucket has to be search sequentially to locate a record

New cards

What are the common collision resolving strategies?

separate chaining (chain of buckets)
open addressing techniques
- linear probing
- quadratic probing

New cards

linear probing

The idea:

table remains a simple array of size N
on insert(x), compute h(x) mod N, if the cell is full, find another by sequentially searching for the next available slot
- go to h(x)+1, h(x)+2, etc.
on find x, compute h(x) mod N, if the cell doesn’t match, look elsewhere

*NOTE: hash function → hash(key, size) = key % size

New cards

quadratic probing

The idea:

Let key x be stored in element h(x) = t of the array
on insert(x), if the cell at h(x) is full, find another be sequentially searching for the next available slot based on a quadratic sequence
the probing sequence is h(x), (h(x) = i²) % N) for i = 1,2,3,… until an empty slot is found
- N is the table size

New cards

bucket overflow can occur b/c of:

insufficient buckets
skew in distribution of records

New cards

When can a skew in distribution of records occur?

multiple records have same search-key value
chosen hash function produces non-uniform distribution of key values

New cards

Overflow buckets

although the probability of bucket overflow can be reduced, it cannot be eliminated and instead is handled w/these buckets

New cards

Overflow chaining

the overflow buckets of a given bucket are chained together in a linked list

New cards

hash function for strings

keys could be strings → just view a string by its letters
how to map a string into an integer index? “hash” it!
one possible hash function:
- treat first character as an int, and hash on that
  - h(s) = c0 % array.length

*NOTE: this is not a good hash function b/c it only considers the first character of a string (clustering likely to occur)

**NOTE: strings will collide if the start w/the same character or if their first characters produce the same remainder after the modulo

New cards

Typical hash functions perform what?

perform computation on the internal binary representation of the search-key

Example: for a string search-key, the binary representations of all the characters in the string could be added and the sum modulo the number of buckets could be returned

New cards

Creating indexes in SQL (B+ tree)

CREATE INDEX indexName ON tableName(attributeName)

New cards

What is the default index made in SQL?

default is a B+ tree

New cards

Why not create indexes on everything?

too much storage and index maintenance

New cards

What does SQL do automatically in regards to index?

automatically creates an index on primary key

New cards

Creating index in SQL (hash table)

CREATE INDEX hashIndex ON tableName USING HASH (columnName);

New cards

Creating indexes on more than one attribute

CREATE INDEX doubleIndex ON tableName(columnName1, columnName2)

New cards

Index Selection problem

Ask yourself: what indexes should we build to speed up the workload?

FROM/WHERE clauses → favor an index
INSERT/UPDATE clauses → discourage an index
Index selection = normally done by people, recently done automatically (SQL server)

100

New cards

workload

a set of SQL queries plus how often they run