DB Chapter 10 Concepts - Indexing

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/100

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

101 Terms

1
New cards

What’s the purpose of indexing?

indexing mechanisms used to speed up access to desired data

2
New cards

search key

attribute of set of attributes used to look up records in a file

3
New cards

index file

consists of records of the form

search-key

pointer/address

*stored at that address (not the actual column!)

4
New cards

records are called what in an index file?

index entries

5
New cards

Are index files smaller or larger than the original file?

index files are typically much smaller than the original file

  • in fact, data files are usually too large so that’s why we convert it to index files

6
New cards

What are the two basic kinds of indices?

ordered indices and hash indices

7
New cards

Ordered indices

search keys are stored in sorted order on the search key value

8
New cards

Hash indices

search keys are distributed using a “hash function” (i.e. hash table)

9
New cards

primary index

in a sequentially ordered file, the index whose search key specifies the sequential order of the file

10
New cards

primary index is also called what?

clustering index

11
New cards

Are primary key and primary index the same thing?

NO primary key =! key → but the search key of a primary index is usually but not necessarily the primary key

12
New cards

secondary index

an index whose search key specifies an order different from the sequential order of the file

13
New cards

secondary index is also called what?

non-clustering index

14
New cards

Index-sequential file (direct-lookup)

ordered sequential file w/a primary index

15
New cards

How many primary indices can be in an index-sequential file?

only 1 per file!!!

16
New cards

range queries

some info between some range (x to y)

  • Example:

    • SELECT *
      FROM faculty
      WHERE id >= 20000
      AND id <= 50000;

    • range query w/a range from (20000 to 50000)

17
New cards

dense index

index record appears for every search-key value in the file

18
New cards

What is the problem w/dense index?

It is very hard to do modifications b/c it is packed!

  • have to shift down/up a lot of other records for any NSERT (shift down) or DELETE (shift up)

19
New cards

Are search key and key the same thing?

NO, search key =! key → forget “key” from the last few chapters

*also remember that primary key =! key

20
New cards

sparse index

contains index records for only some search-key values (i.e., there are records not in the index)

  • applicable when records are sequentially ordered on search-key

21
New cards

To locate a record w/search-key value K we:

  1. fine index record w/largest search-key value < K

  2. search file sequentially starting at the record to which the index record points

22
New cards

index maintance

updating index to accommodate insert/delete of record(s)

23
New cards

dense index advantages/disadvantages

  • advantages: faster to search

  • disadvantage: update takes longer & more storage

24
New cards

spare index advantages/disadvantages

  • disadvantages: slower search

  • advantages: update takes shorter time & less storage

25
New cards

Compared to dense indices, spare indices are:

  • less space and less maintenance overhead for insertions and deletions

  • generally slower than dense index for locating records

26
New cards

good tradeoff

sparse index w/an index entry for every data block in file, corresponding to least search-key value in the data block

27
New cards

How does the dense and sparse index incorporate both ideas?

  • dense b/c every data block has a pointer

  • spares b/c not every record has a pointer

    • only the first record of each data block has a pointer

28
New cards

Are range queries good w/secondary index?

NO, range queries are VERY BAD w/secondary index b/c it has to jump around (which is very time consuming)

29
New cards

Secondary index implementation

  • index record points to a buck that contains pointers to all the actual records w/that particular search-key value

30
New cards

Are secondary index dense or sparse?

secondary indices have to be dense

31
New cards

Overall analysis of primary and secondary indices

  • indices offer substantial benefits when searching for records

  • but updating indices imposes overhead on db modification (when a file is modified, every index on the file must be updated) → index maintenance

32
New cards

Sequential scans on primary vs secondary indices

  • primary index is efficient w/sequential scan

    • secondary index is expensive w/sequential scan

33
New cards

Sequential scans are also called what?

range queries

34
New cards

When does access to memory become expensive w/index?

if primary index does not fit in memory, access becomes expensive

  • this could happen if primary index is TOO LARGE → can’t be loaded into main memory

35
New cards

How can we make the primary index fit into memory when it is too large?

make a multilevel index: treat primary index kept on disk as a sequential file and construct a sparse index on it

  • has an outer and inner index

36
New cards

Outer index

a sparse index of primary index → in a multilevel index

37
New cards

Inner index

the primary index file → in a multilevel index

38
New cards

What happens if the outer index becomes too large in a multilevel index?

if even outer index is too large to fit in main memory, yet another level of index can be created, and so on

39
New cards

Modifications w/multilevel index

indices at all levels must be updated on insertion or deletion from the file

40
New cards

What does the B in B+ tree stand for?

balanced

41
New cards

B+ tree is an example of what type of index?

B+tree is a multilevel index!

  • tree is an index (WOW!)

42
New cards

Trees are called what?

dynamic data structure

  • dynamic = changes in sizes, modifying structure done easily

  • since trees are dynamic → it helps w/index maintenance

43
New cards

tree is formed of?

nodes

44
New cards

How many parent/child do nodes have?

each node (except root) has one parent and zero or more child nodes

45
New cards

How many parent/child do leaf nodes have?

leaf nodes have no children

46
New cards

When is a tree unbalanced?

it is unbalanced if leaf nodes occur at different levels

47
New cards

nonleaf node is also called?

internal node

48
New cards

What does a subtree of node consist of?

node and all descendant nodes

49
New cards

Dynamic multilevel indexes using B+ Trees

  • provide multi-level access structure

  • tree is always balanced

50
New cards

What do you have to mention w/B+ trees?

you must mention the order of the B+ tree

  • example: B+ tree of order n

51
New cards

Typical node consists of what?

pointer

search-key

pointer

pointer

search-key

pointer

P1

K1

P2

Pn-1

Kn-1

Pn

pointers (Pi) and search-keys (Ki)!

52
New cards

In a typical node, Ki are what?

search-key values

53
New cards

In a typical node, Pi are what?

  • pointers to children (for non-leaf nodes) OR

  • pointers to records OR

  • buckets of records (for leaf nodes)

*NOTE: these (data) pointers point to where the data is stored in the data file!

54
New cards

What is the max # of pointers in a typical node?

n pointers

55
New cards

What is the max # of search keys in a typical node?

n-1 search keys

56
New cards

How are the search-keys ordered in a typical node?

K1 < K2 < K3 < … < Kn-1

57
New cards

All B+ trees (rooted trees) satisfy the following properties:

  1. all paths from root to leaf are of the same length

  2. each internal node has between ceiling(n/2) and n children (pointers)

58
New cards

What are the different types of nodes in a B+ tree?

root node (top level), internal node (middle level(s)), leaf node (last level)

59
New cards

A leaf node has between ___ and ___ values?

between ceiling( (n-1)/2 ) and (n-1) values!!!

60
New cards

What are the special cases for leaf nodes?

  • if the root is not a leaf, it has at least 2 children

  • if the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n-1) values

61
New cards

Data pointers are stored where in a B+ tree?

data pointers are only stored at the leaf nodes

  • leaf nodes have an entry for every value of the search field and a data pointer to the record if search field is a key field

62
New cards

internal nodes

some search field values from the leaf nodes repeated to guide search

63
New cards

For a B+ tree of order 5 (n=5), how many leaf values and internal nodes must there be?

  • leaf nodes = between 2 and 4 values

    • Min: ceiling ( (n-1)/2 ) = ceiling ( (5-1)/2 ) = 2

    • Max: n-1 = 5-1 = 4

  • internal nodes other than root = between 3 and 5 children

    • Min: ceiling (n/2) = ceiling (5/2) = 3

    • Max: n = 5

*NOTE: root must have at least 2 children

**NOTE: all paths from root to leaf are of the same length!

64
New cards

Can we traverse between the leaf nodes in a B+ tree?

YES → leaf nodes are organized into a doubly-linked list that allows us to easily traverse the leaf pages in any direction

65
New cards

B+ Tree adding a value

  • find leaf node where item belongs

  • insert in leaf

  • if node is too full,

    • split, and promote middle key up to parent, middle key also goes to the right

  • if root split, create new root containing promoted key

66
New cards

Splitting a B+ tree (during insertion/deletion)

Let n be the order of the B+ tree and the leaf node is full with (n-1) values, adding one more value will lead to overflow this leaf node, hence this node should be split into two leaf nodes

→ Let m = int (n/2)

  • if a leaf overflows:

    • the left most m keys are left in the node

    • the right most m+1 keys are moved to the new node

    • the (m+1)-st key is propagated to the parent node

**NOTE: The right sub-tree of each non-leaf node should always contain greater or equal key values

67
New cards

B+ Tree deleting a value

  • when a record is deleted from a B+ tree, it is always removed from the leaf level

  • if the deletion of the key does not cause the leaf underflow

    • delete the key from the leaf

    • if the key of the deleted record appears in an index node, sue the next key to replace it

  • if the deletion causes the leaf and the corresponding index node underflow

    • redistribute → if there is a sibling with more than m keys

    • merge → if there is no sibling with more than m keys

    • adjust the index node to reflect the change

68
New cards

hashing is also known as?

unordered indexing

69
New cards

hash table

an array of some fixed size, that positions data elements based on (keys) according to an algorithm called hash function

70
New cards

hash function

a key → position is an array index

  • key → hash function: h(key) → position

71
New cards

Example of a hash function → a “hash function” h for integer keys is:

h(i) = i % array.length = array index (i.e. position)

72
New cards

What is the lookup time using hash function?

constant-time lookup → O(1), which is extremely fast!

73
New cards

What is the runtime for an insertion using hash function?

O(1), which is extremely fast for insertion!

74
New cards

What is the runtime for a deletion using hash function?

O(1), which is extremely fast for deletion!

75
New cards

What about ordering for hash tables?

hash tables have no ordering information!

*NOTE: this makes hashing not good for range queries

76
New cards

For hash tables, it is expensive to do the following:

  • getMin, getMax, removeMin, removeMax

  • various ordered traversals

  • printing items in sorted order

77
New cards

Hash collisions

the event that two hash keys map into the same position in the array

78
New cards

collision resolution

a strategy for fixing collisions in a hash table

  • Example: we can use bucket of (data) records

    • What if the bucket gets full? then we make a chain of buckets!

79
New cards

How can collisions be reduced?

with a selection of a good hash function BUT it is not possible to avoid collisions altogether

  • the only way to avoid complete is to find a perfect hash function (no collisions possible)

  • which is hard to do & takes LOTS OF TIME & is application-specific

80
New cards

What if we increase the size of a hash table?

Increasing the size of the hash table leads to a decrease in collisions

81
New cards

Bucket

a unit of storage containing one or more records (typically a disk block)

82
New cards

hash file organization

obtain the bucket of a record directly from its search-key value using a hash function

83
New cards

Why is it hard to sequential search buckets?

records w/different search-key values may be mapped to the same bucket; thus entire bucket has to be search sequentially to locate a record

84
New cards

What are the common collision resolving strategies?

  • separate chaining (chain of buckets)

  • open addressing techniques

    • linear probing

    • quadratic probing

85
New cards

linear probing

The idea:

  • table remains a simple array of size N

  • on insert(x), compute h(x) mod N, if the cell is full, find another by sequentially searching for the next available slot

    • go to h(x)+1, h(x)+2, etc.

  • on find x, compute h(x) mod N, if the cell doesn’t match, look elsewhere

*NOTE: hash function → hash(key, size) = key % size

86
New cards

quadratic probing

The idea:

  • Let key x be stored in element h(x) = t of the array

  • on insert(x), if the cell at h(x) is full, find another be sequentially searching for the next available slot based on a quadratic sequence

  • the probing sequence is h(x), (h(x) = i2) % N) for i = 1,2,3,… until an empty slot is found

    • N is the table size

87
New cards

bucket overflow can occur b/c of:

  • insufficient buckets

  • skew in distribution of records

88
New cards

When can a skew in distribution of records occur?

  1. multiple records have same search-key value

  2. chosen hash function produces non-uniform distribution of key values

89
New cards

Overflow buckets

although the probability of bucket overflow can be reduced, it cannot be eliminated and instead is handled w/these buckets

90
New cards

Overflow chaining

the overflow buckets of a given bucket are chained together in a linked list

91
New cards

hash function for strings

  • keys could be strings → just view a string by its letters

  • how to map a string into an integer index? “hash” it!

  • one possible hash function:

    • treat first character as an int, and hash on that

      • h(s) = c0 % array.length

*NOTE: this is not a good hash function b/c it only considers the first character of a string (clustering likely to occur)

**NOTE: strings will collide if the start w/the same character or if their first characters produce the same remainder after the modulo

92
New cards

Typical hash functions perform what?

perform computation on the internal binary representation of the search-key

  • Example: for a string search-key, the binary representations of all the characters in the string could be added and the sum modulo the number of buckets could be returned

93
New cards

Creating indexes in SQL (B+ tree)

CREATE INDEX indexName ON tableName(attributeName)

94
New cards

What is the default index made in SQL?

default is a B+ tree

95
New cards

Why not create indexes on everything?

too much storage and index maintenance

96
New cards

What does SQL do automatically in regards to index?

automatically creates an index on primary key

97
New cards

Creating index in SQL (hash table)

CREATE INDEX hashIndex ON tableName USING HASH (columnName);

98
New cards

Creating indexes on more than one attribute

CREATE INDEX doubleIndex ON tableName(columnName1, columnName2)

99
New cards

Index Selection problem

Ask yourself: what indexes should we build to speed up the workload?

  • FROM/WHERE clauses → favor an index

  • INSERT/UPDATE clauses → discourage an index

  • Index selection = normally done by people, recently done automatically (SQL server)

100
New cards

workload

a set of SQL queries plus how often they run