Lectures 11 & 12 Databases Quiz - File Storage and Indexing

1.0(1)
studied byStudied by 1 person
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/56

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

57 Terms

1
New cards

where are databases typically stored?

on magnetic disks; accessed using physical database file structures

2
New cards

storage hierarchy

  1. primary storage (CPU main memory, cache memory)

  2. secondary storage (magnetic disks, flash memory, solid-state drives)

  3. tertiary storage (removable media)

3
New cards

memory hierarchies and storage devices

  • cache memory (static RAM, DRAM)

  • mass storage (magnetic disks- CD ROM, DVD, tape drives)

  • flash memory (nonvolatile)

4
New cards

storage hierarchy diagram

knowt flashcard image
5
New cards

primary storage

fastest media but volatile (cache, main memory)

6
New cards

secondary storage

next level in hierarchy, non-volatile, moderately fast access time; also called on-line storage; ex: flash memory, magnetic disks

7
New cards

tertiary storage

lowest level in hierarchy, non-volatile, slow access time; also called off-line storage and used for archival storage; ex: magnetic tape, optical storage

8
New cards

magnetic tape

sequential access, 1 to 12 TB capacity; a few drives with many types; juke boxes with petabytes (thousands of TBs) of storage

9
New cards

storage organization of databases

  • persistent data (most databases)

  • transient data (exists only during program execution)

  • file organization (determines how records are physically placed on the disk and how they are accessed)

10
New cards

why do we not only use main memory?

it is possible that the database might not fit into main memory. main memory is volatile.

11
New cards

main memory vs disk

  • given available main memory, when do we keep which part of the database in main memory?

    • buffer manager- component of DBMS that decides when to move data between disk and main memory

  • how do we ensure transaction property durability?

    • buffer manager needs to make sure data written by committed transactions is written to disk to ensure durability

12
New cards

blocks

a database file is partitioned into fixed-length storage units called blocks. blocks are units of both storage allocation and data transfer

Database system seeks to minimize the number of block transfers between the disk and memory. We can reduce the number of disk accesses by keeping as many blocks as possible in main memory.

13
New cards

buffer

portion of main memory available to store copies of disk blocks

14
New cards

buffer

portion of main memory available to store copies of disk blocks

15
New cards

buffer manager

subsystem responsible for allocating buffer space in main memory; programs call on the buffer manager when they need a block from disk.

  1. if the block is already in the buffer, the buffer manager returns the address of the block in main memory

  2. if the block is not in the buffer, the buffer manager

    1. allocates space in the buffer for the block

      1. replacing (throwing out) some other block, if required, to make space for the new block

      2. replaced block written back to disk only if it was modified since the most recent time that it was written to/fetched from the disk

    2. reads the block from the disk to the buffer, and returns

16
New cards

least recently used (LRU) strategy

buffer replacement policy; most operating systems replace the block least recently used. Idea behind LRU- use past pattern of block references as a predictor of future references

Queries have well-defined access patterns (such as sequential scans), and a database system can use the information in a user’s query to predict future references

  • LRU can be a bad strategy from certain access patterns involving repeated scans of data

  • mixed strategy with hints on replacement strategy provided by the query optimizer is preferable

17
New cards

pinned block

memory block that is not allowed to be written back to disk

18
New cards

toss-immediate strategy

frees the space occupied by a block as soon as the final tuple of that block has been processed

19
New cards

most recently used (MRU) strategy

system must pin the block currently being processed. After the final tuple of that block has been processed, the block is unpinned, and it becomes the most recently used block. Buffer manager can use statistical information regarding the probability that a request will reference a particular relation.

20
New cards

record

collection of related data values or items; values correspond to record fields

21
New cards

binary large objects

BLOBs; unstructured objects

22
New cards

file organization

  • the database is stored as a collection of files. Each file stores records (tuples from a table.) A record is a sequence of fields (the attributes of a tuple).

  • reading one record at a time from disk would be very slow (random access)

    • organize our database files in pages (size of block or larger)

    • read/write data in units of pages

    • one page will usually contain several records

  • one approach:

    • assume record size is fixed

    • each file has records of one particular type only

    • diff files are used for diff relations

    • this case is easiest to implement

23
New cards

reasons for variable-length records

  • one or more fields have variable length

  • one or more fields are repeating

  • one or more fields are optional

  • file contains records of diff types

24
New cards

fixed-length records

  • simple approach

    • store record i starting from byte n * (i-1) where n is the size of each record

    • record access is simple, but records may cross blocks

      • modification: do not allow records to cross block boundaries

  • deletion of record i: alternatives

    • move records i+1, … n to i, … n -1

    • move record n to i

    • do not move records, but link all free records on a free list

25
New cards

free lists

  • store the address of the first deleted record in the file header

  • use this first record to store the address of the second deleted record, and so on

  • can think of these stored addresses as pointers since they “point” to the location of a record

  • more space efficient representation: reuse space for normal attributes of free records to store pointers (no pointers stored in in-use records)

26
New cards

variable-length records

  • variable-length records arise in database systems in several ways:

    • storage of multiple record types in a file

    • record types that allow variable lengths for one or more fields such as strings (varchar)

    • record types that allow repeating fields (used in some older data models)

  • attributes are stored in order

  • variable length attributes represented by fixed size (offset, length), with actual data stored after all fixed length attributes

  • null values represented by null-value bitmap

27
New cards

spanned records

larger than a single block; pointer at end of first block points to block containing remainder of record

28
New cards

unspanned record

records not allowed to cross block boundaries

29
New cards

blocking factor

average number of records per block for the file

30
New cards

allocating file blocks on disk

  • contiguous allocation

  • linked allocation

  • indexed allocation

31
New cards

file header (file descriptor)

contains file information needed by system programs (disk addresses, format descriptions)

32
New cards

heap

a record can be placed anywhere in the file where there is space

33
New cards

sequential

store records in sequential order, based on the value of the search key of each record

34
New cards

hashing

a hash function computed on some attribute of each record; the result specifies in which block of the file the record should be placed

35
New cards

multitable clustering file organization

records of each relation may be stored in a separate file. In a multitable clustering file organization, records of several different relations can be stored in the same file; motivation: store related records on the same block to minimize I/O

36
New cards

files of unordered records (heap files)

  • heap (or pile) file

    • records placed in file in order of insertion

  • inserting a new record is very efficient

  • searching for a record requires linear search

  • deletion techniques

    • rewrite the block

    • use deletion marker

37
New cards

files of ordered records (sorted files)

  • ordered (sequential) file

    • records sorted by ordering field

      • called ordering key if ordering field is a key field

  • advantages

    • reading records in order of ordering key value is extremely efficient

    • finding next record

    • binary search technique

38
New cards

access times for various file organizations

knowt flashcard image
39
New cards

bucket

unit of storage containing one or more records. a bucket is typically a disk block

40
New cards

static hashing

  • in hash file organization we obtain the bucket of a record directly from its search-key value using a hash function

  • hash function h is a function from the set of all search-key values to the set of all bucket addresses B

  • hash function is used to locate records for access, insertion, as well as deletion

  • records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record

41
New cards

hash functions

  • worst hash function maps all search-key values to the same bucket; this makes access time proportional to the number of search-key values in the file

  • an ideal hash function is uniform, meaning each bucket is assigned the same number of search-key values from the set of all possible values

  • ideal hash function is random, so each bucket will have the same number of records assigned to it irrespective of the actual distribution of search-key values in the file

  • typical hash functions perform computation on the internal binary representation of the search-key

42
New cards

why can bucket overflow occur?

  • insufficient buckets

  • skew in distribution of records because of two reasons:

    • multiple records have the same search-key value

    • chosen hash function produces non-uniform distribution of key values

although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled using overflow buckets

43
New cards

overflow chaining

the overflow buckets of a given bucket are chained together in a linked list; example of closed hashing

44
New cards

open hashing

does not use overflow buckets; not suitable for database applications

45
New cards

deficiencies of static hashing

  • if the initial number of buckets is too small, and file grows, performance will degrade due to too much overflows

  • if space is allocated for anticipated growth, a significant amount of space will be wasted initially (and buckets won’t be full)

  • if database shrinks, again space will be wasted

one solution: periodic re-organization of the file with a new hash function; this is expensive and disrupts normal operations

better solution: allow the number of buckets to be modified dynamically

46
New cards

indexes

used to speed up record retrieval in response to certain search conditions; index structures provide secondary access paths; any field can be used to create an index, and multiple indexes can be constructed; most indexes based on ordered files organized into tree structures

47
New cards

types of single-level ordered indexes

  • ordered index similar to index in a textbook

  • indexing field (attribute)

    • index stores each value of the index field with list of pointers to all disk blocks that contain records with that field value

  • values in index are ordered

  • primary index, clustering index, secondary index

48
New cards

primary index

  • specified on the ordering key field of ordered file of records

  • ordered file with two fields

    • primary key, K(i)

    • pointer to a disk block, P(i)

  • one index entry in the index file for each block in the data file

  • indexes may be dense or sparse

  • major problem: insertion and deletion of records

    • move records around and change index values

    • solutions:

      • use unordered overflow file

      • use linked list of overflow records

49
New cards

dense index

has an index entry for every search key value in the data file

50
New cards

sparse index

has entries for only some search values

51
New cards

clustering indexes

  • clustering field

    • file records are physically ordered on a nonkey field without a distinct value for each record

  • ordered file with two fields

    • same type as clustering field

    • disk block pointer

52
New cards

secondary indexes

  • provide secondary means of accessing a data file

    • some primary access exists

  • ordering file with two fields

    • Indexing field K(i)

    • block pointer or record pointer, P(i)

  • usually need more storage space and longer search time than primary index

    • improved search time for arbitrary record

  • index record points to a bucket that contains pointers to all the actual records with that particular search-key value

  • secondary indices have to be dense

53
New cards

primary and secondary indices sequential scanning

  • indices offer substantial benefits when searching for records

  • BUT: updating indices imposes overhead on database modification — when a file is modified, every index on the file must be updated,

  • sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive

    • each record access may fetch a new block from disk

    • block fetch requires about 5 to 10 milliseconds, vs about 100 nanoseconds for memory access

54
New cards

multilevel index

  • if primary index does not fit in memory, access becomes expensive

  • solution: treat primary index kept on disk as sequential file and construct a sparse index on it

    • outer index- a sparse index of primary index

    • inner index- the primary index file

  • if even outer index is too large to fit in main memory, yet another level of index can be created, and so on

  • indices at all levels must be updated on insertion or deletion from the file

  • designed to greatly reduce remaining search space as search is conducted

  • index file- considered first (or base level) of multilevel index

  • second level- primary index to the first level

  • third level- primary index to the second level

55
New cards

dynamic multilevel indexes using B+ trees

  • tree data structure terminology

    • tree is formed of nodes

    • each node (except root) has one parent and zero or more child nodes

    • leaf node has no child node

      • unbalance if leaf nodes occur at different levels

    • nonleaf node called internal node

    • subtree of node consists of node and all descendant nodes

56
New cards

B+ trees

  • data pointers stored only at the leaf nodes

    • leaf nodes have an entry for every value of the search field, and a data pointer to the record if search field is a key field

    • for a nonkey search field, the pointer points to a block containing pointers to the data file records

  • internal nodes

    • some search field values from the leaf nodes repeated to guide search

57
New cards

B+ tree index files

  • B+ tree indices are an alternative to indexed-sequential files

  • disadvantage of indexed-sequential files:

    • performance degrades as file grows, since many overflow blocks get created

    • periodic reorganization of entire file is required

  • advantages of B+ tree index files:

    • automatically reorganizes itself with small, local, changes in the face of insertions and deletions

    • reorganization of entire file is not required to maintain performance

  • (minor) disadvantage of B+ trees:

    • extra insertion and deletion overhead, space overhead

  • advantages of B+ trees outweigh disadvantages, so B+ trees are used extensively