Informatics 141 / Computer Science 121 - Lecture 18

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/21

flashcard set

Earn XP

Description and Tags

Flashcards for reviewing index engineering issues, scaling, distributed indexing, MapReduce, and index updating strategies.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

22 Terms

1
New cards

Indexer Milestone #1

Implement a simple Indexer with a small set of files, traversing folders, reading JSON, parsing HTML dealing with broken HTML, tokenization & stemming.

2
New cards

Boolean Search Milestone #2

Implement simple Boolean retrieval, AND only, with a text interface. Bonus points for a web GUI.

3
New cards

Ranked Search Milestone #3

Implement ranked retrieval, scale up. Must perform under 300ms, ideally ~100ms.

4
New cards

Small Index

Load entirely into memory as a hash table/dictionary. Find the postings by key on the query terms.

5
New cards

Large Index

Does not fit in memory. Must search for postings in the file itself.

6
New cards

How to search for postings in a large index?

Scan the file looking for the query terms or jump to right positions in the file.

7
New cards

Index the index

Create another bookkeeping file with offsets into the inverted index file.

8
New cards

Seek operation

Used to jump to specific positions in a file to access data.

9
New cards

Auxiliary Structures

Document IDs: lookup table from docid to URL. Vocabulary or lexicon: lookup table from index terms to the byte offset.

10
New cards

Alternatives to single inverted index

Split the inverted index file into alphabet ranges and Create file system path trees.

11
New cards

Distributed Indexing

Distributed processing driven by the need to index and analyze huge amounts of data.

12
New cards

MapReduce

Distributed programming tool designed for indexing and analysis tasks.

13
New cards

MapReduce

Distributed programming framework that focuses on data placement and distribution.

14
New cards

Mapper

Transforms a list of items into another list of items of the same length.

15
New cards

Reducer

Transforms a list of items into a single item.

16
New cards

Map Stage

Transforms data records into pairs, each with a key and a value (e.g.,

17
New cards

Shuffle

Uses a hash function so that all pairs with the same key end up next to each other and on the same machine.

18
New cards

Reduce Stage

Processes records in batches, where all pairs with the same key are processed at the same time.

19
New cards

Updating Index

Index merging is a good strategy for handling updates when they come in large batches.

20
New cards

Handling small Index updates

Create separate index for new documents, merge results a posteriori from the search in both indexes.

21
New cards

Handling Deletions

Handled using delete list.

22
New cards

Handling Index Modifications

Putting old version on delete list, adding new version to new documents index.