1/21
Flashcards for reviewing index engineering issues, scaling, distributed indexing, MapReduce, and index updating strategies.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Indexer Milestone #1
Implement a simple Indexer with a small set of files, traversing folders, reading JSON, parsing HTML dealing with broken HTML, tokenization & stemming.
Boolean Search Milestone #2
Implement simple Boolean retrieval, AND only, with a text interface. Bonus points for a web GUI.
Ranked Search Milestone #3
Implement ranked retrieval, scale up. Must perform under 300ms, ideally ~100ms.
Small Index
Load entirely into memory as a hash table/dictionary. Find the postings by key on the query terms.
Large Index
Does not fit in memory. Must search for postings in the file itself.
How to search for postings in a large index?
Scan the file looking for the query terms or jump to right positions in the file.
Index the index
Create another bookkeeping file with offsets into the inverted index file.
Seek operation
Used to jump to specific positions in a file to access data.
Auxiliary Structures
Document IDs: lookup table from docid to URL. Vocabulary or lexicon: lookup table from index terms to the byte offset.
Alternatives to single inverted index
Split the inverted index file into alphabet ranges and Create file system path trees.
Distributed Indexing
Distributed processing driven by the need to index and analyze huge amounts of data.
MapReduce
Distributed programming tool designed for indexing and analysis tasks.
MapReduce
Distributed programming framework that focuses on data placement and distribution.
Mapper
Transforms a list of items into another list of items of the same length.
Reducer
Transforms a list of items into a single item.
Map Stage
Transforms data records into pairs, each with a key and a value (e.g.,
Shuffle
Uses a hash function so that all pairs with the same key end up next to each other and on the same machine.
Reduce Stage
Processes records in batches, where all pairs with the same key are processed at the same time.
Updating Index
Index merging is a good strategy for handling updates when they come in large batches.
Handling small Index updates
Create separate index for new documents, merge results a posteriori from the search in both indexes.
Handling Deletions
Handled using delete list.
Handling Index Modifications
Putting old version on delete list, adding new version to new documents index.