stupid baka quiz 4

0.0(0)

Studied by 29 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/74

There's no tags or description

Looks like no tags are added yet.

Last updated 8:38 AM on 11/25/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

75 Terms

New cards

retrieval models

provide a mathematical framework for defining search process

New cards

What is an in-memory index

it keeps entire vocabulary and posting lists in RAM during index construction

New cards

why is an in-memory index limited?

it cannot scale because it assumes entire index fits in memory (fails for large collections where postings/term dict > avail RAM)

New cards

why is random disk access a bottleneck?

they require disk seeks :((((( takes wayyy too long compared to RAM access

New cards

why build partial indexes?

because full inverted index cannot fit into RAM

New cards

how long would it take to read 1mb sequentially from MEMORY?

~3 nanoseconds

New cards

how long would it take to read 1mb sequentially from DISK?

825 nanoseconds

New cards

how long would it take to read 1mb sequentially from SSD?

49 microseconds

New cards

why sort partial indexes/lists before writing them onto the disk?

to make them easier to merge

New cards

you have two sorted partial indexes A and B and you want to merge them so that you can have one nice big index. HOWEVER, you do not have enough space to hold both in memory at the same time. You should….

Read the two files block by block and repeating until you have one sorted big index at the end (e.g., comparing block A1 to block B1 then comparing block A2 to block B2 etc etc)

New cards

Given partial indexes (A, B, C, D, E, ….) is it more efficient to read from two documents at a time or do a multi-way merge and read from all files simultaneously?

Read from all files simultaneously (as long as reading block sizes are big enough)

New cards

Given that postings are sorted by docID, if list lengths are x and y, then merge takes …

O(x + y) operations

New cards

Given x terms and their corresponding postings lists, what’s the best order for query processing?

Process them in order of INCREASING frequency (e.g. pair of words with least amount of postings » pair of words with greatest amount of postings)

New cards

Given the phrase “Friends, Romans, Countrymen”, what bi-grams are generated?

(friends romans) (romans countrymen)

New cards

Should n-grams be adopted for arbitrary sizes? (True/False)

False, they’re unfeasible

New cards

What’s some problems with n-grams?

False positives, indexes/dict too big

New cards

For phrase queries, we should use a merge algorithm _______ at the _____ level

recursively, document

New cards

A positional index ________ postings storage ______

expands, substantially

New cards

Are positional indexes in standard use now?

True

New cards

Positional indexes require an entry for each….

occurrence

New cards

What are some advantages of Boolean search?

all of the above

New cards

What are some drawbacks to boolean retrieval?

All of the above

New cards

With a large index that doesn’t fit in memory, scanning the file looking for query terms could take…

from O(n) to O(logN), depending on whether or not query terms are sorted

New cards

With a large index that doesn’t fit in memory, jumping to the right positions could take…

O(1)

New cards

In order to keep retrieval time quick, you should build a _______ for your index

index

New cards

MapReduce is…

a distributed programming tool designed for indexing and analysis tasks

New cards

A key benefit of MapReduce is that the multiple operations on the same input provides…

fault tolerance

New cards

Index merging is a good strategy when updates come in _______

large batches

New cards

Is index merging efficient at handling small updates? (True/False)

False

New cards

Why is multi-way merging more efficient?

Since it reads from all partial index files at once and avoids disk I/O

New cards

Why is sequential disk access faster than random access?

Sequential disk access avoids disk seeks

New cards

What do skip pointers do, and how do they help?

Allow the algorithm to jump ahead within long postings lists instead of advancing one docID at a time

New cards

What is the trade-off of adding more skip pointers?

Improves skipping ability but increases index size and overhead

New cards

How do skip pointers help?

reduce the number of comparisons needed during intersections

New cards

Why can’t we simply scan the entire inverted index file to find a term’s postings

Scanning is O(n), and the file may be gigabytes long

New cards

What does “indexing the index” accomplish?

It creates a RAM-resident table mapping terms → byte offsets in the on-disk index

New cards

Why is binary search NOT sufficient to search inside the inverted index file?

Records inside the index file have variable length, so you cannot jump to the middle.

New cards

What is stored in the secondary index (the lexicon)?

Term statistics, vocabulary, and byte offset pointers

New cards

Why does the secondary index likely fit in RAM?

It only stores metadata for terms, not the postings lists

New cards

What is the purpose of splitting the index into directories (a/, b/, c/, …)?

It keeps each index file smaller and easier to cache or load in parallel.

New cards

How does MapReduce support large-scale indexing?

It parallelizes indexing by mapping documents into (term, docID) pairs and reducing them by term

New cards

What happens during the Shuffle phase of MapReduce indexing?

Machines combine identical terms from all mappers so reducers receive all data for each term.

New cards

Boolean queries are good for …

Expert users with precise knowledge, applications

New cards

Boolean queries result in….(too few, too many, both) results

both (too few and too many)

New cards

In ranked retrieval, system returns

ordered list of top documents in collection for a query

New cards

Free text queries are queries wehre

user queries are just words in a human language

New cards

How can we rank-order the documents in the collection with respect to a query?

Assign a score that measures how well document and query “match” to each document

New cards

Jaccard coefficient

jaccard(A,B) = | intersection(A, B) | / | union(A, B) |

New cards

What are some good properties of Jaccard coefficient?

A, B don’t need to be same size

Always comes out to be in range [0, 1] inclusive

New cards

What are some issues with using Jaccard

Doesn’t consider term frequency, doesn’t consider IDF

New cards

Bag of words model

Vector representation that doesn’t consider the ordering of words in a document

New cards

Relevance (does/does not) increase proportionally with term frequency.

does not

New cards

Log frequency weight of term t in d is, (given term frequency > 0)

1 + log10*term freq

New cards

Given a query ‘kitty loaf’ and document: “fat cat loaf is the best loaf’, what is the Jaccard Coefficient?

1/7

New cards

Given a query ‘kitty loaf’ and document: “I like chonk chonk cute cat’, what is the Jaccard Coefficient?

New cards

What is the document frequency of cat given document corpus:

New cards

What is the inverse document frequency of cat given the following corpus

==> total # of docs = N = 3

document frequency = 3

log10(N/df) = log10(3/3) = 1

New cards

Rare terms are (more/less) informative than frequent terms

New cards

Document frequency captures….

number of documents that contain a word t

inverse measure of informativeness of t

New cards

idf (inverse document frequency) of t is calculated by doing

idf = log10(N / df)

where total # of docs = N

document frequency = # of docs term is in = df = (1 + log(tf))

New cards

Does idf have an effect on ranking for one-term queries? (True/False)

False

New cards

The collection frequency of t is ….

the number of occurrences of t in the collection, counting multiple occurrences.

New cards

tf-idf is calculated by ..

multiplying tf and idf together

= (1 + log(tf)) * log(N/df)

New cards

What are some benefits to using tf-idf?

Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection

New cards

In the vector space model, _______ are the axes

terms

New cards

In the vector space model, _______ are the points/vectors

documents

New cards

In the vector space model, documents are ranked according to….

proximity to queries in this space

New cards

In the vector space model, proximity represents the…

similarities of vectors AND the inverse of distance

New cards

Why is Euclidean distance a bad idea for vector space proximity?

because Euclidean distance is large for vectors of different lengths, and queries and documents will (usually) have very different lengths...

New cards

In the vector space model, documents are ranked according to the _______- between documents and query

angles

New cards

After length-normalization, both long and short documents now have comparable…

weights

New cards

SMART Notation

denotes the combination in use in an engine, with the notation ddd.qqq

New cards

<p><span style="background-color: transparent;"><span>Determine the most efficient processing order, if any, for the Boolean query </span><strong><span>Q</span></strong><span> considering the document frequency information from the table → </span></span></p><p></p>

Determine the most efficient processing order, if any, for the Boolean query Q considering the document frequency information from the table →

(T1 AND T3) first, then merge with T2

New cards

What is the tf-idf of...the word loaf in S1?

(1+log(tf)) * log(N/df)

tf = 2 (occurs twice)

idf = log(N/df) = log(3/2)

(1 + log(2)) * log(3/2)

New cards

Using cosine similarity as the ranking formula, what is the relative ranking of these documents for a query with coordinates [1, 1, 1, 1]?

D2, D1

(there’s a lot of math involved, but for time purposes in quiz, just look and see that D2 more similar than D1)

Explore top notes

SOORTEN WETTEN

Updated 58d ago

Note

Chapter 4:Molecular compounds

Updated 1220d ago

Note

Electricity and Magnetism

Updated 1218d ago

Note

Chapter 34: Vehicle and Property Insurance

Updated 1294d ago

Note

Chapter 5: Learning

Updated 1373d ago

Note

Chapter 1: Prehistoric Art

Updated 1041d ago

Note

Membrane Potential

Updated 1283d ago

Note

RPH: Did Rizal Retract?

Updated 1217d ago

Note

SOORTEN WETTEN

Updated 58d ago

Note

Chapter 4:Molecular compounds

Updated 1220d ago

Note

Electricity and Magnetism

Updated 1218d ago

Note

Chapter 34: Vehicle and Property Insurance

Updated 1294d ago

Note

Chapter 5: Learning

Updated 1373d ago

Note

Chapter 1: Prehistoric Art

Updated 1041d ago

Note

Membrane Potential

Updated 1283d ago

Note

RPH: Did Rizal Retract?

Updated 1217d ago

Note

Explore top flashcards

Gram-Positive rods

Updated 258d ago

Flashcards (50)

AP HUMAN GEOGRAPHY ULTIMATE REVIEW FLASHCARDS

Updated 249d ago

Flashcards (385)

Institutions of the National Government - Judiciary

Updated 69d ago

Flashcards (43)

AP Euro Crash Course Terms

Updated 642d ago

Flashcards (105)

321- 327 chapter 8

Updated 903d ago

Flashcards (20)

AP Comparative Govt--Unit 2

Updated 1167d ago

Flashcards (72)

unit 13 level B sadlier definitions

Updated 1020d ago

Flashcards (20)

AP United States History Ultimate Review Guide Flashcards

Updated 39d ago

Flashcards (376)

Gram-Positive rods

Updated 258d ago

Flashcards (50)

AP HUMAN GEOGRAPHY ULTIMATE REVIEW FLASHCARDS

Updated 249d ago

Flashcards (385)

Institutions of the National Government - Judiciary

Updated 69d ago

Flashcards (43)

AP Euro Crash Course Terms

Updated 642d ago

Flashcards (105)

321- 327 chapter 8

Updated 903d ago

Flashcards (20)

AP Comparative Govt--Unit 2

Updated 1167d ago

Flashcards (72)

unit 13 level B sadlier definitions

Updated 1020d ago

Flashcards (20)

AP United States History Ultimate Review Guide Flashcards

Updated 39d ago

Flashcards (376)