stupid baka quiz 4

0.0(0)
studied byStudied by 28 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/74

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

75 Terms

1
New cards

retrieval models

provide a mathematical framework for defining search process

2
New cards

What is an in-memory index

it keeps entire vocabulary and posting lists in RAM during index construction

3
New cards

why is an in-memory index limited?

it cannot scale because it assumes entire index fits in memory (fails for large collections where postings/term dict > avail RAM)

4
New cards

why is random disk access a bottleneck?

they require disk seeks :((((( takes wayyy too long compared to RAM access

5
New cards

why build partial indexes?

because full inverted index cannot fit into RAM

6
New cards

how long would it take to read 1mb sequentially from MEMORY?

~3 nanoseconds

7
New cards

how long would it take to read 1mb sequentially from DISK?

825 nanoseconds

8
New cards

how long would it take to read 1mb sequentially from SSD?

49 microseconds

9
New cards

why sort partial indexes/lists before writing them onto the disk?

to make them easier to merge

10
New cards

you have two sorted partial indexes A and B and you want to merge them so that you can have one nice big index. HOWEVER, you do not have enough space to hold both in memory at the same time. You should….

Read the two files block by block and repeating until you have one sorted big index at the end (e.g., comparing block A1 to block B1 then comparing block A2 to block B2 etc etc)

11
New cards

Given partial indexes (A, B, C, D, E, ….) is it more efficient to read from two documents at a time or do a multi-way merge and read from all files simultaneously?

Read from all files simultaneously (as long as reading block sizes are big enough)

12
New cards

Given that postings are sorted by docID, if list lengths are x and y, then merge takes …

O(x + y) operations

13
New cards

Given x terms and their corresponding postings lists, what’s the best order for query processing?

Process them in order of INCREASING frequency (e.g. pair of words with least amount of postings » pair of words with greatest amount of postings)

14
New cards

Given the phrase “Friends, Romans, Countrymen”, what bi-grams are generated? 

(friends romans) (romans countrymen)

15
New cards

Should n-grams be adopted for arbitrary sizes? (True/False)

False, they’re unfeasible

16
New cards

What’s some problems with n-grams?

False positives, indexes/dict too big

17
New cards

For phrase queries, we should use a merge algorithm _______ at the _____ level

recursively, document

18
New cards

A positional index ________ postings storage ______

expands, substantially

19
New cards

Are positional indexes in standard use now?

True

20
New cards

Positional indexes require an entry for each….

occurrence

21
New cards

What are some advantages of Boolean search?

all of the above

22
New cards

What are some drawbacks to boolean retrieval?

All of the above

23
New cards

With a large index that doesn’t fit in memory, scanning the file looking for query terms could take…

from O(n) to O(logN), depending on whether or not query terms are sorted

24
New cards

With a large index that doesn’t fit in memory, jumping to the right positions could take…

O(1)

25
New cards

In order to keep retrieval time quick, you should build a _______ for your index

index

26
New cards

MapReduce is…

a distributed programming tool designed for indexing and analysis tasks

27
New cards

A key benefit of MapReduce is that the multiple operations on the same input provides…

fault tolerance

28
New cards

Index merging is a good strategy when updates come in _______

large batches

29
New cards

Is index merging efficient at handling small updates? (True/False)

False

30
New cards

Why is multi-way merging more efficient?

Since it reads from all partial index files at once and avoids disk I/O

31
New cards

Why is sequential disk access faster than random access?

Sequential disk access avoids disk seeks

32
New cards

What do skip pointers do, and how do they help?

Allow the algorithm to jump ahead within long postings lists instead of advancing one docID at a time

33
New cards

What is the trade-off of adding more skip pointers?

Improves skipping ability but increases index size and overhead

34
New cards

How do skip pointers help?

reduce the number of comparisons needed during intersections

35
New cards

Why can’t we simply scan the entire inverted index file to find a term’s postings

Scanning is O(n), and the file may be gigabytes long

36
New cards

What does “indexing the index” accomplish?

It creates a RAM-resident table mapping terms → byte offsets in the on-disk index

37
New cards

Why is binary search NOT sufficient to search inside the inverted index file?

Records inside the index file have variable length, so you cannot jump to the middle.

38
New cards

What is stored in the secondary index (the lexicon)?

Term statistics, vocabulary, and byte offset pointers

39
New cards

Why does the secondary index likely fit in RAM?

It only stores metadata for terms, not the postings lists

40
New cards

What is the purpose of splitting the index into directories (a/, b/, c/, …)?

It keeps each index file smaller and easier to cache or load in parallel.

41
New cards

How does MapReduce support large-scale indexing?

It parallelizes indexing by mapping documents into (term, docID) pairs and reducing them by term

42
New cards

What happens during the Shuffle phase of MapReduce indexing?

Machines combine identical terms from all mappers so reducers receive all data for each term.

43
New cards

Boolean queries are good for …

Expert users with precise knowledge, applications

44
New cards

Boolean queries result in….(too few, too many, both) results

both (too few and too many)

45
New cards

In ranked retrieval, system returns

ordered list of top documents in collection for a query

46
New cards

Free text queries are queries wehre

user queries are just words in a human language

47
New cards

How can we rank-order the documents in the collection with respect to a query?

Assign a score that measures how well document and query “match” to each document

48
New cards

Jaccard coefficient

jaccard(A,B) = | intersection(A, B) | / | union(A, B) |

49
New cards

What are some good properties of Jaccard coefficient?

A, B don’t need to be same size

Always comes out to be in range [0, 1] inclusive

50
New cards

What are some issues with using Jaccard

Doesn’t consider term frequency, doesn’t consider IDF

51
New cards

Bag of words model

Vector representation that doesn’t consider the ordering of words in a document

52
New cards

Relevance (does/does not) increase proportionally with term frequency.

does not

53
New cards

Log frequency weight of term t in d is, (given term frequency > 0)

1 + log10*term freq

54
New cards

Given a query ‘kitty loaf’ and document: “fat cat loaf is the best loaf’, what is the Jaccard Coefficient?

1/7

55
New cards

Given a query ‘kitty loaf’ and document: “I like chonk chonk cute cat’, what is the Jaccard Coefficient?

0

56
New cards
<p><span style="background-color: transparent;"><span>What is the document frequency of cat given document corpus:</span></span></p>

What is the document frequency of cat given document corpus:

3

57
New cards
<p><span style="background-color: transparent;"><span>What is the inverse document frequency of cat given the following corpus</span></span></p>

What is the inverse document frequency of cat given the following corpus

1

==> total # of docs = N = 3

document frequency = 3

log10(N/df) = log10(3/3) = 1

58
New cards

Rare terms are (more/less) informative than frequent terms

more

59
New cards

Document frequency captures….

number of documents that contain a word t

inverse measure of informativeness of t

60
New cards

idf (inverse document frequency) of t is calculated by doing

idf = log10(N / df)

where total # of docs = N

document frequency = # of docs term is in = df = (1 + log(tf))

61
New cards

Does idf have an effect on ranking for one-term queries? (True/False)

False

62
New cards

The collection frequency of t is ….

the number of occurrences of t in the collection, counting multiple occurrences.

63
New cards

tf-idf is calculated by ..

multiplying tf and idf together

= (1 + log(tf)) * log(N/df)

64
New cards

What are some benefits to using tf-idf?

  • Increases with the number of occurrences within a document

  • Increases with the rarity of the term in the collection

65
New cards

In the vector space model, _______ are the axes

terms

66
New cards

In the vector space model, _______ are the points/vectors

documents

67
New cards

In the vector space model, documents are ranked according to….

proximity to queries in this space

68
New cards

In the vector space model, proximity represents the…

similarities of vectors AND the inverse of distance

69
New cards

Why is Euclidean distance a bad idea for vector space proximity?

because Euclidean distance is large for vectors of different lengths, and queries and documents will (usually) have very different lengths...

70
New cards

In the vector space model, documents are ranked according to the _______- between documents and query

angles

71
New cards

After length-normalization, both long and short documents now have comparable…

weights

72
New cards

SMART Notation

denotes the combination in use in an engine, with the notation ddd.qqq

73
New cards
<p><span style="background-color: transparent;"><span>Determine the most efficient processing order, if any, for the Boolean query </span><strong><span>Q</span></strong><span> considering the document frequency information from the table →&nbsp;</span></span></p><p></p>

Determine the most efficient processing order, if any, for the Boolean query Q considering the document frequency information from the table → 

(T1 AND T3) first, then merge with T2

74
New cards
<p>What is the tf-idf of...the word loaf in S1?</p>

What is the tf-idf of...the word loaf in S1?

2

(1+log(tf)) * log(N/df)

tf = 2 (occurs twice)

idf = log(N/df) = log(3/2)

(1 + log(2)) * log(3/2)

75
New cards
<p>Using cosine similarity as the ranking formula, what is the relative ranking of these documents for a query with coordinates [1, 1, 1, 1]?</p>

Using cosine similarity as the ranking formula, what is the relative ranking of these documents for a query with coordinates [1, 1, 1, 1]?

D2, D1

(there’s a lot of math involved, but for time purposes in quiz, just look and see that D2 more similar than D1)

Explore top flashcards