CS121 / INF141 Information Retrieval Practice Questions

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/120

There's no tags or description

Looks like no tags are added yet.

Last updated 3:49 PM on 6/9/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

121 Terms

New cards

A startup is building a new tool that identifies documents of interest to a user as soon as they are published, pushing them directly to the user's feed. Which category best describes the core function?

A) Search

B) Classification

C) Filtering or tracking

D) Question answering

C) Filtering or tracking

New cards

A user submits the query "Find how to merge two lists in Python" into a search engine. Which characteristic of search engines, as opposed to database engines, is essential here?

A) The search involves numbers in addition to text.

B) The ability to infer meaning from informal and ad-hoc queries.

C) The data is required to be well-structured.

D) The necessity of using a formal language.

B) The ability to infer meaning from informal and ad-hoc queries.

New cards

Which factor is a core element in modern Web IR that is explicitly ignored in the assumptions of Classic IR models?

A) Textual similarity between query and document

B) The context of the user (location, prior queries, age)

C) The existence of a relevance score R(Q,D)

D) The collection of documents, or corpus

B) The context of the user

New cards

A search engine struggles with fresh results because new pages are constant and volume is too large for one server. Which two features of Web IR are they primarily struggling to manage?

A) User context and large user count

B) The corpus is not centralized and is not static

C) Relevance score maximization and link structure

D) Performance optimization and evaluation logging

B) The corpus is not centralized, and the corpus is not static.

New cards

A company wants a search for "fishing boats" to be treated the same as "fishes boat." Which Text Transformation process groups similar words together?

A) Tokenizer

B) Stopping

C) Stemming

D) Information Extraction

C) Stemming

New cards

In the Web Search Engine Workflow, which component traverses the web, finds documents, and sends them to the Indexer?

A) The Tokenizer

B) The Ad Indices

C) The Query Broker

D) The Web Spider, or Crawler

New cards

True or False? The left side of the architecture of a Web search engine pertains to processes that are done well before any query is issued.

A) True

B) False

A) True

New cards

Within the architecture of a search engine, tokenization pertains mostly to:

A) Text acquisition

B) Text transformation

C) Indexing

B) Text transformation

New cards

Users click on the third result instead of the first despite higher relevance scores. Which part of the architecture uses this data to analyze ranking algorithms?

A) Text Transformation/Query expansion

B) Index Creation/Term weighting

C) Evaluation/Logging (Ranking analysis)

D) Text Acquisition/Web Crawling

C) Evaluation/Logging, specifically Ranking analysis

New cards

A developer focuses on the top 5 results on the first page. Which user characteristic justifies this prioritization?

A) Most queries have spelling mistakes.

B) Generic relevance models are impossible.

C) Approximately 85% of users look over only the first result screen.

D) Users rarely submit queries longer than 3 words.

C) Approximately 85% of users look over only the first result screen.

New cards

A component runs over documents to discard common, uninformative words like "the" and "a." What is this process called?

A) Tokenization

B) Stemming

C) Linguistic analysis

D) Removal of stop words

New cards

A librarian searches a fixed archive of exactly 5,000 digitized manuscripts. In IR terminology, what do these 5,000 manuscripts collectively represent?

A) The inverted index

B) The Query Log

C) The Corpus

D) The Crawled Data

C) The Corpus

New cards

A student spends an hour refining search terms and modifying queries based on results. Which of the following is true?

A) This is typical search engine user behavior.

B) This is atypical search engine user behavior, as users almost never modify their queries.

New cards

In medical research search, finding every relevant study is critical even if irrelevant results are included. Which metric is most important?

A) Precision, because users only look at the first screen.

B) Recall, because completeness is required in Sciences/Law.

C) Precision, because the corpus is static.

D) Recall, to eliminate duplicates.

B) Recall, because completeness is required in specialized searches like Sciences/Law.

New cards

Hundreds of pages contain the exact same article text due to syndication. This problem relates to which Web Graph characteristic?

A) High linkage

B) High rate of change

C) Significant duplication (25%-40% of content)

D) Web graph is too large for one server

C) Significant duplication, with studies showing between 25%-40% of content may be duplicated.

New cards

A tool indexes a fixed, unchangeable collection of historical documents. Which two components can they safely skip?

A) Text processing

B) Collection of user context

C) Spam detection and elimination

D) Crawling

E) Indexing

C) Spam detection and elimination, D) Crawling

New cards

Processing: "The team's decision to re-prioritize the task was difficult." Which decision pertains specifically to the Tokenization step?

A) Applying stemming to "decision" and "decide".

B) Identifying user intent.

C) How to handle the apostrophe in "team's" and the hyphen in "re-prioritize".

D) Deciding if "The" is a stop word.

C) How to handle the apostrophe in "team's" and the hyphen in "re-prioritize".

New cards

Match URL syntax A://B/C?D#E:

A) A=Protocol, B=Hostname, C=Path, D=Query, E=Fragment

B) A=Hostname, B=Protocol, C=Query, D=Path, E=Fragment

C) A=Protocol, B=Path, C=Hostname, D=Fragment, E=Query

A) A = Protocol, B = Hostname, C = Path, D = Query, E = Fragment

New cards

Tokenizer rule: Separate by whitespace, preserve punctuation, lower-case. Sentence: "It's 10:30, and $500 is due." Tokens?

A) it, s, 10, 30, and, 500, is, due

B) its, 10:30, and, 500, is, due

C) it's, 10:30,, and, $500, is, due.

D) it's, 10:30, and, $500, is, due

C) it's, 10:30,, and, $500, is, due.

New cards

Aggressive tokenizer: all non-alphanumeric characters are separators, lower-case. Sentence: "The team's decision to re-examine the data, 1,000 points, was final." Tokens?

A) the, team's, decision... 1000, points...

B) the, teams, decision... 1000, points...

C) the, team, decision, examine, data, points, final

D) the, team, s, decision, to, re, examine, the, data, 1, 000, points, was, final

New cards

What is the frontier of a web crawler?

A. Set of URLs already crawled

B. Set of URLs from which to start

C. Set of URLs that have been seen but not yet crawled

C) It's the set of URLs that have been seen but not yet crawled

New cards

Robots.txt:

User-agent: *

Disallow: /admin/

User-agent: Googlebot

Disallow: /images/

Mark true:

A. All except PerplexityBot can crawl /images/

B. No crawler should fetch /admin/

C. Googlebot can fetch /admin/

D. "IR S26" can crawl /images/ but not /admin/

B, D

New cards

Politeness in web crawling refers to avoiding overloading websites by limiting request frequency. A. True B. False

A) True

New cards

Purpose of politeness policies in web crawling?

A. Increase pages per second

B. Reduce redundant downloads

C. Prevent overloading servers by controlling timing and frequency

D. Ensure crawler only accesses allowed permissions

C) To prevent overloading web servers by controlling the timing and frequency of requests

New cards

Desired characteristics for a large-scale Web crawler? (Select all)

A. Distributed system

B. Crawl only specific topics

C. Scalable

D. Fully utilize processing and bandwidth

A, C, D

New cards

Distinguish URL from URN:

A. URL identifies by name, URN by location

B. URL specifies location, URN specifies what the resource is

C. Both identify by physical location

D. URN is a subset of URL for web resources

B) A URL specifies where a resource is located, while a URN specifies what the resource is

New cards

A crawler trap is always unintentional. A. True B. False

B) False

New cards

Legal trouble scenarios for a crawler: (Mark all)

A. Republishing posts verbatim

B. Short excerpts with commentary

C. Bypassing paywalls commercially

D. Stripping copyright and selling access

A, C, D

New cards

Why use a cache server to fetch pages? (Select all)

A. To avoid overloading the local network (e.g. ICS)

B. Websites refuse unknown crawlers

C. To allow staff to analyze requests

D. Frontier is managed by cache

A, C

New cards

Why should web crawlers avoid running as fast as possible?

A. Reduce latency

B. Avoid overwhelming servers and respect rate limits

C. Collect all data

D. Minimize memory

B) To avoid overwhelming servers and respect rate limits

New cards

Characteristics of a crawler trap? (Select all)

A. Dynamically generated infinite URLs

B. Single large file slowing speed

C. Outdated harmless pages

D. Pages causing endless loops

A, D

New cards

Responsible developer action per Bot Writer Guidelines:

A. Deploy immediately across hundreds of sites

B. Run full speed overnight

C. Test locally and check each site's policy

D. Hide identity

C) Test the bot locally and check each site's crawling policy

New cards

What determines which page is crawled next after one is finished?

A. Never crawled before

B. Most popular

C. Highest-priority page on highest-priority site not visited for N seconds

D. Highest-priority on frontier

C) The highest-priority page hosted by the highest-priority site that hasn't been visited for the past N seconds

New cards

URL Syntax A://B/C?D#E:

A. A=Protocol, B=Hostname, C=Path, D=Query, E=Fragment

B. A=Hostname, B=Protocol...

C. A=Protocol, B=Path...

A) A = Protocol, B = Hostname, C = Path, D = Query, E = Fragment

New cards

After obtaining an IP, the crawler can open a connection without a port number as the server routes automatically. A. False B. True

A) False

New cards

Robots.txt:

User-Agent: *

Content-Signal: search=yes, ai-train=no

Allow: /

Mark true:

A. Invalid, ignore it

B. Search engines can crawl entirety

C. No part for AI training

D. No part for AI prompts

B, C, D

New cards

Crawler Code line #3: tbd_url = self.frontier.get_tbd_url()

A. URL marked valid

B. Just downloaded URL picked

C. Next URL to be downloaded is picked

D. New URL placed in frontier

C) The next URL to be downloaded is picked from the frontier

New cards

Crawler Code line #7: resp = download(tbd_url, ...)

A. Next URL placed in frontier

B. Many URLs downloaded at once

C. Next URL is downloaded

D. Response from frontier

C) The next URL is downloaded

New cards

Crawler Code line #11: scraped_urls = scraper(tbd_url, resp)

A. Your code is called

B. Next URL picked

C. Next URL downloaded

D. Crawler receives URLs from frontier

A) Your code is called

New cards

What is missing from the provided crawler code loop?

A. Way to discard URLs

B. Robots.txt check for tbd_url

C. Stop condition for empty frontier

D. Adding URLs to frontier

B) A check of whether the tbd_url is allowed in robots.txt

New cards

Boolean retrieval: ANDing three sorted lists of size n, m, q. Complexity of best 3-way merge?

A. O(n + m + q)

B. O(n × m × q)

C. O(n² + m² + q²)

D. O(n)

A) O(n + m + q)

New cards

A startup moves from a term-document matrix to a different structure for millions of pages due to memory. What is the best change?

A. Compress pages to integers

B. Map structure from term to list of docIDs

C. Matrix on disk

D. Remove rare terms

B) Use map structure from term to a list of docIDs

New cards

Postings: "election" (10k), "fraud" (500), "statistics" (1k). Best merge order for AND query?

A. Scan all documents

B. (election AND statistics) then fraud

C. (fraud AND statistics) then election

D. (fraud AND statistics), sort, then election

C) Merge the postings of "fraud" and "statistics" first, and then merge with the posting of "election"

New cards

Simhash: "ocean waves carry energy across the ocean using deep waves and surface waves."

Summation vector: [X, 1, 5, -1, 3, 1, Y, 3]

What is X? (ocean hash: 10110001, count 2)

A. 1 B. -1 C. 5 D. 3 E. -5

B) -1

New cards

Simhash: "ocean waves carry energy across the ocean using deep waves and surface waves."

Summation vector: [X, 1, 5, -1, 3, 1, Y, 3]

What is Y? (waves hash: 01101101, count 3)

A. 1 B. -1 C. 5 D. 3 E. -5

E) -5

New cards

Simhash final fingerprint for summation vector [-1, 1, 5, -1, 3, 1, -5, 3]?

A. 11101101

B. 01101111

C. 01101101

D. 01101100

C) 01101101

New cards

Porter Stemmer behavior:

A. Morphological analysis for dictionary word

B. Heuristic "suffix-stripping" resulting in non-word roots

C. POS tagging to resolve ambiguity

D. Mapping to synonyms

B) It applies heuristic "suffix-stripping" rules to truncate words, which often results in a "root" that is not a valid dictionary word.

New cards

HTML:

Important Announcement

Cache server

Do not bypass the cache

If you do...

Which words are most important?

A. Important Announcement Cache server

B. Do not bypass the cache

C. If you do...

D. Important Announcement Cache server Do not bypass the cache

D) Important Announcement Cache server Do not bypass the cache

New cards

Indexing 200M pages. Postings store full URL strings as docIDs. Memory is an issue. Best fix?

A. Compress URL strings

B. Store domain only

C. Map unique URL to small integer docID

D. Duplicate docIDs across indexes

C) Map each unique URL to a small integer docID and store only the integers in postings

New cards

How can we build indexes larger than available memory?

A. Offload in-memory hashtable to files periodically and merge

B. One index per file

C. Never possible

A) We can offload the in-memory hashtable to files, every so often, and merge those files in the end.

New cards

Users search for "U.S.A." and "USA" expecting same results. Best strategy?

A. Keep different

B. Normalization to common form

C. Remove punctuation during indexing only

D. Stop word removal

B) Use normalization to map "U.S.A." and "USA" to a common form in index and query

New cards

Ranking algorithm uses weighted features (tf, length, links). What is the index doing?

A. Stores ranking scores directly

B. Replaces ranking model

C. Unnecessary

D. Stores docIDs and tf to efficiently find candidate docs

D) The index stores lists of doc IDs along with term frequencies, so that the ranking function can efficiently find candidate docs

New cards

E-commerce search for: t shirts, t-shirts, tshirt. Best strategy?

A. Hyphens as separators

B. Strip hyphens and combine tokens (map to tshirt)

C. Remove hyphenated tokens

D. Keep separate

B) Strip hyphens and combine tokens, mapping all forms to tshirt

New cards

Support phrase/proximity queries like "tropical fish" efficiently. Best index upgrade?

A. Global word count table

B. Term to document count mapping

C. Store word importance

D. Store word position within document

D) Extend the postings to store word position within each document

New cards

Legal search for "to be or not to be". System removes stopwords. Consequence?

A. Likely fail to find exact phrase

B. Work fine as stopwords ignored in Law

C. Find only documents where stopwords repeat

D. Automatic synonym rewrite

A) The system will likely fail to find documents that contain this exact phrase

New cards

Tokenizer: I.B.M. -> ibm, cs.uci.edu -> csuciedu. Query keeps periods. Search: PhD cs.uci.edu. Result?

A. PhD matches (case-sensitive)

B. Matches URL but not abbreviations

C. No impact

D. Matches PhD but fails to find documents with cs.uci.edu

D) The engine succeeds to match documents containing PhD but fails in finding documents containing cs.uci.edu

New cards

Why is Boolean retrieval well-suited for professional search (e.g. WestLaw)?

A. Returns fewer documents

B. Ranks according to lawyer expectations

C. Gives exact inclusion/exclusion control and proximity

D. Interprets natural language

C) Because Boolean queries give exact inclusion/exclusion control over terms and even proximity

New cards

Pseudo code: BuildIndex(D) using HashTable and n=n+1. Mark all correct:

A. Processes all documents

B. Small batches

C. Index on disk

D. Won't work for sufficiently large data

E. Terms in lists

A, D

New cards

Editors see "comput" or "argu" in logs. Want recall for argue/argued/argues but naturalness. Best change?

A. Keep non-word stems

B. No stemming

C. Hybrid dictionary-based stemmer (Krovetz)

D. Only remove final 's'

C) Replace the algorithmic stemmer with a hybrid dictionary-based stemmer like Krovetz

New cards

Postings (term: docID: pos1, pos2...)

not: 10: 48, 192; 15: 14; 22: 7, 79, 150, 306...

to: 10: 4, 18; 15: 22; 22: 76, 80, 112, 540...

Query: "to be or not to be". Best candidate?

A. 10 B. 15 C. 22 D. 30

C) 22

New cards

Boolean system: "standard user dlink 650" too many results, slightly longer variant zero results. Best fix for novice users?

A. Switch AND to OR automatically

B. Ranked retrieval model with weighted document vectors

C. Silently drop terms

D. Require explicit operators

B) Replace the Boolean model with a ranked retrieval model that scores free-text queries using term-weighted document vectors.

New cards

Query "capricious person". "person" is common, "capricious" rare. d1 has "person" 3x. d2 has "person" 1x, "capricious" 1x. Effect of idf?

A. d1 stays above d2

B. d2 likely moves above d1

C. Same score

D. Both pushed to bottom

B) d2 is more likely to move above d1, because the rare term "capricious" gets a much higher idf weight.

New cards

Why does Boolean retrieval frustrate average users compared to Ranked retrieval?

A. More processing power

B. Zero results or thousands of unorganized results

C. Slower

D. Requires vector input

B) Boolean retrieval may return either zero results or thousands of unorganized results.

New cards

Long articles repeat terms more often. Euclidean distance ranks them below short pieces with same distribution. How to fix without penalizing length?

A. Cosine similarity on length-normalized vectors

B. Drop rare terms

C. Normalize only query vector

D. Jaccard similarity

A) Switch from Euclidean distance to cosine similarity computed on length-normalized document and query vectors.

New cards

MapReduce for inverted index. Which behavior is correct?

A. Mapper:

B. Mapper:

C. Mapper:

D. Mapper:

D) Mappers emit ⟨ term, docId ⟩ or ⟨ term, docId:position ⟩ pairs; reducers gather all values for the same term and write the postings list for that term.

New cards

500 GB index, 5 GB RAM. Best architecture for fast response per "Index the Index"?

A. Virtual memory/paging

B. MapReduce for every request

C. In-memory lexicon pointing to byte offsets on disk

D. Linear scan of inverted index

C) Keep a small lexicon, or term-to-offset dictionary, in memory that points to the specific byte location of postings lists on the disk.

New cards

Query "climate change". Doc A mentions once. Doc B mentions 40 times. Same doc length. Jaccard Coefficient outcome?

A. B scores 40x higher

B. B slightly higher

C. Identical scores (Jaccard ignores tf)

D. A higher (concise)

C) Both documents will have identical Jaccard scores, as Jaccard ignores term frequency.

New cards

Vector Space Model: short summary and long textbook, same word distribution. Euclidean vs Cosine similarity?

A. Euclidean similar, Cosine different

B. Euclidean different, Cosine identical

C. Both identical

D. Cosine not applicable

B) Euclidean distance will show them as very different, while Cosine similarity will show them as identical.

New cards

Query "The Arachnocentric Universe." Standard TF-IDF weighting. Which term contributes most to rank?

A. "The" (high tf)

B. "Universe" (noun weight)

C. "Arachnocentric" (low df)

D. All equal

C) "Arachnocentric", because it likely has the lowest Document Frequency.

New cards

D2 is D1 appended to itself (tf in D2 = 2x tf in D1). Length normalization applied. True statement?

A. D2 has larger weights

B. Euclidean distance(q, D1) = Euclidean distance(q, D2)

C. Cosine similarity decreases

D. Length normalization favors long docs

B) For any query vector q, the Euclidean distance between q and D1 is equal to the Euclidean distance between q and D2.

New cards

D1: "Red Portuguese wine is the best wine". Word "wine" count?

A. 1 B. 2 C. 3 D. 4

B) 2

New cards

Weighted term frequency of "wine" in D3 (count=2)?

A. 2 B. log(2) C. 2 / log(2) D. 1 + log(2)

D) 1 + log(2)

New cards

3 Documents: D1, D2, D3 all contain "Portuguese". Document frequency of "Portuguese"?

A. 1 B. 3/log(3) C. log(3/3) D. 3

D) 3

New cards

3 Documents: D1, D2, D3 all contain "Portuguese". Inverse document frequency of "Portuguese"?

A. 1 B. 3/log(3) C. log(3/3) D. 3

C) log(3/3)

New cards

Jaccard similarity between Q: "tasting wine" and D3: "Portuguese wine also includes white wine"?

A. 1/6 B. 1/7 C. 1/8 D. 2/7

A) 1/6

New cards

SMART notation "lnc.ltc". Select all true components:

A. Doc vector uses log tf

B. Doc vector uses no idf

C. Query vector uses cosine normalization

D. Query vector uses idf

E. Query vector uses log tf

A, B, C, D, E

New cards

Vector Space Model representation. Select all true:

A. Terms are axes

B. Documents are vectors

C. Space is low-dimensional (2-3)

D. Vectors are sparse

E. Terms are vectors, docs are axes

A, B, D

New cards

Why is Jaccard insufficient for ranked retrieval? (Select all)

A. Ignores term frequency

B. Cannot calculate overlap

C. Fails to account for term rarity

D. Computationally impossible

E. Loses information on word usage frequency

A, C, E

New cards

Why store only weighted tf in index and use idf in query?

A. Better ranking

B. Idf requires cosine distance

C. Adding idf to index requires recalculating scores at end of processing collection

D. Idf unknown until query

C) Because adding the idf component to the index would require recalculating all the tf-related scores in all the postings at the end of processing all the documents.

New cards

Why use cosine of angle instead of the angle for similarity? (Select all)

A. Easier to calculate

B. Varies 0 to 1 vs 0 to π/2

C. Cosine same shape as log

D. Captures rare terms better

A, B

New cards

Boolean retrieval: complexity of best 3-way merge for sorted lists n, m, q?

A. O(n + m + q)

B. O(n × m × q)

C. O(n² + m² + q²)

D. O(n)

A) O(n + m + q)

New cards

Reading 1 MB sequentially from disk is faster than reading 1 MB sequentially from RAM. A. True B. False

B) False

New cards

What is a posting?

A. Inverted index

B. Data structure describing context of word occurrence in a document

C. Web page retrieved by crawler

B) It's a data structure that describes the context of occurrence of a word in a document.

New cards

Boolean Retrieval truths: (Select all)

A. Effectiveness depends on user

B. Simple queries work well

C. Complex queries difficult to implement

D. Inefficient (processes all docs)

A, C

New cards

What is an inverted index?

A. Map with terms as keys and postings lists as values

B. Term-document matrix

C. Map with documents as keys

A) It's a map with terms as keys and postings lists as values.

New cards

Main problem of term-document matrix for large collections?

A. Slow term search

B. Inefficient memory use

C. Slow document search

B) It is an inefficient use of memory.

New cards

Complexity of indexer associating URL with integer in postings for N documents?

A. O(1) B. O(N) C. O(N²) D. O(N³)

C) O(N²)

New cards

Word most likely on standard stopword list?

A. algorithm B. data C. the D. scraper

C) "the"

New cards

Matrix search: information AND retrieval. Result?

A. page 2 B. none C. page 3 D. page 1 and 4

C) page 3

New cards

Minimum information in a posting?

A. Word count B. Doc id C. Word position D. URL

B) The document id

New cards

S1: May the force... Champions. S2: Live long... Champion. S3: Champions are not... Postings for "Champions"?

A. S1, S2, S3 B. S1, S3 C. S2, S3 D. S2

B) S1, S3

New cards

S1: May the force be with you, Champions. S2: Live long and prosper, Champion. S3: Champions are not made in gyms, they are made from something deep inside of them. Terms count?

A. 3 B. 23 C. 24 D. 25

C) 24

New cards

HTML document with header, bold, and paragraph text. Which words most important?

A. Header words B. Bold words C. Paragraph words D. Header and Bold words

D) Important Announcement Cache server Do not bypass the cache

New cards

Boolean retrieval: unsorted lists. Benefit from sorting before merging?

A. Same B. Benefit from sorting first C. Worse

B) It will benefit from sorting first

New cards

Tokenization is always performed after removing HTML tags. A. False B. True

A) False

New cards

Simple in-memory indexer pseudo code. Mark all correct:

A. Processes all documents

B. Small batches

C. Inverted index on disk

D. Won't work for large data

E. Terms in lists

A, D

New cards

Can we build indexes larger than RAM? Best way?

A. Offload hashtable to files and merge

B. Write directly to file bypassing memory

C. Never possible

A) Offload the in-memory hashtable to files every so often, then merge those files at the end.

New cards

Role of a positional index in Boolean systems?

A. Stores words/positions for phrase queries

B. List of unique terms

C. Improve speed of non-Boolean

D. Single-word efficiency

A) It stores words and their positions to ensure phrase queries can be resolved accurately.

New cards

Using a 2-gram index can decrease retrieval performance for large text corpus. A. True B. False

A) True

100

New cards

Implementation for word importance (headers, bold, normal). Select all reasonable:

A. Three separate lists per term

B. Field in postings classifying type

C. Extent lists

D. HTML tags in terms

E. Three separate indexes

A, B, C, E