1/42
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is information retrieval?
It's a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information
Which of the following are examples of work within the Information Retrieval field?
web search engines
filtering for documents of interest
the design of a relational database for a library
classifying books into categories
automatically answer customer support questions
web search engines,
filtering for documents of interest,
classifying books into categories,
automatically answer customer support questions
web search
gathering and finding information on the web
vertical search
gathering and finding information focused on a specific topic
desktop search
gathering and finding information on a single computer
peer-to-peer search
gathering and finding information on network of independent nodes
enterprise search
gathering and finding information within a company's network
Find out what is going on today at UCI?
relational database vs search engine
search engine
Find all female students whose last name is Smith
relational database vs search engine
relational database
Find how to split words in python
relational database vs search engine
search engine
What is the weather like in bali
relational database vs search engine
search engine
What were the temperature and humidity values registered in Crystal Cove between 9/1/2019 and 10/1/2019?
relational database vs search engine
relational database
What contributes to the relevance of a document with respect to a query in the context of a search engine?
prior queries made by the same user, the author of the document, the geographic location of the person who's querying, the popularity of the document, textual similarity, the geographic origin of the document
True or False? The right side of the architecture pertains to processes that are done well before any query is done.
false
How is advertisement integrated with web search?
The user's query goes both to the search engine and the ad engine; the search engine retrieves the most relevant results, the ad engine uses an auction system on the query words.
Cost Per Mil (CPM)
Cost for showing the ad 1000 times
Cost Per Click (CPC)
Cost for users clicking on the ad after it is shown to them
The following is the syntax of a URL:
A://B/C?D#E
A: scheme
B: authority
C: path
D: query
E: fragment
Which of these are Universal Resource Identifiers (URI)?
ISBN 0-486-2777-3
rmi://filter.uci.edu
"Pride and Prejudice"
http://www.ics.uci.edu/~lopes
ISBN 0-486-2777-3
rmi://filter.uci.edu
http://www.ics.uci.edu/~lopes
Besides web crawling, what other ways are there to obtain data from Web sites?
targeted downloads of specific URLs
Downloads from the Library of Congress
Data dumps provided by companies and organizations
Web APIs provided by certain web sites
targeted downloads of specific URLs
Data dumps provided by companies and organizations
Web APIs provided by certain web sites
Consider the following robots.txt file:
User-agent: *
Disallow: /foo
Disallow: /bar
User-agent: Googlebot
Disallow: /baz/a
According to this, the Googlebot is
Allowed to crawl /foo and /bar
Not allowed to crawl neither /foo nor /bar
Not allowed to crawl /baz/a
Allowed to crawl /baz/a
Allowed to crawl /foo and /bar
Not allowed to crawl /baz/a
True or False?
"If something is on the Web, a Web crawler has the right to get it"
False
Hw1 (tokenization) pertains mostly to which of the following
index
text acquisition
text transformation
text transformation
How large is the web, measured in number of hosts?
O(quadrillion)
O(million)
O(billion)
O(trillion)
O(billion)
Consider the sequence of characters "hello!!"
is this a token
maybe, it depends on what your definition of a token is
Should crawlers wait between requests to the same web site?
yes
All the crawler traps that exist on the web are deliberately created
False
What is bad about crawler traps
they are hard to detect, they make web crawlers busy for no good reason, they prevent or delay crawlers from going to other sites
What is the frontier of a web crawler?
Its the set of URLs that have been seen but not yet crawled
What are HTTP status codes 2xx?
Page retrieved successfully
What are HTTP status codes 3xx?
Redirection
A normal crawler fetches pages directly from the web servers. However, your crawler used a cache server to fetch pages. Why?
Because having more than one hundred crawlers fetching pages directly could overload the ICS network if the crawlers are not properly developed
Which of the following methods can you DIRECTLY use to detect pages or documents that are near duplicates?
Simhash
Fingerprint
Cyclic redundancy check
document slope curve
Blake3
Simhash
Fingerprint
The deep web is a large part of the web that only has encrypted content, and thus it is not crawled nor indexed by normal search engines. T/F
False
Web crawlers can and should send hundreds of requests per second to a web site, because otherwise they will take a very long time to crawl. T/F
False
What is the main problem of using a term-document matrix for searching in large collections of documents?
It is an efficient use of memory
Should crawlers hit the same web site as fast as possible as a strategy to crawl faster?
No
What is an inverted index?
its a map with terms as keys and postings lists as values
what is the minimum information is a posting?
the document id
Consider the following sentences (each sentence is to be considered a different document):
S1: I tried searching for this error but got me nowhere.
S2: To be or not to be, this is the question.
S3: This seems to do the trick.
What are the postings for the term "to"?
S2, S3
Reading 1MB sequentially from memory is faster than Reading 1 MB sequentially from disk.
True
Reading 1 MB sequentially from memory is 2 times faster than Reading 1 MB sequentially from disk
False
In Boolean retrieval, a query that ANDs three terms results in having to intersect three lists of postings. Assume the three lists are of size n, m, q, respectively, each being very large
If you keep the lists unsorted, what best approximates the complexity of a 3-way intersection algorithm
O(nmq)