information retrieval quiz 2

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/47

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

48 Terms

New cards

More Unethical SEO

-Keyword Stuffing (e.g. misleading meta tags, repeating words, hiding keywords)

-Cloaking (Serving different content to a spider than to a browser, e.g. via redirects)

-Link exchanges (pages link to each other for no real reason)

New cards

Spam

2 types:

Link spamming- Bots that search for blogs and leave comments with links

Clicker bots- Bots that issue queries and "click" on targeted query results

Fighting spam is rapidly advancing (on BOTH sides)

New cards

Webmaster Guidelines

Search engines have SEO policies - What is allowed and not allowed

They must not be ignored - Once a site is blacklisted by a search engine, it will virtually disappear from the Web

New cards

Universal Resource Identifiers (URI)

A string of characters used to identify a resource

Name what the resource is

URLs also fall under this (and also specify where the resource is)

New cards

Anatomy of a URL

scheme:// domain:port / path?query_string#fragment_id

scheme and domain are mandatory, everything else is optional

subdomains trail behind the domain name, like this: calendar.ics . uci.edu (uci.edu is the domain)

New cards

How to acquire data?

Data dumps

URL downloads (Find out the URLs of specific resources + Run a downloader that takes that list and downloads the resources)

Web APIs

Web Crawling (should be treated as a last resort)

New cards

Knuth's definition of algorithm

Finiteness : does it ends after a finite number of steps?- Definiteness : are the steps rigorously and unambiguously specified?

Input : does it has zero or more inputs?

Output : does it has one or more outputs related to the inputs?

Effectiveness : are the operations basic enough for someone to perform them using pencil and paper only?

New cards

Basic Crawl Algorithm

Initialize a queue of URLs (seeds)

Repeat until no more URLs in queue:

- Get one URL from the queue

- If the page can be crawled, fetch associated page

- Store representation of page

- Extract URLs from page and add them to the queue

Queue = "frontier"

New cards

Drawbacks of the basic algorithm

Will upset web admins (impolite) - It's abusing the web servers

Very slow - 1 page at a time

Will get caught in traps and infinite sequences

Will fetch duplicates without noticing

Will bring in data noise

Will miss content due to client-side scripting

New cards

Permission to crawl

Be polite: try to ask the website if you can crawl it first!

Robots Exclusion Standard aka robots.txt

- Sites may have that file at the root

- Very simple syntax (but no formal/official standard yet!)

Honor basis! It's not a security mechanism

New cards

Sitemaps

Also listed in robots.txt

Allow web masters to send info to crawlers.

Usually generated by scripts

Contain lists of URLs and data about those URLs, E.g.

• Location of pages that might not be linked

• Relative importance

• Update frequency

• "Last updated" / Modification time

• You can allow a specific resource from a disallowed resource (e.g. you can crawl these pages but the rest is off limits)

Can inform the crawler about pages it might not find

Gives crawler a hint about when to check a page for changes

New cards

Politeness

Avoid hitting any site too often - Sites are for people, not for bots

Ignore politeness -> Denial of service (DOS) attack

Be polite -> Use artificial delays

New cards

Performance (I)

Back of the envelope calculation:

- If 1 page fetch = 500ms, then how much time to crawl 1 million pages? (it's worse than that... Unresponsive servers)

Most of the time, the crawler thread is waiting for the network data

Solution: multi-threaded or distributed crawling- Politeness is harder to control, but it is possible (e.g. different servers)

New cards

Performance (II)

Domain Name lookups - Given a domain name, retrieve its IP address

www.ics.uci.edu -> 128.195.1.83

Distributed set of servers - Latency can be high (2 secs is not unusual)

Common implementations are blocking - One request at a time, result is cached

Back of the envelope calculation - if 1 DNS lookup = 800ms, how much time to lookup the entire Web?

New cards

Crawler traps

May trap the crawler on the site forever - Web server responds with ever changing URLs and content (Dynamic pages)

Can be intentional or unintentional

E.g. the ICS calendar is a crawler trap

Some webadmins can create traps to penalize impolite crawlers

See http://www.fleiner.com/bots/

E.g. very large documents, disallowed in robots.txt, created to consume crawler resources or event to break poorly designed parsers of crawlers that ignore robots.txt

New cards

Duplicate Detection

Exact and near duplication are widespread

- Copies, mirror sites, versions, spam, plagiarism...

- Studies: 30% of Web pages are [near-]duplicates of the other 70%

- Little or no value (noise to the search engine and the user; you can show only one to the user and perhaps a "show similar" link)

Detection

- Detection of exact duplication is easy, but exact duplication is rare (use Hashes, checksums)

- Detection of near-duplicates is harder (use Page fingerprints)

New cards

Data Noise

Web pages have content not directly related to the page

- Ads, templates, etc

- Noise negatively impacts information retrieval

Technique 1: Cumulative distribution of tags- Document slope curve

(Other techniques in literature)

New cards

Client-Side Scripting

Modern web sites are heavily scripted (JavaScript, TypeScript) - Content behind XMLHttpRequests

To get to that content crawlers must interact with the scripts

- Hard thing to do: user interaction emulation (e.g. Selenium)

- Most crawlers will not do it and the content will be never indexed

New cards

The Deep Web

Places where crawlers rarely go...

- Content behind login forms

- Content behind JavaScript/TypeScript

- Sites that aren't linked from anywhere

It is estimated that the deep web is 400-500x larger than the shallow web

New cards

Guidelines for bot writers

-Reconsider - do you really need a bot?

- Be Accountable - if your actions cause problems, be available to take prompt action in response

- Test Locally - expand the scope gradually before you unleash your crawler on others

- Don't hog resources - web servers are for people primarily. Walk,don't run.

- Stay with it - "it's vital to know what your robot is doing, and that itr emains under control".

Find out the sites' crawling policies

Contact the Web admins

New cards

The Web and the Law

The Web is only ~30 years old and was born at CERN for physics research

Wild Wild West (or Wild Wild Web...), for the most part

Very few laws related to the internet

Existing laws either untested or outdated

Our own judgments/actions matter

New cards

First web search engines

W3Catalogue (1993) : primitive catalogue

World Wide Web Wanderer (1993) : first web crawler to measure the size of the web

JumpSearch (1993) : crawling, indexing, searching. First"modern" engine.

Excite and RankDex (1994) : ranking system inspired Google's PageRank

Webcrawler (1994) : full text indexing

New cards

Who offends who?

Users can offend providers (personal information/privacy issues)

Providers can in turn offend users (content/security issues)

New cards

Copyrights

Products of intellect are ruled by "Copyright Law"

Web content, source code, etc. are products of intellect

It protects works regardless of whether they have been published or not

New cards

Fair Use

Various purposes for which the reproduction of a particular work may be considered fair:

- Criticism

- Comment

- News

- Teaching

- Scholarship

- Research

It's a very grey area

New cards

Fair Use Factors

-The purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational purposes

- The nature of the copyrighted work

- The amount and substantiality of the portion used in relation to the copyrighted work as a whole (are you using the "core" of the work?)

- The effect of the use upon the potential market for, or value of, the copyrighted work

New cards

eBay vs. Bidder's Edge (May 24, 2000)

-eBay successfully stopped crawlers from Bidder's Edge

-"trespass to chattels"

-Jury:

Bidder's Edge intentionally and without authorization interfered with eBay's possessory interest in the computer system.

Bidder's Edge's unauthorized use proximately resulted in damage to eBay

New cards

American Airlines vs. FareChase (March 10, 2003)

-American Airlines stops crawlers from Farechase for scraping AA.comwebsite for "web fares" information (online fare comparison)

-"trespass to chattels"

-Settled suit later during that year.

Jury:

-Farechase's actions are intentional and without American's consent

-Farechase's conduct interferes with American's computer network and system(...) such actions adversely affect and harm American and the condition,quality and value of American's property

New cards

Two major jobs of the crawler

download webpages if authorized, and discover new URLs.

New cards

How does the crawler see the web?

It accesses the web through the frontier and grabs URLs through it

New cards

Retrieving Web Pages

-Web crawler client program connects to a domain namesystem (DNS) server

-DNS server translates the hostname into an internet protocol(IP) address

-Crawler then attempts to connect to server host using specificport

-After connection, crawler sends an HTTP request to the webserver to request a page- usually a GET request

New cards

Problem: Web crawlers spend a lot of time waiting for responses to requests

Solution: To reduce this inefficiency, web crawlers use threads and fetch hundreds of pages at once

BUT crawlers could potentially flood sites with requests for pages

To avoid this problem, web crawlers use politeness policies- e.g., distribute requests between different servers, delay between requests to same web server, ...

New cards

Where to enforce politeness?

When grabbing URLs from the frontier and also when grabbing a web page's links

New cards

Freshness

Definition: Freshness is the proportion of pages that are fresh in the collection

In a website, webpages are constantly added, deleted, and modified

Stale copies (or their representation in our index) no longer reflect the real contents of the web pages

Web crawler must continually revisit pages- Check for modifications and maintain the freshness of the collection

Not possible to constantly check all pages- must check important pages and pages that change frequently

HTTP protocol has a special request type called HEAD that makes it easy to check for page changes- returns information about page, not page itself

Optimizing for this metric is dangerous:- E.g. If you have a website that is always updated, optimizing your search engine for freshness would lead you to not crawl that site.

New cards

Age

Another time metric

(it has a formula i can't put here i'm sorry)

Web page updates follow the Poisson distribution on average - time until the next update is governed by an exponential distribution

Older a page gets, the more it costs not to crawl it - e.g., expected age with mean change frequency

New cards

Distributed Crawling

Three reasons to use multiple computers for crawling

- Helps to put the crawler closer to the sites it crawls; lower requirements in the network

- Reduces the number of sites each crawler has to remember; lower requirements on local memory

- Reduces the locally required resources

Distributed crawlers may use a hash function to assign URLs to crawling computers/threads

- Some hash function should be computed on the host part of each URL

New cards

Focused Crawling

-Attempts to download only those pages that are about a particular topic

- used by vertical search applications

- can also be used to improve a search engine in a specific topic, or around specific events in time

-Rely on the fact that pages about a topic tend to have links to other pages on the same topic

- popular pages for a topic are typically used as seeds

- E.g. pages that many users are clicking after they search a keyword

-Crawler uses some text classifier to decide whether a page is on topic

New cards

Desktop Crawls

-Used for desktop search and enterprise search

-Differences to web crawling with respect to the web

- Much easier to find the data (file system ~ sitemap)

- Responding quickly to updates is more important

- Must be conservative in terms of disk, memory and CPU usage (It should be transparent to the user.)

- Need to cope with many different document formats

- Data privacy much more important than in the web context

New cards

Document Feeds Crawls

-In general on the web: documents are created and updated

-But many documents are also just published (created at a fixed time and rarely updated again)

e.g., news articles, blog posts, press releases, email

-Document feed: Published documents from a single source can be ordered in a sequence

- Crawlers can find new documents by examining the end of the feed

New cards

Two types of document feeds

push feed: alerts the subscriber to new documents, involves subscription

• E.g. news agencies

pull feed: requires the subscriber to check periodically for new documents

most common format is RSS (Really Simple Syndication, Resource Description Framework Site Summary, Rich Site Summary)

New cards

Major characteristics of RSS to crawlers

ttl tag (time to live) - amount of time (in minutes) contents should be cached

RSS feeds are accessed like web pages

- using HTTP GET requests to web servers that host them

- You can use HTTP HEAD requests to check if the feed has changed

Easy for crawlers to parse (it is just XML)

Easy to find new information:

- just GET, parse and if any, grab the new ... s.

New cards

The need for conversion

Text is stored in hundreds of incompatible file formats- e.g., raw text, RTF, HTML, XML, Microsoft Word, ODF, PDF

Other types of files, not dedicated to text, are also important- e.g., PowerPoint, Excel, Keynote, Pages•

Typically use a conversion tool or library

- converts the document content into a tagged text format such as HTML or XML

- retains some of the important formatting information

- Some converters will not retain formatting (e.g. *nix's pdftotxt, pdf2ps+pdf2ascii, etc.)

New cards

Glyphs

the symbols that you see, that encode letters or meanings

New cards

Character Encoding

A character encoding is a mapping between bits and glyphs

- i.e., getting from bits in a file to characters on a screen

- Can be a major source of incompatibility

New cards

ASCII

ASCII is basic character encoding scheme for English (since 1963)

- Encodes 128 letters, numbers, special characters, and control characters in 7 bits, extended with an extra bit for storage in bytes

New cards

The problem with character encoding

Other languages can have many more glyphs

- e.g., Chinese > 40,000 characters, with over 3,000 in common use

Many languages have multiple encoding schemes

- e.g., CJK (Chinese-Japanese-Korean) family of East Asian languages,Hindi, Arabic

- must specify encoding in the file : the file is not self-describing

- can't have multiple languages in one file

Unicode was developed in the late 80's to address encoding problems

New cards

Unicode

Single mapping from numbers to glyphs that attempts to include all glyphs in common use in all known languages

Unicode is a mapping between numbers and glyphs

- solves the problem of mapping from number to glyphs

- does not uniquely specify bits to glyph mapping- e.g., UTF-8, UTF-16, UTF-32

New cards

Unicode encodings

Proliferation of encodings comes from a need for compatibility and to save space

- UTF-8 uses one byte for English (=ASCII), but as many as 4 bytes for some traditional Chinese characters

• variable length encoding, more difficult to do string operations

- UTF-32 uses 4 bytes for every character

• more memory, no backward compatibility with ASCII

You can mix both in your design

- Many applications use UTF-32 for internal text encoding (fast random lookup) and UTF-8 for disk storage (less space)

Explore top notes

Lam 2013, Reply to Stan Becker

Note

Note

Note

Note

Mathematics: Tactics and Strategies

Updated 1047d ago

Note

Chapter 4: The Brain and Behavior

Updated 1191d ago

Note

GGZ stigmatisering

Updated 292d ago

Note

Chapter 5- Atomic Structure

Updated 1137d ago

Note

Lam 2013, Reply to Stan Becker

Note

Note

Note

Note

Mathematics: Tactics and Strategies

Updated 1047d ago

Note

Chapter 4: The Brain and Behavior

Updated 1191d ago

Note

GGZ stigmatisering

Updated 292d ago

Note

Chapter 5- Atomic Structure

Updated 1137d ago

Note

Explore top flashcards

New Warriors Daniel Test

Updated 392d ago

Flashcards (23)

PMP ACTUAL TEST 4 - Buổi 1

Updated 573d ago

Flashcards (56)

earth and life science

Updated 738d ago

Flashcards (50)

ISA Arborist Chapter 1

Updated 600d ago

Flashcards (96)

16 - anti-heart failure drugs

Flashcards (36)

Flashcards (56)

Flashcards (42)

PGN Members Fall 2025

Updated 20d ago

Flashcards (135)

New Warriors Daniel Test

Updated 392d ago

Flashcards (23)

PMP ACTUAL TEST 4 - Buổi 1

Updated 573d ago

Flashcards (56)

earth and life science

Updated 738d ago

Flashcards (50)

ISA Arborist Chapter 1

Updated 600d ago

Flashcards (96)

16 - anti-heart failure drugs

Flashcards (36)

Flashcards (56)

Flashcards (42)

PGN Members Fall 2025

Updated 20d ago

Flashcards (135)