Chapter 3 - Web Crawler

0.0(0)

Studied by 0 people

Call with Kai

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/23

Earn XP

Description and Tags

Computer Information System

University/Undergrad

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

24 Terms

New cards

What is a web crawler?

Finds and downloads web pages automatically, it also provides the collection for searching

New cards

Where does the web pages stored?

Web pages are stored on web servers that use HTTP to exchange information with client software

New cards

What are the steps for retrieving web pages?

1. Web crawler client program connect to a domain name system (DNS) server
2. DNS translate the hostname into an internet protocol
3. Crawler then attempts to connect to server host using specific ports
4. After connection, crawler sends an HTTP request to the web server

New cards

What are the steps for a web crawler?

1. Starts with a set of seeds, which are a set of URLs given to its as parameters
2. Crawler starts fetching pages from the request queue
3. Continue until no more new URLs or disk full

New cards

How do the web crawler increase the efficiency of the web?

web crawlers use threads and fetch
hundreds of pages at once

New cards

How does crawlers can avoid potentially flood sites with requests for pages?

web crawlers use politeness policies, it is a delay between requests to the same web servers

New cards

What are the files that used to control crawlers?

robots.txt file

New cards

What is a focused crawling?

Attempts to download only those pages that are about a particular topic

New cards

What is a deep web?

Sites that are difficult for a crawler to find are collectively referred to as the deep (or hidden) Web

New cards

What is a private sites?

no incoming links, or may
require log in with a valid
account

New cards

What is a form results?

sites that can be reached
only after entering some data
into a form

New cards

What is a scripted pages?

pages that use JavaScript,
Flash, or another client-side
language to generate links

New cards

What is a sitemaps?

Sitemaps contain lists of URLs and data about those URLs, such as modification time and modification frequency

New cards

What are the reasons to use multiple computers for crawling?

1. Helps to put the crawler closer to the sites it crawls
2. Reduces the number of sites the crawler has to remember
3. Reduces computing resources required

New cards

What is a desktop crawls?

Used for desktop search and ernterprise search

New cards

What is a push feeds?

alters the subscriber to new documents

New cards

What is a pull feed?

requires the subsciber to check periodically for new documents

New cards

What is really simple sindication means?

term used to refer the collection of Web feed formats that provide updated or shared information in a standardized way.

New cards

does the web is under control of search engine preovider?

No/false

New cards

What is website freshness?

Web pages are constantly being added, deleted, and modified.

New cards

What is the special request type called in HTTP Protocol?

HEAD request

New cards

What does crawler use to decide whether a page is on topic?

they use text classifier

New cards

How does the distributed crawler assign URLs to crawling computers?

They use hash function to should be computed on the host part of each URL

New cards

What are the differences between Desktop Crawls and Web Crawling?

1. Easier to find data
2. Responding quickly
3. conservative in terms of disk and CPU usage
4. Many different documents format
5. Data privacy very important