Chapter 3 - Web Crawler

0.0(0)
studied byStudied by 0 people
full-widthCall with Kai
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/23

flashcard set

Earn XP

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

24 Terms

1
New cards
What is a web crawler?
Finds and downloads web pages automatically, it also provides the collection for searching
2
New cards
Where does the web pages stored?
Web pages are stored on web servers that use HTTP to exchange information with client software
3
New cards
What are the steps for retrieving web pages?
1. Web crawler client program connect to a domain name system (DNS) server
2. DNS translate the hostname into an internet protocol
3. Crawler then attempts to connect to server host using specific ports
4. After connection, crawler sends an HTTP request to the web server
4
New cards
What are the steps for a web crawler?
1. Starts with a set of seeds, which are a set of URLs given to its as parameters
2. Crawler starts fetching pages from the request queue
3. Continue until no more new URLs or disk full
5
New cards
How do the web crawler increase the efficiency of the web?
web crawlers use threads and fetch
hundreds of pages at once
6
New cards
How does crawlers can avoid potentially flood sites with requests for pages?
web crawlers use politeness policies, it is a delay between requests to the same web servers
7
New cards
What are the files that used to control crawlers?
robots.txt file
8
New cards
What is a focused crawling?
Attempts to download only those pages that are about a particular topic
9
New cards
What is a deep web?
Sites that are difficult for a crawler to find are collectively referred to as the deep (or hidden) Web
10
New cards
What is a private sites?
no incoming links, or may
require log in with a valid
account
11
New cards
What is a form results?
sites that can be reached
only after entering some data
into a form
12
New cards
What is a scripted pages?
pages that use JavaScript,
Flash, or another client-side
language to generate links
13
New cards
What is a sitemaps?
Sitemaps contain lists of URLs and data about those URLs, such as modification time and modification frequency
14
New cards
What are the reasons to use multiple computers for crawling?
1. Helps to put the crawler closer to the sites it crawls
2. Reduces the number of sites the crawler has to remember
3. Reduces computing resources required
15
New cards
What is a desktop crawls?
Used for desktop search and ernterprise search
16
New cards
What is a push feeds?
alters the subscriber to new documents
17
New cards
What is a pull feed?
requires the subsciber to check periodically for new documents
18
New cards
What is really simple sindication means?
term used to refer the collection of Web feed formats that provide updated or shared information in a standardized way.
19
New cards
does the web is under control of search engine preovider?
No/false
20
New cards
What is website freshness?
Web pages are constantly being added, deleted, and modified.
21
New cards
What is the special request type called in HTTP Protocol?
HEAD request
22
New cards
What does crawler use to decide whether a page is on topic?
they use text classifier
23
New cards
How does the distributed crawler assign URLs to crawling computers?
They use hash function to should be computed on the host part of each URL
24
New cards
What are the differences between Desktop Crawls and Web Crawling?

1. Easier to find data
2. Responding quickly
3. conservative in terms of disk and CPU usage
4. Many different documents format
5. Data privacy very important