CHAPTER 9: DESIGN A WEB CRAWLER

0.0(0)

Studied by 0 people

Call with Kai

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/20

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

21 Terms

New cards

Web crawler purposes

Search engine indexing
Web archiving
Web mining
Web monitoring

New cards

Basic steps of web crawler

Download web pages for given list of URLS
Extra URLs from web pages
Repeat

New cards

Requirement questions

What is the purpose
How many pages per month
What content types? HTML or PDF, etc
Should we store HTML contents

New cards

Characteristics of good web crawler

Scalable, parallelization
Robust to handle bad HTML, unresponsive server, crashes, etc
Polite: not make too many request to a server
Extensible: easy to support new features

New cards

Estimation

Pages to crawl per second
Storage needed: page size x page craw per second x years of storage

New cards

Diagram components

Seed URLs
URL frontier
DNS resolver
HTML downloader
Content parser
Content storage
URL extractor
URL filter
URL storage

<ol><li><p>Seed URLs</p></li><li><p>URL frontier</p></li><li><p>DNS resolver</p></li><li><p>HTML downloader</p></li><li><p>Content parser</p></li><li><p>Content storage</p></li><li><p>URL extractor</p></li><li><p>URL filter</p></li><li><p>URL storage</p></li></ol>

New cards

What makes good seed URL

Popular websites by countries or topics

New cards

Why we need a content parser

Validate web page format
Reduce storage space

New cards

What does URL extractor do

Get URL from content
Convert relative URL to absolute ones

New cards

What does URL filter do?

Exclude certain content types
Exclude error links
Exclude blacklisted sites

New cards

Requirements of URL frontier

Prioritize more important URLs
Rate limit requests to each domain

New cards

URL frontier components

Prioritizer: compute prioritization of URL
Front queue each with different priority
Front queue selector: choose a queue to process with bias towards high priority
Mapping table: map domain to queue
Back queue router: enqueue URL to corresponding queue based on domain
Back queue: contains URL from list of domains
Back queue router: router queue to workers

<ol><li><p>Prioritizer: compute prioritization of URL</p></li><li><p>Front queue each with different priority</p></li><li><p>Front queue selector: choose a queue to process with bias towards high priority</p></li><li><p>Mapping table: map domain to queue</p></li><li><p>Back queue router: enqueue URL to corresponding queue based on domain</p></li><li><p>Back queue: contains URL from list of domains</p></li><li><p>Back queue router: router queue to workers</p></li></ol>

New cards

When to recrawl pages

Based on historical update frequency
Based on importance

New cards

How does storage for URL queue work

Storage in disk because it won’t all fit in memory
Write to memory buffer first and bulk write to disk
Bulk fetch from disk to memory for data to dequeue

New cards

What is Robots.txt

Robots exclusion protocol
Specifies what pages crawlers are allowed to download

New cards

How to speed up HTML downloading

Scale up downloader
Cache DNS records
Deploy downloader closer to website hosts
Specify max wait time

New cards

How to make the system more robust

Have multiple downloaders
Save craw states and data snapshot
Exception handling to prevent crash
Data validation

New cards

Where to add more modules such as download PNG files?

After content parser

New cards

How to check if web page is duplicate?

Check hash

New cards

What is spider trap

A webpage that causes crawler in an infinite loop. For example, infinite deep directory structure

New cards

Wrap up talking points

Server side render SPAs
Filter out spam webpages
Increase database availability
Collect analytics