CHAPTER 9: DESIGN A WEB CRAWLER

0.0(0)
studied byStudied by 0 people
full-widthCall with Kai
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/20

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

21 Terms

1
New cards

Web crawler purposes

  1. Search engine indexing

  2. Web archiving

  3. Web mining

  4. Web monitoring

2
New cards

Basic steps of web crawler

  1. Download web pages for given list of URLS

  2. Extra URLs from web pages

  3. Repeat

3
New cards

Requirement questions

  1. What is the purpose

  2. How many pages per month

  3. What content types? HTML or PDF, etc

  4. Should we store HTML contents

4
New cards

Characteristics of good web crawler

  1. Scalable, parallelization

  2. Robust to handle bad HTML, unresponsive server, crashes, etc

  3. Polite: not make too many request to a server

  4. Extensible: easy to support new features

5
New cards

Estimation

  1. Pages to crawl per second

  2. Storage needed: page size x page craw per second x years of storage

6
New cards

Diagram components

  1. Seed URLs

  2. URL frontier

  3. DNS resolver

  4. HTML downloader

  5. Content parser

  6. Content storage

  7. URL extractor

  8. URL filter

  9. URL storage

<ol><li><p>Seed URLs</p></li><li><p>URL frontier</p></li><li><p>DNS resolver</p></li><li><p>HTML downloader</p></li><li><p>Content parser</p></li><li><p>Content storage</p></li><li><p>URL extractor</p></li><li><p>URL filter</p></li><li><p>URL storage</p></li></ol>
7
New cards

What makes good seed URL

Popular websites by countries or topics

8
New cards

Why we need a content parser

  1. Validate web page format

  2. Reduce storage space

9
New cards

What does URL extractor do

  1. Get URL from content

  2. Convert relative URL to absolute ones

10
New cards

What does URL filter do?

  1. Exclude certain content types

  2. Exclude error links

  3. Exclude blacklisted sites

11
New cards

Requirements of URL frontier

  1. Prioritize more important URLs

  2. Rate limit requests to each domain

12
New cards

URL frontier components

  1. Prioritizer: compute prioritization of URL

  2. Front queue each with different priority

  3. Front queue selector: choose a queue to process with bias towards high priority

  4. Mapping table: map domain to queue

  5. Back queue router: enqueue URL to corresponding queue based on domain

  6. Back queue: contains URL from list of domains

  7. Back queue router: router queue to workers

<ol><li><p>Prioritizer: compute prioritization of URL</p></li><li><p>Front queue each with different priority</p></li><li><p>Front queue selector: choose a queue to process with bias towards high priority</p></li><li><p>Mapping table: map domain to queue</p></li><li><p>Back queue router: enqueue URL to corresponding queue based on domain</p></li><li><p>Back queue: contains URL from list of domains</p></li><li><p>Back queue router: router queue to workers</p></li></ol>
13
New cards

When to recrawl pages

  1. Based on historical update frequency

  2. Based on importance

14
New cards

How does storage for URL queue work

  1. Storage in disk because it won’t all fit in memory

  2. Write to memory buffer first and bulk write to disk

  3. Bulk fetch from disk to memory for data to dequeue

15
New cards

What is Robots.txt

  1. Robots exclusion protocol

  2. Specifies what pages crawlers are allowed to download

16
New cards

How to speed up HTML downloading

  1. Scale up downloader

  2. Cache DNS records

  3. Deploy downloader closer to website hosts

  4. Specify max wait time

17
New cards

How to make the system more robust

  1. Have multiple downloaders

  2. Save craw states and data snapshot

  3. Exception handling to prevent crash

  4. Data validation

18
New cards

Where to add more modules such as download PNG files?

After content parser

19
New cards

How to check if web page is duplicate?

Check hash

20
New cards

What is spider trap

A webpage that causes crawler in an infinite loop. For example, infinite deep directory structure

21
New cards

Wrap up talking points

  1. Server side render SPAs

  2. Filter out spam webpages

  3. Increase database availability

  4. Collect analytics