Web Crawler – Data Flo

0.0(0)
studied byStudied by 0 people
full-widthCall with Kai
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/4

flashcard set

Earn XP

Description and Tags

Covers the step-by-step flow of data in a web crawler: Queue, Hot Path, Processing Path, and Controller Logic.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

5 Terms

1
New cards

What are the main steps in the Data Flow of a web crawler?

  • Queue

  • Hot Path

  • Processing Path

  • Controller Logic

2
New cards

What happens in the Queue step of the data flow?

  • Start with an initial seed of URLs and insert them into the queue

  • Put the elements into append-only logs, sharded by domain name

  • Streaming jobs run every k minutes per domain, take one element from the logs, and put it into the HotPath Queue (while respecting robots.txt and other niceness properties)

3
New cards

What happens in the Hot Path step of the data flow?

  • Withdraw URL from the queue

  • Query DNS servers for the IP address (use cache if available)

  • HTTP curl the webpage contents

  • Save HTML locally in S3 bucket

  • Save the location in a StateDB indexed on the URL

  • For retries, push the failed item back into Amazon SQS

4
New cards

What happens in the Processing Path step of the data flow?

  • Deduplicate content by comparing hashes (e.g., Redis set)

  • Convert HTML into text, store in blob storage

  • Extract hyperlinks and images

  • Update storage with processed data

5
New cards

What does Controller Logic manage in the data flow?

  • Push new hyperlinks into the appropriate sharded queue after verifying they haven’t been revisited

  • Keep track of crawl depth for each domain and stop once the depth limit is reached

  • Respect robots.txt and other niceness properties