1/4
Covers the step-by-step flow of data in a web crawler: Queue, Hot Path, Processing Path, and Controller Logic.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What are the main steps in the Data Flow of a web crawler?
Queue
Hot Path
Processing Path
Controller Logic
What happens in the Queue step of the data flow?
Start with an initial seed of URLs and insert them into the queue
Put the elements into append-only logs, sharded by domain name
Streaming jobs run every k minutes per domain, take one element from the logs, and put it into the HotPath Queue (while respecting robots.txt and other niceness properties)
What happens in the Hot Path step of the data flow?
Withdraw URL from the queue
Query DNS servers for the IP address (use cache if available)
HTTP curl the webpage contents
Save HTML locally in S3 bucket
Save the location in a StateDB indexed on the URL
For retries, push the failed item back into Amazon SQS
What happens in the Processing Path step of the data flow?
Deduplicate content by comparing hashes (e.g., Redis set)
Convert HTML into text, store in blob storage
Extract hyperlinks and images
Update storage with processed data
What does Controller Logic manage in the data flow?
Push new hyperlinks into the appropriate sharded queue after verifying they haven’t been revisited
Keep track of crawl depth for each domain and stop once the depth limit is reached
Respect robots.txt and other niceness properties