1/3
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Functional/Non functional
crawls websites based on certain reasoning (search indexing, web archival, web monitoring, etc).
scalable, robust, extensible, polite
High level diagram
Deep Dive (non-diagram)
Enforcing Politeness- Use Robots.txt (filename used for telling web crawlers which urls to exclude)
Extensible- Want to be able to add other components, like PNG extractor (after Content Seen?)
Freshness- Could recrawl based on web pages update history. Expensive though.
Robust- Save crawl states in case downloading fails, can pickup where left off. Hashing for downloaders (checking if content seen)
Deep Dive (High Level Diagram)