1/47
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
More Unethical SEO
-Keyword Stuffing (e.g. misleading meta tags, repeating words, hiding keywords)
-Cloaking (Serving different content to a spider than to a browser, e.g. via redirects)
-Link exchanges (pages link to each other for no real reason)
Spam
2 types:
Link spamming- Bots that search for blogs and leave comments with links
Clicker bots- Bots that issue queries and "click" on targeted query results
Fighting spam is rapidly advancing (on BOTH sides)
Webmaster Guidelines
Search engines have SEO policies - What is allowed and not allowed
They must not be ignored - Once a site is blacklisted by a search engine, it will virtually disappear from the Web
Universal Resource Identifiers (URI)
A string of characters used to identify a resource
Name what the resource is
URLs also fall under this (and also specify where the resource is)
Anatomy of a URL
scheme:// domain:port / path?query_string#fragment_id
scheme and domain are mandatory, everything else is optional
subdomains trail behind the domain name, like this: calendar.ics . uci.edu (uci.edu is the domain)
How to acquire data?
Data dumps
URL downloads (Find out the URLs of specific resources + Run a downloader that takes that list and downloads the resources)
Web APIs
Web Crawling (should be treated as a last resort)
Knuth's definition of algorithm
Finiteness : does it ends after a finite number of steps?- Definiteness : are the steps rigorously and unambiguously specified?
Input : does it has zero or more inputs?
Output : does it has one or more outputs related to the inputs?
Effectiveness : are the operations basic enough for someone to perform them using pencil and paper only?
Basic Crawl Algorithm
Initialize a queue of URLs (seeds)
Repeat until no more URLs in queue:
- Get one URL from the queue
- If the page can be crawled, fetch associated page
- Store representation of page
- Extract URLs from page and add them to the queue
Queue = "frontier"
Drawbacks of the basic algorithm
Will upset web admins (impolite) - It's abusing the web servers
Very slow - 1 page at a time
Will get caught in traps and infinite sequences
Will fetch duplicates without noticing
Will bring in data noise
Will miss content due to client-side scripting
Permission to crawl
Be polite: try to ask the website if you can crawl it first!
Robots Exclusion Standard aka robots.txt
- Sites may have that file at the root
- Very simple syntax (but no formal/official standard yet!)
Honor basis! It's not a security mechanism
Sitemaps
Also listed in robots.txt
Allow web masters to send info to crawlers.
Usually generated by scripts
Contain lists of URLs and data about those URLs, E.g.
• Location of pages that might not be linked
• Relative importance
• Update frequency
• "Last updated" / Modification time
• You can allow a specific resource from a disallowed resource (e.g. you can crawl these pages but the rest is off limits)
Can inform the crawler about pages it might not find
Gives crawler a hint about when to check a page for changes
Politeness
Avoid hitting any site too often - Sites are for people, not for bots
Ignore politeness -> Denial of service (DOS) attack
Be polite -> Use artificial delays
Performance (I)
Back of the envelope calculation:
- If 1 page fetch = 500ms, then how much time to crawl 1 million pages? (it's worse than that... Unresponsive servers)
Most of the time, the crawler thread is waiting for the network data
Solution: multi-threaded or distributed crawling- Politeness is harder to control, but it is possible (e.g. different servers)
Performance (II)
Domain Name lookups - Given a domain name, retrieve its IP address
www.ics.uci.edu -> 128.195.1.83
Distributed set of servers - Latency can be high (2 secs is not unusual)
Common implementations are blocking - One request at a time, result is cached
Back of the envelope calculation - if 1 DNS lookup = 800ms, how much time to lookup the entire Web?
Crawler traps
May trap the crawler on the site forever - Web server responds with ever changing URLs and content (Dynamic pages)
Can be intentional or unintentional
E.g. the ICS calendar is a crawler trap
Some webadmins can create traps to penalize impolite crawlers
See http://www.fleiner.com/bots/
E.g. very large documents, disallowed in robots.txt, created to consume crawler resources or event to break poorly designed parsers of crawlers that ignore robots.txt
Duplicate Detection
Exact and near duplication are widespread
- Copies, mirror sites, versions, spam, plagiarism...
- Studies: 30% of Web pages are [near-]duplicates of the other 70%
- Little or no value (noise to the search engine and the user; you can show only one to the user and perhaps a "show similar" link)
Detection
- Detection of exact duplication is easy, but exact duplication is rare (use Hashes, checksums)
- Detection of near-duplicates is harder (use Page fingerprints)
Data Noise
Web pages have content not directly related to the page
- Ads, templates, etc
- Noise negatively impacts information retrieval
Technique 1: Cumulative distribution of tags- Document slope curve
(Other techniques in literature)
Client-Side Scripting
Modern web sites are heavily scripted (JavaScript, TypeScript) - Content behind XMLHttpRequests
To get to that content crawlers must interact with the scripts
- Hard thing to do: user interaction emulation (e.g. Selenium)
- Most crawlers will not do it and the content will be never indexed
The Deep Web
Places where crawlers rarely go...
- Content behind login forms
- Content behind JavaScript/TypeScript
- Sites that aren't linked from anywhere
It is estimated that the deep web is 400-500x larger than the shallow web
Guidelines for bot writers
-Reconsider - do you really need a bot?
- Be Accountable - if your actions cause problems, be available to take prompt action in response
- Test Locally - expand the scope gradually before you unleash your crawler on others
- Don't hog resources - web servers are for people primarily. Walk,don't run.
- Stay with it - "it's vital to know what your robot is doing, and that itr emains under control".
Find out the sites' crawling policies
Contact the Web admins
The Web and the Law
The Web is only ~30 years old and was born at CERN for physics research
Wild Wild West (or Wild Wild Web...), for the most part
Very few laws related to the internet
Existing laws either untested or outdated
Our own judgments/actions matter
First web search engines
W3Catalogue (1993) : primitive catalogue
World Wide Web Wanderer (1993) : first web crawler to measure the size of the web
JumpSearch (1993) : crawling, indexing, searching. First"modern" engine.
Excite and RankDex (1994) : ranking system inspired Google's PageRank
Webcrawler (1994) : full text indexing
Who offends who?
Users can offend providers (personal information/privacy issues)
Providers can in turn offend users (content/security issues)
Copyrights
Products of intellect are ruled by "Copyright Law"
Web content, source code, etc. are products of intellect
It protects works regardless of whether they have been published or not
Fair Use
Various purposes for which the reproduction of a particular work may be considered fair:
- Criticism
- Comment
- News
- Teaching
- Scholarship
- Research
It's a very grey area
Fair Use Factors
-The purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational purposes
- The nature of the copyrighted work
- The amount and substantiality of the portion used in relation to the copyrighted work as a whole (are you using the "core" of the work?)
- The effect of the use upon the potential market for, or value of, the copyrighted work
eBay vs. Bidder's Edge (May 24, 2000)
-eBay successfully stopped crawlers from Bidder's Edge
-"trespass to chattels"
-Jury:
Bidder's Edge intentionally and without authorization interfered with eBay's possessory interest in the computer system.
Bidder's Edge's unauthorized use proximately resulted in damage to eBay
American Airlines vs. FareChase (March 10, 2003)
-American Airlines stops crawlers from Farechase for scraping AA.comwebsite for "web fares" information (online fare comparison)
-"trespass to chattels"
-Settled suit later during that year.
Jury:
-Farechase's actions are intentional and without American's consent
-Farechase's conduct interferes with American's computer network and system(...) such actions adversely affect and harm American and the condition,quality and value of American's property
Two major jobs of the crawler
download webpages if authorized, and discover new URLs.
How does the crawler see the web?
It accesses the web through the frontier and grabs URLs through it
Retrieving Web Pages
-Web crawler client program connects to a domain namesystem (DNS) server
-DNS server translates the hostname into an internet protocol(IP) address
-Crawler then attempts to connect to server host using specificport
-After connection, crawler sends an HTTP request to the webserver to request a page- usually a GET request
Problem: Web crawlers spend a lot of time waiting for responses to requests
Solution: To reduce this inefficiency, web crawlers use threads and fetch hundreds of pages at once
BUT crawlers could potentially flood sites with requests for pages
To avoid this problem, web crawlers use politeness policies- e.g., distribute requests between different servers, delay between requests to same web server, ...
Where to enforce politeness?
When grabbing URLs from the frontier and also when grabbing a web page's links
Freshness
Definition: Freshness is the proportion of pages that are fresh in the collection
In a website, webpages are constantly added, deleted, and modified
Stale copies (or their representation in our index) no longer reflect the real contents of the web pages
Web crawler must continually revisit pages- Check for modifications and maintain the freshness of the collection
Not possible to constantly check all pages- must check important pages and pages that change frequently
HTTP protocol has a special request type called HEAD that makes it easy to check for page changes- returns information about page, not page itself
Optimizing for this metric is dangerous:- E.g. If you have a website that is always updated, optimizing your search engine for freshness would lead you to not crawl that site.
Age
Another time metric
(it has a formula i can't put here i'm sorry)
Web page updates follow the Poisson distribution on average - time until the next update is governed by an exponential distribution
Older a page gets, the more it costs not to crawl it - e.g., expected age with mean change frequency
Distributed Crawling
Three reasons to use multiple computers for crawling
- Helps to put the crawler closer to the sites it crawls; lower requirements in the network
- Reduces the number of sites each crawler has to remember; lower requirements on local memory
- Reduces the locally required resources
Distributed crawlers may use a hash function to assign URLs to crawling computers/threads
- Some hash function should be computed on the host part of each URL
Focused Crawling
-Attempts to download only those pages that are about a particular topic
- used by vertical search applications
- can also be used to improve a search engine in a specific topic, or around specific events in time
-Rely on the fact that pages about a topic tend to have links to other pages on the same topic
- popular pages for a topic are typically used as seeds
- E.g. pages that many users are clicking after they search a keyword
-Crawler uses some text classifier to decide whether a page is on topic
Desktop Crawls
-Used for desktop search and enterprise search
-Differences to web crawling with respect to the web
- Much easier to find the data (file system ~ sitemap)
- Responding quickly to updates is more important
- Must be conservative in terms of disk, memory and CPU usage (It should be transparent to the user.)
- Need to cope with many different document formats
- Data privacy much more important than in the web context
Document Feeds Crawls
-In general on the web: documents are created and updated
-But many documents are also just published (created at a fixed time and rarely updated again)
e.g., news articles, blog posts, press releases, email
-Document feed: Published documents from a single source can be ordered in a sequence
- Crawlers can find new documents by examining the end of the feed
Two types of document feeds
push feed: alerts the subscriber to new documents, involves subscription
• E.g. news agencies
pull feed: requires the subscriber to check periodically for new documents
most common format is RSS (Really Simple Syndication, Resource Description Framework Site Summary, Rich Site Summary)
Major characteristics of RSS to crawlers
ttl tag (time to live) - amount of time (in minutes) contents should be cached
RSS feeds are accessed like web pages
- using HTTP GET requests to web servers that host them
- You can use HTTP HEAD requests to check if the feed has changed
Easy for crawlers to parse (it is just XML)
Easy to find new information:
- just GET, parse and if any, grab the new
The need for conversion
Text is stored in hundreds of incompatible file formats- e.g., raw text, RTF, HTML, XML, Microsoft Word, ODF, PDF
Other types of files, not dedicated to text, are also important- e.g., PowerPoint, Excel, Keynote, Pages•
Typically use a conversion tool or library
- converts the document content into a tagged text format such as HTML or XML
- retains some of the important formatting information
- Some converters will not retain formatting (e.g. *nix's pdftotxt, pdf2ps+pdf2ascii, etc.)
Glyphs
the symbols that you see, that encode letters or meanings
Character Encoding
A character encoding is a mapping between bits and glyphs
- i.e., getting from bits in a file to characters on a screen
- Can be a major source of incompatibility
ASCII
ASCII is basic character encoding scheme for English (since 1963)
- Encodes 128 letters, numbers, special characters, and control characters in 7 bits, extended with an extra bit for storage in bytes
The problem with character encoding
Other languages can have many more glyphs
- e.g., Chinese > 40,000 characters, with over 3,000 in common use
Many languages have multiple encoding schemes
- e.g., CJK (Chinese-Japanese-Korean) family of East Asian languages,Hindi, Arabic
- must specify encoding in the file : the file is not self-describing
- can't have multiple languages in one file
Unicode was developed in the late 80's to address encoding problems
Unicode
Single mapping from numbers to glyphs that attempts to include all glyphs in common use in all known languages
Unicode is a mapping between numbers and glyphs
- solves the problem of mapping from number to glyphs
- does not uniquely specify bits to glyph mapping- e.g., UTF-8, UTF-16, UTF-32
Unicode encodings
Proliferation of encodings comes from a need for compatibility and to save space
- UTF-8 uses one byte for English (=ASCII), but as many as 4 bytes for some traditional Chinese characters
• variable length encoding, more difficult to do string operations
- UTF-32 uses 4 bytes for every character
• more memory, no backward compatibility with ASCII
You can mix both in your design
- Many applications use UTF-32 for internal text encoding (fast random lookup) and UTF-8 for disk storage (less space)