Bidders specify:
Search terms that trigger their bid.
Amount to bid for each search term.
Overall ad budget and bid limits.
Google may set a reserve/minimum price for each term.
Google estimates each bidder's click-through rate (CTR) if listed first.
Ads with very low CTRs are not displayed.
Bids are ranked by multiplying the click-through rate and the bid amount.
Ads are displayed in rank order.
Google is paid only when an ad is clicked.
The price paid is the smallest amount the bidder could have bid to get their ranking.
Example: If A bids 1.00$$1.00$$ and B bids 0.50$$0.50$$, when A's ad is clicked, A pays 0.51$$0.51$$. However, if A also get's 1000 impressions, A get's charged nothing.
Ad Rank: Determines the position of an ad.
AdRank=Bid∗QualityScore$$Ad Rank = Bid * Quality Score$$
AdRank=Bid∗ClickProbability$$Ad Rank = Bid * Click Probability$$
CPC: Determined by the ad rank of the next highest ad below you.
Exception: If you are the only bidder or the lowest bid, you pay your maximum bid per click!
CPC = \frac{Ad Rank \ of \ the \ person \ below \ you}{Your \ Quality \ Score} + $0.01$$CPC = \frac{Ad Rank \ of \ the \ person \ below \ you}{Your \ Quality \ Score} + $0.01$$
AdWords bidding penalizes advertisers with low-quality scores.
High-Quality Scores lead to higher ad ranks and lower CPCs.
The average cost per click varies by keyword and industry.
Roughly 2.32$$2.32$$ on the search network.
Roughly 0.58$$0.58$$ on the display network.
Historical click performance of the ad
Landing page quality
Relevance to the user
User click-through rates
Landing pages are the pages that appear when a user clicks on your ad.
AdSense is a service for placing Google ads on web pages.
Offers ads besides cost-per-click:
Cost per Thousand displays (CPM)
Cost per Engagement: Advertisers pay only when users actively engage with ads (e.g., hovering, expanding, reviewing the landing page).
Google assigns two numbers for advertisers: "googleadclient" and "googleadslot."
Small, transparent images on a web page that allow tracking page loads by a web server.
Based on WordNet, a database of word meanings developed at Princeton.
A semantic lexicon for the English language.
Groups English words into sets of synonyms called synsets.
Provides short, general definitions.
Records semantic relations between synsets.
Purpose: a) Intuitive dictionary/thesaurus combination, b) Support automatic text analysis.
A technology platform facilitating the buying and selling of media advertising inventory from multiple ad networks.
Key function: Aggregation of ad space supply from publishers and matching it with advertiser demand.
A real-time marketplace owned by Google for buying and selling advertising.
Uses Dynamic Advertising Reporting and Targeting (DART) technology.
When a user calls up a webpage using DART, a tag signals DoubleClick's server to find an advertisement matching the marketer's needs with the user's profile.
Different users can see different advertisements on the same site simultaneously.
Cookies refine targeting by tracking repeat visitors or those who have seen a specific ad.
DoubleClick places small (1x1 pixels) GIF files on sites to load cookies on your machine.
Can potentially see search strings typed into search engines.
Number of unique users their advertisements were displayed to.
How many users clicked on their ads or paid listings.
Which ads or paid listings they clicked on.
Small strings of HTML code placed in a web page.
Also called "clear GIFs," pixel tags, web bugs, tracking bugs.
IP address of the computer.
URL of the page (referer).
URL of the web beacon image.
Time the Web Beacon was viewed.
Browser type (user_agent).
Any previously set cookie values.
IP address may change; unique ID stored with the cookie helps track the user.
A protocol for two digital advertising companies to transact.
Ad Exchange consolidates the integration of ad networks.
Technology used to store complex structured and unstructured information.
Knowledge-base: Represents facts about the world.
Inference Engine: Reasons about facts and deduces new facts using rules and logic.
Freebase, Google's Knowledge Graph, Apple's Siri, IBM’s Watson
A hierarchy of concepts with parent/child relations (subClass/superClass, broader/narrower).
The representation of knowledge in a knowledgebase.
Can express arbitrary complex relations between concepts (e.g., X marriedTo Y, A worksFor B, C locatedIn D).
A crowd-sourced community effort to extract structured information from Wikipedia.
Makes information available on the Web using RDF (Resource Description Framework) and SPARQL.
Taxonomies are narrower than ontologies, which include a larger variety of relation types.
Mathematically:
Taxonomy: A tree structure of classifications.
Ontology: A directed, labeled, cyclic graph.
Convert speech to text
Convert text to entities
Convert entities to a query
Retrieve results
Interpret results
Convert back to speech
Automated speech recognition
Parts of Speech tagging
Question/Answer analysis
Database interfaces to Wolfram, OpenTable, etc
Transforming machine output to natural language
Text-to-speech processing
A knowledgebase can be seen as a directed labeled multigraph, where nodes are entities and edges are relations.
A graph with multiple edges between the same end nodes.
Consists of: subject, predicate, object (主谓宾).
Hyponym: More specific.
Holonym: Denoting the whole.
Hypernym: A broad or superordinate.
A component that applies logical rules to a knowledgebase to deduce new information.
Forward chaining: Starts with known facts and asserts new facts; search engines typically use forward chaining
Backward chaining: Starts with goals and works backward to determine necessary facts.
Repeated application of modus ponens.
Starts with available data and uses inference rules to extract more data until a goal is reached.
Improves variety of search results by locating alternate interpretations of query terms (e.g., "taj mahal" - mausoleum or musician).
Provides deeper and broader results (e.g., person entities include relations like age, birthplace, marital status, children, education, etc.).
Provides the best summary by exploiting relationships among entities.
30 million articles, 4 million in English.
16 million images.
8000 views per second.
500 million unique visitors per month.
3.7 billion monthly mobile page views.
2.1 billion edits, 700 million English.
20 million registered users.
120,000 active editors.
1,400 administrators.
All working for free, with no central control
Jimmy Wales
Encyclopaedia: Notable topics, No original research (NOR), Neutral point of view (NPOV), Verifiability (referencing).
Free content: Anyone can edit, No copyright infringements.
Be civil.
No firm rules.
Article links
Category links
Interlingual links
Redirect pages: Short pages providing equivalent names for an entity.
Disambiguation pages: Pages linking to multiple similarly named articles.
Articles form a network of semantically related terms.
Categories are organized in a taxonomy-like structure.
Semi-protect pages: Lock pages from being edited by unregistered and new users.
Protect pages: Lock pages from being edited.
Does not take advertising, only donations.
Wikimedia Foundation, Inc. (WMF): American non-profit organization headquartered in San Francisco.
Owns the internet domain names and hosts Wikipedia.
Employs over 280 people, with annual revenues in excess of 75$$75$$ million (as of 2015).
Commons for multimedia.
**Wiktionary as free dictionary.
**Wikidata for structured data.
An effort to convert Wikipedia data into a knowledgebase.
Aims to create a free RDF-like KB about the world that can be read/edited by humans & machines.
Wikidata clients use the repository to populate Web pages or Wikipedia infoboxes.
3 years old, 14M items, 209M edits
Two structured representations of Wikipedia.
Wikidata: Manually curated, will master structured data for Wikipedia, synchronized through bots, data is fairly accurate but data depth is still small.
DBpedia: Automatically extracted, live update, one-way extraction only, data reach is deep but there are problems in ontology and mappings, especially for non-English.
Categories also contain classes, but do not form a taxonomic hierarchy. There is no ISA hierarchy.
Created by Ward Cunningham
Wikis = Website that you can create and edit in real-time.
In Nupedia, content can only be edited by experts.
In Wikipedia, content can be edited by anyone.
Considering Only Terms with High-idf Scores: Postings of low-idf terms have many docs, so these docs get eliminated from the set of contenders.
Considering Only Docs Containing Many Query Terms: Easy to implement in postings traversal.
Pre-compute for each dictionary term t, the r docs of highest weight (tf-idf) in t’s postings.
Call this the champion list for t (aka fancy list or top docs for t).
r has to be chosen at index build time, thus it’s possible that r < K (number of docs to return).
At query time, only compute scores for docs in the champion list of some query term.
Top-ranking documents should be both relevant and authoritative.
Relevance is modeled by cosine scores.
Authority is a query-independent property of a document, assigned a query-independent quality score g(d) between 0 and 1.
Wikipedia among websites.
Articles in certain newspapers.
A paper with many citations.
High PageRank.
Combining cosine relevance and authority
netscore(q,d)=g(d)+cosine(q,d)$$net_score(q,d) = g(d) + cosine(q,d)$$
Other linear combinations can be used, using assigning different weights to the two factors
Order all postings by g(d), the authority measure.
Most authoritative documents appear early in the postings list.
Maintain for each term a champion list of the r docs with highest g(d) + tf-idf(td)
Seek top-K results from only the docs in these champion lists.
For each term, maintain two postings lists called high and low.
high is the champion list.
When traversing postings on a query, only traverse high lists first.
If we get more than K docs, select the top K and stop.
Else proceed to get docs from the low lists.
Can be used even for simple cosine scores, without global quality g(d).
Content factors:
Content relevance
Word count
User signals:
Click through rate
Bounce rate
Technical factors:
Presence of H1/H2
Use of HTTPS
User experience:
Number of internal/external links
Social signals:
Facebook total
Tweets
The highest content relevance scores were found among the results for positions 3 to 6.
It’s rising for years.
Measures the average percentage of users who click on the result at each position on the Search Engine Results Page (SERP).
Measures the percentage of users who only click on the URL from Google’s search results, without visiting any other URLs on the domain, and then return back to the SERP.
These are single-page sessions where the user leaves the site without interacting with the page.
Page encryption using HTTPS is rising.
Facebook has the highest level of user interactions.
All top 100 websites have a mobile-friendly version; they use either a mobile sub-domain or responsive design.
Correlations are not synonymous with causal relationships.
There is no guarantee that respective factors actually have any impact on the ranking.
There are many examples of illusory correlations (logical fallacy).
Google Search, Google Maps, YouTube, Android, Gmail, the Play Store and Google Chrome.
Chrome and Safari.
Chrome: Google search.
Firefox: Yahoo.
Microsoft Edge and Internet Explorer: Bing.
Use a microphone to give a spoken query like “how old is Obama”, and after getting the result, send a follow-up query like “how tall is he”.
A horizontal partition of data in a database or search engine.
Each individual partition is referred to as a shard.
Each shard is held on a separate database server instance, to spread the load.
Parse the query
Convert words into wordIDs using the lexicon
Select the barrels that contain documents which match the wordIDs
Scan through the document list until there is a document that matches all of the search terms
Compute the rank of that document for the query (using PageRank as one component)
Repeat step 4 until no documents are found and we've examined all of the barrels
Sort the set of returned documents that have been matched by document rank and return the top k.
High performance, scalable, full-text search library.
Does not include crawlers or explicit searching of documents. Searching the index must be implemented elsewhere.
100% Java, no dependencies. Offered by the Apache Software Foundation
A full-text search server based on Lucene.
Communicates with Lucene via XML/HTTP, JSON Interfaces.
Written in JAVA 5.
Flexible data schema to define types and fields
Output includes highlighting matches
Configurable advanced caching
Index Replication
Extensible Open Architecture, Plugins
Web Administration Interface
Stores a positional inverted index.
Creates tokens using a Tokenizer and/or applying Filters (Token Filters).
Lucene scores using a combination of TF-IDF and vector closeness
TF−IDF:W(x,y)=tf(x,y)∗log(df(x)N)$$TF-IDF: W(x,y) = tf(x,y) * log(\frac{N}{df(x)})$$ (term frequency * inverse document frequency)
Cosine−similarity(queryvector,docvector):(∣V(q)∣∗∣V(d)∣)V(q)∗V(d)$$Cosine-similarity(query_vector, doc_vector): \frac{V(q) * V(d)}{(|V(q)| * |V(d)|)}$$
tf(x,y)$$tf(x,y)$$: term frequency of x in y
N$$N$$: total number of documents
df(x)$$df(x)$$: number of document containing x
V(q)∗V(d)$$V(q)*V(d)$$ is the dot product of the weighted vectors
∣V(q)∣,∣V(d)∣$$|V(q)|, |V(d)|$$ are the Euclidean norms of the vectors (square root of the sum of squares)
Accepts queries in the form of http requests and returns results as JSON strings
status, always 0 in a successful response
QTime, the server-side query time in milliseconds
numFound, the total number of documents matching the query
start, the offset into the ordered list of results
field types in include str, boolean, int, long, float, double, date, lst, arr
lst is a named list
arr is an array
multivalued fields are returned in an element.
1. Start Solr java -jar start.jar
2. Index your XML data java -jar post.jar *.xml
http://solr/select?q=electronics&fl=name+price
search electronics and limit results to fields: name and price
http://solr/select?q=electronics&sort=price+desc
search electronics and sort by the price in decreasing order
Single and multi-term queries
+, -, AND, OR, NOT operators are supported
Range queries on date or numeric fields,
Boost queries:
Fuzzy search : is a search for words that are similar in spelling
Proximity Search : with a sloppy phrase query. The closer together the two terms appear, higher the score.
Apple.com, Netflix.com, Whitehouse.gov, Indeed.com, Twitter.com, LinkedIn.com
VirtualBox is an open source, freely available Windows application (it also runs on other platforms) that lets you run multiple operating systems on your single machine
Runs on other platforms, and lets you run multiple operating systems on a single machine.
Ubuntu is a Linux-based operating system on personal computers, smartphones and network servers. It uses Unity as its default desktop environment
A Linux-based operating system that uses Unity as its default desktop environment
Query, index, delete, commit, and optimize
For the query “Queen Victoria”, after tokenization, Queen: position:0; offset:0; length:5; Victoria: position:1; offset:6; length:8
Term query: matches a precise term
Wildcard query: portions of term left unspecified, e.g. c*t=cat, cot, cant
Prefix query: end of term left unspecified, e.g. ca*=cat, camp, cad
Fuzzy query: matches an imprecise term
Phrase query: matches multiple terms in sequence/proximity
Range query: matches an alphabetical or numeric range of terms
Boolean query: logically composite other queries
Lucene.NET: implementation of Lucene in C#
PyLucene: Lucene running in JVM embedded in CPython
Tika: Java library for text and metadata extraction
Luke: tool for inspecting Lucene indexes
Lucene document=collection of fields.
Fields are stored separately from the index (.fdt file)
It may be indexed and stored, indexed but not stored, or stored but not indexed.
A single Lucene index may be split into multiple segments.
Each segment stores one or more of the documents in its own inverted index and field storage.
MergePolicy: findMerges() invoked when a new segment is created and return a list of segments to merge.
MergeScheduler: process the merges, sometimes in separate threads.
When a google is deleted, a document is marked for deletion; actually gets removed only when its segment gets merged.
You can’t. Only whole documents can be added and deleted; fields of an existing document cannot be modified.
Yonik Seeley
It contains field types and fields, dynamic fields. managed-schema
It contains Lucene index parameters, request handler mappings, cache settings and plugins
Yes solr support indexing data from dtabase and HTTP GET, and full/incremental indexing/searching
solr supports indexing MS Office, pdf, rtf, OpenDocument, images, mp3, zip ….
JSON. The default type of searching results for solr
Its the front-end framework to use Solr for UI of web pages.
It relies on server and container security.
It supports both master/slave and sharding. But not all solr features support sharding.
Yes you can use solr to achieve near real-time index updates
Non-word errors
Typographical errors
Cognitive errors
Spelling Error Detection
Spelling Error Correction
Yes, a big dictionary, Using context, or Fast response time for autocomplete is needed
Any word not in a dictionary is presumed to be an error.
The larger the dictionary the better
Generate candidates from the dictionary that are close to the error.
There are some ways:
(a) Shortest weighted edit distance
Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other
(b) Highest noisy channel probability
Use probabilities to select the most obvious candidates
Typing quickly, keyboard adjacency, inconsistent rules, ambiguous word breaking, new words
According to Cucerzan and Brill, more than 10% of search engine queries are misspelled
Consider whether the surrounding words “make sense” for your candidate set.
For candidate words with similar sounds but different spellings and different meanings.
(we can use the Soundex algorithm)
Allow for insertion of a space or hyphen; Or deletion of a space; Watch out for words NOT in the Lexicon that are valid
This model suggests treating the misspelled word as if a correctly spelled word has been distorted by being passed through a noisy communication channel.
Noise in this case refers to substitutions, insertions or deletions of letters
To predict the correct word, use Bayesian inference. The formula will incorporate prior probability of a words existence, and a multiplication of that with the liklihood from the misspelled word
Initial step: Create a dictionary and encode it for fast retrieval
When a query is submitted, the spell checker examines each word and looks for possible character edits. Observation:
80% of errors are within edit distance 1
Almost all errors within edit distance 2
Take the output of step 2 and compute probabilities for the candidates using previously identified n-grams
Delect the result with highest probability
Usually which data structure is used for efficient prefix matching in autocomplete and spell correction?
Trie.
Is the search for corrections carried out from left-to-right or right-to-left? What factor is used to control the response time?
The search for corrections is carried out from left-to-right.
The branching factor controls the amount of time required to search for spelling corrections.
O(m). where m is the length of the string, at the expense of increased storage
Wikipedia’s list of common English misspelling
Aspell filtered version of that list
Birkbeck spelling error corpus
A n-gram model is a type of probabilistic language model for predicting the next item in a sequence.
Two benefits of n-gram models are simplicity and scalability – with larger n, a model can store more context with a well- understood space–time tradeoff, enabling small experiments to scale up efficiently
Google has collected and uses a great deal of N-gram data. Google is using the Linguistics Data Consortium to distribute more than one trillion words they have extracted from public web pages
Rather than store each word as a string, it is generally represented in memory as a 64-bit hash number, with the words themselves stored on disk.
Probabilities are generally quantized using only 4-8 bits (instead of 8-byte floats),
N-grams are stored in reverse tries.
N-grams can also be shrunk by pruning, for example only storing N- grams with counts greater than some threshold (such as the count threshold of 40 used for the Google N-gram release)
If a higher-order N-gram has a zero count, we simply backoff to a lower order N-gram, weighed by a fixed (context-independent) weight
The backoff terminates in the unigram
For a given “big.txt”, extract individual words and build a dictionary (hashtable) that counts how many times each word appears. (for words not appearing, set default value to 1).
For a given query word, compute all the words that are in edit- distance 1 with that, and words in the set that are in edit-distance 2 with that.
For the given query word, if the word itself is in the original word- set, then returns it as the correct word. Otherwise find the word with highest frequency that has edit-distance 1, if none exists then find the word with highest frequency having edit-distance 2.
The minimum edit distance between two strings is the minimum number of editing operations (insertion, deletion, substitution) needed to transform one into the other
To compute the minimum edit distance D(n, m) for two strings X and Y, the formula is:
D(i, j)= min(
D(i-1, j)+1 // for deletion
D(i, j-1)+1 // for insertion
D(i-1, j-1)+2 // for substitution, if X[i]!=Y[j],
D(i-1, j-1) // if X[i]==Y[j]
)
For initial states, D(i, 0)=i, D(0, j)=j
Computing minimum distance time: O(mn)
Computing minimum distance space: O(mn)
Backtracing: O(m+n)
For spell correction: some letters are more likely to be mistyped than others
A specific table layout that allows visualization of the performance of an algorithm;
Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice-versa)
To account for special costs for deletion and insertion during edit distance calculation:
$$ D(i, j)= min(\ \ \ D(i-1, j) + del[x[i]] \ for \ deletion \\ D(i, j-1) + ins[y[j]] \ for \ insertion \\ D(i-1, j-1) + sub[x[i], y[j]] \ for \ substitution, \ if \ X[i]! =Y[j], \\ D(i-1, j-1) \ if \ X[i]==Y[j] \
) $$
for initial states,
D(0,0)=0$$D(0,0) = 0$$
D(i,0)=D(i−1,0)+del[x[i]], i:1 N$$D(i, 0) = D(i-1, 0) + del[x[i]], \ i: 1~N$$
D(0,j)=D(0,j−1)+ins[y[j]] i:1 N$$D(0, j) = D(0, j-1) + ins[y[j]]\ i: 1~N$$
To correct query phrases that match no docs, retrieve dictionary terms close (in weighted edit distance) to each query term. Then try all possible resulting phrases with one word “fixed” at
Final Review 2
Bidders specify:
Search terms that trigger their bid.
Amount to bid for each search term.
Overall ad budget and bid limits.
Google may set a reserve/minimum price for each term.
Google estimates each bidder's click-through rate (CTR) if listed first.
Ads with very low CTRs are not displayed.
Bids are ranked by multiplying the click-through rate and the bid amount.
Ads are displayed in rank order.
Google is paid only when an ad is clicked.
The price paid is the smallest amount the bidder could have bid to get their ranking.
Example: If A bids 1.00 and B bids 0.50, when A's ad is clicked, A pays 0.51. However, if A also get's 1000 impressions, A get's charged nothing.
Ad Rank: Determines the position of an ad.
AdRank=Bid∗QualityScore
AdRank=Bid∗ClickProbability
CPC: Determined by the ad rank of the next highest ad below you.
Exception: If you are the only bidder or the lowest bid, you pay your maximum bid per click!
CPC = \frac{Ad Rank \ of \ the \ person \ below \ you}{Your \ Quality \ Score} + $0.01
AdWords bidding penalizes advertisers with low-quality scores.
High-Quality Scores lead to higher ad ranks and lower CPCs.
The average cost per click varies by keyword and industry.
Roughly 2.32 on the search network.
Roughly 0.58 on the display network.
Historical click performance of the ad
Landing page quality
Relevance to the user
User click-through rates
Landing pages are the pages that appear when a user clicks on your ad.
AdSense is a service for placing Google ads on web pages.
Offers ads besides cost-per-click:
Cost per Thousand displays (CPM)
Cost per Engagement: Advertisers pay only when users actively engage with ads (e.g., hovering, expanding, reviewing the landing page).
Google assigns two numbers for advertisers: "googleadclient" and "googleadslot."
Small, transparent images on a web page that allow tracking page loads by a web server.
Based on WordNet, a database of word meanings developed at Princeton.
A semantic lexicon for the English language.
Groups English words into sets of synonyms called synsets.
Provides short, general definitions.
Records semantic relations between synsets.
Purpose: a) Intuitive dictionary/thesaurus combination, b) Support automatic text analysis.
A technology platform facilitating the buying and selling of media advertising inventory from multiple ad networks.
Key function: Aggregation of ad space supply from publishers and matching it with advertiser demand.
A real-time marketplace owned by Google for buying and selling advertising.
Uses Dynamic Advertising Reporting and Targeting (DART) technology.
When a user calls up a webpage using DART, a tag signals DoubleClick's server to find an advertisement matching the marketer's needs with the user's profile.
Different users can see different advertisements on the same site simultaneously.
Cookies refine targeting by tracking repeat visitors or those who have seen a specific ad.
DoubleClick places small (1x1 pixels) GIF files on sites to load cookies on your machine.
Can potentially see search strings typed into search engines.
Number of unique users their advertisements were displayed to.
How many users clicked on their ads or paid listings.
Which ads or paid listings they clicked on.
Small strings of HTML code placed in a web page.
Also called "clear GIFs," pixel tags, web bugs, tracking bugs.
IP address of the computer.
URL of the page (referer).
URL of the web beacon image.
Time the Web Beacon was viewed.
Browser type (user_agent).
Any previously set cookie values.
IP address may change; unique ID stored with the cookie helps track the user.
A protocol for two digital advertising companies to transact.
Ad Exchange consolidates the integration of ad networks.
Technology used to store complex structured and unstructured information.
Knowledge-base: Represents facts about the world.
Inference Engine: Reasons about facts and deduces new facts using rules and logic.
Freebase, Google's Knowledge Graph, Apple's Siri, IBM’s Watson
A hierarchy of concepts with parent/child relations (subClass/superClass, broader/narrower).
The representation of knowledge in a knowledgebase.
Can express arbitrary complex relations between concepts (e.g., X marriedTo Y, A worksFor B, C locatedIn D).
A crowd-sourced community effort to extract structured information from Wikipedia.
Makes information available on the Web using RDF (Resource Description Framework) and SPARQL.
Taxonomies are narrower than ontologies, which include a larger variety of relation types.
Mathematically:
Taxonomy: A tree structure of classifications.
Ontology: A directed, labeled, cyclic graph.
Convert speech to text
Convert text to entities
Convert entities to a query
Retrieve results
Interpret results
Convert back to speech
Automated speech recognition
Parts of Speech tagging
Question/Answer analysis
Database interfaces to Wolfram, OpenTable, etc
Transforming machine output to natural language
Text-to-speech processing
A knowledgebase can be seen as a directed labeled multigraph, where nodes are entities and edges are relations.
A graph with multiple edges between the same end nodes.
Consists of: subject, predicate, object (主谓宾).
Hyponym: More specific.
Holonym: Denoting the whole.
Hypernym: A broad or superordinate.
A component that applies logical rules to a knowledgebase to deduce new information.
Forward chaining: Starts with known facts and asserts new facts; search engines typically use forward chaining
Backward chaining: Starts with goals and works backward to determine necessary facts.
Repeated application of modus ponens.
Starts with available data and uses inference rules to extract more data until a goal is reached.
Improves variety of search results by locating alternate interpretations of query terms (e.g., "taj mahal" - mausoleum or musician).
Provides deeper and broader results (e.g., person entities include relations like age, birthplace, marital status, children, education, etc.).
Provides the best summary by exploiting relationships among entities.
30 million articles, 4 million in English.
16 million images.
8000 views per second.
500 million unique visitors per month.
3.7 billion monthly mobile page views.
2.1 billion edits, 700 million English.
20 million registered users.
120,000 active editors.
1,400 administrators.
All working for free, with no central control
Jimmy Wales
Encyclopaedia: Notable topics, No original research (NOR), Neutral point of view (NPOV), Verifiability (referencing).
Free content: Anyone can edit, No copyright infringements.
Be civil.
No firm rules.
Article links
Category links
Interlingual links
Redirect pages: Short pages providing equivalent names for an entity.
Disambiguation pages: Pages linking to multiple similarly named articles.
Articles form a network of semantically related terms.
Categories are organized in a taxonomy-like structure.
Semi-protect pages: Lock pages from being edited by unregistered and new users.
Protect pages: Lock pages from being edited.
Does not take advertising, only donations.
Wikimedia Foundation, Inc. (WMF): American non-profit organization headquartered in San Francisco.
Owns the internet domain names and hosts Wikipedia.
Employs over 280 people, with annual revenues in excess of 75 million (as of 2015).
Commons for multimedia.
**Wiktionary as free dictionary.
**Wikidata for structured data.
An effort to convert Wikipedia data into a knowledgebase.
Aims to create a free RDF-like KB about the world that can be read/edited by humans & machines.
Wikidata clients use the repository to populate Web pages or Wikipedia infoboxes.
3 years old, 14M items, 209M edits
Two structured representations of Wikipedia.
Wikidata: Manually curated, will master structured data for Wikipedia, synchronized through bots, data is fairly accurate but data depth is still small.
DBpedia: Automatically extracted, live update, one-way extraction only, data reach is deep but there are problems in ontology and mappings, especially for non-English.
Categories also contain classes, but do not form a taxonomic hierarchy. There is no ISA hierarchy.
Created by Ward Cunningham
Wikis = Website that you can create and edit in real-time.
In Nupedia, content can only be edited by experts.
In Wikipedia, content can be edited by anyone.
Considering Only Terms with High-idf Scores: Postings of low-idf terms have many docs, so these docs get eliminated from the set of contenders.
Considering Only Docs Containing Many Query Terms: Easy to implement in postings traversal.
Pre-compute for each dictionary term t, the r docs of highest weight (tf-idf) in t’s postings.
Call this the champion list for t (aka fancy list or top docs for t).
r has to be chosen at index build time, thus it’s possible that r < K (number of docs to return).
At query time, only compute scores for docs in the champion list of some query term.
Top-ranking documents should be both relevant and authoritative.
Relevance is modeled by cosine scores.
Authority is a query-independent property of a document, assigned a query-independent quality score g(d) between 0 and 1.
Wikipedia among websites.
Articles in certain newspapers.
A paper with many citations.
High PageRank.
Combining cosine relevance and authority
netscore(q,d)=g(d)+cosine(q,d)
Other linear combinations can be used, using assigning different weights to the two factors
Order all postings by g(d), the authority measure.
Most authoritative documents appear early in the postings list.
Maintain for each term a champion list of the r docs with highest g(d) + tf-idf(td)
Seek top-K results from only the docs in these champion lists.
For each term, maintain two postings lists called high and low.
high is the champion list.
When traversing postings on a query, only traverse high lists first.
If we get more than K docs, select the top K and stop.
Else proceed to get docs from the low lists.
Can be used even for simple cosine scores, without global quality g(d).
Content factors:
Content relevance
Word count
User signals:
Click through rate
Bounce rate
Technical factors:
Presence of H1/H2
Use of HTTPS
User experience:
Number of internal/external links
Social signals:
Facebook total
Tweets
The highest content relevance scores were found among the results for positions 3 to 6.
It’s rising for years.
Measures the average percentage of users who click on the result at each position on the Search Engine Results Page (SERP).
Measures the percentage of users who only click on the URL from Google’s search results, without visiting any other URLs on the domain, and then return back to the SERP.
These are single-page sessions where the user leaves the site without interacting with the page.
Page encryption using HTTPS is rising.
Facebook has the highest level of user interactions.
All top 100 websites have a mobile-friendly version; they use either a mobile sub-domain or responsive design.
Correlations are not synonymous with causal relationships.
There is no guarantee that respective factors actually have any impact on the ranking.
There are many examples of illusory correlations (logical fallacy).
Google Search, Google Maps, YouTube, Android, Gmail, the Play Store and Google Chrome.
Chrome and Safari.
Chrome: Google search.
Firefox: Yahoo.
Microsoft Edge and Internet Explorer: Bing.
Use a microphone to give a spoken query like “how old is Obama”, and after getting the result, send a follow-up query like “how tall is he”.
A horizontal partition of data in a database or search engine.
Each individual partition is referred to as a shard.
Each shard is held on a separate database server instance, to spread the load.
Parse the query
Convert words into wordIDs using the lexicon
Select the barrels that contain documents which match the wordIDs
Scan through the document list until there is a document that matches all of the search terms
Compute the rank of that document for the query (using PageRank as one component)
Repeat step 4 until no documents are found and we've examined all of the barrels
Sort the set of returned documents that have been matched by document rank and return the top k.
High performance, scalable, full-text search library.
Does not include crawlers or explicit searching of documents. Searching the index must be implemented elsewhere.
100% Java, no dependencies. Offered by the Apache Software Foundation
A full-text search server based on Lucene.
Communicates with Lucene via XML/HTTP, JSON Interfaces.
Written in JAVA 5.
Flexible data schema to define types and fields
Output includes highlighting matches
Configurable advanced caching
Index Replication
Extensible Open Architecture, Plugins
Web Administration Interface
Stores a positional inverted index.
Creates tokens using a Tokenizer and/or applying Filters (Token Filters).
Lucene scores using a combination of TF-IDF and vector closeness
TF−IDF:W(x,y)=tf(x,y)∗log(df(x)N) (term frequency * inverse document frequency)
Cosine−similarity(queryvector,docvector):(∣V(q)∣∗∣V(d)∣)V(q)∗V(d)
tf(x,y): term frequency of x in y
N: total number of documents
df(x): number of document containing x
V(q)∗V(d) is the dot product of the weighted vectors
∣V(q)∣,∣V(d)∣ are the Euclidean norms of the vectors (square root of the sum of squares)
Accepts queries in the form of http requests and returns results as JSON strings
status, always 0 in a successful response
QTime, the server-side query time in milliseconds
numFound, the total number of documents matching the query
start, the offset into the ordered list of results
field types in include str, boolean, int, long, float, double, date, lst, arr
lst is a named list
arr is an array
multivalued fields are returned in an element.
1. Start Solr java -jar start.jar
2. Index your XML data java -jar post.jar *.xml
http://solr/select?q=electronics&fl=name+price
search electronics and limit results to fields: name and price
http://solr/select?q=electronics&sort=price+desc
search electronics and sort by the price in decreasing order
Single and multi-term queries
+, -, AND, OR, NOT operators are supported
Range queries on date or numeric fields,
Boost queries:
Fuzzy search : is a search for words that are similar in spelling
Proximity Search : with a sloppy phrase query. The closer together the two terms appear, higher the score.
Apple.com, Netflix.com, Whitehouse.gov, Indeed.com, Twitter.com, LinkedIn.com
VirtualBox is an open source, freely available Windows application (it also runs on other platforms) that lets you run multiple operating systems on your single machine
Runs on other platforms, and lets you run multiple operating systems on a single machine.
Ubuntu is a Linux-based operating system on personal computers, smartphones and network servers. It uses Unity as its default desktop environment
A Linux-based operating system that uses Unity as its default desktop environment
Query, index, delete, commit, and optimize
For the query “Queen Victoria”, after tokenization, Queen: position:0; offset:0; length:5; Victoria: position:1; offset:6; length:8
Term query: matches a precise term
Wildcard query: portions of term left unspecified, e.g. c*t=cat, cot, cant
Prefix query: end of term left unspecified, e.g. ca*=cat, camp, cad
Fuzzy query: matches an imprecise term
Phrase query: matches multiple terms in sequence/proximity
Range query: matches an alphabetical or numeric range of terms
Boolean query: logically composite other queries
Lucene.NET: implementation of Lucene in C#
PyLucene: Lucene running in JVM embedded in CPython
Tika: Java library for text and metadata extraction
Luke: tool for inspecting Lucene indexes
Lucene document=collection of fields.
Fields are stored separately from the index (.fdt file)
It may be indexed and stored, indexed but not stored, or stored but not indexed.
A single Lucene index may be split into multiple segments.
Each segment stores one or more of the documents in its own inverted index and field storage.
MergePolicy: findMerges() invoked when a new segment is created and return a list of segments to merge.
MergeScheduler: process the merges, sometimes in separate threads.
When a google is deleted, a document is marked for deletion; actually gets removed only when its segment gets merged.
You can’t. Only whole documents can be added and deleted; fields of an existing document cannot be modified.
Yonik Seeley
It contains field types and fields, dynamic fields. managed-schema
It contains Lucene index parameters, request handler mappings, cache settings and plugins
Yes solr support indexing data from dtabase and HTTP GET, and full/incremental indexing/searching
solr supports indexing MS Office, pdf, rtf, OpenDocument, images, mp3, zip ….
JSON. The default type of searching results for solr
Its the front-end framework to use Solr for UI of web pages.
It relies on server and container security.
It supports both master/slave and sharding. But not all solr features support sharding.
Yes you can use solr to achieve near real-time index updates
Non-word errors
Typographical errors
Cognitive errors
Spelling Error Detection
Spelling Error Correction
Yes, a big dictionary, Using context, or Fast response time for autocomplete is needed
Any word not in a dictionary is presumed to be an error.
The larger the dictionary the better
Generate candidates from the dictionary that are close to the error.
There are some ways:
(a) Shortest weighted edit distance
Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other
(b) Highest noisy channel probability
Use probabilities to select the most obvious candidates
Typing quickly, keyboard adjacency, inconsistent rules, ambiguous word breaking, new words
According to Cucerzan and Brill, more than 10% of search engine queries are misspelled
Consider whether the surrounding words “make sense” for your candidate set.
For candidate words with similar sounds but different spellings and different meanings.
(we can use the Soundex algorithm)
Allow for insertion of a space or hyphen; Or deletion of a space; Watch out for words NOT in the Lexicon that are valid
This model suggests treating the misspelled word as if a correctly spelled word has been distorted by being passed through a noisy communication channel.
Noise in this case refers to substitutions, insertions or deletions of letters
To predict the correct word, use Bayesian inference. The formula will incorporate prior probability of a words existence, and a multiplication of that with the liklihood from the misspelled word
Initial step: Create a dictionary and encode it for fast retrieval
When a query is submitted, the spell checker examines each word and looks for possible character edits. Observation:
80% of errors are within edit distance 1
Almost all errors within edit distance 2
Take the output of step 2 and compute probabilities for the candidates using previously identified n-grams
Delect the result with highest probability
Usually which data structure is used for efficient prefix matching in autocomplete and spell correction?
Trie.
Is the search for corrections carried out from left-to-right or right-to-left? What factor is used to control the response time?
The search for corrections is carried out from left-to-right.
The branching factor controls the amount of time required to search for spelling corrections.
O(m). where m is the length of the string, at the expense of increased storage
Wikipedia’s list of common English misspelling
Aspell filtered version of that list
Birkbeck spelling error corpus
A n-gram model is a type of probabilistic language model for predicting the next item in a sequence.
Two benefits of n-gram models are simplicity and scalability – with larger n, a model can store more context with a well- understood space–time tradeoff, enabling small experiments to scale up efficiently
Google has collected and uses a great deal of N-gram data. Google is using the Linguistics Data Consortium to distribute more than one trillion words they have extracted from public web pages
Rather than store each word as a string, it is generally represented in memory as a 64-bit hash number, with the words themselves stored on disk.
Probabilities are generally quantized using only 4-8 bits (instead of 8-byte floats),
N-grams are stored in reverse tries.
N-grams can also be shrunk by pruning, for example only storing N- grams with counts greater than some threshold (such as the count threshold of 40 used for the Google N-gram release)
If a higher-order N-gram has a zero count, we simply backoff to a lower order N-gram, weighed by a fixed (context-independent) weight
The backoff terminates in the unigram
For a given “big.txt”, extract individual words and build a dictionary (hashtable) that counts how many times each word appears. (for words not appearing, set default value to 1).
For a given query word, compute all the words that are in edit- distance 1 with that, and words in the set that are in edit-distance 2 with that.
For the given query word, if the word itself is in the original word- set, then returns it as the correct word. Otherwise find the word with highest frequency that has edit-distance 1, if none exists then find the word with highest frequency having edit-distance 2.
The minimum edit distance between two strings is the minimum number of editing operations (insertion, deletion, substitution) needed to transform one into the other
To compute the minimum edit distance D(n, m) for two strings X and Y, the formula is:
D(i, j)= min(
D(i-1, j)+1 // for deletion
D(i, j-1)+1 // for insertion
D(i-1, j-1)+2 // for substitution, if X[i]!=Y[j],
D(i-1, j-1) // if X[i]==Y[j]
)
For initial states, D(i, 0)=i, D(0, j)=j
Computing minimum distance time: O(mn)
Computing minimum distance space: O(mn)
Backtracing: O(m+n)
For spell correction: some letters are more likely to be mistyped than others
A specific table layout that allows visualization of the performance of an algorithm;
Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice-versa)
To account for special costs for deletion and insertion during edit distance calculation:
D(i, j)= min(\ \ \ D(i-1, j) + del[x[i]] \ for \ deletion \\ D(i, j-1) + ins[y[j]] \ for \ insertion \\ D(i-1, j-1) + sub[x[i], y[j]] \ for \ substitution, \ if \ X[i]! =Y[j], \\ D(i-1, j-1) \ if \ X[i]==Y[j] \
)
for initial states,
D(0,0)=0
D(i,0)=D(i−1,0)+del[x[i]], i:1 N
D(0,j)=D(0,j−1)+ins[y[j]] i:1 N
To correct query phrases that match no docs, retrieve dictionary terms close (in weighted edit distance) to each query term. Then try all possible resulting phrases with one word “fixed” at