Note

0.0(0)

Chat with Kai

Explore Top Notes

Limiting Factors of Photosynthesis

Arguments Against International Trade

Note

Studied by 18 people

5.0(2)

After birth & Measuring Health

Chapter 3 - Enterprise, business growth & size

Note

Studied by 13 people

5.0(1)

60d ago

Final Review 2

Google AdWords Bidding Google AdWords Ranking Google AdWords Payment Ad Rank and Cost-Per-Click (CPC)Click-Through Rates (CTR) in AdWords Average Cost of AdWords Factors Influencing Click-Through Probability Landing Pages Google AdSense Tracking Pixels AdSense Technology WordNet Ad Exchange Ad Network DoubleClick Ad Exchange DART Technology DoubleClick User Tracking DART Cookies Information Web Beacons Information Sent to Server When a Web Beacon is Viewed IP Address Challenge with Web Beacons Real-Time Bidding (RTB)Ad Exchange Efficiency Knowledgebase (KB)Knowledge Base System Components Knowledgebase Examples Taxonomy Ontology DBpedia Taxonomy vs. Ontology Siri Steps/Challenges Siri Related Technologies Knowledgebase Graph Transformation Multigraph Resource Description Format (RDF) Triple Hyponym, Holonym, Hypernym Inference Engine Forward Chaining vs. Backward Chaining Forward Chaining Knowledge Graph Enhancement of Google Search Wikipedia Statistics Wikipedia Founder Wikipedia’s Five Pillars Wikipedia Link Types Wikipedia Special Pages Wikipedia Category Graph (WCG)Wikipedia Page Protection Wikipedia Revenue Wikimedia Related Projects to Wikipedia Wikidata Wikidata Size Wikidata vs. DBpedia Wikipedia's Structure Website Model Wiki Difference between Nupedia and Wikipedia Query Processing Champion Lists Heuristic Static Quality Scores Heuristic High Authority Signals Examples Net Score Calculation Inverted List Reorganization High and Low Lists Heuristic Google Ranking Factors Content Relevance Scores Landing Page Word Count Click-Through Rate (CTR)Bounce Rate Google HTTPS Social Network User Interactions Mobile-Friendly Websites SEO Findings and Google Algorithm Google Services with Over a Billion Monthly Active Users Popular Browsers in US Default Search Engines Query Processing Using Conversational Search Scenario Database Shard Google Query Processing Basic Steps Lucene Solr Solr Advantages Lucene's Inverted Index Lucene's Analyzer Lucenen's Similarity Decision TF-IDF Formula Terms Cosine-Similarity Formula Terms Solr Server Function Solr Query Result Solr Index XML Data Solr Query Meaning Query Types Supported by Solr Lucene/Solr Websites VirtualBox Ubuntu Solr's Basic Operations Query Tokenization Example Query Type Explanation Lucene Terms Lucene Document Organization Lucene Index Storage Lence Field Lucene Index Segment Lucene's Merge Segments Google Document Deletetion Lucene Document Modification Solr Creator Solr Config File - Schema.xml Solr Config File - Solrconfig.xml Indexing and Search With DB's Solr File Extentions for Indexing Solre Results Type SolrJs Sor Security Solr Scaling Solr Real-Time Index Updates Types of Spelling Errors Main Spelling Tasks Spelling Error Correction Requirements Non-Word Spelling Error Detection Non-Word Spelling Error Correction Common Types of Misspellings Search Queries Misspelled Using Context for Spelling Correction Challenges for Identifying Spelling Errors Noisy Channel Model Bayesian inference Basic Spell Correction Algorithm Steps Prefix Matching Data Structure Spelling Correction Search Direction Trie Time Complexity Spelling Correction Dictionary Error Test Sets N-gram Model Google N-gram Data Use Google N-gram Data Storage Applying the N-Gram Model to Spelling Correction Peter Norvig’s Spelling Corrector Steps Minimum Edit Distance Levenshtein Algorithm Recursion Formula Minimum Edit Distance Complexities Adding Weights to Edit Distance Computation Confusion Matrix How to add Special Cost for Deletion and Insertion to calculate Edit Distance Context-Sensitive Correction

Google AdWords Bidding

Bidders specify:
- Search terms that trigger their bid.
- Amount to bid for each search term.
- Overall ad budget and bid limits.
Google may set a reserve/minimum price for each term.

Google AdWords Ranking

Google estimates each bidder's click-through rate (CTR) if listed first.
Ads with very low CTRs are not displayed.
Bids are ranked by multiplying the click-through rate and the bid amount.
Ads are displayed in rank order.

Google AdWords Payment

Google is paid only when an ad is clicked.
The price paid is the smallest amount the bidder could have bid to get their ranking.
- Example: If A bids $1.00$ $$1.00$$ and B bids $0.50$ $$0.50$$, when A's ad is clicked, A pays $0.51$ $$0.51$$. However, if A also get's 1000 impressions, A get's charged nothing.

Ad Rank and Cost-Per-Click (CPC)

Ad Rank: Determines the position of an ad.
- $Ad Rank = Bid * Quality Score$ $$Ad Rank = Bid * Quality Score$$
- $Ad Rank = Bid * Click Probability$ $$Ad Rank = Bid * Click Probability$$
CPC: Determined by the ad rank of the next highest ad below you.
- Exception: If you are the only bidder or the lowest bid, you pay your maximum bid per click!
- CPC = \frac{Ad Rank \ of \ the \ person \ below \ you}{Your \ Quality \ Score} + $0.01$$CPC = \frac{Ad Rank \ of \ the \ person \ below \ you}{Your \ Quality \ Score} + $0.01$$

Click-Through Rates (CTR) in AdWords

AdWords bidding penalizes advertisers with low-quality scores.
High-Quality Scores lead to higher ad ranks and lower CPCs.

Average Cost of AdWords

The average cost per click varies by keyword and industry.
- Roughly $2.32$ $$2.32$$ on the search network.
- Roughly $0.58$ $$0.58$$ on the display network.

Factors Influencing Click-Through Probability

Historical click performance of the ad
Landing page quality
Relevance to the user
User click-through rates

Landing Pages

Landing pages are the pages that appear when a user clicks on your ad.

Google AdSense

AdSense is a service for placing Google ads on web pages.
Offers ads besides cost-per-click:
- Cost per Thousand displays (CPM)
- Cost per Engagement: Advertisers pay only when users actively engage with ads (e.g., hovering, expanding, reviewing the landing page).
Google assigns two numbers for advertisers: "googleadclient" and "googleadslot."

Tracking Pixels

Small, transparent images on a web page that allow tracking page loads by a web server.

AdSense Technology

Based on WordNet, a database of word meanings developed at Princeton.

WordNet

A semantic lexicon for the English language.
Groups English words into sets of synonyms called synsets.
Provides short, general definitions.
Records semantic relations between synsets.
Purpose: a) Intuitive dictionary/thesaurus combination, b) Support automatic text analysis.

Ad Exchange

A technology platform facilitating the buying and selling of media advertising inventory from multiple ad networks.

Ad Network

Key function: Aggregation of ad space supply from publishers and matching it with advertiser demand.

DoubleClick Ad Exchange

A real-time marketplace owned by Google for buying and selling advertising.
Uses Dynamic Advertising Reporting and Targeting (DART) technology.

DART Technology

When a user calls up a webpage using DART, a tag signals DoubleClick's server to find an advertisement matching the marketer's needs with the user's profile.
Different users can see different advertisements on the same site simultaneously.
Cookies refine targeting by tracking repeat visitors or those who have seen a specific ad.

DoubleClick User Tracking

DoubleClick places small (1x1 pixels) GIF files on sites to load cookies on your machine.
Can potentially see search strings typed into search engines.

DART Cookies Information

Number of unique users their advertisements were displayed to.
How many users clicked on their ads or paid listings.
Which ads or paid listings they clicked on.

Web Beacons

Small strings of HTML code placed in a web page.
Also called "clear GIFs," pixel tags, web bugs, tracking bugs.

Information Sent to Server When a Web Beacon is Viewed

IP address of the computer.
URL of the page (referer).
URL of the web beacon image.
Time the Web Beacon was viewed.
Browser type (user_agent).
Any previously set cookie values.

IP Address Challenge with Web Beacons

IP address may change; unique ID stored with the cookie helps track the user.

Real-Time Bidding (RTB)

A protocol for two digital advertising companies to transact.

Ad Exchange Efficiency

Ad Exchange consolidates the integration of ad networks.

Knowledgebase (KB)

Technology used to store complex structured and unstructured information.

Knowledge Base System Components

Knowledge-base: Represents facts about the world.
Inference Engine: Reasons about facts and deduces new facts using rules and logic.

Knowledgebase Examples

Freebase, Google's Knowledge Graph, Apple's Siri, IBM’s Watson

Taxonomy

A hierarchy of concepts with parent/child relations (subClass/superClass, broader/narrower).

Ontology

The representation of knowledge in a knowledgebase.
Can express arbitrary complex relations between concepts (e.g., X marriedTo Y, A worksFor B, C locatedIn D).

DBpedia

A crowd-sourced community effort to extract structured information from Wikipedia.
Makes information available on the Web using RDF (Resource Description Framework) and SPARQL.

Taxonomy vs. Ontology

Taxonomies are narrower than ontologies, which include a larger variety of relation types.
Mathematically:
- Taxonomy: A tree structure of classifications.
- Ontology: A directed, labeled, cyclic graph.

Siri Steps/Challenges

Convert speech to text
Convert text to entities
Convert entities to a query
Retrieve results
Interpret results
Convert back to speech

Siri Related Technologies

Automated speech recognition
Parts of Speech tagging
Question/Answer analysis
Database interfaces to Wolfram, OpenTable, etc
Transforming machine output to natural language
Text-to-speech processing

Knowledgebase Graph Transformation

A knowledgebase can be seen as a directed labeled multigraph, where nodes are entities and edges are relations.

Multigraph

A graph with multiple edges between the same end nodes.

Resource Description Format (RDF) Triple

Consists of: subject, predicate, object (主谓宾).

Hyponym, Holonym, Hypernym

Hyponym: More specific.
Holonym: Denoting the whole.
Hypernym: A broad or superordinate.

Inference Engine

A component that applies logical rules to a knowledgebase to deduce new information.

Forward Chaining vs. Backward Chaining

Forward chaining: Starts with known facts and asserts new facts; search engines typically use forward chaining
Backward chaining: Starts with goals and works backward to determine necessary facts.

Forward Chaining

Repeated application of modus ponens.
Starts with available data and uses inference rules to extract more data until a goal is reached.

Knowledge Graph Enhancement of Google Search

Improves variety of search results by locating alternate interpretations of query terms (e.g., "taj mahal" - mausoleum or musician).
Provides deeper and broader results (e.g., person entities include relations like age, birthplace, marital status, children, education, etc.).
Provides the best summary by exploiting relationships among entities.

Wikipedia Statistics

30 million articles, 4 million in English.
16 million images.
8000 views per second.
500 million unique visitors per month.
3.7 billion monthly mobile page views.
2.1 billion edits, 700 million English.
20 million registered users.
120,000 active editors.
1,400 administrators.
All working for free, with no central control

Wikipedia Founder

Jimmy Wales

Wikipedia’s Five Pillars

Encyclopaedia: Notable topics, No original research (NOR), Neutral point of view (NPOV), Verifiability (referencing).
Free content: Anyone can edit, No copyright infringements.
Be civil.
No firm rules.

Wikipedia Link Types

Article links
Category links
Interlingual links

Wikipedia Special Pages

Redirect pages: Short pages providing equivalent names for an entity.
Disambiguation pages: Pages linking to multiple similarly named articles.

Wikipedia Category Graph (WCG)

Articles form a network of semantically related terms.
Categories are organized in a taxonomy-like structure.

Wikipedia Page Protection

Semi-protect pages: Lock pages from being edited by unregistered and new users.
Protect pages: Lock pages from being edited.

Wikipedia Revenue

Does not take advertising, only donations.

Wikimedia

Wikimedia Foundation, Inc. (WMF): American non-profit organization headquartered in San Francisco.
Owns the internet domain names and hosts Wikipedia.
Employs over 280 people, with annual revenues in excess of $75$ $$75$$ million (as of 2015).

Related Projects to Wikipedia

Commons for multimedia.
**Wiktionary as free dictionary.
**Wikidata for structured data.

Wikidata

An effort to convert Wikipedia data into a knowledgebase.
Aims to create a free RDF-like KB about the world that can be read/edited by humans & machines.
Wikidata clients use the repository to populate Web pages or Wikipedia infoboxes.

Wikidata Size

3 years old, 14M items, 209M edits

Wikidata vs. DBpedia

Two structured representations of Wikipedia.
- Wikidata: Manually curated, will master structured data for Wikipedia, synchronized through bots, data is fairly accurate but data depth is still small.
- DBpedia: Automatically extracted, live update, one-way extraction only, data reach is deep but there are problems in ontology and mappings, especially for non-English.

Wikipedia's Structure

Categories also contain classes, but do not form a taxonomic hierarchy. There is no ISA hierarchy.

Website Model Wiki

Created by Ward Cunningham
Wikis = Website that you can create and edit in real-time.

Difference between Nupedia and Wikipedia

In Nupedia, content can only be edited by experts.
In Wikipedia, content can be edited by anyone.

Query Processing

Considering Only Terms with High-idf Scores: Postings of low-idf terms have many docs, so these docs get eliminated from the set of contenders.
Considering Only Docs Containing Many Query Terms: Easy to implement in postings traversal.

Champion Lists Heuristic

Pre-compute for each dictionary term t, the r docs of highest weight (tf-idf) in t’s postings.
Call this the champion list for t (aka fancy list or top docs for t).
r has to be chosen at index build time, thus it’s possible that r < K (number of docs to return).
At query time, only compute scores for docs in the champion list of some query term.

Static Quality Scores Heuristic

Top-ranking documents should be both relevant and authoritative.
Relevance is modeled by cosine scores.
Authority is a query-independent property of a document, assigned a query-independent quality score g(d) between 0 and 1.

High Authority Signals Examples

Wikipedia among websites.
Articles in certain newspapers.
A paper with many citations.
High PageRank.

Net Score Calculation

Combining cosine relevance and authority
$net_score(q,d) = g(d) + cosine(q,d)$ $$net_score(q,d) = g(d) + cosine(q,d)$$
Other linear combinations can be used, using assigning different weights to the two factors

Inverted List Reorganization

Order all postings by g(d), the authority measure.
Most authoritative documents appear early in the postings list.
Maintain for each term a champion list of the r docs with highest g(d) + tf-idf(td)
Seek top-K results from only the docs in these champion lists.

High and Low Lists Heuristic

For each term, maintain two postings lists called high and low.
high is the champion list.
When traversing postings on a query, only traverse high lists first.
- If we get more than K docs, select the top K and stop.
- Else proceed to get docs from the low lists.
Can be used even for simple cosine scores, without global quality g(d).

Google Ranking Factors

Content factors:
- Content relevance
- Word count
User signals:
- Click through rate
- Bounce rate
Technical factors:
- Presence of H1/H2
- Use of HTTPS
User experience:
- Number of internal/external links
Social signals:
- Facebook total
- Tweets

Content Relevance Scores

The highest content relevance scores were found among the results for positions 3 to 6.

Landing Page Word Count

It’s rising for years.

Click-Through Rate (CTR)

Measures the average percentage of users who click on the result at each position on the Search Engine Results Page (SERP).

Bounce Rate

Measures the percentage of users who only click on the URL from Google’s search results, without visiting any other URLs on the domain, and then return back to the SERP.
These are single-page sessions where the user leaves the site without interacting with the page.

Google HTTPS

Page encryption using HTTPS is rising.

Social Network User Interactions

Facebook has the highest level of user interactions.

Mobile-Friendly Websites

All top 100 websites have a mobile-friendly version; they use either a mobile sub-domain or responsive design.

SEO Findings and Google Algorithm

Correlations are not synonymous with causal relationships.
There is no guarantee that respective factors actually have any impact on the ranking.
There are many examples of illusory correlations (logical fallacy).

Google Services with Over a Billion Monthly Active Users

Google Search, Google Maps, YouTube, Android, Gmail, the Play Store and Google Chrome.

Popular Browsers in US

Chrome and Safari.

Default Search Engines

Chrome: Google search.
Firefox: Yahoo.
Microsoft Edge and Internet Explorer: Bing.

Query Processing Using Conversational Search Scenario

Use a microphone to give a spoken query like “how old is Obama”, and after getting the result, send a follow-up query like “how tall is he”.

Database Shard

A horizontal partition of data in a database or search engine.
Each individual partition is referred to as a shard.
Each shard is held on a separate database server instance, to spread the load.

Google Query Processing Basic Steps

Parse the query
Convert words into wordIDs using the lexicon
Select the barrels that contain documents which match the wordIDs
Scan through the document list until there is a document that matches all of the search terms
Compute the rank of that document for the query (using PageRank as one component)
Repeat step 4 until no documents are found and we've examined all of the barrels
Sort the set of returned documents that have been matched by document rank and return the top k.

Lucene

High performance, scalable, full-text search library.
Does not include crawlers or explicit searching of documents. Searching the index must be implemented elsewhere.
100% Java, no dependencies. Offered by the Apache Software Foundation

Solr

A full-text search server based on Lucene.
Communicates with Lucene via XML/HTTP, JSON Interfaces.
Written in JAVA 5.

Solr Advantages

Flexible data schema to define types and fields
Output includes highlighting matches
Configurable advanced caching
Index Replication
Extensible Open Architecture, Plugins
Web Administration Interface

Lucene's Inverted Index

Stores a positional inverted index.

Lucene's Analyzer

Creates tokens using a Tokenizer and/or applying Filters (Token Filters).

Lucenen's Similarity Decision

Lucene scores using a combination of TF-IDF and vector closeness
$TF-IDF: W(x,y) = tf(x,y) * log(\frac{N}{df(x)})$ $$TF-IDF: W(x,y) = tf(x,y) * log(\frac{N}{df(x)})$$ (term frequency * inverse document frequency)
$Cosine-similarity(query_vector, doc_vector): \frac{V(q) * V(d)}{(|V(q)| * |V(d)|)}$ $$Cosine-similarity(query_vector, doc_vector): \frac{V(q) * V(d)}{(|V(q)| * |V(d)|)}$$

TF-IDF Formula Terms

$tf(x,y)$ $$tf(x,y)$$: term frequency of x in y
$N$ $$N$$: total number of documents
$df(x)$ $$df(x)$$: number of document containing x

Cosine-Similarity Formula Terms

$V(q)*V(d)$ $$V(q)*V(d)$$ is the dot product of the weighted vectors
$|V(q)|, |V(d)|$ $$|V(q)|, |V(d)|$$ are the Euclidean norms of the vectors (square root of the sum of squares)

Solr Server Function

Accepts queries in the form of http requests and returns results as JSON strings

Solr Query Result

status, always 0 in a successful response
QTime, the server-side query time in milliseconds
numFound, the total number of documents matching the query
start, the offset into the ordered list of results
field types in include str, boolean, int, long, float, double, date, lst, arr
lst is a named list
arr is an array
multivalued fields are returned in an element.

Solr Index XML Data

1. Start Solr java -jar start.jar
2. Index your XML data java -jar post.jar *.xml

Solr Query Meaning

http://solr/select?q=electronics&fl=name+price search electronics and limit results to fields: name and price
http://solr/select?q=electronics&sort=price+desc search electronics and sort by the price in decreasing order

Query Types Supported by Solr

Single and multi-term queries
+, -, AND, OR, NOT operators are supported
Range queries on date or numeric fields,
Boost queries:
Fuzzy search : is a search for words that are similar in spelling
Proximity Search : with a sloppy phrase query. The closer together the two terms appear, higher the score.

Lucene/Solr Websites

Apple.com, Netflix.com, Whitehouse.gov, Indeed.com, Twitter.com, LinkedIn.com

VirtualBox

VirtualBox is an open source, freely available Windows application (it also runs on other platforms) that lets you run multiple operating systems on your single machine
Runs on other platforms, and lets you run multiple operating systems on a single machine.

Ubuntu

Ubuntu is a Linux-based operating system on personal computers, smartphones and network servers. It uses Unity as its default desktop environment
A Linux-based operating system that uses Unity as its default desktop environment

Solr's Basic Operations

Query, index, delete, commit, and optimize

Query Tokenization Example

For the query “Queen Victoria”, after tokenization, Queen: position:0; offset:0; length:5; Victoria: position:1; offset:6; length:8

Query Type Explanation

Term query: matches a precise term
Wildcard query: portions of term left unspecified, e.g. c*t=cat, cot, cant
Prefix query: end of term left unspecified, e.g. ca*=cat, camp, cad
Fuzzy query: matches an imprecise term
Phrase query: matches multiple terms in sequence/proximity
Range query: matches an alphabetical or numeric range of terms
Boolean query: logically composite other queries

Lucene Terms

Lucene.NET: implementation of Lucene in C#
PyLucene: Lucene running in JVM embedded in CPython
Tika: Java library for text and metadata extraction
Luke: tool for inspecting Lucene indexes

Lucene Document Organization

Lucene document=collection of fields.

Lucene Index Storage

Fields are stored separately from the index (.fdt file)

Lence Field

It may be indexed and stored, indexed but not stored, or stored but not indexed.

Lucene Index Segment

A single Lucene index may be split into multiple segments.
Each segment stores one or more of the documents in its own inverted index and field storage.

Lucene's Merge Segments

MergePolicy: findMerges() invoked when a new segment is created and return a list of segments to merge.
MergeScheduler: process the merges, sometimes in separate threads.

Google Document Deletetion

When a google is deleted, a document is marked for deletion; actually gets removed only when its segment gets merged.

Lucene Document Modification

You can’t. Only whole documents can be added and deleted; fields of an existing document cannot be modified.

Solr Creator

Yonik Seeley

Solr Config File - Schema.xml

It contains field types and fields, dynamic fields. managed-schema

Solr Config File - Solrconfig.xml

It contains Lucene index parameters, request handler mappings, cache settings and plugins

Indexing and Search With DB's

Yes solr support indexing data from dtabase and HTTP GET, and full/incremental indexing/searching

Solr File Extentions for Indexing

solr supports indexing MS Office, pdf, rtf, OpenDocument, images, mp3, zip ….

Solre Results Type

JSON. The default type of searching results for solr

SolrJs

Its the front-end framework to use Solr for UI of web pages.

Sor Security

It relies on server and container security.

Solr Scaling

It supports both master/slave and sharding. But not all solr features support sharding.

Solr Real-Time Index Updates

Yes you can use solr to achieve near real-time index updates

Types of Spelling Errors

Non-word errors
Typographical errors
Cognitive errors

Main Spelling Tasks

Spelling Error Detection
Spelling Error Correction

Spelling Error Correction Requirements

Yes, a big dictionary, Using context, or Fast response time for autocomplete is needed

Non-Word Spelling Error Detection

Any word not in a dictionary is presumed to be an error.
The larger the dictionary the better

Non-Word Spelling Error Correction

Generate candidates from the dictionary that are close to the error.
There are some ways:
- (a) Shortest weighted edit distance
- Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other
- (b) Highest noisy channel probability
- Use probabilities to select the most obvious candidates

Common Types of Misspellings

Typing quickly, keyboard adjacency, inconsistent rules, ambiguous word breaking, new words

Search Queries Misspelled

According to Cucerzan and Brill, more than 10% of search engine queries are misspelled

Using Context for Spelling Correction

Consider whether the surrounding words “make sense” for your candidate set.
For candidate words with similar sounds but different spellings and different meanings.
(we can use the Soundex algorithm)

Challenges for Identifying Spelling Errors

Allow for insertion of a space or hyphen; Or deletion of a space; Watch out for words NOT in the Lexicon that are valid

Noisy Channel Model

This model suggests treating the misspelled word as if a correctly spelled word has been distorted by being passed through a noisy communication channel.
Noise in this case refers to substitutions, insertions or deletions of letters

Bayesian inference

To predict the correct word, use Bayesian inference. The formula will incorporate prior probability of a words existence, and a multiplication of that with the liklihood from the misspelled word

Basic Spell Correction Algorithm Steps

Initial step: Create a dictionary and encode it for fast retrieval
When a query is submitted, the spell checker examines each word and looks for possible character edits. Observation:
- 80% of errors are within edit distance 1
- Almost all errors within edit distance 2
Take the output of step 2 and compute probabilities for the candidates using previously identified n-grams
Delect the result with highest probability

Prefix Matching Data Structure

Usually which data structure is used for efficient prefix matching in autocomplete and spell correction?
Trie.

Spelling Correction Search Direction

Is the search for corrections carried out from left-to-right or right-to-left? What factor is used to control the response time?
The search for corrections is carried out from left-to-right.
The branching factor controls the amount of time required to search for spelling corrections.

Trie Time Complexity

O(m). where m is the length of the string, at the expense of increased storage

Spelling Correction Dictionary Error Test Sets

Wikipedia’s list of common English misspelling
Aspell filtered version of that list
Birkbeck spelling error corpus

N-gram Model

A n-gram model is a type of probabilistic language model for predicting the next item in a sequence.
Two benefits of n-gram models are simplicity and scalability – with larger n, a model can store more context with a well- understood space–time tradeoff, enabling small experiments to scale up efficiently

Google N-gram Data Use

Google has collected and uses a great deal of N-gram data. Google is using the Linguistics Data Consortium to distribute more than one trillion words they have extracted from public web pages

Google N-gram Data Storage

Rather than store each word as a string, it is generally represented in memory as a 64-bit hash number, with the words themselves stored on disk.
Probabilities are generally quantized using only 4-8 bits (instead of 8-byte floats),
N-grams are stored in reverse tries.
N-grams can also be shrunk by pruning, for example only storing N- grams with counts greater than some threshold (such as the count threshold of 40 used for the Google N-gram release)

Applying the N-Gram Model to Spelling Correction

If a higher-order N-gram has a zero count, we simply backoff to a lower order N-gram, weighed by a fixed (context-independent) weight
The backoff terminates in the unigram

Peter Norvig’s Spelling Corrector Steps

For a given “big.txt”, extract individual words and build a dictionary (hashtable) that counts how many times each word appears. (for words not appearing, set default value to 1).
For a given query word, compute all the words that are in edit- distance 1 with that, and words in the set that are in edit-distance 2 with that.
For the given query word, if the word itself is in the original word- set, then returns it as the correct word. Otherwise find the word with highest frequency that has edit-distance 1, if none exists then find the word with highest frequency having edit-distance 2.

Minimum Edit Distance

The minimum edit distance between two strings is the minimum number of editing operations (insertion, deletion, substitution) needed to transform one into the other

Levenshtein Algorithm Recursion Formula

To compute the minimum edit distance D(n, m) for two strings X and Y, the formula is:
D(i, j)= min(
- D(i-1, j)+1 // for deletion
- D(i, j-1)+1 // for insertion
- D(i-1, j-1)+2 // for substitution, if X[i]!=Y[j],
- D(i-1, j-1) // if X[i]==Y[j]
  )
For initial states, D(i, 0)=i, D(0, j)=j

Minimum Edit Distance Complexities

Computing minimum distance time: O(mn)
Computing minimum distance space: O(mn)
Backtracing: O(m+n)

Adding Weights to Edit Distance Computation

For spell correction: some letters are more likely to be mistyped than others

Confusion Matrix

A specific table layout that allows visualization of the performance of an algorithm;
Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice-versa)

How to add Special Cost for Deletion and Insertion to calculate Edit Distance

To account for special costs for deletion and insertion during edit distance calculation:

$$ D(i, j)= min(\ \ \ D(i-1, j) + del[x[i]] \ for \ deletion \\ D(i, j-1) + ins[y[j]] \ for \ insertion \\ D(i-1, j-1) + sub[x[i], y[j]] \ for \ substitution, \ if \ X[i]! =Y[j], \\ D(i-1, j-1) \ if \ X[i]==Y[j] \
) $$
for initial states,
$D(0,0) = 0$ $$D(0,0) = 0$$
$D(i, 0) = D(i-1, 0) + del[x[i]], \ i: 1~N$ $$D(i, 0) = D(i-1, 0) + del[x[i]], \ i: 1~N$$
$D(0, j) = D(0, j-1) + ins[y[j]]\ i: 1~N$ $$D(0, j) = D(0, j-1) + ins[y[j]]\ i: 1~N$$

Context-Sensitive Correction

To correct query phrases that match no docs, retrieve dictionary terms close (in weighted edit distance) to each query term. Then try all possible resulting phrases with one word “fixed” at

Note

0.0(0)

Take a practice test

Chat with Kai

Explore Top Notes

Limiting Factors of Photosynthesis

Arguments Against International Trade

Note

Studied by 18 people

5.0(2)

After birth & Measuring Health

Chapter 3 - Enterprise, business growth & size

Note

Studied by 13 people

5.0(1)

Final Review 2

Google AdWords Bidding

Bidders specify:
- Search terms that trigger their bid.
- Amount to bid for each search term.
- Overall ad budget and bid limits.
Google may set a reserve/minimum price for each term.

Google AdWords Ranking

Google estimates each bidder's click-through rate (CTR) if listed first.
Ads with very low CTRs are not displayed.
Bids are ranked by multiplying the click-through rate and the bid amount.
Ads are displayed in rank order.

Google AdWords Payment

Google is paid only when an ad is clicked.
The price paid is the smallest amount the bidder could have bid to get their ranking.
- Example: If A bids $1.00$ and B bids $0.50$ , when A's ad is clicked, A pays $0.51$ . However, if A also get's 1000 impressions, A get's charged nothing.

Ad Rank and Cost-Per-Click (CPC)

Ad Rank: Determines the position of an ad.
- $Ad Rank = Bid * Quality Score$
- $Ad Rank = Bid * Click Probability$
CPC: Determined by the ad rank of the next highest ad below you.
- Exception: If you are the only bidder or the lowest bid, you pay your maximum bid per click!
- CPC = \frac{Ad Rank \ of \ the \ person \ below \ you}{Your \ Quality \ Score} + $0.01

Click-Through Rates (CTR) in AdWords

AdWords bidding penalizes advertisers with low-quality scores.
High-Quality Scores lead to higher ad ranks and lower CPCs.

Average Cost of AdWords

The average cost per click varies by keyword and industry.
- Roughly $2.32$ on the search network.
- Roughly $0.58$ on the display network.

Factors Influencing Click-Through Probability

Historical click performance of the ad
Landing page quality
Relevance to the user
User click-through rates

Landing Pages

Landing pages are the pages that appear when a user clicks on your ad.

Google AdSense

AdSense is a service for placing Google ads on web pages.
Offers ads besides cost-per-click:
- Cost per Thousand displays (CPM)
- Cost per Engagement: Advertisers pay only when users actively engage with ads (e.g., hovering, expanding, reviewing the landing page).
Google assigns two numbers for advertisers: "googleadclient" and "googleadslot."

Tracking Pixels

Small, transparent images on a web page that allow tracking page loads by a web server.

AdSense Technology

Based on WordNet, a database of word meanings developed at Princeton.

WordNet

A semantic lexicon for the English language.
Groups English words into sets of synonyms called synsets.
Provides short, general definitions.
Records semantic relations between synsets.
Purpose: a) Intuitive dictionary/thesaurus combination, b) Support automatic text analysis.

Ad Exchange

A technology platform facilitating the buying and selling of media advertising inventory from multiple ad networks.

Ad Network

Key function: Aggregation of ad space supply from publishers and matching it with advertiser demand.

DoubleClick Ad Exchange

A real-time marketplace owned by Google for buying and selling advertising.
Uses Dynamic Advertising Reporting and Targeting (DART) technology.

DART Technology

When a user calls up a webpage using DART, a tag signals DoubleClick's server to find an advertisement matching the marketer's needs with the user's profile.
Different users can see different advertisements on the same site simultaneously.
Cookies refine targeting by tracking repeat visitors or those who have seen a specific ad.

DoubleClick User Tracking

DoubleClick places small (1x1 pixels) GIF files on sites to load cookies on your machine.
Can potentially see search strings typed into search engines.

DART Cookies Information

Number of unique users their advertisements were displayed to.
How many users clicked on their ads or paid listings.
Which ads or paid listings they clicked on.

Web Beacons

Small strings of HTML code placed in a web page.
Also called "clear GIFs," pixel tags, web bugs, tracking bugs.

Information Sent to Server When a Web Beacon is Viewed

IP address of the computer.
URL of the page (referer).
URL of the web beacon image.
Time the Web Beacon was viewed.
Browser type (user_agent).
Any previously set cookie values.

IP Address Challenge with Web Beacons

IP address may change; unique ID stored with the cookie helps track the user.

Real-Time Bidding (RTB)

A protocol for two digital advertising companies to transact.

Ad Exchange Efficiency

Ad Exchange consolidates the integration of ad networks.

Knowledgebase (KB)

Technology used to store complex structured and unstructured information.

Knowledge Base System Components

Knowledge-base: Represents facts about the world.
Inference Engine: Reasons about facts and deduces new facts using rules and logic.

Knowledgebase Examples

Freebase, Google's Knowledge Graph, Apple's Siri, IBM’s Watson

Taxonomy

A hierarchy of concepts with parent/child relations (subClass/superClass, broader/narrower).

Ontology

The representation of knowledge in a knowledgebase.
Can express arbitrary complex relations between concepts (e.g., X marriedTo Y, A worksFor B, C locatedIn D).

DBpedia

A crowd-sourced community effort to extract structured information from Wikipedia.
Makes information available on the Web using RDF (Resource Description Framework) and SPARQL.

Taxonomy vs. Ontology

Taxonomies are narrower than ontologies, which include a larger variety of relation types.
Mathematically:
- Taxonomy: A tree structure of classifications.
- Ontology: A directed, labeled, cyclic graph.

Siri Steps/Challenges

Convert speech to text
Convert text to entities
Convert entities to a query
Retrieve results
Interpret results
Convert back to speech

Siri Related Technologies

Automated speech recognition
Parts of Speech tagging
Question/Answer analysis
Database interfaces to Wolfram, OpenTable, etc
Transforming machine output to natural language
Text-to-speech processing

Knowledgebase Graph Transformation

A knowledgebase can be seen as a directed labeled multigraph, where nodes are entities and edges are relations.

Multigraph

A graph with multiple edges between the same end nodes.

Resource Description Format (RDF) Triple

Consists of: subject, predicate, object (主谓宾).

Hyponym, Holonym, Hypernym

Hyponym: More specific.
Holonym: Denoting the whole.
Hypernym: A broad or superordinate.

Inference Engine

A component that applies logical rules to a knowledgebase to deduce new information.

Forward Chaining vs. Backward Chaining

Forward chaining: Starts with known facts and asserts new facts; search engines typically use forward chaining
Backward chaining: Starts with goals and works backward to determine necessary facts.

Forward Chaining

Repeated application of modus ponens.
Starts with available data and uses inference rules to extract more data until a goal is reached.

Knowledge Graph Enhancement of Google Search

Improves variety of search results by locating alternate interpretations of query terms (e.g., "taj mahal" - mausoleum or musician).
Provides deeper and broader results (e.g., person entities include relations like age, birthplace, marital status, children, education, etc.).
Provides the best summary by exploiting relationships among entities.

Wikipedia Statistics

30 million articles, 4 million in English.
16 million images.
8000 views per second.
500 million unique visitors per month.
3.7 billion monthly mobile page views.
2.1 billion edits, 700 million English.
20 million registered users.
120,000 active editors.
1,400 administrators.
All working for free, with no central control

Wikipedia Founder

Jimmy Wales

Wikipedia’s Five Pillars

Encyclopaedia: Notable topics, No original research (NOR), Neutral point of view (NPOV), Verifiability (referencing).
Free content: Anyone can edit, No copyright infringements.
Be civil.
No firm rules.

Wikipedia Link Types

Article links
Category links
Interlingual links

Wikipedia Special Pages

Redirect pages: Short pages providing equivalent names for an entity.
Disambiguation pages: Pages linking to multiple similarly named articles.

Wikipedia Category Graph (WCG)

Articles form a network of semantically related terms.
Categories are organized in a taxonomy-like structure.

Wikipedia Page Protection

Semi-protect pages: Lock pages from being edited by unregistered and new users.
Protect pages: Lock pages from being edited.

Wikipedia Revenue

Does not take advertising, only donations.

Wikimedia

Wikimedia Foundation, Inc. (WMF): American non-profit organization headquartered in San Francisco.
Owns the internet domain names and hosts Wikipedia.
Employs over 280 people, with annual revenues in excess of $75$ million (as of 2015).

Related Projects to Wikipedia

Commons for multimedia.
**Wiktionary as free dictionary.
**Wikidata for structured data.

Wikidata

An effort to convert Wikipedia data into a knowledgebase.
Aims to create a free RDF-like KB about the world that can be read/edited by humans & machines.
Wikidata clients use the repository to populate Web pages or Wikipedia infoboxes.

Wikidata Size

3 years old, 14M items, 209M edits

Wikidata vs. DBpedia

Two structured representations of Wikipedia.
- Wikidata: Manually curated, will master structured data for Wikipedia, synchronized through bots, data is fairly accurate but data depth is still small.
- DBpedia: Automatically extracted, live update, one-way extraction only, data reach is deep but there are problems in ontology and mappings, especially for non-English.

Wikipedia's Structure

Categories also contain classes, but do not form a taxonomic hierarchy. There is no ISA hierarchy.

Website Model Wiki

Created by Ward Cunningham
Wikis = Website that you can create and edit in real-time.

Difference between Nupedia and Wikipedia

In Nupedia, content can only be edited by experts.
In Wikipedia, content can be edited by anyone.

Query Processing

Considering Only Terms with High-idf Scores: Postings of low-idf terms have many docs, so these docs get eliminated from the set of contenders.
Considering Only Docs Containing Many Query Terms: Easy to implement in postings traversal.

Champion Lists Heuristic

Pre-compute for each dictionary term t, the r docs of highest weight (tf-idf) in t’s postings.
Call this the champion list for t (aka fancy list or top docs for t).
r has to be chosen at index build time, thus it’s possible that r < K (number of docs to return).
At query time, only compute scores for docs in the champion list of some query term.

Static Quality Scores Heuristic

Top-ranking documents should be both relevant and authoritative.
Relevance is modeled by cosine scores.
Authority is a query-independent property of a document, assigned a query-independent quality score g(d) between 0 and 1.

High Authority Signals Examples

Wikipedia among websites.
Articles in certain newspapers.
A paper with many citations.
High PageRank.

Net Score Calculation

Combining cosine relevance and authority
$net_score(q,d) = g(d) + cosine(q,d)$
Other linear combinations can be used, using assigning different weights to the two factors

Inverted List Reorganization

Order all postings by g(d), the authority measure.
Most authoritative documents appear early in the postings list.
Maintain for each term a champion list of the r docs with highest g(d) + tf-idf(td)
Seek top-K results from only the docs in these champion lists.

High and Low Lists Heuristic

For each term, maintain two postings lists called high and low.
high is the champion list.
When traversing postings on a query, only traverse high lists first.
- If we get more than K docs, select the top K and stop.
- Else proceed to get docs from the low lists.
Can be used even for simple cosine scores, without global quality g(d).

Google Ranking Factors

Content factors:
- Content relevance
- Word count
User signals:
- Click through rate
- Bounce rate
Technical factors:
- Presence of H1/H2
- Use of HTTPS
User experience:
- Number of internal/external links
Social signals:
- Facebook total
- Tweets

Content Relevance Scores

The highest content relevance scores were found among the results for positions 3 to 6.

Landing Page Word Count

It’s rising for years.

Click-Through Rate (CTR)

Measures the average percentage of users who click on the result at each position on the Search Engine Results Page (SERP).

Bounce Rate

Measures the percentage of users who only click on the URL from Google’s search results, without visiting any other URLs on the domain, and then return back to the SERP.
These are single-page sessions where the user leaves the site without interacting with the page.

Google HTTPS

Page encryption using HTTPS is rising.

Social Network User Interactions

Facebook has the highest level of user interactions.

Mobile-Friendly Websites

All top 100 websites have a mobile-friendly version; they use either a mobile sub-domain or responsive design.

SEO Findings and Google Algorithm

Correlations are not synonymous with causal relationships.
There is no guarantee that respective factors actually have any impact on the ranking.
There are many examples of illusory correlations (logical fallacy).

Google Services with Over a Billion Monthly Active Users

Google Search, Google Maps, YouTube, Android, Gmail, the Play Store and Google Chrome.

Popular Browsers in US

Chrome and Safari.

Default Search Engines

Chrome: Google search.
Firefox: Yahoo.
Microsoft Edge and Internet Explorer: Bing.

Query Processing Using Conversational Search Scenario

Use a microphone to give a spoken query like “how old is Obama”, and after getting the result, send a follow-up query like “how tall is he”.

Database Shard

A horizontal partition of data in a database or search engine.
Each individual partition is referred to as a shard.
Each shard is held on a separate database server instance, to spread the load.

Google Query Processing Basic Steps

Parse the query
Convert words into wordIDs using the lexicon
Select the barrels that contain documents which match the wordIDs
Scan through the document list until there is a document that matches all of the search terms
Compute the rank of that document for the query (using PageRank as one component)
Repeat step 4 until no documents are found and we've examined all of the barrels
Sort the set of returned documents that have been matched by document rank and return the top k.

Lucene

High performance, scalable, full-text search library.
Does not include crawlers or explicit searching of documents. Searching the index must be implemented elsewhere.
100% Java, no dependencies. Offered by the Apache Software Foundation

Solr

A full-text search server based on Lucene.
Communicates with Lucene via XML/HTTP, JSON Interfaces.
Written in JAVA 5.

Solr Advantages

Flexible data schema to define types and fields
Output includes highlighting matches
Configurable advanced caching
Index Replication
Extensible Open Architecture, Plugins
Web Administration Interface

Lucene's Inverted Index

Stores a positional inverted index.

Lucene's Analyzer

Creates tokens using a Tokenizer and/or applying Filters (Token Filters).

Lucenen's Similarity Decision

Lucene scores using a combination of TF-IDF and vector closeness
$TF-IDF: W(x,y) = tf(x,y) * log(\frac{N}{df(x)})$ (term frequency * inverse document frequency)
$Cosine-similarity(query_vector, doc_vector): \frac{V(q) * V(d)}{(|V(q)| * |V(d)|)}$

TF-IDF Formula Terms

$tf(x,y)$ : term frequency of x in y
$N$ : total number of documents
$df(x)$ : number of document containing x

Cosine-Similarity Formula Terms

$V(q)*V(d)$ is the dot product of the weighted vectors
$|V(q)|, |V(d)|$ are the Euclidean norms of the vectors (square root of the sum of squares)

Solr Server Function

Accepts queries in the form of http requests and returns results as JSON strings

Solr Query Result

status, always 0 in a successful response
QTime, the server-side query time in milliseconds
numFound, the total number of documents matching the query
start, the offset into the ordered list of results
field types in include str, boolean, int, long, float, double, date, lst, arr
lst is a named list
arr is an array
multivalued fields are returned in an element.

Solr Index XML Data

1. Start Solr java -jar start.jar
2. Index your XML data java -jar post.jar *.xml

Solr Query Meaning

http://solr/select?q=electronics&fl=name+price search electronics and limit results to fields: name and price
http://solr/select?q=electronics&sort=price+desc search electronics and sort by the price in decreasing order

Query Types Supported by Solr

Single and multi-term queries
+, -, AND, OR, NOT operators are supported
Range queries on date or numeric fields,
Boost queries:
Fuzzy search : is a search for words that are similar in spelling
Proximity Search : with a sloppy phrase query. The closer together the two terms appear, higher the score.

Lucene/Solr Websites

Apple.com, Netflix.com, Whitehouse.gov, Indeed.com, Twitter.com, LinkedIn.com

VirtualBox

VirtualBox is an open source, freely available Windows application (it also runs on other platforms) that lets you run multiple operating systems on your single machine
Runs on other platforms, and lets you run multiple operating systems on a single machine.

Ubuntu

Ubuntu is a Linux-based operating system on personal computers, smartphones and network servers. It uses Unity as its default desktop environment
A Linux-based operating system that uses Unity as its default desktop environment

Solr's Basic Operations

Query, index, delete, commit, and optimize

Query Tokenization Example

For the query “Queen Victoria”, after tokenization, Queen: position:0; offset:0; length:5; Victoria: position:1; offset:6; length:8

Query Type Explanation

Term query: matches a precise term
Wildcard query: portions of term left unspecified, e.g. c*t=cat, cot, cant
Prefix query: end of term left unspecified, e.g. ca*=cat, camp, cad
Fuzzy query: matches an imprecise term
Phrase query: matches multiple terms in sequence/proximity
Range query: matches an alphabetical or numeric range of terms
Boolean query: logically composite other queries

Lucene Terms

Lucene.NET: implementation of Lucene in C#
PyLucene: Lucene running in JVM embedded in CPython
Tika: Java library for text and metadata extraction
Luke: tool for inspecting Lucene indexes

Lucene Document Organization

Lucene document=collection of fields.

Lucene Index Storage

Fields are stored separately from the index (.fdt file)

Lence Field

It may be indexed and stored, indexed but not stored, or stored but not indexed.

Lucene Index Segment

A single Lucene index may be split into multiple segments.
Each segment stores one or more of the documents in its own inverted index and field storage.

Lucene's Merge Segments

MergePolicy: findMerges() invoked when a new segment is created and return a list of segments to merge.
MergeScheduler: process the merges, sometimes in separate threads.

Google Document Deletetion

When a google is deleted, a document is marked for deletion; actually gets removed only when its segment gets merged.

Lucene Document Modification

You can’t. Only whole documents can be added and deleted; fields of an existing document cannot be modified.

Solr Creator

Yonik Seeley

Solr Config File - Schema.xml

It contains field types and fields, dynamic fields. managed-schema

Solr Config File - Solrconfig.xml

It contains Lucene index parameters, request handler mappings, cache settings and plugins

Indexing and Search With DB's

Yes solr support indexing data from dtabase and HTTP GET, and full/incremental indexing/searching

Solr File Extentions for Indexing

solr supports indexing MS Office, pdf, rtf, OpenDocument, images, mp3, zip ….

Solre Results Type

JSON. The default type of searching results for solr

SolrJs

Its the front-end framework to use Solr for UI of web pages.

Sor Security

It relies on server and container security.

Solr Scaling

It supports both master/slave and sharding. But not all solr features support sharding.

Solr Real-Time Index Updates

Yes you can use solr to achieve near real-time index updates

Types of Spelling Errors

Non-word errors
Typographical errors
Cognitive errors

Main Spelling Tasks

Spelling Error Detection
Spelling Error Correction

Spelling Error Correction Requirements

Yes, a big dictionary, Using context, or Fast response time for autocomplete is needed

Non-Word Spelling Error Detection

Any word not in a dictionary is presumed to be an error.
The larger the dictionary the better

Non-Word Spelling Error Correction

Generate candidates from the dictionary that are close to the error.
There are some ways:
- (a) Shortest weighted edit distance
- Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other
- (b) Highest noisy channel probability
- Use probabilities to select the most obvious candidates

Common Types of Misspellings

Typing quickly, keyboard adjacency, inconsistent rules, ambiguous word breaking, new words

Search Queries Misspelled

According to Cucerzan and Brill, more than 10% of search engine queries are misspelled

Using Context for Spelling Correction

Consider whether the surrounding words “make sense” for your candidate set.
For candidate words with similar sounds but different spellings and different meanings.
(we can use the Soundex algorithm)

Challenges for Identifying Spelling Errors

Allow for insertion of a space or hyphen; Or deletion of a space; Watch out for words NOT in the Lexicon that are valid

Noisy Channel Model

This model suggests treating the misspelled word as if a correctly spelled word has been distorted by being passed through a noisy communication channel.
Noise in this case refers to substitutions, insertions or deletions of letters

Bayesian inference

To predict the correct word, use Bayesian inference. The formula will incorporate prior probability of a words existence, and a multiplication of that with the liklihood from the misspelled word

Basic Spell Correction Algorithm Steps

Initial step: Create a dictionary and encode it for fast retrieval
When a query is submitted, the spell checker examines each word and looks for possible character edits. Observation:
- 80% of errors are within edit distance 1
- Almost all errors within edit distance 2
Take the output of step 2 and compute probabilities for the candidates using previously identified n-grams
Delect the result with highest probability

Prefix Matching Data Structure

Usually which data structure is used for efficient prefix matching in autocomplete and spell correction?
Trie.

Spelling Correction Search Direction

Is the search for corrections carried out from left-to-right or right-to-left? What factor is used to control the response time?
The search for corrections is carried out from left-to-right.
The branching factor controls the amount of time required to search for spelling corrections.

Trie Time Complexity

O(m). where m is the length of the string, at the expense of increased storage

Spelling Correction Dictionary Error Test Sets

Wikipedia’s list of common English misspelling
Aspell filtered version of that list
Birkbeck spelling error corpus

N-gram Model

A n-gram model is a type of probabilistic language model for predicting the next item in a sequence.
Two benefits of n-gram models are simplicity and scalability – with larger n, a model can store more context with a well- understood space–time tradeoff, enabling small experiments to scale up efficiently

Google N-gram Data Use

Google has collected and uses a great deal of N-gram data. Google is using the Linguistics Data Consortium to distribute more than one trillion words they have extracted from public web pages

Google N-gram Data Storage

Rather than store each word as a string, it is generally represented in memory as a 64-bit hash number, with the words themselves stored on disk.
Probabilities are generally quantized using only 4-8 bits (instead of 8-byte floats),
N-grams are stored in reverse tries.
N-grams can also be shrunk by pruning, for example only storing N- grams with counts greater than some threshold (such as the count threshold of 40 used for the Google N-gram release)

Applying the N-Gram Model to Spelling Correction

If a higher-order N-gram has a zero count, we simply backoff to a lower order N-gram, weighed by a fixed (context-independent) weight
The backoff terminates in the unigram

Peter Norvig’s Spelling Corrector Steps

For a given “big.txt”, extract individual words and build a dictionary (hashtable) that counts how many times each word appears. (for words not appearing, set default value to 1).
For a given query word, compute all the words that are in edit- distance 1 with that, and words in the set that are in edit-distance 2 with that.
For the given query word, if the word itself is in the original word- set, then returns it as the correct word. Otherwise find the word with highest frequency that has edit-distance 1, if none exists then find the word with highest frequency having edit-distance 2.

Minimum Edit Distance

The minimum edit distance between two strings is the minimum number of editing operations (insertion, deletion, substitution) needed to transform one into the other

Levenshtein Algorithm Recursion Formula

To compute the minimum edit distance D(n, m) for two strings X and Y, the formula is:
D(i, j)= min(
- D(i-1, j)+1 // for deletion
- D(i, j-1)+1 // for insertion
- D(i-1, j-1)+2 // for substitution, if X[i]!=Y[j],
- D(i-1, j-1) // if X[i]==Y[j]
  )
For initial states, D(i, 0)=i, D(0, j)=j

Minimum Edit Distance Complexities

Computing minimum distance time: O(mn)
Computing minimum distance space: O(mn)
Backtracing: O(m+n)

Adding Weights to Edit Distance Computation

For spell correction: some letters are more likely to be mistyped than others

Confusion Matrix

A specific table layout that allows visualization of the performance of an algorithm;
Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice-versa)

How to add Special Cost for Deletion and Insertion to calculate Edit Distance

To account for special costs for deletion and insertion during edit distance calculation:

D(i, j)= min(\ \ \ D(i-1, j) + del[x[i]] \ for \ deletion \\ D(i, j-1) + ins[y[j]] \ for \ insertion \\ D(i-1, j-1) + sub[x[i], y[j]] \ for \ substitution, \ if \ X[i]! =Y[j], \\ D(i-1, j-1) \ if \ X[i]==Y[j] \
)
for initial states,
$D(0,0) = 0$
$D(i, 0) = D(i-1, 0) + del[x[i]], \ i: 1~N$
$D(0, j) = D(0, j-1) + ins[y[j]]\ i: 1~N$

Context-Sensitive Correction

To correct query phrases that match no docs, retrieve dictionary terms close (in weighted edit distance) to each query term. Then try all possible resulting phrases with one word “fixed” at