60d ago

Final Review 2

Google AdWords BiddingGoogle AdWords RankingGoogle AdWords PaymentAd Rank and Cost-Per-Click (CPC)Click-Through Rates (CTR) in AdWordsAverage Cost of AdWordsFactors Influencing Click-Through ProbabilityLanding PagesGoogle AdSenseTracking PixelsAdSense TechnologyWordNetAd ExchangeAd NetworkDoubleClick Ad ExchangeDART TechnologyDoubleClick User TrackingDART Cookies InformationWeb BeaconsInformation Sent to Server When a Web Beacon is ViewedIP Address Challenge with Web BeaconsReal-Time Bidding (RTB)Ad Exchange EfficiencyKnowledgebase (KB)Knowledge Base System ComponentsKnowledgebase ExamplesTaxonomyOntologyDBpediaTaxonomy vs. OntologySiri Steps/ChallengesSiri Related TechnologiesKnowledgebase Graph TransformationMultigraphResource Description Format (RDF) TripleHyponym, Holonym, HypernymInference EngineForward Chaining vs. Backward ChainingForward ChainingKnowledge Graph Enhancement of Google SearchWikipedia StatisticsWikipedia FounderWikipedia’s Five PillarsWikipedia Link TypesWikipedia Special PagesWikipedia Category Graph (WCG)Wikipedia Page ProtectionWikipedia RevenueWikimediaRelated Projects to WikipediaWikidataWikidata SizeWikidata vs. DBpediaWikipedia's StructureWebsite Model WikiDifference between Nupedia and WikipediaQuery ProcessingChampion Lists HeuristicStatic Quality Scores HeuristicHigh Authority Signals ExamplesNet Score CalculationInverted List ReorganizationHigh and Low Lists HeuristicGoogle Ranking FactorsContent Relevance ScoresLanding Page Word CountClick-Through Rate (CTR)Bounce RateGoogle HTTPSSocial Network User InteractionsMobile-Friendly WebsitesSEO Findings and Google AlgorithmGoogle Services with Over a Billion Monthly Active UsersPopular Browsers in USDefault Search EnginesQuery Processing Using Conversational Search ScenarioDatabase ShardGoogle Query Processing Basic StepsLuceneSolrSolr AdvantagesLucene's Inverted IndexLucene's AnalyzerLucenen's Similarity DecisionTF-IDF Formula TermsCosine-Similarity Formula TermsSolr Server FunctionSolr Query ResultSolr Index XML DataSolr Query MeaningQuery Types Supported by SolrLucene/Solr WebsitesVirtualBoxUbuntuSolr's Basic OperationsQuery Tokenization ExampleQuery Type ExplanationLucene TermsLucene Document OrganizationLucene Index StorageLence FieldLucene Index SegmentLucene's Merge SegmentsGoogle Document DeletetionLucene Document ModificationSolr CreatorSolr Config File - Schema.xmlSolr Config File - Solrconfig.xmlIndexing and Search With DB'sSolr File Extentions for IndexingSolre Results TypeSolrJsSor SecuritySolr ScalingSolr Real-Time Index UpdatesTypes of Spelling ErrorsMain Spelling TasksSpelling Error Correction RequirementsNon-Word Spelling Error DetectionNon-Word Spelling Error CorrectionCommon Types of MisspellingsSearch Queries MisspelledUsing Context for Spelling CorrectionChallenges for Identifying Spelling ErrorsNoisy Channel ModelBayesian inferenceBasic Spell Correction Algorithm StepsPrefix Matching Data StructureSpelling Correction Search DirectionTrie Time ComplexitySpelling Correction Dictionary Error Test SetsN-gram ModelGoogle N-gram Data UseGoogle N-gram Data StorageApplying the N-Gram Model to Spelling CorrectionPeter Norvig’s Spelling Corrector StepsMinimum Edit DistanceLevenshtein Algorithm Recursion FormulaMinimum Edit Distance ComplexitiesAdding Weights to Edit Distance ComputationConfusion MatrixHow to add Special Cost for Deletion and Insertion to calculate Edit DistanceContext-Sensitive Correction


Google AdWords Bidding

  • Bidders specify:

    • Search terms that trigger their bid.

    • Amount to bid for each search term.

    • Overall ad budget and bid limits.

  • Google may set a reserve/minimum price for each term.

Google AdWords Ranking

  • Google estimates each bidder's click-through rate (CTR) if listed first.

  • Ads with very low CTRs are not displayed.

  • Bids are ranked by multiplying the click-through rate and the bid amount.

  • Ads are displayed in rank order.

Google AdWords Payment

  • Google is paid only when an ad is clicked.

  • The price paid is the smallest amount the bidder could have bid to get their ranking.

    • Example: If A bids 1.001.00$$1.00$$ and B bids 0.500.50$$0.50$$, when A's ad is clicked, A pays 0.510.51$$0.51$$. However, if A also get's 1000 impressions, A get's charged nothing.

Ad Rank and Cost-Per-Click (CPC)

  • Ad Rank: Determines the position of an ad.

    • AdRank=BidQualityScoreAd Rank = Bid * Quality Score$$Ad Rank = Bid * Quality Score$$

    • AdRank=BidClickProbabilityAd Rank = Bid * Click Probability$$Ad Rank = Bid * Click Probability$$

  • CPC: Determined by the ad rank of the next highest ad below you.

    • Exception: If you are the only bidder or the lowest bid, you pay your maximum bid per click!

    • CPC = \frac{Ad Rank \ of \ the \ person \ below \ you}{Your \ Quality \ Score} + $0.01$$CPC = \frac{Ad Rank \ of \ the \ person \ below \ you}{Your \ Quality \ Score} + $0.01$$

Click-Through Rates (CTR) in AdWords

  • AdWords bidding penalizes advertisers with low-quality scores.

  • High-Quality Scores lead to higher ad ranks and lower CPCs.

Average Cost of AdWords

  • The average cost per click varies by keyword and industry.

    • Roughly 2.322.32$$2.32$$ on the search network.

    • Roughly 0.580.58$$0.58$$ on the display network.

Factors Influencing Click-Through Probability

  • Historical click performance of the ad

  • Landing page quality

  • Relevance to the user

  • User click-through rates

Landing Pages

  • Landing pages are the pages that appear when a user clicks on your ad.

Google AdSense

  • AdSense is a service for placing Google ads on web pages.

  • Offers ads besides cost-per-click:

    • Cost per Thousand displays (CPM)

    • Cost per Engagement: Advertisers pay only when users actively engage with ads (e.g., hovering, expanding, reviewing the landing page).

  • Google assigns two numbers for advertisers: "googleadclient" and "googleadslot."

Tracking Pixels

  • Small, transparent images on a web page that allow tracking page loads by a web server.

AdSense Technology

  • Based on WordNet, a database of word meanings developed at Princeton.

WordNet

  • A semantic lexicon for the English language.

  • Groups English words into sets of synonyms called synsets.

  • Provides short, general definitions.

  • Records semantic relations between synsets.

  • Purpose: a) Intuitive dictionary/thesaurus combination, b) Support automatic text analysis.

Ad Exchange

  • A technology platform facilitating the buying and selling of media advertising inventory from multiple ad networks.

Ad Network

  • Key function: Aggregation of ad space supply from publishers and matching it with advertiser demand.

DoubleClick Ad Exchange

  • A real-time marketplace owned by Google for buying and selling advertising.

  • Uses Dynamic Advertising Reporting and Targeting (DART) technology.

DART Technology

  • When a user calls up a webpage using DART, a tag signals DoubleClick's server to find an advertisement matching the marketer's needs with the user's profile.

  • Different users can see different advertisements on the same site simultaneously.

  • Cookies refine targeting by tracking repeat visitors or those who have seen a specific ad.

DoubleClick User Tracking

  • DoubleClick places small (1x1 pixels) GIF files on sites to load cookies on your machine.

  • Can potentially see search strings typed into search engines.

DART Cookies Information

  • Number of unique users their advertisements were displayed to.

  • How many users clicked on their ads or paid listings.

  • Which ads or paid listings they clicked on.

Web Beacons

  • Small strings of HTML code placed in a web page.

  • Also called "clear GIFs," pixel tags, web bugs, tracking bugs.

Information Sent to Server When a Web Beacon is Viewed

  • IP address of the computer.

  • URL of the page (referer).

  • URL of the web beacon image.

  • Time the Web Beacon was viewed.

  • Browser type (user_agent).

  • Any previously set cookie values.

IP Address Challenge with Web Beacons

  • IP address may change; unique ID stored with the cookie helps track the user.

Real-Time Bidding (RTB)

  • A protocol for two digital advertising companies to transact.

Ad Exchange Efficiency

  • Ad Exchange consolidates the integration of ad networks.

Knowledgebase (KB)

  • Technology used to store complex structured and unstructured information.

Knowledge Base System Components

  • Knowledge-base: Represents facts about the world.

  • Inference Engine: Reasons about facts and deduces new facts using rules and logic.

Knowledgebase Examples

  • Freebase, Google's Knowledge Graph, Apple's Siri, IBM’s Watson

Taxonomy

  • A hierarchy of concepts with parent/child relations (subClass/superClass, broader/narrower).

Ontology

  • The representation of knowledge in a knowledgebase.

  • Can express arbitrary complex relations between concepts (e.g., X marriedTo Y, A worksFor B, C locatedIn D).

DBpedia

  • A crowd-sourced community effort to extract structured information from Wikipedia.

  • Makes information available on the Web using RDF (Resource Description Framework) and SPARQL.

Taxonomy vs. Ontology

  • Taxonomies are narrower than ontologies, which include a larger variety of relation types.

  • Mathematically:

    • Taxonomy: A tree structure of classifications.

    • Ontology: A directed, labeled, cyclic graph.

Siri Steps/Challenges

  • Convert speech to text

  • Convert text to entities

  • Convert entities to a query

  • Retrieve results

  • Interpret results

  • Convert back to speech

Siri Related Technologies

  • Automated speech recognition

  • Parts of Speech tagging

  • Question/Answer analysis

  • Database interfaces to Wolfram, OpenTable, etc

  • Transforming machine output to natural language

  • Text-to-speech processing

Knowledgebase Graph Transformation

  • A knowledgebase can be seen as a directed labeled multigraph, where nodes are entities and edges are relations.

Multigraph

  • A graph with multiple edges between the same end nodes.

Resource Description Format (RDF) Triple

  • Consists of: subject, predicate, object (主谓宾).

Hyponym, Holonym, Hypernym

  • Hyponym: More specific.

  • Holonym: Denoting the whole.

  • Hypernym: A broad or superordinate.

Inference Engine

  • A component that applies logical rules to a knowledgebase to deduce new information.

Forward Chaining vs. Backward Chaining

  • Forward chaining: Starts with known facts and asserts new facts; search engines typically use forward chaining

  • Backward chaining: Starts with goals and works backward to determine necessary facts.

Forward Chaining

  • Repeated application of modus ponens.

  • Starts with available data and uses inference rules to extract more data until a goal is reached.

Knowledge Graph Enhancement of Google Search

  • Improves variety of search results by locating alternate interpretations of query terms (e.g., "taj mahal" - mausoleum or musician).

  • Provides deeper and broader results (e.g., person entities include relations like age, birthplace, marital status, children, education, etc.).

  • Provides the best summary by exploiting relationships among entities.

Wikipedia Statistics

  • 30 million articles, 4 million in English.

  • 16 million images.

  • 8000 views per second.

  • 500 million unique visitors per month.

  • 3.7 billion monthly mobile page views.

  • 2.1 billion edits, 700 million English.

  • 20 million registered users.

  • 120,000 active editors.

  • 1,400 administrators.

  • All working for free, with no central control

Wikipedia Founder

  • Jimmy Wales

Wikipedia’s Five Pillars

  • Encyclopaedia: Notable topics, No original research (NOR), Neutral point of view (NPOV), Verifiability (referencing).

  • Free content: Anyone can edit, No copyright infringements.

  • Be civil.

  • No firm rules.

Wikipedia Link Types

  • Article links

  • Category links

  • Interlingual links

Wikipedia Special Pages

  • Redirect pages: Short pages providing equivalent names for an entity.

  • Disambiguation pages: Pages linking to multiple similarly named articles.

Wikipedia Category Graph (WCG)

  • Articles form a network of semantically related terms.

  • Categories are organized in a taxonomy-like structure.

Wikipedia Page Protection

  • Semi-protect pages: Lock pages from being edited by unregistered and new users.

  • Protect pages: Lock pages from being edited.

Wikipedia Revenue

  • Does not take advertising, only donations.

Wikimedia

  • Wikimedia Foundation, Inc. (WMF): American non-profit organization headquartered in San Francisco.

  • Owns the internet domain names and hosts Wikipedia.

  • Employs over 280 people, with annual revenues in excess of 7575$$75$$ million (as of 2015).

Related Projects to Wikipedia

  • Commons for multimedia.

  • **Wiktionary as free dictionary.

  • **Wikidata for structured data.

Wikidata

  • An effort to convert Wikipedia data into a knowledgebase.

  • Aims to create a free RDF-like KB about the world that can be read/edited by humans & machines.

  • Wikidata clients use the repository to populate Web pages or Wikipedia infoboxes.

Wikidata Size

  • 3 years old, 14M items, 209M edits

Wikidata vs. DBpedia

  • Two structured representations of Wikipedia.

    • Wikidata: Manually curated, will master structured data for Wikipedia, synchronized through bots, data is fairly accurate but data depth is still small.

    • DBpedia: Automatically extracted, live update, one-way extraction only, data reach is deep but there are problems in ontology and mappings, especially for non-English.

Wikipedia's Structure

  • Categories also contain classes, but do not form a taxonomic hierarchy. There is no ISA hierarchy.

Website Model Wiki

  • Created by Ward Cunningham

  • Wikis = Website that you can create and edit in real-time.

Difference between Nupedia and Wikipedia

  • In Nupedia, content can only be edited by experts.

  • In Wikipedia, content can be edited by anyone.

Query Processing

  • Considering Only Terms with High-idf Scores: Postings of low-idf terms have many docs, so these docs get eliminated from the set of contenders.

  • Considering Only Docs Containing Many Query Terms: Easy to implement in postings traversal.

Champion Lists Heuristic

  • Pre-compute for each dictionary term t, the r docs of highest weight (tf-idf) in t’s postings.

  • Call this the champion list for t (aka fancy list or top docs for t).

  • r has to be chosen at index build time, thus it’s possible that r < K (number of docs to return).

  • At query time, only compute scores for docs in the champion list of some query term.

Static Quality Scores Heuristic

  • Top-ranking documents should be both relevant and authoritative.

  • Relevance is modeled by cosine scores.

  • Authority is a query-independent property of a document, assigned a query-independent quality score g(d) between 0 and 1.

High Authority Signals Examples

  • Wikipedia among websites.

  • Articles in certain newspapers.

  • A paper with many citations.

  • High PageRank.

Net Score Calculation

  • Combining cosine relevance and authority

  • netscore(q,d)=g(d)+cosine(q,d)net_score(q,d) = g(d) + cosine(q,d)$$net_score(q,d) = g(d) + cosine(q,d)$$

  • Other linear combinations can be used, using assigning different weights to the two factors

Inverted List Reorganization

  • Order all postings by g(d), the authority measure.

  • Most authoritative documents appear early in the postings list.

  • Maintain for each term a champion list of the r docs with highest g(d) + tf-idf(td)

  • Seek top-K results from only the docs in these champion lists.

High and Low Lists Heuristic

  • For each term, maintain two postings lists called high and low.

  • high is the champion list.

  • When traversing postings on a query, only traverse high lists first.

    • If we get more than K docs, select the top K and stop.

    • Else proceed to get docs from the low lists.

  • Can be used even for simple cosine scores, without global quality g(d).

Google Ranking Factors

  • Content factors:

    • Content relevance

    • Word count

  • User signals:

    • Click through rate

    • Bounce rate

  • Technical factors:

    • Presence of H1/H2

    • Use of HTTPS

  • User experience:

    • Number of internal/external links

  • Social signals:

    • Facebook total

    • Tweets

Content Relevance Scores

  • The highest content relevance scores were found among the results for positions 3 to 6.

Landing Page Word Count

  • It’s rising for years.

Click-Through Rate (CTR)

  • Measures the average percentage of users who click on the result at each position on the Search Engine Results Page (SERP).

Bounce Rate

  • Measures the percentage of users who only click on the URL from Google’s search results, without visiting any other URLs on the domain, and then return back to the SERP.

  • These are single-page sessions where the user leaves the site without interacting with the page.

Google HTTPS

  • Page encryption using HTTPS is rising.

Social Network User Interactions

  • Facebook has the highest level of user interactions.

Mobile-Friendly Websites

  • All top 100 websites have a mobile-friendly version; they use either a mobile sub-domain or responsive design.

SEO Findings and Google Algorithm

  • Correlations are not synonymous with causal relationships.

  • There is no guarantee that respective factors actually have any impact on the ranking.

  • There are many examples of illusory correlations (logical fallacy).

Google Services with Over a Billion Monthly Active Users

  • Google Search, Google Maps, YouTube, Android, Gmail, the Play Store and Google Chrome.

Popular Browsers in US

  • Chrome and Safari.

Default Search Engines

  • Chrome: Google search.

  • Firefox: Yahoo.

  • Microsoft Edge and Internet Explorer: Bing.

Query Processing Using Conversational Search Scenario

  • Use a microphone to give a spoken query like “how old is Obama”, and after getting the result, send a follow-up query like “how tall is he”.

Database Shard

  • A horizontal partition of data in a database or search engine.

  • Each individual partition is referred to as a shard.

  • Each shard is held on a separate database server instance, to spread the load.

Google Query Processing Basic Steps

  1. Parse the query

  2. Convert words into wordIDs using the lexicon

  3. Select the barrels that contain documents which match the wordIDs

  4. Scan through the document list until there is a document that matches all of the search terms

  5. Compute the rank of that document for the query (using PageRank as one component)

  6. Repeat step 4 until no documents are found and we've examined all of the barrels

  7. Sort the set of returned documents that have been matched by document rank and return the top k.

Lucene

  • High performance, scalable, full-text search library.

  • Does not include crawlers or explicit searching of documents. Searching the index must be implemented elsewhere.

  • 100% Java, no dependencies. Offered by the Apache Software Foundation

Solr

  • A full-text search server based on Lucene.

  • Communicates with Lucene via XML/HTTP, JSON Interfaces.

  • Written in JAVA 5.

Solr Advantages

  • Flexible data schema to define types and fields

  • Output includes highlighting matches

  • Configurable advanced caching

  • Index Replication

  • Extensible Open Architecture, Plugins

  • Web Administration Interface

Lucene's Inverted Index

  • Stores a positional inverted index.

Lucene's Analyzer

  • Creates tokens using a Tokenizer and/or applying Filters (Token Filters).

Lucenen's Similarity Decision

  • Lucene scores using a combination of TF-IDF and vector closeness

  • TFIDF:W(x,y)=tf(x,y)log(Ndf(x))TF-IDF: W(x,y) = tf(x,y) * log(\frac{N}{df(x)})$$TF-IDF: W(x,y) = tf(x,y) * log(\frac{N}{df(x)})$$ (term frequency * inverse document frequency)

  • Cosinesimilarity(queryvector,docvector):V(q)V(d)(V(q)V(d))Cosine-similarity(query_vector, doc_vector): \frac{V(q) * V(d)}{(|V(q)| * |V(d)|)}$$Cosine-similarity(query_vector, doc_vector): \frac{V(q) * V(d)}{(|V(q)| * |V(d)|)}$$

TF-IDF Formula Terms

  • tf(x,y)tf(x,y)$$tf(x,y)$$: term frequency of x in y

  • NN$$N$$: total number of documents

  • df(x)df(x)$$df(x)$$: number of document containing x

Cosine-Similarity Formula Terms

  • V(q)V(d)V(q)*V(d)$$V(q)*V(d)$$ is the dot product of the weighted vectors

  • V(q),V(d)|V(q)|, |V(d)|$$|V(q)|, |V(d)|$$ are the Euclidean norms of the vectors (square root of the sum of squares)

Solr Server Function

  • Accepts queries in the form of http requests and returns results as JSON strings

Solr Query Result

  • status, always 0 in a successful response

  • QTime, the server-side query time in milliseconds

  • numFound, the total number of documents matching the query

  • start, the offset into the ordered list of results

  • field types in include str, boolean, int, long, float, double, date, lst, arr

  • lst is a named list

  • arr is an array

  • multivalued fields are returned in an element.

Solr Index XML Data

  • 1. Start Solr java -jar start.jar

  • 2. Index your XML data java -jar post.jar *.xml

Solr Query Meaning

  • http://solr/select?q=electronics&fl=name+price search electronics and limit results to fields: name and price

  • http://solr/select?q=electronics&sort=price+desc search electronics and sort by the price in decreasing order

Query Types Supported by Solr

  • Single and multi-term queries

  • +, -, AND, OR, NOT operators are supported

  • Range queries on date or numeric fields,

  • Boost queries:

  • Fuzzy search : is a search for words that are similar in spelling

  • Proximity Search : with a sloppy phrase query. The closer together the two terms appear, higher the score.

Lucene/Solr Websites

  • Apple.com, Netflix.com, Whitehouse.gov, Indeed.com, Twitter.com, LinkedIn.com

VirtualBox

  • VirtualBox is an open source, freely available Windows application (it also runs on other platforms) that lets you run multiple operating systems on your single machine

  • Runs on other platforms, and lets you run multiple operating systems on a single machine.

Ubuntu

  • Ubuntu is a Linux-based operating system on personal computers, smartphones and network servers. It uses Unity as its default desktop environment

  • A Linux-based operating system that uses Unity as its default desktop environment

Solr's Basic Operations

  • Query, index, delete, commit, and optimize

Query Tokenization Example

  • For the query “Queen Victoria”, after tokenization, Queen: position:0; offset:0; length:5; Victoria: position:1; offset:6; length:8

Query Type Explanation

  • Term query: matches a precise term

  • Wildcard query: portions of term left unspecified, e.g. c*t=cat, cot, cant

  • Prefix query: end of term left unspecified, e.g. ca*=cat, camp, cad

  • Fuzzy query: matches an imprecise term

  • Phrase query: matches multiple terms in sequence/proximity

  • Range query: matches an alphabetical or numeric range of terms

  • Boolean query: logically composite other queries

Lucene Terms

  • Lucene.NET: implementation of Lucene in C#

  • PyLucene: Lucene running in JVM embedded in CPython

  • Tika: Java library for text and metadata extraction

  • Luke: tool for inspecting Lucene indexes

Lucene Document Organization

  • Lucene document=collection of fields.

Lucene Index Storage

  • Fields are stored separately from the index (.fdt file)

Lence Field

  • It may be indexed and stored, indexed but not stored, or stored but not indexed.

Lucene Index Segment

  • A single Lucene index may be split into multiple segments.

  • Each segment stores one or more of the documents in its own inverted index and field storage.

Lucene's Merge Segments

  • MergePolicy: findMerges() invoked when a new segment is created and return a list of segments to merge.

  • MergeScheduler: process the merges, sometimes in separate threads.

Google Document Deletetion

  • When a google is deleted, a document is marked for deletion; actually gets removed only when its segment gets merged.

Lucene Document Modification

  • You can’t. Only whole documents can be added and deleted; fields of an existing document cannot be modified.

Solr Creator

  • Yonik Seeley

Solr Config File - Schema.xml

  • It contains field types and fields, dynamic fields. managed-schema

Solr Config File - Solrconfig.xml

  • It contains Lucene index parameters, request handler mappings, cache settings and plugins

Indexing and Search With DB's

  • Yes solr support indexing data from dtabase and HTTP GET, and full/incremental indexing/searching

Solr File Extentions for Indexing

  • solr supports indexing MS Office, pdf, rtf, OpenDocument, images, mp3, zip ….

Solre Results Type

  • JSON. The default type of searching results for solr

SolrJs

  • Its the front-end framework to use Solr for UI of web pages.

Sor Security

  • It relies on server and container security.

Solr Scaling

  • It supports both master/slave and sharding. But not all solr features support sharding.

Solr Real-Time Index Updates

  • Yes you can use solr to achieve near real-time index updates

Types of Spelling Errors

  • Non-word errors

  • Typographical errors

  • Cognitive errors

Main Spelling Tasks

  • Spelling Error Detection

  • Spelling Error Correction

Spelling Error Correction Requirements

  • Yes, a big dictionary, Using context, or Fast response time for autocomplete is needed

Non-Word Spelling Error Detection

  • Any word not in a dictionary is presumed to be an error.

  • The larger the dictionary the better

Non-Word Spelling Error Correction

  • Generate candidates from the dictionary that are close to the error.

  • There are some ways:

    • (a) Shortest weighted edit distance

    • Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other

    • (b) Highest noisy channel probability

    • Use probabilities to select the most obvious candidates

Common Types of Misspellings

  • Typing quickly, keyboard adjacency, inconsistent rules, ambiguous word breaking, new words

Search Queries Misspelled

  • According to Cucerzan and Brill, more than 10% of search engine queries are misspelled

Using Context for Spelling Correction

  • Consider whether the surrounding words “make sense” for your candidate set.

  • For candidate words with similar sounds but different spellings and different meanings.

  • (we can use the Soundex algorithm)

Challenges for Identifying Spelling Errors

  • Allow for insertion of a space or hyphen; Or deletion of a space; Watch out for words NOT in the Lexicon that are valid

Noisy Channel Model

  • This model suggests treating the misspelled word as if a correctly spelled word has been distorted by being passed through a noisy communication channel.

  • Noise in this case refers to substitutions, insertions or deletions of letters

Bayesian inference

  • To predict the correct word, use Bayesian inference. The formula will incorporate prior probability of a words existence, and a multiplication of that with the liklihood from the misspelled word

Basic Spell Correction Algorithm Steps

  • Initial step: Create a dictionary and encode it for fast retrieval

  • When a query is submitted, the spell checker examines each word and looks for possible character edits. Observation:

    • 80% of errors are within edit distance 1

    • Almost all errors within edit distance 2

  • Take the output of step 2 and compute probabilities for the candidates using previously identified n-grams

  • Delect the result with highest probability

Prefix Matching Data Structure

  • Usually which data structure is used for efficient prefix matching in autocomplete and spell correction?

  • Trie.

Spelling Correction Search Direction

  • Is the search for corrections carried out from left-to-right or right-to-left? What factor is used to control the response time?

  • The search for corrections is carried out from left-to-right.

  • The branching factor controls the amount of time required to search for spelling corrections.

Trie Time Complexity

  • O(m). where m is the length of the string, at the expense of increased storage

Spelling Correction Dictionary Error Test Sets

  • Wikipedia’s list of common English misspelling

  • Aspell filtered version of that list

  • Birkbeck spelling error corpus

N-gram Model

  • A n-gram model is a type of probabilistic language model for predicting the next item in a sequence.

  • Two benefits of n-gram models are simplicity and scalability – with larger n, a model can store more context with a well- understood space–time tradeoff, enabling small experiments to scale up efficiently

Google N-gram Data Use

  • Google has collected and uses a great deal of N-gram data. Google is using the Linguistics Data Consortium to distribute more than one trillion words they have extracted from public web pages

Google N-gram Data Storage

  • Rather than store each word as a string, it is generally represented in memory as a 64-bit hash number, with the words themselves stored on disk.

  • Probabilities are generally quantized using only 4-8 bits (instead of 8-byte floats),

  • N-grams are stored in reverse tries.

  • N-grams can also be shrunk by pruning, for example only storing N- grams with counts greater than some threshold (such as the count threshold of 40 used for the Google N-gram release)

Applying the N-Gram Model to Spelling Correction

  • If a higher-order N-gram has a zero count, we simply backoff to a lower order N-gram, weighed by a fixed (context-independent) weight

  • The backoff terminates in the unigram

Peter Norvig’s Spelling Corrector Steps

  • For a given “big.txt”, extract individual words and build a dictionary (hashtable) that counts how many times each word appears. (for words not appearing, set default value to 1).

  • For a given query word, compute all the words that are in edit- distance 1 with that, and words in the set that are in edit-distance 2 with that.

  • For the given query word, if the word itself is in the original word- set, then returns it as the correct word. Otherwise find the word with highest frequency that has edit-distance 1, if none exists then find the word with highest frequency having edit-distance 2.

Minimum Edit Distance

  • The minimum edit distance between two strings is the minimum number of editing operations (insertion, deletion, substitution) needed to transform one into the other

Levenshtein Algorithm Recursion Formula

  • To compute the minimum edit distance D(n, m) for two strings X and Y, the formula is:

  • D(i, j)= min(

    • D(i-1, j)+1 // for deletion

    • D(i, j-1)+1 // for insertion

    • D(i-1, j-1)+2 // for substitution, if X[i]!=Y[j],

    • D(i-1, j-1) // if X[i]==Y[j]
      )

  • For initial states, D(i, 0)=i, D(0, j)=j

Minimum Edit Distance Complexities

  • Computing minimum distance time: O(mn)

  • Computing minimum distance space: O(mn)

  • Backtracing: O(m+n)

Adding Weights to Edit Distance Computation

  • For spell correction: some letters are more likely to be mistyped than others

Confusion Matrix

  • A specific table layout that allows visualization of the performance of an algorithm;

  • Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice-versa)

How to add Special Cost for Deletion and Insertion to calculate Edit Distance

  • To account for special costs for deletion and insertion during edit distance calculation:

$$ D(i, j)= min(\ \ \ D(i-1, j) + del[x[i]] \ for \ deletion \\ D(i, j-1) + ins[y[j]] \ for \ insertion \\ D(i-1, j-1) + sub[x[i], y[j]] \ for \ substitution, \ if \ X[i]! =Y[j], \\ D(i-1, j-1) \ if \ X[i]==Y[j] \
) $$
for initial states,
D(0,0)=0D(0,0) = 0$$D(0,0) = 0$$
D(i,0)=D(i1,0)+del[x[i]], i:1 ND(i, 0) = D(i-1, 0) + del[x[i]], \ i: 1~N$$D(i, 0) = D(i-1, 0) + del[x[i]], \ i: 1~N$$
D(0,j)=D(0,j1)+ins[y[j]] i:1 ND(0, j) = D(0, j-1) + ins[y[j]]\ i: 1~N$$D(0, j) = D(0, j-1) + ins[y[j]]\ i: 1~N$$

Context-Sensitive Correction

  • To correct query phrases that match no docs, retrieve dictionary terms close (in weighted edit distance) to each query term. Then try all possible resulting phrases with one word “fixed” at


knowt logo

Final Review 2

Google AdWords Bidding

  • Bidders specify:

    • Search terms that trigger their bid.

    • Amount to bid for each search term.

    • Overall ad budget and bid limits.

  • Google may set a reserve/minimum price for each term.

Google AdWords Ranking

  • Google estimates each bidder's click-through rate (CTR) if listed first.

  • Ads with very low CTRs are not displayed.

  • Bids are ranked by multiplying the click-through rate and the bid amount.

  • Ads are displayed in rank order.

Google AdWords Payment

  • Google is paid only when an ad is clicked.

  • The price paid is the smallest amount the bidder could have bid to get their ranking.

    • Example: If A bids 1.001.00 and B bids 0.500.50, when A's ad is clicked, A pays 0.510.51. However, if A also get's 1000 impressions, A get's charged nothing.

Ad Rank and Cost-Per-Click (CPC)

  • Ad Rank: Determines the position of an ad.

    • AdRank=BidQualityScoreAd Rank = Bid * Quality Score

    • AdRank=BidClickProbabilityAd Rank = Bid * Click Probability

  • CPC: Determined by the ad rank of the next highest ad below you.

    • Exception: If you are the only bidder or the lowest bid, you pay your maximum bid per click!

    • CPC = \frac{Ad Rank \ of \ the \ person \ below \ you}{Your \ Quality \ Score} + $0.01

Click-Through Rates (CTR) in AdWords

  • AdWords bidding penalizes advertisers with low-quality scores.

  • High-Quality Scores lead to higher ad ranks and lower CPCs.

Average Cost of AdWords

  • The average cost per click varies by keyword and industry.

    • Roughly 2.322.32 on the search network.

    • Roughly 0.580.58 on the display network.

Factors Influencing Click-Through Probability

  • Historical click performance of the ad

  • Landing page quality

  • Relevance to the user

  • User click-through rates

Landing Pages

  • Landing pages are the pages that appear when a user clicks on your ad.

Google AdSense

  • AdSense is a service for placing Google ads on web pages.

  • Offers ads besides cost-per-click:

    • Cost per Thousand displays (CPM)

    • Cost per Engagement: Advertisers pay only when users actively engage with ads (e.g., hovering, expanding, reviewing the landing page).

  • Google assigns two numbers for advertisers: "googleadclient" and "googleadslot."

Tracking Pixels

  • Small, transparent images on a web page that allow tracking page loads by a web server.

AdSense Technology

  • Based on WordNet, a database of word meanings developed at Princeton.

WordNet

  • A semantic lexicon for the English language.

  • Groups English words into sets of synonyms called synsets.

  • Provides short, general definitions.

  • Records semantic relations between synsets.

  • Purpose: a) Intuitive dictionary/thesaurus combination, b) Support automatic text analysis.

Ad Exchange

  • A technology platform facilitating the buying and selling of media advertising inventory from multiple ad networks.

Ad Network

  • Key function: Aggregation of ad space supply from publishers and matching it with advertiser demand.

DoubleClick Ad Exchange

  • A real-time marketplace owned by Google for buying and selling advertising.

  • Uses Dynamic Advertising Reporting and Targeting (DART) technology.

DART Technology

  • When a user calls up a webpage using DART, a tag signals DoubleClick's server to find an advertisement matching the marketer's needs with the user's profile.

  • Different users can see different advertisements on the same site simultaneously.

  • Cookies refine targeting by tracking repeat visitors or those who have seen a specific ad.

DoubleClick User Tracking

  • DoubleClick places small (1x1 pixels) GIF files on sites to load cookies on your machine.

  • Can potentially see search strings typed into search engines.

DART Cookies Information

  • Number of unique users their advertisements were displayed to.

  • How many users clicked on their ads or paid listings.

  • Which ads or paid listings they clicked on.

Web Beacons

  • Small strings of HTML code placed in a web page.

  • Also called "clear GIFs," pixel tags, web bugs, tracking bugs.

Information Sent to Server When a Web Beacon is Viewed

  • IP address of the computer.

  • URL of the page (referer).

  • URL of the web beacon image.

  • Time the Web Beacon was viewed.

  • Browser type (user_agent).

  • Any previously set cookie values.

IP Address Challenge with Web Beacons

  • IP address may change; unique ID stored with the cookie helps track the user.

Real-Time Bidding (RTB)

  • A protocol for two digital advertising companies to transact.

Ad Exchange Efficiency

  • Ad Exchange consolidates the integration of ad networks.

Knowledgebase (KB)

  • Technology used to store complex structured and unstructured information.

Knowledge Base System Components

  • Knowledge-base: Represents facts about the world.

  • Inference Engine: Reasons about facts and deduces new facts using rules and logic.

Knowledgebase Examples

  • Freebase, Google's Knowledge Graph, Apple's Siri, IBM’s Watson

Taxonomy

  • A hierarchy of concepts with parent/child relations (subClass/superClass, broader/narrower).

Ontology

  • The representation of knowledge in a knowledgebase.

  • Can express arbitrary complex relations between concepts (e.g., X marriedTo Y, A worksFor B, C locatedIn D).

DBpedia

  • A crowd-sourced community effort to extract structured information from Wikipedia.

  • Makes information available on the Web using RDF (Resource Description Framework) and SPARQL.

Taxonomy vs. Ontology

  • Taxonomies are narrower than ontologies, which include a larger variety of relation types.

  • Mathematically:

    • Taxonomy: A tree structure of classifications.

    • Ontology: A directed, labeled, cyclic graph.

Siri Steps/Challenges

  • Convert speech to text

  • Convert text to entities

  • Convert entities to a query

  • Retrieve results

  • Interpret results

  • Convert back to speech

Siri Related Technologies

  • Automated speech recognition

  • Parts of Speech tagging

  • Question/Answer analysis

  • Database interfaces to Wolfram, OpenTable, etc

  • Transforming machine output to natural language

  • Text-to-speech processing

Knowledgebase Graph Transformation

  • A knowledgebase can be seen as a directed labeled multigraph, where nodes are entities and edges are relations.

Multigraph

  • A graph with multiple edges between the same end nodes.

Resource Description Format (RDF) Triple

  • Consists of: subject, predicate, object (主谓宾).

Hyponym, Holonym, Hypernym

  • Hyponym: More specific.

  • Holonym: Denoting the whole.

  • Hypernym: A broad or superordinate.

Inference Engine

  • A component that applies logical rules to a knowledgebase to deduce new information.

Forward Chaining vs. Backward Chaining

  • Forward chaining: Starts with known facts and asserts new facts; search engines typically use forward chaining

  • Backward chaining: Starts with goals and works backward to determine necessary facts.

Forward Chaining

  • Repeated application of modus ponens.

  • Starts with available data and uses inference rules to extract more data until a goal is reached.

Knowledge Graph Enhancement of Google Search

  • Improves variety of search results by locating alternate interpretations of query terms (e.g., "taj mahal" - mausoleum or musician).

  • Provides deeper and broader results (e.g., person entities include relations like age, birthplace, marital status, children, education, etc.).

  • Provides the best summary by exploiting relationships among entities.

Wikipedia Statistics

  • 30 million articles, 4 million in English.

  • 16 million images.

  • 8000 views per second.

  • 500 million unique visitors per month.

  • 3.7 billion monthly mobile page views.

  • 2.1 billion edits, 700 million English.

  • 20 million registered users.

  • 120,000 active editors.

  • 1,400 administrators.

  • All working for free, with no central control

Wikipedia Founder

  • Jimmy Wales

Wikipedia’s Five Pillars

  • Encyclopaedia: Notable topics, No original research (NOR), Neutral point of view (NPOV), Verifiability (referencing).

  • Free content: Anyone can edit, No copyright infringements.

  • Be civil.

  • No firm rules.

Wikipedia Link Types

  • Article links

  • Category links

  • Interlingual links

Wikipedia Special Pages

  • Redirect pages: Short pages providing equivalent names for an entity.

  • Disambiguation pages: Pages linking to multiple similarly named articles.

Wikipedia Category Graph (WCG)

  • Articles form a network of semantically related terms.

  • Categories are organized in a taxonomy-like structure.

Wikipedia Page Protection

  • Semi-protect pages: Lock pages from being edited by unregistered and new users.

  • Protect pages: Lock pages from being edited.

Wikipedia Revenue

  • Does not take advertising, only donations.

Wikimedia

  • Wikimedia Foundation, Inc. (WMF): American non-profit organization headquartered in San Francisco.

  • Owns the internet domain names and hosts Wikipedia.

  • Employs over 280 people, with annual revenues in excess of 7575 million (as of 2015).

Related Projects to Wikipedia

  • Commons for multimedia.

  • **Wiktionary as free dictionary.

  • **Wikidata for structured data.

Wikidata

  • An effort to convert Wikipedia data into a knowledgebase.

  • Aims to create a free RDF-like KB about the world that can be read/edited by humans & machines.

  • Wikidata clients use the repository to populate Web pages or Wikipedia infoboxes.

Wikidata Size

  • 3 years old, 14M items, 209M edits

Wikidata vs. DBpedia

  • Two structured representations of Wikipedia.

    • Wikidata: Manually curated, will master structured data for Wikipedia, synchronized through bots, data is fairly accurate but data depth is still small.

    • DBpedia: Automatically extracted, live update, one-way extraction only, data reach is deep but there are problems in ontology and mappings, especially for non-English.

Wikipedia's Structure

  • Categories also contain classes, but do not form a taxonomic hierarchy. There is no ISA hierarchy.

Website Model Wiki

  • Created by Ward Cunningham

  • Wikis = Website that you can create and edit in real-time.

Difference between Nupedia and Wikipedia

  • In Nupedia, content can only be edited by experts.

  • In Wikipedia, content can be edited by anyone.

Query Processing

  • Considering Only Terms with High-idf Scores: Postings of low-idf terms have many docs, so these docs get eliminated from the set of contenders.

  • Considering Only Docs Containing Many Query Terms: Easy to implement in postings traversal.

Champion Lists Heuristic

  • Pre-compute for each dictionary term t, the r docs of highest weight (tf-idf) in t’s postings.

  • Call this the champion list for t (aka fancy list or top docs for t).

  • r has to be chosen at index build time, thus it’s possible that r < K (number of docs to return).

  • At query time, only compute scores for docs in the champion list of some query term.

Static Quality Scores Heuristic

  • Top-ranking documents should be both relevant and authoritative.

  • Relevance is modeled by cosine scores.

  • Authority is a query-independent property of a document, assigned a query-independent quality score g(d) between 0 and 1.

High Authority Signals Examples

  • Wikipedia among websites.

  • Articles in certain newspapers.

  • A paper with many citations.

  • High PageRank.

Net Score Calculation

  • Combining cosine relevance and authority

  • netscore(q,d)=g(d)+cosine(q,d)net_score(q,d) = g(d) + cosine(q,d)

  • Other linear combinations can be used, using assigning different weights to the two factors

Inverted List Reorganization

  • Order all postings by g(d), the authority measure.

  • Most authoritative documents appear early in the postings list.

  • Maintain for each term a champion list of the r docs with highest g(d) + tf-idf(td)

  • Seek top-K results from only the docs in these champion lists.

High and Low Lists Heuristic

  • For each term, maintain two postings lists called high and low.

  • high is the champion list.

  • When traversing postings on a query, only traverse high lists first.

    • If we get more than K docs, select the top K and stop.

    • Else proceed to get docs from the low lists.

  • Can be used even for simple cosine scores, without global quality g(d).

Google Ranking Factors

  • Content factors:

    • Content relevance

    • Word count

  • User signals:

    • Click through rate

    • Bounce rate

  • Technical factors:

    • Presence of H1/H2

    • Use of HTTPS

  • User experience:

    • Number of internal/external links

  • Social signals:

    • Facebook total

    • Tweets

Content Relevance Scores

  • The highest content relevance scores were found among the results for positions 3 to 6.

Landing Page Word Count

  • It’s rising for years.

Click-Through Rate (CTR)

  • Measures the average percentage of users who click on the result at each position on the Search Engine Results Page (SERP).

Bounce Rate

  • Measures the percentage of users who only click on the URL from Google’s search results, without visiting any other URLs on the domain, and then return back to the SERP.

  • These are single-page sessions where the user leaves the site without interacting with the page.

Google HTTPS

  • Page encryption using HTTPS is rising.

Social Network User Interactions

  • Facebook has the highest level of user interactions.

Mobile-Friendly Websites

  • All top 100 websites have a mobile-friendly version; they use either a mobile sub-domain or responsive design.

SEO Findings and Google Algorithm

  • Correlations are not synonymous with causal relationships.

  • There is no guarantee that respective factors actually have any impact on the ranking.

  • There are many examples of illusory correlations (logical fallacy).

Google Services with Over a Billion Monthly Active Users

  • Google Search, Google Maps, YouTube, Android, Gmail, the Play Store and Google Chrome.

Popular Browsers in US

  • Chrome and Safari.

Default Search Engines

  • Chrome: Google search.

  • Firefox: Yahoo.

  • Microsoft Edge and Internet Explorer: Bing.

Query Processing Using Conversational Search Scenario

  • Use a microphone to give a spoken query like “how old is Obama”, and after getting the result, send a follow-up query like “how tall is he”.

Database Shard

  • A horizontal partition of data in a database or search engine.

  • Each individual partition is referred to as a shard.

  • Each shard is held on a separate database server instance, to spread the load.

Google Query Processing Basic Steps

  1. Parse the query

  2. Convert words into wordIDs using the lexicon

  3. Select the barrels that contain documents which match the wordIDs

  4. Scan through the document list until there is a document that matches all of the search terms

  5. Compute the rank of that document for the query (using PageRank as one component)

  6. Repeat step 4 until no documents are found and we've examined all of the barrels

  7. Sort the set of returned documents that have been matched by document rank and return the top k.

Lucene

  • High performance, scalable, full-text search library.

  • Does not include crawlers or explicit searching of documents. Searching the index must be implemented elsewhere.

  • 100% Java, no dependencies. Offered by the Apache Software Foundation

Solr

  • A full-text search server based on Lucene.

  • Communicates with Lucene via XML/HTTP, JSON Interfaces.

  • Written in JAVA 5.

Solr Advantages

  • Flexible data schema to define types and fields

  • Output includes highlighting matches

  • Configurable advanced caching

  • Index Replication

  • Extensible Open Architecture, Plugins

  • Web Administration Interface

Lucene's Inverted Index

  • Stores a positional inverted index.

Lucene's Analyzer

  • Creates tokens using a Tokenizer and/or applying Filters (Token Filters).

Lucenen's Similarity Decision

  • Lucene scores using a combination of TF-IDF and vector closeness

  • TFIDF:W(x,y)=tf(x,y)log(Ndf(x))TF-IDF: W(x,y) = tf(x,y) * log(\frac{N}{df(x)}) (term frequency * inverse document frequency)

  • Cosinesimilarity(queryvector,docvector):V(q)V(d)(V(q)V(d))Cosine-similarity(query_vector, doc_vector): \frac{V(q) * V(d)}{(|V(q)| * |V(d)|)}

TF-IDF Formula Terms

  • tf(x,y)tf(x,y): term frequency of x in y

  • NN: total number of documents

  • df(x)df(x): number of document containing x

Cosine-Similarity Formula Terms

  • V(q)V(d)V(q)*V(d) is the dot product of the weighted vectors

  • V(q),V(d)|V(q)|, |V(d)| are the Euclidean norms of the vectors (square root of the sum of squares)

Solr Server Function

  • Accepts queries in the form of http requests and returns results as JSON strings

Solr Query Result

  • status, always 0 in a successful response

  • QTime, the server-side query time in milliseconds

  • numFound, the total number of documents matching the query

  • start, the offset into the ordered list of results

  • field types in include str, boolean, int, long, float, double, date, lst, arr

  • lst is a named list

  • arr is an array

  • multivalued fields are returned in an element.

Solr Index XML Data

  • 1. Start Solr java -jar start.jar

  • 2. Index your XML data java -jar post.jar *.xml

Solr Query Meaning

  • http://solr/select?q=electronics&fl=name+price search electronics and limit results to fields: name and price

  • http://solr/select?q=electronics&sort=price+desc search electronics and sort by the price in decreasing order

Query Types Supported by Solr

  • Single and multi-term queries

  • +, -, AND, OR, NOT operators are supported

  • Range queries on date or numeric fields,

  • Boost queries:

  • Fuzzy search : is a search for words that are similar in spelling

  • Proximity Search : with a sloppy phrase query. The closer together the two terms appear, higher the score.

Lucene/Solr Websites

  • Apple.com, Netflix.com, Whitehouse.gov, Indeed.com, Twitter.com, LinkedIn.com

VirtualBox

  • VirtualBox is an open source, freely available Windows application (it also runs on other platforms) that lets you run multiple operating systems on your single machine

  • Runs on other platforms, and lets you run multiple operating systems on a single machine.

Ubuntu

  • Ubuntu is a Linux-based operating system on personal computers, smartphones and network servers. It uses Unity as its default desktop environment

  • A Linux-based operating system that uses Unity as its default desktop environment

Solr's Basic Operations

  • Query, index, delete, commit, and optimize

Query Tokenization Example

  • For the query “Queen Victoria”, after tokenization, Queen: position:0; offset:0; length:5; Victoria: position:1; offset:6; length:8

Query Type Explanation

  • Term query: matches a precise term

  • Wildcard query: portions of term left unspecified, e.g. c*t=cat, cot, cant

  • Prefix query: end of term left unspecified, e.g. ca*=cat, camp, cad

  • Fuzzy query: matches an imprecise term

  • Phrase query: matches multiple terms in sequence/proximity

  • Range query: matches an alphabetical or numeric range of terms

  • Boolean query: logically composite other queries

Lucene Terms

  • Lucene.NET: implementation of Lucene in C#

  • PyLucene: Lucene running in JVM embedded in CPython

  • Tika: Java library for text and metadata extraction

  • Luke: tool for inspecting Lucene indexes

Lucene Document Organization

  • Lucene document=collection of fields.

Lucene Index Storage

  • Fields are stored separately from the index (.fdt file)

Lence Field

  • It may be indexed and stored, indexed but not stored, or stored but not indexed.

Lucene Index Segment

  • A single Lucene index may be split into multiple segments.

  • Each segment stores one or more of the documents in its own inverted index and field storage.

Lucene's Merge Segments

  • MergePolicy: findMerges() invoked when a new segment is created and return a list of segments to merge.

  • MergeScheduler: process the merges, sometimes in separate threads.

Google Document Deletetion

  • When a google is deleted, a document is marked for deletion; actually gets removed only when its segment gets merged.

Lucene Document Modification

  • You can’t. Only whole documents can be added and deleted; fields of an existing document cannot be modified.

Solr Creator

  • Yonik Seeley

Solr Config File - Schema.xml

  • It contains field types and fields, dynamic fields. managed-schema

Solr Config File - Solrconfig.xml

  • It contains Lucene index parameters, request handler mappings, cache settings and plugins

Indexing and Search With DB's

  • Yes solr support indexing data from dtabase and HTTP GET, and full/incremental indexing/searching

Solr File Extentions for Indexing

  • solr supports indexing MS Office, pdf, rtf, OpenDocument, images, mp3, zip ….

Solre Results Type

  • JSON. The default type of searching results for solr

SolrJs

  • Its the front-end framework to use Solr for UI of web pages.

Sor Security

  • It relies on server and container security.

Solr Scaling

  • It supports both master/slave and sharding. But not all solr features support sharding.

Solr Real-Time Index Updates

  • Yes you can use solr to achieve near real-time index updates

Types of Spelling Errors

  • Non-word errors

  • Typographical errors

  • Cognitive errors

Main Spelling Tasks

  • Spelling Error Detection

  • Spelling Error Correction

Spelling Error Correction Requirements

  • Yes, a big dictionary, Using context, or Fast response time for autocomplete is needed

Non-Word Spelling Error Detection

  • Any word not in a dictionary is presumed to be an error.

  • The larger the dictionary the better

Non-Word Spelling Error Correction

  • Generate candidates from the dictionary that are close to the error.

  • There are some ways:

    • (a) Shortest weighted edit distance

    • Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other

    • (b) Highest noisy channel probability

    • Use probabilities to select the most obvious candidates

Common Types of Misspellings

  • Typing quickly, keyboard adjacency, inconsistent rules, ambiguous word breaking, new words

Search Queries Misspelled

  • According to Cucerzan and Brill, more than 10% of search engine queries are misspelled

Using Context for Spelling Correction

  • Consider whether the surrounding words “make sense” for your candidate set.

  • For candidate words with similar sounds but different spellings and different meanings.

  • (we can use the Soundex algorithm)

Challenges for Identifying Spelling Errors

  • Allow for insertion of a space or hyphen; Or deletion of a space; Watch out for words NOT in the Lexicon that are valid

Noisy Channel Model

  • This model suggests treating the misspelled word as if a correctly spelled word has been distorted by being passed through a noisy communication channel.

  • Noise in this case refers to substitutions, insertions or deletions of letters

Bayesian inference

  • To predict the correct word, use Bayesian inference. The formula will incorporate prior probability of a words existence, and a multiplication of that with the liklihood from the misspelled word

Basic Spell Correction Algorithm Steps

  • Initial step: Create a dictionary and encode it for fast retrieval

  • When a query is submitted, the spell checker examines each word and looks for possible character edits. Observation:

    • 80% of errors are within edit distance 1

    • Almost all errors within edit distance 2

  • Take the output of step 2 and compute probabilities for the candidates using previously identified n-grams

  • Delect the result with highest probability

Prefix Matching Data Structure

  • Usually which data structure is used for efficient prefix matching in autocomplete and spell correction?

  • Trie.

Spelling Correction Search Direction

  • Is the search for corrections carried out from left-to-right or right-to-left? What factor is used to control the response time?

  • The search for corrections is carried out from left-to-right.

  • The branching factor controls the amount of time required to search for spelling corrections.

Trie Time Complexity

  • O(m). where m is the length of the string, at the expense of increased storage

Spelling Correction Dictionary Error Test Sets

  • Wikipedia’s list of common English misspelling

  • Aspell filtered version of that list

  • Birkbeck spelling error corpus

N-gram Model

  • A n-gram model is a type of probabilistic language model for predicting the next item in a sequence.

  • Two benefits of n-gram models are simplicity and scalability – with larger n, a model can store more context with a well- understood space–time tradeoff, enabling small experiments to scale up efficiently

Google N-gram Data Use

  • Google has collected and uses a great deal of N-gram data. Google is using the Linguistics Data Consortium to distribute more than one trillion words they have extracted from public web pages

Google N-gram Data Storage

  • Rather than store each word as a string, it is generally represented in memory as a 64-bit hash number, with the words themselves stored on disk.

  • Probabilities are generally quantized using only 4-8 bits (instead of 8-byte floats),

  • N-grams are stored in reverse tries.

  • N-grams can also be shrunk by pruning, for example only storing N- grams with counts greater than some threshold (such as the count threshold of 40 used for the Google N-gram release)

Applying the N-Gram Model to Spelling Correction

  • If a higher-order N-gram has a zero count, we simply backoff to a lower order N-gram, weighed by a fixed (context-independent) weight

  • The backoff terminates in the unigram

Peter Norvig’s Spelling Corrector Steps

  • For a given “big.txt”, extract individual words and build a dictionary (hashtable) that counts how many times each word appears. (for words not appearing, set default value to 1).

  • For a given query word, compute all the words that are in edit- distance 1 with that, and words in the set that are in edit-distance 2 with that.

  • For the given query word, if the word itself is in the original word- set, then returns it as the correct word. Otherwise find the word with highest frequency that has edit-distance 1, if none exists then find the word with highest frequency having edit-distance 2.

Minimum Edit Distance

  • The minimum edit distance between two strings is the minimum number of editing operations (insertion, deletion, substitution) needed to transform one into the other

Levenshtein Algorithm Recursion Formula

  • To compute the minimum edit distance D(n, m) for two strings X and Y, the formula is:

  • D(i, j)= min(

    • D(i-1, j)+1 // for deletion

    • D(i, j-1)+1 // for insertion

    • D(i-1, j-1)+2 // for substitution, if X[i]!=Y[j],

    • D(i-1, j-1) // if X[i]==Y[j]
      )

  • For initial states, D(i, 0)=i, D(0, j)=j

Minimum Edit Distance Complexities

  • Computing minimum distance time: O(mn)

  • Computing minimum distance space: O(mn)

  • Backtracing: O(m+n)

Adding Weights to Edit Distance Computation

  • For spell correction: some letters are more likely to be mistyped than others

Confusion Matrix

  • A specific table layout that allows visualization of the performance of an algorithm;

  • Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice-versa)

How to add Special Cost for Deletion and Insertion to calculate Edit Distance

  • To account for special costs for deletion and insertion during edit distance calculation:

D(i, j)= min(\ \ \ D(i-1, j) + del[x[i]] \ for \ deletion \\ D(i, j-1) + ins[y[j]] \ for \ insertion \\ D(i-1, j-1) + sub[x[i], y[j]] \ for \ substitution, \ if \ X[i]! =Y[j], \\ D(i-1, j-1) \ if \ X[i]==Y[j] \
)
for initial states,
D(0,0)=0D(0,0) = 0
D(i,0)=D(i1,0)+del[x[i]], i:1 ND(i, 0) = D(i-1, 0) + del[x[i]], \ i: 1~N
D(0,j)=D(0,j1)+ins[y[j]] i:1 ND(0, j) = D(0, j-1) + ins[y[j]]\ i: 1~N

Context-Sensitive Correction

  • To correct query phrases that match no docs, retrieve dictionary terms close (in weighted edit distance) to each query term. Then try all possible resulting phrases with one word “fixed” at