Information retrieval is the standard name, but traditionally it involved document retrieval, leaving further analysis to the user. Modern search engines have improved by combining information retrieval with knowledge graphs, inferencing, query history, location data, and natural language processing. Question Answering (QA) focuses on building systems that automatically answer questions posed by humans in natural language.
Around 10-20% of query logs consist of questions like:
In the past, Google's approach involved finding the question as a string on the web and returning the subsequent sentence as the answer. This worked effectively for FAQ-style questions but often failed otherwise. A more sophisticated version combines knowledge graphs, N-grams, WordNet, and NLP techniques.
Example: Question: Who was the prime minister of Australia during the Great Depression? Answer: James Scullin (Labor) 1929–31
Many questions present semantic challenges, such as:
Entity identification and disambiguation are crucial.
NLP is essential because keyword matching is insufficient. Consider the question: "When was Wendy’s founded?" A passage might mention Wendy Moonan and the founding of the Murano glassmaking industry in 1291, leading to an incorrect answer.
Example: The renowned Murano glassmaking industry, on an island in the Venetian lagoon, has gone through several reincarnations since it was founded in 1291. Three exhibitions of 20th-century Murano glass are coming up in New York. By Wendy Moonan.
Identifying the relationship between entities is crucial. For instance, with the question "When was Microsoft established?", the system needs to differentiate between Microsoft establishing partnerships and the establishment of Microsoft itself. A correct answer might not even include the query term.
Example: Microsoft Corp was founded in the US in 1975, incorporated in 1981, and established in the UK in 1982.
Some questions necessitate inference, posing a challenge for search engines.
Example: What is the distance between the largest city in California and the largest city in Nevada?
Sometimes, the required data may not exist or be readily accessible.
Example: how many Ph.D. degrees in mathematics were granted by European universities in 1986?
Siri, initially a DARPA project, employs a multi-step process:
Coreference resolution helps resolve ambiguities, and clarification questions are used to understand the user's intent.
Examples: U: “book a table at Il Fornaio at 7:00 with my mom”. U: “also send her an email reminder”
U: “chicago pizza”. S: “Did you mean pizza restaurants in Chicago or Chicago- style pizza?”
Watson is a QA computing system that applies natural language processing, information retrieval, knowledge representation, automated reasoning, and machine learning. Although it won a Jeopardy contest, it has faced challenges in meeting earlier expectations. However, it performs well on standard natural language tasks.
Previously known for specializing in Q&A, AskJeeves is now less effective than Google.
Questions fall into distinct categories:
This architecture includes question processing, passage retrieval, and answer extraction, utilizing tools like WordNet, NER (Named Entity Recognition), and POS (Part-of-Speech) parsers.
Questions can be organized into taxonomies (e.g., reason, number, manner, location). Factoid questions (who, where, when, how many) have predictable answer categories.
Tools include part-of-speech recognizers and named entity recognizers to identify information units (names, locations, numeric expressions).
Nouns are typically entities, verbs are relationships, and adjectives describe relationships.
After extracting a relation from the question, information sources (Wikipedia infoboxes, DBpedia, FreeBase) can be queried via a triple store.
WordNet provides hypernyms (superordinate groupings) and hyponyms (more specific terms).
Examples: question: When was the internal combustion engine invented? Answer: The first internal combustion engine was built in 1867.
Lexical chains: (1) invent:v#1 ® HYPERNIM ® createbymental_act:v#1 ® HYPERNIM ® create:v#1 ® HYPONIM ® build:v#1
question: How many chromosomes does a human zygote have? Answer: 46 chromosomes lie in the nucleus of every normal human cell.
Lexical chains: (1) zygote:n#1 ® HYPERNIM ® cell:n#1 ® HAS.PART ® nucleus:n#1
WordNet refines the type of answer and merges named entities with a hierarchy.
One measure is Leacock-Chodorow similarity, defined as:
d = -ln( \frac{æ}{2})
After formulating queries, send them to a search engine and retrieve snippets. Filter results by answer type and rank passages based on a trained classifier.
Features: Question keywords, Named Entities, Longest overlapping sequence, Shortest keyword-covering span, N-gram overlap.
Passage ordering involves:
Ranking considers factors like answer type, text passage content, and word proximity.
Involves identifying relationships between question head words and anchor words in candidate answer passages.
Supervised machine learning ranks passages based on:
BERT helps computers understand language by using surrounding text for context. It has strong results in sentiment analysis, semantic role labeling, and disambiguation. Google applies BERT to its search algorithms. BERT is pre-trained using unlabeled text and continues to learn.
BERT captures relationships in a bidirectional way, unlike context-free models like word2vec or GloVe.
BERT reads bidirectionally, accounting for the effect of all other words in a sentence on the focus word.
Variants of BERT are pre-trained on specialized corpora (patentBERT, docBERT, bioBERT, VideoBERT).
AskMSR relies on scattered web information and simple methods.