Information Extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents.
The goal is "machine reading" of text to populate databases or knowledge graphs, enabling computers to understand and process human language in a meaningful way.
Organize information for people:
Info boxes in Wikipedia, which provide a structured summary of key facts about a topic.
Summarize facts across news collections, allowing users to quickly grasp the main points of multiple articles.
Organize information for machine algorithms:
Data analytics, enabling automated analysis of large volumes of text data.
New knowledge through inference (e.g., Works-for(x, y) \land located-in(y, z) \rightarrow lives-in(x, z)), allowing the discovery of implicit relationships.
Question answering, enabling machines to answer questions based on extracted information.
Common steps in IE include:
Named Entity Recognition (NER), which identifies and classifies named entities in text.
Linking entities, which connects named entities to their corresponding entries in a knowledge base.
Extracting relations, which identifies and classifies relationships between entities.
Resolving coreferences, which identifies and groups mentions that refer to the same entity.
Detect and classify named entities, such as:
[Boeing]'s new CEO [Kelly Ortberg] acknowledged the company's past management failures, particularly under former CEO [Dave Calhoun], which led to significant financial losses and delays in aircraft deliveries. [Ortberg] admitted that they struggled with internal communication and decision-making during [Calhoun]'s tenure, which contributed to the prolonged delays in the development of the [737 MAX] and other key projects. Furthermore, [Boeing]'s inability to meet deadlines for high-profile contracts, such as the [Air Force One] replacement, reflected poor leadership and a lack of strategic oversight.
Ground entities to a representation of "world" knowledge.
Example: China (LOC) https://en.wikipedia.org/wiki/China
Classify relationships between entities.
Example: is CEO
Resolve pronouns to named entities.
Example: President Xi Jinping of China, on his first visit to the United States …
They can be grouped if they refer to the same thing
Facts extracted from text are often triples: (entity1, relationship, entity2).
A collection of these triples forms a knowledge graph or knowledge base.
(Paris, iscapitalof, France)
Knowledge graphs are directed graphs with entities as vertices and relations as labelled edges.
Useful for search and question answering systems.
Queries become traversals on the knowledge graph.
WikiData: a structured knowledge graph for almost every page on Wikipedia.
Finding mentions of specific entities.
Useful for search functionality.
Finding "facts" mentioned in the text.
Building structured knowledge bases or knowledge graphs.
Graphs are used for lots of applications including QA
Knowledge bases can help people find information without having to read lots of text
Persons, objects, etc that are typically named with a proper noun.
Objects: Eiffel Tower
Persons: Nelson Mandela
Organisations: United Nations
Locations: Paris
etc
Domain-specific:
Biomedicine (drugs, genes, diseases, etc)
Legal (laws, courts, etc)
Sports (teams, players, events, etc)
Named entity recognition: Which tokens are naming a particular entity type?
Entity Linking: Which specific entity does the mention correspond to?
Define a list of terms and synonyms
Use exact string matching to find them in your corpus
Does mention detection and entity linking in one go
Don’t need annotated examples to train a classifier
Works fairly well for specific cases (e.g. names of drugs) where words are only used in a single context
Completely ignores the context of the sentence
Requires an exhaustive list of terms & synonyms - not appropriate for many purposes
What about ambiguous cases?!?!
“Paris” in France or Texas?
Which spans of text are entities?
span = start and end coordinates in text
What types of entity are they?
Categories of entities (e.g. people, location, organisation, etc)
Often these two problems are solved together
Inside–outside–beginning (IOB) labeling – whether a token begins a named entity, is a continuation of a named entity, or is not one.
(Other labeling schemes exist)
Often paired with the task of extracting the type of entity: person, product, organisation, location, etc.
John Doe lost his £1300 MacBook Pro at the University of Glasgow .
B-PER I-PER O O B-MONEY B-PROD I-PROD O O B-ORG I-ORG I-ORG O
Given a sequence of tokens, label each one given a set of labels (e.g. O, B-CITY, I-CITY).
Supervised - therefore needs annotated data for training and evaluation
Hidden States: The IOB tags
Observed States: The words in the sequence
Goal: Predict these!
HMMs can’t take in extra information:
The part of speech can be useful for predicting NER
Capitalization can be a useful signal
“Bath” - city? “bath” - object?
Other possible features
What kind of text is this?
Does the text match any known entities?
More context about other words/entities in the text
Basic CRF approach:
Turn this into a supervised classification problem
Basic features are the current token and the previously predicted label
Can then add custom features (e.g. Part of Speech, capitalization, etc)
CRFs allow modelling of more complex sequences while integrating in extra features
We can use transformer models like BERT to make token-level predictions.
Also effective: use a CRF on top of token embeddings.
Typically, we want the whole entities, so need to turn IOB tags into spans
Token-level: Calculate (macro) precision, recall, F1 score for IOB tags
Often easier to set up
BUT: Dependent on the tokenizer, so cannot compare results from different tokenizers
Span-level: Match predicted spans versus the correct spans
Trickier: predicted and correct spans may not match exactly
Can calculate True Positives, False Positives and False Negatives
Supervised approaches
HMM- classic method - but doesn’t do well with extra features
Conditional Random Fields- treats problem like a sequential classification problem
can factor in extra features like part-of-speech
BERT-based- classification problem for each token
generates dense context vectors for each token
use vectors as input to a classifier for different token labels
About linking mentions to a knowledge base
Is about dealing with ambiguity
Many knowledge bases have lists of names that the entity is known by
Can be used for exact matching or other comparisons
Can apply ideas from information retrieval [see Information Retrieval course for more detail]
Candidate generation
Use faster (less accurate) approach to get a short list of candidates
Two popular options: character n-grams & dense vector representations
(optionally) rerank candidates to pick top hits
Use a more costly approach to score the short list and pick the best
A popular candidate generation approach
Represent an entity using a vector built from its name (or other aliases)
Use ideas from earlier in the course
Encode all the entities in the knowledge base with a transformer
Encode mention with a transformer to get a dense vector
Use dot-product or cosine similarity to compare vectors to find best entity
Pair up the document text (with entity tagged) with one of the candidates
Train as a binary classifier (whether it is the correct entity)
Pick the candidate with the best score
An example of a dense vector candidate generation (known as a bi-encoder) with a cross-encoder reranker
How many mentions are matched to the correct entity?
Accuracy@1
How many mentions have the correct entity in their top 5 candidates?
Accuracy@5
Or other values (e.g. top 10, top 50, etc)
Other information retrieval metrics may also be appropriate
Searching a knowledge base of entities for the right one
Can use ideas from information retrieval
Getting candidates
Character n-grams or dense vector approach
Optionally reranking with a cross-encoder
Evaluating with hits@1, hits@k, etc
Why extract relations?
We want to know whether (and how) entities are related
Useful for all kinds of applications
Question-answering
Knowledge inference
Two entities appearing in lots of sentences/documents together likely means that they are connected
Could measure with raw counts of co-occurrences
May be skewed by very popular terms
Knowledge may be represented with a triple
(subject, relation, object) or sometimes (subject, predicate, object)
Can also be expressed as relation(subject, object)
e.g. capital_of(Reykjavik, Iceland)
Binary relation: relational triple (e1, relation, e2)
(Barack Obama, place-of-birth, Honolulu)
(Barack Obama, height, 6’1”)
Many relations are not binary:
“ appointed as ”
One option: decompose n-ary relations to multiple binary
(, employed-by, )
(, has-job-title, )
(, uses-job-title, )
Some relations only occur between certain entity types
Entities may have types from NER/linking
Choose your relation of interest
Gather some example entity pairs for that relation
Find sentences that contain both entities
Find common patterns in those sentences that fit the relation of interest
Use those patterns on text to find more entity pairs and repeat Steps 2-5
Good quality requires human annotators
Three classification task
Binary, Multi-class, Multi-label
Entity names?
Words in between?
Words before and after?
Every token may not be relevant for identifying the relation between two entities
Dependency parse can help isolate the important tokens
Idea: Find the shortest path between two tokens on the dependency path
Use the tokens on the path as input for a relation classifier
Need to tell the BERT model which tokens are the entities in the candidate relation
Treat it as text classification and don’t tell BERT about the relevant entities
Entity markers use special tokens to identify the entities
Can use specific context vectors (e.g. the entities or their markers)
Entity resolution (linking)
spaCy, TAGME, DBpedia Spotlight, others…
Relation extraction
Stanford CoreNLP, DeepDive, Stanford MIML-RE, UMass FACTORIE
Which term (in purple or green) does the pronoun (in red) correspond to?
Winograd schema: A famous set of challenging coreference problems
One word change can affect the co-reference and requires reasoning
Coreference: when two mentions refer to the same entity in the world
Anaphora: a word referring backwards to a previous word
Cataphora: a word referring forwards to a future word
Exophora: external context (maybe unresolved)
Mention: span of text referring to some entity
Pronouns
Use a part-of-speech tagger
Named entities
Use a named entity recognition approach
Noun phrases
Use a parser
Task: Find the noun-phrase that a pronoun refers to
Defines a way traverse a syntax tree
Person/Number/Gender agreement
More recently mentioned entities preferred for referenced
Semantic compatibility
Grammatical Role
Certain syntactic constraints
Parallelism
Binary classification for coreference relation between two mentions
Could use tf-idf approach with scikit-learn classifier or transformers
Can apply different clustering methods to group mentions together