Information Extraction Notes
Information Extraction
Information Extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents.
The goal is "machine reading" of text to populate databases or knowledge graphs, enabling computers to understand and process human language in a meaningful way.
Goals of Information Extraction
Organize information for people:
Info boxes in Wikipedia, which provide a structured summary of key facts about a topic.
Summarize facts across news collections, allowing users to quickly grasp the main points of multiple articles.
Organize information for machine algorithms:
Data analytics, enabling automated analysis of large volumes of text data.
New knowledge through inference (e.g., Works-for(x, y) \land located-in(y, z) \rightarrow lives-in(x, z)), allowing the discovery of implicit relationships.
Question answering, enabling machines to answer questions based on extracted information.
Information Extraction Steps
Common steps in IE include:
Named Entity Recognition (NER), which identifies and classifies named entities in text.
Linking entities, which connects named entities to their corresponding entries in a knowledge base.
Extracting relations, which identifies and classifies relationships between entities.
Resolving coreferences, which identifies and groups mentions that refer to the same entity.
Named Entity Recognition (NER)
Detect and classify named entities, such as:
[Boeing]'s new CEO [Kelly Ortberg] acknowledged the company's past management failures, particularly under former CEO [Dave Calhoun], which led to significant financial losses and delays in aircraft deliveries. [Ortberg] admitted that they struggled with internal communication and decision-making during [Calhoun]'s tenure, which contributed to the prolonged delays in the development of the [737 MAX] and other key projects. Furthermore, [Boeing]'s inability to meet deadlines for high-profile contracts, such as the [Air Force One] replacement, reflected poor leadership and a lack of strategic oversight.
Linking Entities
Ground entities to a representation of "world" knowledge.
Example: China (LOC) https://en.wikipedia.org/wiki/China
Relation Extraction
Classify relationships between entities.
Example: is CEO
Coreference Resolution
Resolve pronouns to named entities.
Example: President Xi Jinping of China, on his first visit to the United States …
They can be grouped if they refer to the same thing
Knowledge Graphs / Knowledge Bases
Facts extracted from text are often triples: (entity1, relationship, entity2).
A collection of these triples forms a knowledge graph or knowledge base.
(Paris, iscapitalof, France)
Knowledge graphs are directed graphs with entities as vertices and relations as labelled edges.
Useful for search and question answering systems.
Queries become traversals on the knowledge graph.
Example of a Knowledge Graph
WikiData: a structured knowledge graph for almost every page on Wikipedia.
Uses of Information Extraction
Finding mentions of specific entities.
Useful for search functionality.
Finding "facts" mentioned in the text.
Building structured knowledge bases or knowledge graphs.
Graphs are used for lots of applications including QA
Knowledge bases can help people find information without having to read lots of text
Named Entities
Persons, objects, etc that are typically named with a proper noun.
Objects: Eiffel Tower
Persons: Nelson Mandela
Organisations: United Nations
Locations: Paris
etc
Domain-specific:
Biomedicine (drugs, genes, diseases, etc)
Legal (laws, courts, etc)
Sports (teams, players, events, etc)
Two Common Steps for Extracting Entities
Named entity recognition: Which tokens are naming a particular entity type?
Entity Linking: Which specific entity does the mention correspond to?
Solving with a huge dictionary of terms
Define a list of terms and synonyms
Use exact string matching to find them in your corpus
Advantages:
Does mention detection and entity linking in one go
Don’t need annotated examples to train a classifier
Works fairly well for specific cases (e.g. names of drugs) where words are only used in a single context
Disadvantages:
Completely ignores the context of the sentence
Requires an exhaustive list of terms & synonyms - not appropriate for many purposes
What about ambiguous cases?!?!
“Paris” in France or Texas?
Named Entity Recognition
Which spans of text are entities?
span = start and end coordinates in text
What types of entity are they?
Categories of entities (e.g. people, location, organisation, etc)
Often these two problems are solved together
Sequence-Labelling Problem
Inside–outside–beginning (IOB) labeling – whether a token begins a named entity, is a continuation of a named entity, or is not one.
(Other labeling schemes exist)
Often paired with the task of extracting the type of entity: person, product, organisation, location, etc.
John Doe lost his £1300 MacBook Pro at the University of Glasgow .
B-PER I-PER O O B-MONEY B-PROD I-PROD O O B-ORG I-ORG I-ORG O
Supervised Sequence Labelling
Given a sequence of tokens, label each one given a set of labels (e.g. O, B-CITY, I-CITY).
Supervised - therefore needs annotated data for training and evaluation
Hidden Markov Models (HMMs)
Hidden States: The IOB tags
Observed States: The words in the sequence
Goal: Predict these!
Limits of Hidden Markov Models
HMMs can’t take in extra information:
The part of speech can be useful for predicting NER
Capitalization can be a useful signal
“Bath” - city? “bath” - object?
Other possible features
What kind of text is this?
Does the text match any known entities?
More context about other words/entities in the text
Conditional Random Fields (CRF)
Basic CRF approach:
Turn this into a supervised classification problem
Basic features are the current token and the previously predicted label
Can then add custom features (e.g. Part of Speech, capitalization, etc)
CRFs allow modelling of more complex sequences while integrating in extra features
BERT-based sequence labelling
We can use transformer models like BERT to make token-level predictions.
Also effective: use a CRF on top of token embeddings.
Recovering spans from Inside–outside–beginning tags
Typically, we want the whole entities, so need to turn IOB tags into spans
Evaluation of Named Entity Recognition
Token-level: Calculate (macro) precision, recall, F1 score for IOB tags
Often easier to set up
BUT: Dependent on the tokenizer, so cannot compare results from different tokenizers
Span-level: Match predicted spans versus the correct spans
Trickier: predicted and correct spans may not match exactly
Can calculate True Positives, False Positives and False Negatives
Summary of NER methods
Supervised approaches
HMM- classic method - but doesn’t do well with extra features
Conditional Random Fields- treats problem like a sequential classification problem
can factor in extra features like part-of-speech
BERT-based- classification problem for each token
generates dense context vectors for each token
use vectors as input to a classifier for different token labels
Entity Linking
About linking mentions to a knowledge base
Is about dealing with ambiguity
Using aliases from the knowledge base
Many knowledge bases have lists of names that the entity is known by
Can be used for exact matching or other comparisons
Entity linking is a retrieval task
Can apply ideas from information retrieval [see Information Retrieval course for more detail]
Candidate generation
Use faster (less accurate) approach to get a short list of candidates
Two popular options: character n-grams & dense vector representations
(optionally) rerank candidates to pick top hits
Use a more costly approach to score the short list and pick the best
Candidate generation: character n-grams
A popular candidate generation approach
Represent an entity using a vector built from its name (or other aliases)
Use ideas from earlier in the course
Candidate generation: dense vector representations
Encode all the entities in the knowledge base with a transformer
Encode mention with a transformer to get a dense vector
Use dot-product or cosine similarity to compare vectors to find best entity
Reranker the candidates with a cross-encoder
Pair up the document text (with entity tagged) with one of the candidates
Train as a binary classifier (whether it is the correct entity)
Pick the candidate with the best score
An example of a dense vector candidate generation (known as a bi-encoder) with a cross-encoder reranker
Evaluating entity linking
How many mentions are matched to the correct entity?
Accuracy@1
How many mentions have the correct entity in their top 5 candidates?
Accuracy@5
Or other values (e.g. top 10, top 50, etc)
Other information retrieval metrics may also be appropriate
Summary of entity linking
Searching a knowledge base of entities for the right one
Can use ideas from information retrieval
Getting candidates
Character n-grams or dense vector approach
Optionally reranking with a cross-encoder
Evaluating with hits@1, hits@k, etc
Relation Extraction
Why extract relations?
We want to know whether (and how) entities are related
Useful for all kinds of applications
Question-answering
Knowledge inference
A blunt tool: co-occurrences
Two entities appearing in lots of sentences/documents together likely means that they are connected
Could measure with raw counts of co-occurrences
May be skewed by very popular terms
Knowing the relation type is invaluable
Knowledge may be represented with a triple
(subject, relation, object) or sometimes (subject, predicate, object)
Can also be expressed as relation(subject, object)
e.g. capital_of(Reykjavik, Iceland)
N-ary Relations
Binary relation: relational triple (e1, relation, e2)
(Barack Obama, place-of-birth, Honolulu)
(Barack Obama, height, 6’1”)
Many relations are not binary:
“ appointed as ”
One option: decompose n-ary relations to multiple binary
(, employed-by, )
(, has-job-title, )
(, uses-job-title, )
Filtering illogical relations
Some relations only occur between certain entity types
Entities may have types from NER/linking
Rule-based with Hearst Patterns
Choose your relation of interest
Gather some example entity pairs for that relation
Find sentences that contain both entities
Find common patterns in those sentences that fit the relation of interest
Use those patterns on text to find more entity pairs and repeat Steps 2-5
Supervised = annotated data
Good quality requires human annotators
What classification task is relation extraction?
Three classification task
Binary, Multi-class, Multi-label
What text is important for identifying the relation?
Entity names?
Words in between?
Words before and after?
Dependency-parse based
Every token may not be relevant for identifying the relation between two entities
Dependency parse can help isolate the important tokens
Idea: Find the shortest path between two tokens on the dependency path
Use the tokens on the path as input for a relation classifier
BERT for relation classification
Need to tell the BERT model which tokens are the entities in the candidate relation
Different ways to use BERT for relation extraction
Treat it as text classification and don’t tell BERT about the relevant entities
Entity markers use special tokens to identify the entities
Can use specific context vectors (e.g. the entities or their markers)
Entity Linking & Relation Extraction Tools
Entity resolution (linking)
spaCy, TAGME, DBpedia Spotlight, others…
Relation extraction
Stanford CoreNLP, DeepDive, Stanford MIML-RE, UMass FACTORIE
Coreference Resolution
Which term (in purple or green) does the pronoun (in red) correspond to?
Reasoning/knowledge may be needed
Winograd schema: A famous set of challenging coreference problems
One word change can affect the co-reference and requires reasoning
Different types of references
Coreference: when two mentions refer to the same entity in the world
Anaphora: a word referring backwards to a previous word
Cataphora: a word referring forwards to a future word
Exophora: external context (maybe unresolved)
Mention detection
Mention: span of text referring to some entity
Pronouns
Use a part-of-speech tagger
Named entities
Use a named entity recognition approach
Noun phrases
Use a parser
Hobbs Algorithm for Pronoun Resolution
Task: Find the noun-phrase that a pronoun refers to
Defines a way traverse a syntax tree
Features for coreference identification
Person/Number/Gender agreement
More recently mentioned entities preferred for referenced
Semantic compatibility
Grammatical Role
Certain syntactic constraints
Parallelism
Supervised: as a relation extraction problem
Binary classification for coreference relation between two mentions
Could use tf-idf approach with scikit-learn classifier or transformers
Supervised: as a clustering problem
Can apply different clustering methods to group mentions together