Note

0.0(0)

Chat with Kai

Information Extraction Notes

Information Extraction

Information Extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents.
The goal is "machine reading" of text to populate databases or knowledge graphs, enabling computers to understand and process human language in a meaningful way.

Goals of Information Extraction

Organize information for people:
- Info boxes in Wikipedia, which provide a structured summary of key facts about a topic.
- Summarize facts across news collections, allowing users to quickly grasp the main points of multiple articles.
Organize information for machine algorithms:
- Data analytics, enabling automated analysis of large volumes of text data.
- New knowledge through inference (e.g., Works-for(x, y) \land located-in(y, z) \rightarrow lives-in(x, z)), allowing the discovery of implicit relationships.
- Question answering, enabling machines to answer questions based on extracted information.

Information Extraction Steps

Common steps in IE include:
- Named Entity Recognition (NER), which identifies and classifies named entities in text.
- Linking entities, which connects named entities to their corresponding entries in a knowledge base.
- Extracting relations, which identifies and classifies relationships between entities.
- Resolving coreferences, which identifies and groups mentions that refer to the same entity.

Named Entity Recognition (NER)

Detect and classify named entities, such as:
- [Boeing]'s new CEO [Kelly Ortberg] acknowledged the company's past management failures, particularly under former CEO [Dave Calhoun], which led to significant financial losses and delays in aircraft deliveries. [Ortberg] admitted that they struggled with internal communication and decision-making during [Calhoun]'s tenure, which contributed to the prolonged delays in the development of the [737 MAX] and other key projects. Furthermore, [Boeing]'s inability to meet deadlines for high-profile contracts, such as the [Air Force One] replacement, reflected poor leadership and a lack of strategic oversight.

Linking Entities

Ground entities to a representation of "world" knowledge.
- Example: China (LOC) https://en.wikipedia.org/wiki/China

Relation Extraction

Classify relationships between entities.
- Example: is CEO

Coreference Resolution

Resolve pronouns to named entities.
- Example: President Xi Jinping of China, on his first visit to the United States …
- They can be grouped if they refer to the same thing

Knowledge Graphs / Knowledge Bases

Facts extracted from text are often triples: (entity1, relationship, entity2).
A collection of these triples forms a knowledge graph or knowledge base.
- (Paris, iscapitalof, France)
Knowledge graphs are directed graphs with entities as vertices and relations as labelled edges.
Useful for search and question answering systems.
Queries become traversals on the knowledge graph.

Example of a Knowledge Graph

WikiData: a structured knowledge graph for almost every page on Wikipedia.

Uses of Information Extraction

Finding mentions of specific entities.
- Useful for search functionality.
Finding "facts" mentioned in the text.
- Building structured knowledge bases or knowledge graphs.
- Graphs are used for lots of applications including QA
- Knowledge bases can help people find information without having to read lots of text

Named Entities

Persons, objects, etc that are typically named with a proper noun.
- Objects: Eiffel Tower
- Persons: Nelson Mandela
- Organisations: United Nations
- Locations: Paris
- etc
Domain-specific:
- Biomedicine (drugs, genes, diseases, etc)
- Legal (laws, courts, etc)
- Sports (teams, players, events, etc)

Two Common Steps for Extracting Entities

Named entity recognition: Which tokens are naming a particular entity type?
Entity Linking: Which specific entity does the mention correspond to?

Solving with a huge dictionary of terms

Define a list of terms and synonyms
Use exact string matching to find them in your corpus

Advantages:

Does mention detection and entity linking in one go
Don’t need annotated examples to train a classifier
Works fairly well for specific cases (e.g. names of drugs) where words are only used in a single context

Disadvantages:

Completely ignores the context of the sentence
Requires an exhaustive list of terms & synonyms - not appropriate for many purposes
What about ambiguous cases?!?!
- “Paris” in France or Texas?

Named Entity Recognition

Which spans of text are entities?
- span = start and end coordinates in text
What types of entity are they?
- Categories of entities (e.g. people, location, organisation, etc)

Often these two problems are solved together

Sequence-Labelling Problem

Inside–outside–beginning (IOB) labeling – whether a token begins a named entity, is a continuation of a named entity, or is not one.
- (Other labeling schemes exist)
Often paired with the task of extracting the type of entity: person, product, organisation, location, etc.
- John Doe lost his £1300 MacBook Pro at the University of Glasgow .
- B-PER I-PER O O B-MONEY B-PROD I-PROD O O B-ORG I-ORG I-ORG O

Supervised Sequence Labelling

Given a sequence of tokens, label each one given a set of labels (e.g. O, B-CITY, I-CITY).
Supervised - therefore needs annotated data for training and evaluation

Hidden Markov Models (HMMs)

Hidden States: The IOB tags
Observed States: The words in the sequence
Goal: Predict these!

Limits of Hidden Markov Models

HMMs can’t take in extra information:

The part of speech can be useful for predicting NER
Capitalization can be a useful signal
- “Bath” - city? “bath” - object?
Other possible features
- What kind of text is this?
- Does the text match any known entities?
- More context about other words/entities in the text

Conditional Random Fields (CRF)

Basic CRF approach:
- Turn this into a supervised classification problem
- Basic features are the current token and the previously predicted label
- Can then add custom features (e.g. Part of Speech, capitalization, etc)
CRFs allow modelling of more complex sequences while integrating in extra features

BERT-based sequence labelling

We can use transformer models like BERT to make token-level predictions.
Also effective: use a CRF on top of token embeddings.

Recovering spans from Inside–outside–beginning tags

Typically, we want the whole entities, so need to turn IOB tags into spans

Evaluation of Named Entity Recognition

Token-level: Calculate (macro) precision, recall, F1 score for IOB tags
- Often easier to set up
- BUT: Dependent on the tokenizer, so cannot compare results from different tokenizers
Span-level: Match predicted spans versus the correct spans
- Trickier: predicted and correct spans may not match exactly
- Can calculate True Positives, False Positives and False Negatives

Summary of NER methods

Supervised approaches
- HMM- classic method - but doesn’t do well with extra features
- Conditional Random Fields- treats problem like a sequential classification problem
  - can factor in extra features like part-of-speech
- BERT-based- classification problem for each token
  - generates dense context vectors for each token
  - use vectors as input to a classifier for different token labels

Entity Linking

About linking mentions to a knowledge base
Is about dealing with ambiguity

Using aliases from the knowledge base

Many knowledge bases have lists of names that the entity is known by
- Can be used for exact matching or other comparisons

Entity linking is a retrieval task

Can apply ideas from information retrieval [see Information Retrieval course for more detail]
- Candidate generation
  - Use faster (less accurate) approach to get a short list of candidates
  - Two popular options: character n-grams & dense vector representations
- (optionally) rerank candidates to pick top hits
  - Use a more costly approach to score the short list and pick the best

Candidate generation: character n-grams

A popular candidate generation approach
Represent an entity using a vector built from its name (or other aliases)
Use ideas from earlier in the course

Candidate generation: dense vector representations

Encode all the entities in the knowledge base with a transformer
Encode mention with a transformer to get a dense vector
Use dot-product or cosine similarity to compare vectors to find best entity

Reranker the candidates with a cross-encoder

Pair up the document text (with entity tagged) with one of the candidates
- Train as a binary classifier (whether it is the correct entity)
- Pick the candidate with the best score
An example of a dense vector candidate generation (known as a bi-encoder) with a cross-encoder reranker

Evaluating entity linking

How many mentions are matched to the correct entity?
- Accuracy@1
How many mentions have the correct entity in their top 5 candidates?
- Accuracy@5
- Or other values (e.g. top 10, top 50, etc)
Other information retrieval metrics may also be appropriate

Summary of entity linking

Searching a knowledge base of entities for the right one
- Can use ideas from information retrieval
Getting candidates
- Character n-grams or dense vector approach
Optionally reranking with a cross-encoder
Evaluating with hits@1, hits@k, etc

Relation Extraction

Why extract relations?
- We want to know whether (and how) entities are related
- Useful for all kinds of applications
  - Question-answering
  - Knowledge inference

A blunt tool: co-occurrences

Two entities appearing in lots of sentences/documents together likely means that they are connected
Could measure with raw counts of co-occurrences
May be skewed by very popular terms

Knowing the relation type is invaluable

Knowledge may be represented with a triple
- (subject, relation, object) or sometimes (subject, predicate, object)
Can also be expressed as relation(subject, object)
- e.g. capital_of(Reykjavik, Iceland)

N-ary Relations

Binary relation: relational triple (e1, relation, e2)
- (Barack Obama, place-of-birth, Honolulu)
- (Barack Obama, height, 6’1”)
Many relations are not binary:
- “ appointed as ”
One option: decompose n-ary relations to multiple binary
- (, employed-by, )
- (, has-job-title, )
- (, uses-job-title, )

Filtering illogical relations

Some relations only occur between certain entity types
Entities may have types from NER/linking

Rule-based with Hearst Patterns

Choose your relation of interest
Gather some example entity pairs for that relation
Find sentences that contain both entities
Find common patterns in those sentences that fit the relation of interest
Use those patterns on text to find more entity pairs and repeat Steps 2-5

Supervised = annotated data

Good quality requires human annotators

What classification task is relation extraction?

Three classification task
Binary, Multi-class, Multi-label

What text is important for identifying the relation?

Entity names?
Words in between?
Words before and after?

Dependency-parse based

Every token may not be relevant for identifying the relation between two entities
Dependency parse can help isolate the important tokens
Idea: Find the shortest path between two tokens on the dependency path
- Use the tokens on the path as input for a relation classifier

BERT for relation classification

Need to tell the BERT model which tokens are the entities in the candidate relation

Different ways to use BERT for relation extraction

Treat it as text classification and don’t tell BERT about the relevant entities
Entity markers use special tokens to identify the entities
- Can use specific context vectors (e.g. the entities or their markers)

Entity Linking & Relation Extraction Tools

Entity resolution (linking)
- spaCy, TAGME, DBpedia Spotlight, others…
Relation extraction
- Stanford CoreNLP, DeepDive, Stanford MIML-RE, UMass FACTORIE

Coreference Resolution

Which term (in purple or green) does the pronoun (in red) correspond to?

Reasoning/knowledge may be needed

Winograd schema: A famous set of challenging coreference problems
One word change can affect the co-reference and requires reasoning

Different types of references

Coreference: when two mentions refer to the same entity in the world
Anaphora: a word referring backwards to a previous word
Cataphora: a word referring forwards to a future word
Exophora: external context (maybe unresolved)

Mention detection

Mention: span of text referring to some entity
- Pronouns
  - Use a part-of-speech tagger
- Named entities
  - Use a named entity recognition approach
- Noun phrases
  - Use a parser

Hobbs Algorithm for Pronoun Resolution

Task: Find the noun-phrase that a pronoun refers to
Defines a way traverse a syntax tree

Features for coreference identification

Person/Number/Gender agreement
More recently mentioned entities preferred for referenced
Semantic compatibility
Grammatical Role
Certain syntactic constraints
Parallelism

Supervised: as a relation extraction problem

Binary classification for coreference relation between two mentions
Could use tf-idf approach with scikit-learn classifier or transformers

Supervised: as a clustering problem

Can apply different clustering methods to group mentions together

Note

0.0(0)

Take a practice test

Chat with Kai