Note
0.0(0)
MD

Information Extraction Notes

Information Extraction

  • Information Extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents.

  • The goal is "machine reading" of text to populate databases or knowledge graphs, enabling computers to understand and process human language in a meaningful way.

Goals of Information Extraction
  1. Organize information for people:

    • Info boxes in Wikipedia, which provide a structured summary of key facts about a topic.

    • Summarize facts across news collections, allowing users to quickly grasp the main points of multiple articles.

  2. Organize information for machine algorithms:

    • Data analytics, enabling automated analysis of large volumes of text data.

    • New knowledge through inference (e.g., Works-for(x, y) \land located-in(y, z) \rightarrow lives-in(x, z)), allowing the discovery of implicit relationships.

    • Question answering, enabling machines to answer questions based on extracted information.

Information Extraction Steps
  • Common steps in IE include:

    • Named Entity Recognition (NER), which identifies and classifies named entities in text.

    • Linking entities, which connects named entities to their corresponding entries in a knowledge base.

    • Extracting relations, which identifies and classifies relationships between entities.

    • Resolving coreferences, which identifies and groups mentions that refer to the same entity.

Named Entity Recognition (NER)
  • Detect and classify named entities, such as:

    • [Boeing]'s new CEO [Kelly Ortberg] acknowledged the company's past management failures, particularly under former CEO [Dave Calhoun], which led to significant financial losses and delays in aircraft deliveries. [Ortberg] admitted that they struggled with internal communication and decision-making during [Calhoun]'s tenure, which contributed to the prolonged delays in the development of the [737 MAX] and other key projects. Furthermore, [Boeing]'s inability to meet deadlines for high-profile contracts, such as the [Air Force One] replacement, reflected poor leadership and a lack of strategic oversight.

Linking Entities
  • Ground entities to a representation of "world" knowledge.

    • Example: China (LOC) https://en.wikipedia.org/wiki/China

Relation Extraction
  • Classify relationships between entities.

    • Example: is CEO

Coreference Resolution
  • Resolve pronouns to named entities.

    • Example: President Xi Jinping of China, on his first visit to the United States …

    • They can be grouped if they refer to the same thing

Knowledge Graphs / Knowledge Bases
  • Facts extracted from text are often triples: (entity1, relationship, entity2).

  • A collection of these triples forms a knowledge graph or knowledge base.

    • (Paris, iscapitalof, France)

  • Knowledge graphs are directed graphs with entities as vertices and relations as labelled edges.

  • Useful for search and question answering systems.

  • Queries become traversals on the knowledge graph.

Example of a Knowledge Graph
  • WikiData: a structured knowledge graph for almost every page on Wikipedia.

Uses of Information Extraction
  • Finding mentions of specific entities.

    • Useful for search functionality.

  • Finding "facts" mentioned in the text.

    • Building structured knowledge bases or knowledge graphs.

    • Graphs are used for lots of applications including QA

    • Knowledge bases can help people find information without having to read lots of text

Named Entities
  • Persons, objects, etc that are typically named with a proper noun.

    • Objects: Eiffel Tower

    • Persons: Nelson Mandela

    • Organisations: United Nations

    • Locations: Paris

    • etc

  • Domain-specific:

    • Biomedicine (drugs, genes, diseases, etc)

    • Legal (laws, courts, etc)

    • Sports (teams, players, events, etc)

Two Common Steps for Extracting Entities
  1. Named entity recognition: Which tokens are naming a particular entity type?

  2. Entity Linking: Which specific entity does the mention correspond to?

Solving with a huge dictionary of terms
  1. Define a list of terms and synonyms

  2. Use exact string matching to find them in your corpus

Advantages:
  • Does mention detection and entity linking in one go

  • Don’t need annotated examples to train a classifier

  • Works fairly well for specific cases (e.g. names of drugs) where words are only used in a single context

Disadvantages:
  • Completely ignores the context of the sentence

  • Requires an exhaustive list of terms & synonyms - not appropriate for many purposes

  • What about ambiguous cases?!?!

    • “Paris” in France or Texas?

Named Entity Recognition
  1. Which spans of text are entities?

    • span = start and end coordinates in text

  2. What types of entity are they?

    • Categories of entities (e.g. people, location, organisation, etc)

  • Often these two problems are solved together

Sequence-Labelling Problem
  • Inside–outside–beginning (IOB) labeling – whether a token begins a named entity, is a continuation of a named entity, or is not one.

    • (Other labeling schemes exist)

  • Often paired with the task of extracting the type of entity: person, product, organisation, location, etc.

    • John Doe lost his £1300 MacBook Pro at the University of Glasgow .

    • B-PER I-PER O O B-MONEY B-PROD I-PROD O O B-ORG I-ORG I-ORG O

Supervised Sequence Labelling
  • Given a sequence of tokens, label each one given a set of labels (e.g. O, B-CITY, I-CITY).

  • Supervised - therefore needs annotated data for training and evaluation

Hidden Markov Models (HMMs)
  • Hidden States: The IOB tags

  • Observed States: The words in the sequence

  • Goal: Predict these!

Limits of Hidden Markov Models

HMMs can’t take in extra information:

  • The part of speech can be useful for predicting NER

  • Capitalization can be a useful signal

    • “Bath” - city? “bath” - object?

  • Other possible features

    • What kind of text is this?

    • Does the text match any known entities?

    • More context about other words/entities in the text

Conditional Random Fields (CRF)
  • Basic CRF approach:

    • Turn this into a supervised classification problem

    • Basic features are the current token and the previously predicted label

    • Can then add custom features (e.g. Part of Speech, capitalization, etc)

  • CRFs allow modelling of more complex sequences while integrating in extra features

BERT-based sequence labelling
  • We can use transformer models like BERT to make token-level predictions.

  • Also effective: use a CRF on top of token embeddings.

Recovering spans from Inside–outside–beginning tags
  • Typically, we want the whole entities, so need to turn IOB tags into spans

Evaluation of Named Entity Recognition
  • Token-level: Calculate (macro) precision, recall, F1 score for IOB tags

    • Often easier to set up

    • BUT: Dependent on the tokenizer, so cannot compare results from different tokenizers

  • Span-level: Match predicted spans versus the correct spans

    • Trickier: predicted and correct spans may not match exactly

    • Can calculate True Positives, False Positives and False Negatives

Summary of NER methods
  • Supervised approaches

    • HMM- classic method - but doesn’t do well with extra features

    • Conditional Random Fields- treats problem like a sequential classification problem

      • can factor in extra features like part-of-speech

    • BERT-based- classification problem for each token

      • generates dense context vectors for each token

      • use vectors as input to a classifier for different token labels

Entity Linking
  • About linking mentions to a knowledge base

  • Is about dealing with ambiguity

Using aliases from the knowledge base
  • Many knowledge bases have lists of names that the entity is known by

    • Can be used for exact matching or other comparisons

Entity linking is a retrieval task
  • Can apply ideas from information retrieval [see Information Retrieval course for more detail]

    • Candidate generation

      • Use faster (less accurate) approach to get a short list of candidates

      • Two popular options: character n-grams & dense vector representations

    • (optionally) rerank candidates to pick top hits

      • Use a more costly approach to score the short list and pick the best

Candidate generation: character n-grams
  • A popular candidate generation approach

  • Represent an entity using a vector built from its name (or other aliases)

  • Use ideas from earlier in the course

Candidate generation: dense vector representations
  1. Encode all the entities in the knowledge base with a transformer

  2. Encode mention with a transformer to get a dense vector

  3. Use dot-product or cosine similarity to compare vectors to find best entity

Reranker the candidates with a cross-encoder
  • Pair up the document text (with entity tagged) with one of the candidates

    • Train as a binary classifier (whether it is the correct entity)

    • Pick the candidate with the best score

  • An example of a dense vector candidate generation (known as a bi-encoder) with a cross-encoder reranker

Evaluating entity linking
  • How many mentions are matched to the correct entity?

    • Accuracy@1

  • How many mentions have the correct entity in their top 5 candidates?

    • Accuracy@5

    • Or other values (e.g. top 10, top 50, etc)

  • Other information retrieval metrics may also be appropriate

Summary of entity linking
  • Searching a knowledge base of entities for the right one

    • Can use ideas from information retrieval

  • Getting candidates

    • Character n-grams or dense vector approach

  • Optionally reranking with a cross-encoder

  • Evaluating with hits@1, hits@k, etc

Relation Extraction
  • Why extract relations?

    • We want to know whether (and how) entities are related

    • Useful for all kinds of applications

      • Question-answering

      • Knowledge inference

A blunt tool: co-occurrences
  • Two entities appearing in lots of sentences/documents together likely means that they are connected

  • Could measure with raw counts of co-occurrences

  • May be skewed by very popular terms

Knowing the relation type is invaluable
  • Knowledge may be represented with a triple

    • (subject, relation, object) or sometimes (subject, predicate, object)

  • Can also be expressed as relation(subject, object)

    • e.g. capital_of(Reykjavik, Iceland)

N-ary Relations
  • Binary relation: relational triple (e1, relation, e2)

    • (Barack Obama, place-of-birth, Honolulu)

    • (Barack Obama, height, 6’1”)

  • Many relations are not binary:

    • “ appointed as ”

  • One option: decompose n-ary relations to multiple binary

    • (, employed-by, )

    • (, has-job-title, )

    • (, uses-job-title, )

Filtering illogical relations
  • Some relations only occur between certain entity types

  • Entities may have types from NER/linking

Rule-based with Hearst Patterns
  1. Choose your relation of interest

  2. Gather some example entity pairs for that relation

  3. Find sentences that contain both entities

  4. Find common patterns in those sentences that fit the relation of interest

  5. Use those patterns on text to find more entity pairs and repeat Steps 2-5

Supervised = annotated data
  • Good quality requires human annotators

What classification task is relation extraction?
  • Three classification task

  • Binary, Multi-class, Multi-label

What text is important for identifying the relation?
  • Entity names?

  • Words in between?

  • Words before and after?

Dependency-parse based
  • Every token may not be relevant for identifying the relation between two entities

  • Dependency parse can help isolate the important tokens

  • Idea: Find the shortest path between two tokens on the dependency path

    • Use the tokens on the path as input for a relation classifier

BERT for relation classification
  • Need to tell the BERT model which tokens are the entities in the candidate relation

Different ways to use BERT for relation extraction
  1. Treat it as text classification and don’t tell BERT about the relevant entities

  2. Entity markers use special tokens to identify the entities

    • Can use specific context vectors (e.g. the entities or their markers)

Entity Linking & Relation Extraction Tools
  • Entity resolution (linking)

    • spaCy, TAGME, DBpedia Spotlight, others…

  • Relation extraction

    • Stanford CoreNLP, DeepDive, Stanford MIML-RE, UMass FACTORIE

Coreference Resolution
  • Which term (in purple or green) does the pronoun (in red) correspond to?

Reasoning/knowledge may be needed
  • Winograd schema: A famous set of challenging coreference problems

  • One word change can affect the co-reference and requires reasoning

Different types of references
  • Coreference: when two mentions refer to the same entity in the world

  • Anaphora: a word referring backwards to a previous word

  • Cataphora: a word referring forwards to a future word

  • Exophora: external context (maybe unresolved)

Mention detection
  • Mention: span of text referring to some entity

    • Pronouns

      • Use a part-of-speech tagger

    • Named entities

      • Use a named entity recognition approach

    • Noun phrases

      • Use a parser

Hobbs Algorithm for Pronoun Resolution
  • Task: Find the noun-phrase that a pronoun refers to

  • Defines a way traverse a syntax tree

Features for coreference identification
  1. Person/Number/Gender agreement

  2. More recently mentioned entities preferred for referenced

  3. Semantic compatibility

  4. Grammatical Role

  5. Certain syntactic constraints

  6. Parallelism

Supervised: as a relation extraction problem
  • Binary classification for coreference relation between two mentions

  • Could use tf-idf approach with scikit-learn classifier or transformers

Supervised: as a clustering problem
  • Can apply different clustering methods to group mentions together

Note
0.0(0)