1st

Text Mining

  • Unstructured data is the easiest form of data that can be created in any application scenario.
  • There's a tremendous need to design methods and algorithms which can effectively process a wide variety of text applications
  • Text mining can be regarded as going beyond information access to further help users analyze and digest information and facilitate decision-making

Module 1 - Information Extraction and Text Summarization

Information Extraction:

  • Information extraction is the task of finding structured information from unstructured or semi-structured text.
  • Two fundamental tasks of information extraction are:
    • Named entity recognition
    • Relation extraction
  • Named entity recognition refers to finding names of entities such as people, organizations, and locations.
    • Example: In 1998, Larry Page and Sergey Brin founded Google Inc.
  • Relation extraction refers to finding semantic relations such as FounderOf and HeadquarteredIn between entities.
    • Examples:
      • FounderOf(Larry Page, Google Inc.)
      • FounderOf(Sergey Brin, Google Inc.)
      • FoundedIn(Google Inc., 1998)
  • Example of Information Extraction:
    • Text: Brazil ranks number 5 in the list of countries by population. The term "Ibu Negara” (Lady/Mother of the State) is used for wife of the President of Indonesia. Game of Thrones is an adaptation of A Song of Ice and Fire, George R. R. Martin's series of fantasy novels. It ranks fourth among the IMDB Top Rated TV Shows
    • Data Out
      • THE COUNTRIES WITH THE LARGEST POPULATION
        • China 1 1,388,232,693
        • India 2 1,342,512,706
        • United States 3 326,474,013
        • Indonesia 4 263,510,146
        • Brasil 5 174,315,386
      • THE COUNTRY'S' FIRST LADIES
        • Brigitte Macron
          • Spouse: Emmanuel Macron, President of France (2017-)
        • Melania Trump
          • Spouse: Donald J. Trump, U.S. President (2017-)
        • Iriana Widodo
          • Spouse: Joko Widodo, President of Indonesia (2014-)
          • Also known as: "Ibu Negara" (Lady/Mother of the State)
      • IMDB TOP RATED TV SHOWS
        • Planet Earth II (2016) 9.6
        • Band of Brothers (2001) 9.5
        • Planet Earth (2006) 9.5
        • Game of Thrones (2011) 9.4
        • Breaking Bad (2008) 9.4
  • For structured information to be extracted from unstructured texts, the following main subtasks are involved:
    • Pre-processing of the text – this is where the text is prepared for processing with the help of computational linguistics tools such as tokenization, sentence splitting, morphological analysis (Morphemes can also be divided into inflectional or derivational morphemes.), etc.
    • Finding and classifying concepts – this is where mentions of people, things, locations, events and other pre-specified types of concepts are detected and classified.
    • Connecting the concepts – this is the task of identifying relationships between the extracted concepts.
    • Unifying – this subtask is about presenting the extracted data into a standard form.
    • Getting rid of the noise – this subtask involves eliminating duplicate data.
    • Enriching your knowledge base – this is where the extracted knowledge is ingested in your database for further use.

Example of Information Extraction

  • Consider the paragraph below (an excerpt from a news article about Valencia MotoGP and Marc Marques):
    • Marc Marquez was fastest in the final MotoGP warm-up session of the 2016 season at Valencia, heading Maverick Vinales by just over a tenth of a second. After qualifying second on Saturday behind a rampant Jorge Lorenzo, Marquez took charge of the 20-minute session from the start, eventually setting a best time of 1m31.095s at half-distance.
  • Through information extraction, the following basic facts can be pulled out of the free-flowing text and organized in a structured, machine-readable form:
    • Person: Marc Marquez
    • Location: Valencia
    • Event: MotoGP
    • Related mentions: Maverick Vinales, Yamaha, Jorge Lorenzo

Adding Semantics to the Information Extraction Process

  • Information extraction helps for finding entities, classifying, and storing them in a database.
  • Semantically enhanced information extraction couples those entities with their semantic descriptions and connections from a knowledge graph. The latter is also known as semantic annotation
  • Semantic annotation is applicable for any sort of text – web pages, regular (non-web) documents, text fields in databases, etc.
  • Semantic information extraction enables new types of applications such as:
    • Highlighting, indexing, and retrieval
    • Categorization and generation of more advanced metadata
    • Smooth traversal between unstructured text and available relevant knowledge.

NER (Named Entity Recognition)

  • The named entity recognition (NER) is one of the most popular data preprocessing tasks.
  • NER is a form of NLP.
  • At its core, NLP is just a two-step process:
    • Detecting the entities from the text
    • Classifying them into different categories
  • Some of the categories that are the most important architecture in NER include:
    • Person
    • Organization
    • Place/location
  • Other common tasks include classifying the following:
    • Date/time
    • Expression
    • Numeral measurement (money, percent, weight, etc.)
    • E-mail address

Ambiguity in NER

  • For a person, the category definition is intuitively quite clear, but for computers, there is some ambiguity in classification.
  • Examples:
    • England (Organization) won the 2019 World Cup vs. The 2019 World Cup happened in England (Location).
    • Washington (Location) is the capital of the U.S. vs. The first president of the U.S. was Washington (Person).

NE Types

  • ENAMEX
  • NUMEX
  • TIMEX
ENAMEX
  • PERSON
  • LOCATION
  • ORGANIZATION
  • FACILITIES
  • LOCOMOTIVES
  • ARTIFACTS
  • ENTERTAINMENT
  • CUISINES
  • ORGANISMS
  • PLANTS
  • DISEASE
NUMEX
  • DISTANCE
  • QUANTITY
  • MONEY
  • COUNT
TIMEX
  • TIME
  • DAY
  • MONTH
  • PERIOD
  • DATE
  • YEAR
  • SPECIAL DAY

Techniques of NER

  • RULE-BASED
    • Dictionaries
    • Regular Expressions
    • Context-Free Grammars
  • SUPERVISED
    • Hidden Markov Model
    • Maximum Entropy-Based Model
    • Support Vector Machine Model
    • Conditional Random Field Model
  • SEMI-SUPERVISED
    • Bootstrapping-Based
  • UNSUPERVISED
    • KnowItAll
Dictionary (Gazetteers) Look-up Approach
  • Uses Dictionaries for identifying NERs (Gazetteers).
  • Gazetteer contains NEs from all domains.
  • Advantages:
    • Very simple approach.
    • Gives very high precision.
  • Disadvantages:
    • Preparation of exhaustive dictionary is a tedious and expensive process.
    • The dictionary should cover the different spellings of the same place.
Rule-Based Approach
  • Rule-Based System:
    • Needs more rules to tag all kinds of NE.
  • Advantages:
    • Rich and expressive rules.
    • Good results.
  • Disadvantages:
    • Requires huge experience and grammatical knowledge.
    • Experts to craft rules are expensive.
    • Highly domain-specific (not portable to a new domain).
  • General difficulties:
    • Capitalization is useless for the first word.
    • 's' is not part of the name (e.g., Italy's).
    • Date is "last Thursday," not "Thursday."
    • Milan is a location, not an organization.
    • Arthur Andersen is an organization, not a person.
Statistical Machine Learning Approach
  • Treat each word in a sentence as an observation and assign a label for each.
  • The class labels have to clearly indicate both the boundaries and the types of named entities within the sequence.
  • The Hidden Markov Model (HMM):
    • The Hidden Markov Model is a probabilistic model, usually applied in simple sequential machine learning problems. As the word ‘hidden’ in HMM suggest, it refers to the hidden states or targets to be inferred or predicted.
    • Can be applied in Parts-of-Speech (POS) tagging in NLP
    • Where the hidden states are the POS tags of the words, and Named Entity Recognition, where hidden states can be the boundary labels or named entity labels.
  • With this notation, for each entity type T, two labels are created, namely, B-T and I-T.
    • A token labeled with B-T is the beginning of a named entity of type T while a token labeled with I-T is inside (but not the beginning of) a named entity of type T.
    • In addition, there is a label O for tokens outside of any named entity.
  • HMM:
    • In a probabilistic framework, the best label sequence y=(y<em>1,y</em>2,,y<em>n)y = (y<em>1, y</em>2, …, y<em>n) for an observation sequence x=(x</em>1,x<em>2,,x</em>n)x = (x</em>1, x<em>2, …, x</em>n) is the one that maximizes the conditional probability p(yx)p(y|x), or equivalently, the one that maximizes the joint probability p(x,y)p(x, y).
    • An example is the Nymble system developed by BBN, one of the earliest statistical learning-based NER systems.
    • Each y<em>iy<em>i is generated conditioning on the previous label y</em>i1y</em>{i-1} and the previous word xi1x_{i-1}.
    • If x<em>ix<em>i is the first word of a named entity, it is generated conditioning on the current and the previous labels, i.e., y</em>iy</em>i and yi1y_{i-1}.
  • If x<em>ix<em>i is inside a named entity, it is generated conditioning on the previous observation x</em>i1x</em>{i-1}.
    • P(Y<em>i=c</em>1Y<em>i1=c</em>2,x<em>i1=w)=c(c</em>1,c<em>2,w)c(c</em>2,w)P(Y<em>i = c</em>1 | Y<em>{i-1} = c</em>2, x<em>{i-1} = w) = \frac{c(c</em>1, c<em>2, w)}{c(c</em>2, w)}
    • where C<em>1C<em>1 and C</em>2C</em>2 are two class labels and ww is a word. p(y<em>i=C</em>1Y<em>i1=C</em>2,x<em>i1=w)p(y<em>i = C</em>1 | Y<em>{i-1} = C</em>2, x<em>{i-1} = w) is the probability of observing the class label c</em>1c</em>1 given that the previous class label is c<em>2c<em>2 and the previous word is ww. c(C</em>1,C<em>2,W)c(C</em>1, C<em>2, W) is the number of times we observe class label c</em>1c</em>1 when the previous class label is c<em>2c<em>2 and the previous word is ww, and c(c</em>2,w)c(c</em>2, w) is the number of times we observe the previous class label to be c2c_2 and the previous word to be ww regardless of the current class label.
    • Nymble assumes the following generative process:
      • Each y<em>iy<em>i is generated conditioning on the previous label y</em>i1y</em>{i-1} and the previous word xi1x_{i-1}.
      • If x<em>ix<em>i is the first word of a named entity, it is generated conditioning on the current and the previous labels, i.e., y</em>iy</em>i and yi1y_{i-1}.
  • Components of HMM:
    • States: Hidden variables (e.g., POS tags).
    • Observations: Visible data (e.g., words).
    • Transition Probabilities (A): P(state<em>istate</em>j)P(state<em>i → state</em>j)
    • Emission Probabilities (B): P(observationstate)P(observation | state)
    • Initial Probabilities (π\pi): P(starting state)P(starting \ state)
  • POS Tagging with HMM:
    • Task: Assign POS tags to each word in a sentence.
    • Hidden states: POS tags (Noun, Verb, etc.).
    • Observations: Words.
    • Trained on a tagged corpus for probabilities.
    • Example: Sentence: "Fish swim"
      • Step 1: Tags
        • Fish: Noun or Verb
        • Swim: Noun or Verb
      • Step 2: Probabilities
        • π\pi: P(Noun) = 0.6, P(Verb) = 0.4
        • A: P(Noun→Verb) = 0.3, P(Verb→Verb) = 0.6
        • B: P(Fish|Noun) = 0.5, P(Fish|Verb) = 0.5
        • P(Swim|Verb) = 0.6, P(Swim|Noun) = 0.1
      • Step 3: Viterbi Algorithm
        • Best sequence: Fish→Noun, Swim→Verb
  • NER with HMM:
    • Named Entity Recognition: Find entities like Person, Location, Org
    • Hidden states: B-PER, I-PER, O, etc.
    • Observations: Words
    • Useful for structured info extraction
Maximum-Entropy Markov Model (MEMM) or Conditional Markov Model (CMM)
  • It is a graphical model for sequence labeling that combines features of hidden Markov models (HMM) and maximum entropy (MaxEnt) models.
  • An MEMM is a discriminative model that extends a standard maximum entropy classifier by assuming that the unknown values to be learnt are connected in a Markov chain rather than being conditionally independent of each other.
HMM vs. MEMM vs. CRF
  • In CRFs, the label of the current observation can depend not only on previous labels but also on future labels.
  • CRFs are undirected graphical models while both HMMs and MEMMs are directed graphical models.

Relation Extraction

  • Relation Extraction (RE) is the task of extracting semantic relationships from text, which usually occur between two or more entities.
  • These relations can be of different types.
    • E.g., “Paris is in France” states a “is in” relationship from Paris to France. This can be denoted using triples, (Paris, is in, France).
  • There are five different methods of doing Relation Extraction:
    1. Rule-based RE
    2. Weakly Supervised RE
    3. Supervised RE
    4. Distantly Supervised RE
    5. Unsupervised RE

Rule-Based RE

  • Many instances of relations can be identified through hand-crafted patterns, looking for triples (X, α, Y) where X are entities and α are words in between.
    • For the “Paris is in France” example, α=”is in”. This could be extracted with a regular expression.
    • Named entities in sentence
    • Part-of-speech tags in sentence
  • Only looking at keyword matches will also retrieve many false positives.
    • We can mitigate this by filtering on named entities, only retrieving (CITY, is in, COUNTRY).
    • We can also take into account the part-of-speech (POS) tags to remove additional false positives.
  • These are examples of doing word sequence patterns because the rule specifies a pattern following the order of the text.
    • Unfortunately, these types of rules fall apart for longer-range patterns and sequences with greater variety.
      • E.g., “Fred and Mary got married” cannot successfully be handled by a word sequence pattern.
Dependency Paths in Sentence
  • Instead, we can make use of dependency paths in the sentences, knowing which word is having a grammatical dependency on what other word. This can greatly increase the coverage of the rule without extra effort.
  • We can also transform the sentences before applying the rule.
    • E.g., “The cake was baked by Harry” or “The cake which Harry baked” can be transformed into “Harry baked the cake”. Then we are changing the order to work with our “linear rule,” while also removing redundant modifying word in between.
  • Pros:
    • Humans can create patterns which tend to have high precision
    • Can be tailored to specific domains
  • Cons:
    • Human patterns are still often low-recall (too much variety in languages)
    • A lot of manual work to create all possible rules
    • Have to create rules for every relation type

Weakly Supervised RE

  • The idea here is to start out with a set of hand-crafted rules and automatically find new ones from the unlabeled text data, through an iterative process (bootstrapping).
  • Alternatively, one can start out with a set of seed tuples, describing entities with a specific relation.
    • E.g., seed={(ORG:IBM, LOC:Armonk), (ORG:Microsoft, LOC:Redmond)} states entities having the relation “based in.”
Snowball Algorithm Example
  1. Start with a set of seed tuples (or extract a seed set from the unlabeled text with a few hand-crafted rules).
  2. Extract occurrences from the unlabeled text that matches the tuples and tag them with a NER (named entity recognizer).
  3. Create patterns for these occurrences, e.g., “ORG is based in LOC”.
  4. Generate new tuples from the text, e.g., (ORG:Intel, LOC: Santa Clara), and add to the seed set.
  5. Go to step 2 or terminate and use the patterns that were created for further extraction.
  • Pros:
    • More relations can be discovered than for Rule-based RE (higher recall)
    • Less human effort required (does only require a high-quality seed)
  • Cons:
    • The set of patterns becomes more error-prone with each iteration
    • Must be careful when generating new patterns through occurrences of tuples, e.g., “IBM shut down an office in Hursley” could easily be caught by mistake when generating patterns for the “based in” relation
    • New relation types require new seeds (which have to be manually provided)

Supervised RE

  • A common way to do Supervised Relation Extraction is to train a stacked binary classifier (or a regular binary classifier) to determine if there is a specific relation between two entities.
  • These classifiers take features about the text as input, thus requiring the text to be annotated by other NLP modules first.
  • Typical features are: context words, part-of-speech tags, dependency path between entities, NER tags, tokens, proximity distance between words, etc.
  • Steps:
    1. Manually label the text data according to if a sentence is relevant or not for a specific relation type.
      • E.g., for the “CEO” relation:
        • “Apple CEO Steve Jobs said to Bill Gates.” is relevant
        • “Bob, Pie Enthusiast, said to Bill Gates.” is not relevant
    2. Manually label the relevant sentences as positive/negative if they are expressing the relation.
      • E.g., “Apple CEO Steve Jobs said to Bill Gates.”:
        • (Steve Jobs, CEO, Apple) is positive
        • (Bill Gates, CEO, Apple) is negative
    3. Learn a binary classifier to determine if the sentence is relevant for the relation type
    4. Learn a binary classifier on the relevant sentences to determine if the sentence expresses the relation or not
    5. Use the classifiers to detect relations in new text data. Some choose not to train a “relevance classifier” and instead let a single binary classifier determine both things in one go.
  • Pros:
    • High-quality supervision (ensuring that the relations that are extracted are relevant)
    • We have explicit negative examples
  • Cons:
    • Expensive to label examples
    • Expensive/difficult to add new relations (need to train a new classifier)
    • Does not generalize well to new domains
    • Is only feasible for a small set of relation types

Distantly Supervised RE

  • We can combine the idea of using seed data, as for Weakly Supervised RE, with training a classifier, as for Supervised RE.
  • However, instead of providing a set of seed tuples ourselves, we can take it from an existing Knowledge Base (KB), such as Wikipedia, DBpedia, Wikidata, Freebase, Yago.
  • Steps:
    1. For each relation type we are interested in the KB
    2. For each tuple of this relation in the KB
    3. Select sentences from our unlabeled text data that match these tuples (both words of the tuple are co-occurring in the sentence) and assume that these sentences are positive examples for this relation type
    4. Extract features from these sentences (e.g., POS, context words, etc.)
    5. Train a supervised classifier on this
      • Pros:
        • Less manual effort
        • Can scale to use large amount of labeled data and many relations
        • No iterations required (compared to Weakly Supervised RE)
      • Cons:
        • Noisy annotation of training corpus (sentences that have both words in the tuple may actually not describe the relation)
        • There are no explicit negative examples (this can be tackled by matching unrelated entities)
        • Is restricted to the Knowledge Base
        • May require careful tuning to the task

Unsupervised RE

  • Here we extract relations from text without having to label any training data, provide a set of seed tuples, or having to write rules to capture different types of relations in the text.
  • Instead, we rely on a set of very general constraints and heuristics. It could be argued if this is truly unsupervised since we are using “rules” which are at a more general level.
  • Also, for some cases, even leveraging small sets of labeled text data to design and tweak the systems.
  • Nevertheless, these systems tend to require less supervision in general. Open Information Extraction (Open IE) generally refers to this paradigm.
  • Constraints:
    1. There exists a dependency chain between el and e2 that is not longer than a certain length.
    2. This chain should contain some words of the relation r (usually the main verb)
    3. The path from el to e2 along the syntax tree doesn't cross the sentence-like Boundary (e.g., relative clauses). This means that this path can contain S (SINV, ROOT, etc.) constituents only at the common ancestor position.
    4. Entities do not consist solely of the pronoun.
    5. r should contain at least one VP tag.
    6. r and e2 should have at least one VP tag as a common ancestor.

Text Summarization

  • Text summarization is the process of generating a short, fluent, and most importantly accurate summary of a respectively longer text document.
  • The main idea behind automatic text summarization is to be able to find a short subset of the most essential information from the entire set and present it in a human-readable format.
  • As online textual data grows, automatic text summarization methods have the potential to be very helpful because more useful information can be read in a short time.

Types of Text Summarization

  • Based on Input Type:
    • Single-Document
    • Multi-Document
  • Based on Output Type:
    • Extractive
    • Abstractive
  • Based on the Purpose:
    • Generic
    • Domain-Specific
    • Query-Based
Based on Input Type:
  1. Single Document, where the input length is short. Many of the early summarization systems dealt with single-document summarization.
  2. Multi-Document, where the input can be arbitrarily long.
Based on the Purpose:
  1. Generic, where the model makes no assumptions about the domain or content of the text to be summarized and treats all inputs as homogeneous. The majority of the work that has been done revolves around generic summarization.
  2. Domain-specific, where the model uses domain-specific knowledge to form a more accurate summary. For example, summarizing research papers of a specific domain, biomedical documents, etc.
  3. Query-based, where the summary only contains information that answers natural language questions about the input text.
Based on Output Type:
  1. Extractive, where important sentences are selected from the input text to form a summary. Most summarization approaches today are extractive in nature.
  2. Abstractive, where the model forms its own phrases and sentences to offer a more coherent summary, like what a human would generate. This approach is definitely more appealing but much more difficult than extractive summarization.

How to do Text Summarization

  • Text cleaning
  • Sentence tokenization
  • Word tokenization
  • Word-frequency table
  • Summarization
Steps (example):
  1. Text cleaning:
    • ```python
    • !pip install -U spacy
    • !python -m spacy download encoreweb_sm
    • import spacy
    • from spacy.lang.en.stopwords import STOPWORDS
    • from string import punctuation
    • stopwords = list(STOP_WORDS)
    • nlp = spacy.load(‘encoreweb_sm’)
    • doc = nlp(text)
    • ```
  2. Word tokenization:
    • ```python
    • tokens = [token.text for token in doc]
    • print(tokens)
    • punctuation = punctuation + ‘\n’
    • punctuation
    • word_frequencies = {}
    • for word in doc:
    • if word.text.lower() not in stopwords:
    • if word.text.lower() not in punctuation:
    • if word.text not in word_frequencies.keys():
    • word_frequencies[word.text] = 1
    • else:
    • word_frequencies[word.text] += 1
    • print(word_frequencies)
    • ```
  3. Sentence tokenization:
    • ```python
    • maxfrequency = max(wordfrequencies.values())
    • max_frequency
    • for word in word_frequencies.keys():
    • wordfrequencies[word] = wordfrequencies[word]/max_frequency
    • print(word_frequencies)
    • sentence_tokens = [sent for sent in doc.sents]
    • print(sentence_tokens)
    • ```
  4. Word frequency table:
    • ```python
    • sentence_scores = {}
    • for sent in sentence_tokens:
    • for word in sent:
    • if word.text.lower() in word_frequencies.keys():
    • if sent not in sentence_scores.keys():
    • sentencescores[sent] = wordfrequencies[word.text.lower()]
    • else:
    • sentencescores[sent] += wordfrequencies[word.text.lower()]
    • sentence_scores
    • ```
  5. Summarization:
    • ```python
    • from heapq import nlargest
    • selectlength = int(len(sentencetokens)*0.3)
    • select_length
    • summary = nlargest(selectlength, sentencescores, key = sentence_scores.get)
    • summary
    • final_summary = [word.text for word in summary]
    • summary = ‘ ‘.join(final_summary)
    • ```
Example
  • Input:

    text = """Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: ‘I don’t really hide any feelings too much.I think everyone knows this is my job here. When I’m on the courts or when I’m on the court playing, I’m a competitor and I want to beat every single person whether they’re in the locker room or across the net.So I’m not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.I’m a pretty competitive girl. I say my hellos, but I’m not sending any players flowers as well. Uhm, I’m not really friendly or close to many players.I have not a lot of friends away from the courts. ’ When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men’s tour than the women’s tour? ‘No, not at all.I think just because you’re in the same sport doesn’t mean that you have to be friends with everyone just because you’re categorized, you’re a tennis player, so you’re going to get along with tennis players.I think every person has different interests. I have friends that have completely different jobs and interests, and I’ve met them in very different parts of my life.I think everyone just thinks because we’re tennis players we should be the greatest of friends. But ultimately tennis is just a very small part of what we do.There are so many other things that we’re interested in, that we do. ’"""
    
  • Output (final summary):

    • ```text
    • summary I think just because you’re in the same sport doesn’t mean that you have to be friends with everyone just because you’re categorized, you’re a tennis player, so you’re going to get along with tennis players. Maria Sharapova has basically no friends as tennis players on the WTA Tour. I have friends that have completely different jobs and interests, and I’ve met them in very different parts of my life. I think everyone just thinks because we’re tennis players So I’m not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. When she said she is not really close to a lot of players, is that something strategic that she is doing?
    • ```

Topic Representation Approaches

  • Examples:
    • Headlines (from around the world)
    • Outlines (notes for students)
    • Minutes (of a meeting)
    • Previews (of movies)
    • Synopses (soap opera listings)
    • Digests (TV guide)
    • Biography (resumes, obituaries)
    • Abridgments (Shakespeare for children)
    • Reviews (of a book, CD, movie, etc.)
    • Bulletins (weather forecasts/stock market reports)
    • Sound bites (politicians on a current issue)
    • Histories (chronologies of salient events)
Topic Words
  • This common technique aims to identify words that describe the topic of the input document.
  • An advance of the initial Luhn’s idea was to use a log-likelihood ratio test to identify explanatory words known as the “topic signature.”
  • Generally speaking, there are two ways to compute the importance of a sentence:
    • As a function of the number of topic signatures it contains
    • As the proportion of the topic signatures in the sentence
  • While the first method gives higher scores to longer sentences with more words, the second one measures the density of the topic words.
Frequency-Driven Approaches
  • This approach uses the frequency of words as indicators of importance. The two most common techniques in this category are: word probability and TFIDF (Term Frequency Inverse Document Frequency).
    • The probability of a word w is determined as the number of occurrences of the word, f(w)f(w), divided by the number of all words in the input (which can be a single document or multiple documents).
    • Words with the highest probability are assumed to represent the topic of the document and are included in the summary.
    • TFIDF, a more sophisticated technique, assesses the importance of words and identifies very common words (that should be omitted from consideration) in the document(s) by giving low weights to words appearing in most documents.
    • TFIDF has given way to centroid-based approaches that rank sentences by computing their salience using a set of features.
    • After creating TFIDF vector representations of documents, the documents that describe the same topic are clustered together and centroids are computed—pseudo-documents that consist of the words whose TFIDF scores are higher than a certain threshold and form the cluster.
    • Afterwards, the centroids are used to identify sentences in each cluster that are central to the topic.
Latent Semantic Analysis
  • Latent semantic analysis (LSA) is an unsupervised method for extracting a representation of text semantics based on observed words.
  • The first step is to build a term-sentence matrix, where each row corresponds to a word from the input (n words) and each column corresponds to a sentence.
  • Each entry of the matrix is the weight of the word i in sentence j computed by the TFIDF technique.
  • Then singular value decomposition (SVD) is used on the matrix that transforms the initial matrix into three matrices:
    • A term-topic matrix having weights of words
    • A diagonal matrix where each row corresponds to the weight of a topic
    • A topic-sentence matrix.
  • If you multiply the diagonal matrix with weights with the topic-sentence matrix, the result will describe how much a sentence represents a topic, in other words, the weight of the topic i in sentence j.
Discourse-Based Method
  • A logical development of analyzing semantics is to perform discourse analysis, finding