1/58
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
TEXT MINING
is the practice of analyzing vast collections of textual materials to capture key concepts, trends and hidden relationships
Deals with unstructured data – conversations, declarations or even tweets or comments
Structured
Multimedia
Free Text
Enumerate Text Formats
Lexical Analysis (POS)
Semantic Analysis
Syntactic Analysis (Parsing)
Pragmatic Analysis (Speech act)
Enumerate Basic Concepts of NLP
Word-level ambiguity
Syntactic Ambiguity
Anaphora resolution
Presupposition
Enumerate DIFFICULTIES OF NLP
Presupposition
“He has quit smoking.” implies that he smoked before
Anaphora resolution
”John persuaded Bill to a TV for himself.”
(himself = John or Bill?)
Syntactic Ambiguity
“natural language processing”
“A man saw a boy with a telescope.”
Word-level ambiguity
“design” can be a noun or a verb (ambiguous POS)
“root” has multiple meanings (Ambiguous sense)
Convert Accented Characters
Expand Contractions
Tokenization
Stemming
Lemmatization
Parts of Speech tagging
Stopwords removal
Data Pre-processing for Textual Data
Stopwords removal
Common words that occur in sentences that does not add much value to meaning of the sentence
Acts as bridge and ensure that sentences are grammatically correct.
Words that are filtered out before processing natural language data
Example: the, is, in, for, to, in etc.
Parts of Speech tagging
Assigning parts of speech to each words in a sentence
Tagging is performed at the token level
Lemmatization
Performs morphological analysis of the words
Example: Word: Helps, Morphological Info: Third person singular number, present tense help, Lemma: help
Wordnet
spaCY
TextBlob
Pattern
Stanford CoreNLP
Examples of Lemmatization
Stemming
A way to reduce a word to its root(stem) word
Removes prefixes and suffixes
Less input dimensions
Make training data more dense
Reduce the size of the dictionary
Helps to normalize the word in the document
Word: cats, Suffix: s, Stem: cat
Word: caring, Suffix: ing, Stem: care/car
Porter stemmer
Krovetz stemmer
Examples of Stemming
Tokenization
Used to split a phrase, sentence, paragraph or an entire document into smaller units in words or terms.
It helps to interpret the meaning of the text by analyzing the words present in the text
Count the number of words in a text
Example: “I saw a cat” → Tokens: “I”, “saw”, “a”, “cat”
Expand Contractions
don’t – do not, can’t – can not
Data Pre-processing for Textual Data
Latté – latte, Café – café
Machine-based
Rule-based
Hybrid Approach
TEXT CLASSIFICATION Approaches
Machine-based
classification is learned based on pre-labeled datasets.
Rule-based
texts are grouped based on handcraft linguistic rules.
Users define a list of words for each group.
Examples:
Robredo, Lacson, Marcos - categorized into politics
Christianity, Islam, Atheism - categorized into religion
Hybrid Approach
combination of rule-based and machine based approach.
Uses the list of words to label the dataset. The classification is iteratively improved by updating the list of words
BAG OF WORDS (BOW) MODEL
a vector represents the frequency of words in a predefined dictionary of a word list.
keeps a count of total occurrences of most frequently used words
one of the methods used to transform tokens into a set of features
ignores grammar and order of words
Clean Text
Tokenize
Build Vocab
Generate Vectors
Process of BAG OF WORDS (BOW) MODEL
Vocabulary
list of words in a corpus
Feature vector
is an n-dimensional vector of numerical features that represent some object.
N-gram
is an N-token sequence of words
2-gram (bigram)
a two-word sequence of words (“really good”)
3-gram (trigram)
a three-word sequence of words (“not at all”)
Binary term occurrence
considers whether a term occurs in the document,
Ex:
The quick brown fox jumped over the lazy dog.
The dog is lazy.
The fox is brown.
TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY
considers overall documents of weight of words
Measure the score in order to get the information retrieval (IR) or summarization
Used to reflect how relevant a term is in a given document
helps to establish how important a particular word is in the
context of the document corpus. TF-IDF takes into account the number of times the word appears in the document and offset by the number of documents that appear in the corpus
Term frequency (TF)
represents the number of occurrences of terms in a document
is the frequency of term divided by a total number of terms in the document
Inverse Document Frequency
is obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of that quotient
Count Vectorizer:
gives the number of frequency with respect to index of vocabulary
Naive Bayes
Support Vector Machine
Enumerate some MACHINE-BASE APPROACH
Naive Bayes
used for text classification and text analysis
based on Bayes’ theorem
describes the relationship of conditional probabilities of statistical quantities
Support Vector Machine
supervised learning algorithm for classification
The goal is to create the best line or decision boundary that can segregate n-dimensional space into classes
Hyperplane
multiple line boundaries to separate classes
Support Vectors
data points that are the closest to the hyperplane
Margin
the distance between the vectors and the hyperplane
Linear SVM
Non-linear SVM
Two types of SVM
Linear SVM
Type of SVM used for linearly separable data (classified into 2 classes)
Non-linear SVM
Type of SVM used for non-linearly separated data
Random Forest Algorithm
supervised learning algorithm for classification
Based on ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.
ensemble learning
Random Forest is based on _____ _____, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.
higher
The greater number of trees in the forest leads to _____ accuracy and prevents the problem of overfitting.
Random Forest Algorithm
There should be some actual values in the feature variable of the dataset so that the classifier can predict accurate results rather than a guessed result.
The predictions from each tree must have very low correlations.
WHY USE IT?
It takes less training time as compared to other algorithms.
It predicts output with high accuracy, even for the large dataset it runs efficiently.
It can also maintain accuracy when a large proportion of data is missing.
Lexicon-based appraoch
Another name for RULE-BASED APPROACH
RULE-BASED/LEXICON-BASED APPROACH
uses a lexicon for sentiment analysis
words in texts are labeled as positive or negative (and sometimes as neutral) with the help of a so-called valence dictionary.
Sentiment Analysis
the classification of texts according to the emotion that the text appears to convey.
typically classifies texts according to positive, negative and neutral
some applications: Market analysis, Social media monitoring, Customer feedback analysis, Market research
Lexicon
vocabulary of a group of people, language or field
Dictionary-based approach
Corpus-based approach
Types of Rule-based/Lexicon-based approach
Dictionary-based approach
is created by taking a few words initially.
Then an online dictionary, thesaurus or WordNet can be used to expand this _____ by incorporating synonyms and antonyms of those words.
is expanded till no new words can be added to that dictionary.
can be refined by manual inspection.
Corpus-based approach
finds the sentiment orientation of context-specific words
Statistical Approach
Semantic Approach
Two types of Corpus-based approach
Statistical Approach
The words that show erratic behavior in positive behavior are considered to have positive polarity.
If they show negative recurrence in the negative text they have negative polarity.
If the frequency is equal in both positive and negative text then the word has neutral polarity.
Semantic approach
This approach assigns sentiment values to words and the words that are semantically closer to those words; this can be done by finding synonyms and antonyms with respect to that word.
Num of Positive - Num of Negative / Total Num
Formula for Sentiment Score