1/68
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Natural Language Processing
Interaction between computers and human language
Syntax
Grammatical structure of a sentence
Semantics
Meaning of a sentence
What are the potential challenges of processing and interpreting text?
A single sentence can have different semantic meanings depending on the context
Applications of NLP
Sentiment analysis, topic modeling, question answering, named entity resolution, text summarization
Sentiment Analysis
Classifying the emotional intent of text. Gives the probability the the sentiment is positive, negative, or neutral.
1 - positive
0 - negative
Named Entity Recognition (NER)
Identifies and categorizes named entities within text into predefined categories (persons, organizations, locations, etc.)
Sentence Segmentation
Divides the paragraph into different sentences for better understanding
What is the NLP Pipeline?
Sentence segmentation, word tokenization, stemming, lemmatization, stop word analysis
Corpus
A collection of documents
Text Cleaning
Remove unwanted characters, symbols, and noise from the raw text to make it cleaner and more uniform. Removes punctuation, special characters, and whitespace.
Tokenization
Split the text into smaller, manageable units like words or subwords, called tokens
Stop-word removal
Handle common words like “the” or “and” by removing them to focus on meaningful content
Typo Correction
Replace a word with the word in our dictionary with the nearest edit distance
Stemming
Reducing words to their stem/base form by removing suffixes. Crude chopping.
Lemmatization
Uses context to find the correct dictionary form. Best for tasks where meaning and context are important.
Feature Extraction
Converting texts into numerical vectors for machine learning tasks
Bag of Words
Representation of text that describes the occurrence of words within a document. Order doesn’t matter, just occurrences.
What are the pros of Bag of Words?
Simple, efficient, application to any language, and capture word importance.
What are the cons of Bag of Words?
Loss of word order, limited context, grows large with a big vocabulary size
N-Grams
Variation of Bag of Words that captures sequences of N adjacent words.
Term Frequency-Inverse Document Frequency (TF-IDF)
A way of measuring how relevant a word is to a document in a collection of documents
Term Frequency (TF)
How many times a term appears in a given document
Document Frequency (DF)
Number of documents in which the word is present
How do you compute TF-IDF
wx,y = tfx,y * log (N/dfx)
How is cosine similarity used in NLP?
Preprocessing, vectorization, similarity calculation, and ranking.
Which representation should be used for sparse graphs?
Adjacency Lists or Adjacency Dictionaries
Which representation should be used for dense graphs?
Adjacency Matrix
Graph Featurization
Transforms graph structural information into numerical features that can be used for ML tasks
Graph Level Features
Features that describe the entire graph as a whole and capture global structural patterns (graph diameter, average path length, modularity)
Edge Level Features
Features that describe the relationships or connections between two nodes (edge betweenness centrality, common neighbors)
Node Level Features
Features that describe individual nodes in a graph (degree, centrality measure)
Centrality
Measures how “central” or important a node is within a graph
Centrality Analysis
Discover the most important node(s) in one network
Degree Centrality
Importance of a node based on the degree of that node.
Closeness Centrality
Importance of a node based on how close it is to all the other nodes in the graph. Average shortest path distance from current node to all other nodes
Steps to calculate closeness centrality
Find shortest distance
Sum shortest distances
Apply formula
Betweenness Centrality
Identifies nodes that act as bridges along the shortest paths between pairs of nodes in a network
Vertex Betweenness
Measures the important of nodes in a graph. Indicates how many shortest paths between other nodes pass through a particular node.
Edge Betweenness
Measures the important of edges in a network, rather than nodes. Calculates how many shortest paths between pairs of nodes pass through a particular edge.
Steps to calculate betweenness centrality
Identify shortest paths
Count paths through node
Calculate betweenness
Girvan-Newman Algorithm
Algorithm for the detection and analysis of community structures relies on the iterative elimination of edges that have the highest number of shortest paths between nodes passing through them. By removing edges from the graph one-by-one, the network breaks down into smaller pieces, so-called communities.
Modularity
Measure of the strength of division of a network into modules or communities. Differences between the number of edges within modules and the expected number of edges if the edges were distributed randomly. Ranges from -1 to 1.
1 - Indicates strong division
Recommender SyStem
Algorithms that recommends a particular product to users they are likely to consume based on their preferences, behavior, or past interactions.
Content-based Recommendation
Predicts what a user will like based on their past likes and item features. Requires information on the content and the user profile.
Process of Content-Based Recommendation
Featurize Items
Calculate Similarity (cosine similarity)
Learn User Preferences
Recommend
Pros of Content Based Systems
Works independent from other users, personalized, supports new/unpopular items
Cons of Content Based Systems
Feature selection is difficult, cold start is difficult, limits diversity, maintenance overhead (retrains as user taste changes)
Collaborative Filtering
Recommends a user products on the basis of the preferences of other users with similar taste
User-based nearest-neighbor collaborative filtering
Recommendations based on the preferences and behaviors of similar users
Utility or User-Item Matrix
Matrix that captures the interactions between N users and M items. Is N x M
Jaccard Similarity
J(A, B) = | A & B| / | A U B |
Ignores rating values
What is the issue with cosine similarity in collaborative based systems?
Treats missing ratings as negative
Centered Cosine Similarity
Calculates the means of individual rows and normalizes the ratings by subtracting the row means. Then, apply cosine similarity formula.
Item-based collaborative filtering
Recommend items similar to those a user already likes. Assumes users will prefer items resembling their past preferences
Steps of Item-based collaborative filtering
For given item i, find other similar items
Estimate rating for item i based on ratings for similar items
Apply algorithm
Pros of Collaborative Filtering
No domain knowledge needed, diverse recommendations, uses more info
Cons of Collaborative Filtering
Data sparsity, cold start problems (new users, new items), popularity bias
Complementary idea
Find rules that associate the presence of one set of items with another set of items
Association rules
Discover relationships between items that frequently occur together in transactions
Support count
Frequency of occurrence of an itemset
Metrics of Association Rules
Support
Confidence
Support (association rule metric)
Fraction of transactions that contain both X and Y
Confidence
Measures how often items in Y appear in transactions that contain X
Steps of Mining Association Rules
Frequent Itemset Generation - Generate all itemsets whose support count >= minsup
Rule Generation - Generate high-confidence rules from each frequent itemset
Apriori principle
If an itemset is frequent, then all of its subsets must also be frequent
Explicit values
Values that designers intend for their products to embody
Collateral values
Values that crop up as side effects of design decisions and the way users interact with those designs. These values are not intentionally designed into the system.
Normative Language
Evaluative statements. Expresses the speaker’s opinion