1/12
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
NLP
Any computer manipulation of natural language, from counting words to fully understanding meaning.
NER
Named Entity Recognition, finds and classifies named entities in text: PERSON, PLACE, ORGANISATION, DATE, MONEY, PERCENTAGE.
Challenges of NER
The same word means different things in different contexts, requiring the system to read full context.
Advantages of One-Hot Encoding
Simple and easy to implement; every word has a unique identifier.
Disadvantages of One-Hot Encoding
Vector size = vocabulary (10,000+ numbers per word); cannot handle new words not in the vocabulary, Zero semantics
Word2Vec
Maps each word to a DENSE vector (about 300 numbers), with similar words close together in this mathematical space.
Meaning in Word2Vec
Encoded as direction and distance; words used in similar contexts cluster together.
Comparison: One-Hot Encoding vs Word2Vec
One-Hot: Vector size = vocabulary (10,000+)
Almost all zeros (very sparse)
No relationship between any words
No training needed
Word2Vec:
Vector size = ~300 (fixed, compact)
All numbers are meaningful (dense)
Similar words are mathematically close
Requires training on a large text corpus
TF-IDF
Measures how IMPORTANT and DISTINCTIVE a word is to a specific document.
TF-IDF = TF x IDF.
TF (Term Frequency)
Count of word in document / total words in document; high TF means word appears often in this document.
IDF (Inverse Document Frequency)
log(total number of documents / number of documents containing this word); high IDF means word is RARE across all documents.
Purpose of IDF
Punishes words that appear everywhere and rewards rare, distinctive words.
Example of TF Limitation
The word 'the' appears very often (high TF) in every document — but it is meaningless.