1/20
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
MapReduce
distributed programming tool used for indexing and analysis
mapper
transforms a list of items into another list of items of the same length
reducer
transforms a list of items into a single item
distributed processing
uses large number of inexpensive servers driven by the need to index and analyze big data
director server
distribtues the query to multiple indexing machines
index server
only processes part of the query
director machine
organizes the results and returns them to the user
Jaccard coefficient
Jaccard(A,B) = |A n B|/|AUB|
bag of words model
each document is stored as a vector of word occurrence counts, ignoring the order of the words.
term frequency
number of times term t occurs in document d
score for the doc-query pair
the sum over terms t in both q and d
inverse document frequency (IDF)
log(N/df)
no effect on one term queries
document frequency(df)
number of documents that contain term t
collection frequency
number of occurrences of t in the collection including duplicates
TF-IDF
tf-idf = (1+log(tf)) * log(N/df)
document as vector: the terms
axes of the space
documents as vectors: the documents
points in the space
cosine(query, document)
cos(q,d) = (q*d)/|q|*|v|
normalize vector by length
||x|| = sum x_i²
vector space ranking steps
represent query and document as weighted tf-idf vectors
compute cosine similarity scores for both
rank documents with respect to query by score
return top k
cosine for length-normalized vectors
cos(q,d) = sum i = 1 to |v| [(qi*di)]