1/32
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What kind of data is usually stored in text datasets?
Semi-structured
What is the goal of information retrieval?
Locating relevant documents based on user input, such as keywords or example documents
Name 2 Information Retrieval systems
Online library catalogue systems
Online document management systems
What is precision
% of retrieved examples that are relevant
What is recall
% of relevant examples that were retrieved
What is synonymy in regards to keyword-based retrieval
A keyword T not appearing in a document, even though the document is closely related to T
What is polysemy in regards to keyword-based retrieval
The same keyword may mean different things in different contexts, e.g. mining
What is the goal of similarity-based retrieval
Find similar documents based on a set of common keywords
Issue of similarity-based retrieval
Need to come up with a measure of similarity
How to identify keywords for similarity-based retrieval
Stop list
What is a stop list in regards to similarity-based retrieval
A set of words deemed "irrelevant" even though they may appear frequently (e.g. a, the, of, for, with, etc.)
What is latent semantic indexing
Create a word frequency table and use singular value decomposition (SVD) techniques to reduce the size. Retain K most significant rows
How do you carry out latent semantic indexing
Create a term frequency matrix
SVD construction: compute the SVD of matrix by splitting it into 3 matrices, U, S, V
Vector Identification: For each document, replace original document vector by a new term excluding eliminated terms
Index creation: Store the set of all vectors, indexed
What kind of Database problems are not present in Information Retrieval?
Updating, transaction management, concurrency control, recovery
What problems of information retrieval are not well addressed in DBMS?
Unstructured documents
Approximate search with keywords
Notion of relevance
What is inverted index retrieval
The index is organised by terms, and each term points to a list of documents or web pages that contain that term.
Advantage of inverted indexing
Easy to implement
Disadvantage of inverted indexing
Doesn't handle synonymy and polysemy well
Posting lists could be too long (i.e. takes up a lot of storage)
Types of Text Data Mining
Keyword-based association analysis
Automatic document classification
Similarity detection
Link analysis
Hypertext analysis
What is keyword-based association analysis?
Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them
What is the core idea behind association mining algorithms?
Consider each document as a transaction
View a set of keywords in the document as a set of items in the transaction (question: apriori algorithm?)
What is the motivation for automatic document classification?
Large numbers of online text documents (web pages, emails, etc.) need to be classified
What is the goal of document clustering?
Automatically goup related documents based on their contents
Does document clustering require a predetermined taxonomy?
No, a taxonomy is generated at runtime
What are the major steps of document clustering?
Pre-processing
Hierarchical clustering
What are the challenges of mining the world-wide web?
Too large for effective data warehousing and data mining
Too complex and heterogeneous
What are the 2 major difficulties of keyword-based retrieval
Synonymy
Polysemy
How is a term frequency table structured?
table(i, j) = # of occurences of the world tᵢ in document di
How do you reduce the size of the term frequency matrix in latent semantic indexing?
Singular value decomposition (SVD)
What are the issues with using hyperlinks to infer authority?
What is the goal of HITS (Hyperlink Induced Topic Search)
Exploring interactions between hubs and interactive pages
What algorithm is Google search based on??
The HITS algorithm