Web Mining & Text Retrieval

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/32

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

33 Terms

New cards

What kind of data is usually stored in text datasets?

Semi-structured

New cards

What is the goal of information retrieval?

Locating relevant documents based on user input, such as keywords or example documents

New cards

Name 2 Information Retrieval systems

Online library catalogue systems
Online document management systems

New cards

What is precision

% of retrieved examples that are relevant

New cards

What is recall

% of relevant examples that were retrieved

New cards

What is synonymy in regards to keyword-based retrieval

A keyword T not appearing in a document, even though the document is closely related to T

New cards

What is polysemy in regards to keyword-based retrieval

The same keyword may mean different things in different contexts, e.g. mining

New cards

What is the goal of similarity-based retrieval

Find similar documents based on a set of common keywords

New cards

Issue of similarity-based retrieval

Need to come up with a measure of similarity

New cards

How to identify keywords for similarity-based retrieval

Stop list

New cards

What is a stop list in regards to similarity-based retrieval

A set of words deemed "irrelevant" even though they may appear frequently (e.g. a, the, of, for, with, etc.)

New cards

What is latent semantic indexing

Create a word frequency table and use singular value decomposition (SVD) techniques to reduce the size. Retain K most significant rows

New cards

How do you carry out latent semantic indexing

Create a term frequency matrix
SVD construction: compute the SVD of matrix by splitting it into 3 matrices, U, S, V
Vector Identification: For each document, replace original document vector by a new term excluding eliminated terms
Index creation: Store the set of all vectors, indexed

New cards

What kind of Database problems are not present in Information Retrieval?

Updating, transaction management, concurrency control, recovery

New cards

What problems of information retrieval are not well addressed in DBMS?

Unstructured documents
Approximate search with keywords
Notion of relevance

New cards

What is inverted index retrieval

The index is organised by terms, and each term points to a list of documents or web pages that contain that term.

New cards

Advantage of inverted indexing

Easy to implement

New cards

Disadvantage of inverted indexing

Doesn't handle synonymy and polysemy well
Posting lists could be too long (i.e. takes up a lot of storage)

New cards

Types of Text Data Mining

Keyword-based association analysis
Automatic document classification
Similarity detection
Link analysis
Hypertext analysis

New cards

What is keyword-based association analysis?

Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them

New cards

What is the core idea behind association mining algorithms?

Consider each document as a transaction
View a set of keywords in the document as a set of items in the transaction (question: apriori algorithm?)

New cards

What is the motivation for automatic document classification?

Large numbers of online text documents (web pages, emails, etc.) need to be classified

New cards

What is the goal of document clustering?

Automatically goup related documents based on their contents

New cards

Does document clustering require a predetermined taxonomy?

No, a taxonomy is generated at runtime

New cards

What are the major steps of document clustering?

Pre-processing
Hierarchical clustering

New cards

What are the challenges of mining the world-wide web?

Too large for effective data warehousing and data mining
Too complex and heterogeneous

New cards

What are the 2 major difficulties of keyword-based retrieval

Synonymy
Polysemy

New cards

How is a term frequency table structured?

table(i, j) = # of occurences of the world tᵢ in document d_i

New cards

How do you reduce the size of the term frequency matrix in latent semantic indexing?

Singular value decomposition (SVD)

New cards

What are the issues with using hyperlinks to infer authority?

New cards

What is the goal of HITS (Hyperlink Induced Topic Search)

Exploring interactions between hubs and interactive pages

New cards

What algorithm is Google search based on??

The HITS algorithm

New cards