Web Mining & Text Retrieval

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/32

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

33 Terms

1
New cards

What kind of data is usually stored in text datasets?

Semi-structured

2
New cards

What is the goal of information retrieval?

Locating relevant documents based on user input, such as keywords or example documents

3
New cards

Name 2 Information Retrieval systems

  • Online library catalogue systems

  • Online document management systems

4
New cards

What is precision

% of retrieved examples that are relevant

5
New cards

What is recall

% of relevant examples that were retrieved

6
New cards

What is synonymy in regards to keyword-based retrieval

A keyword T not appearing in a document, even though the document is closely related to T

7
New cards

What is polysemy in regards to keyword-based retrieval

The same keyword may mean different things in different contexts, e.g. mining

8
New cards

What is the goal of similarity-based retrieval

Find similar documents based on a set of common keywords

9
New cards

Issue of similarity-based retrieval

Need to come up with a measure of similarity

10
New cards

How to identify keywords for similarity-based retrieval

Stop list

11
New cards

What is a stop list in regards to similarity-based retrieval

A set of words deemed "irrelevant" even though they may appear frequently (e.g. a, the, of, for, with, etc.)

12
New cards

What is latent semantic indexing

Create a word frequency table and use singular value decomposition (SVD) techniques to reduce the size. Retain K most significant rows

13
New cards

How do you carry out latent semantic indexing

  1. Create a term frequency matrix

  2. SVD construction: compute the SVD of matrix by splitting it into 3 matrices, U, S, V

  3. Vector Identification: For each document, replace original document vector by a new term excluding eliminated terms

  4. Index creation: Store the set of all vectors, indexed

14
New cards

What kind of Database problems are not present in Information Retrieval?

Updating, transaction management, concurrency control, recovery

15
New cards

What problems of information retrieval are not well addressed in DBMS?

  • Unstructured documents

  • Approximate search with keywords

  • Notion of relevance

16
New cards

What is inverted index retrieval

The index is organised by terms, and each term points to a list of documents or web pages that contain that term.

17
New cards

Advantage of inverted indexing

Easy to implement

18
New cards

Disadvantage of inverted indexing

  • Doesn't handle synonymy and polysemy well

  • Posting lists could be too long (i.e. takes up a lot of storage)

19
New cards

Types of Text Data Mining

  • Keyword-based association analysis

  • Automatic document classification

  • Similarity detection

  • Link analysis

  • Hypertext analysis

20
New cards

What is keyword-based association analysis?

Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them

21
New cards

What is the core idea behind association mining algorithms?

  • Consider each document as a transaction

  • View a set of keywords in the document as a set of items in the transaction (question: apriori algorithm?)

22
New cards

What is the motivation for automatic document classification?

Large numbers of online text documents (web pages, emails, etc.) need to be classified

23
New cards

What is the goal of document clustering?

Automatically goup related documents based on their contents

24
New cards

Does document clustering require a predetermined taxonomy?

No, a taxonomy is generated at runtime

25
New cards

What are the major steps of document clustering?

  1. Pre-processing

  2. Hierarchical clustering

26
New cards

What are the challenges of mining the world-wide web?

  • Too large for effective data warehousing and data mining

  • Too complex and heterogeneous

27
New cards

What are the 2 major difficulties of keyword-based retrieval

  • Synonymy

  • Polysemy

28
New cards

How is a term frequency table structured?

table(i, j) = # of occurences of the world tᵢ in document di

29
New cards

How do you reduce the size of the term frequency matrix in latent semantic indexing?

Singular value decomposition (SVD)

30
New cards

What are the issues with using hyperlinks to infer authority?

31
New cards

What is the goal of HITS (Hyperlink Induced Topic Search)

Exploring interactions between hubs and interactive pages

32
New cards

What algorithm is Google search based on??

The HITS algorithm

33
New cards