ITEC 4020 - L7: Indexing and Search

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/53

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

54 Terms

1
New cards
Search Types
The different kinds of searches users perform, including question answering, navigation, task accomplishment, and housekeeping.
2
New cards
Question Answering Search
A search focused on finding specific factual information, such as “best lens Sanibel.”
3
New cards
Navigation Search
A search used to locate a known web page or resource, such as “garden photography.”
4
New cards
Task Accomplishment Search
A search performed to complete an action, such as “photo sharing.”
5
New cards
Housekeeping Search
A search done to find site-related information like “privacy policy” or “contact us.”
6
New cards
SQL Search Limitation
SQL searches rely on exact matches and lack ranking or relevance-based results.
7
New cards
LIKE Operator Problem
The SQL LIKE operator performs linear scans (O(N)) and becomes inefficient with large datasets.
8
New cards
Full Text Search
A search technique that allows ranked, flexible, and high-performance retrieval from large text collections.
9
New cards
Benefits of Full Text Search
Provides fast results, relevance ranking, and reduces cases with no returned results.
10
New cards
Index in Databases
A data structure that improves retrieval speed by storing keywords and their locations.
11
New cards
Trade-off of Indexing
Indexes speed up reads but slow down writes and updates due to maintenance overhead.
12
New cards
Information Retrieval (IR) Index
An index designed for text and document retrieval, different from traditional DBMS indexes.
13
New cards
Inverted Index
A structure that maps words to the list of documents containing them, enabling fast searches.
14
New cards
Steps to Build an Inverted Index
Parse documents, extract tokens, sort tokens, merge duplicates, and store term frequencies.
15
New cards
Dictionary or Lexicon
In IR, a list of unique terms with the number of documents containing each term.
16
New cards
Postings File
A file that contains document IDs and positions for each term in the dictionary.
17
New cards
Term Frequency (TF)
The number of times a term appears within a document, used for ranking relevance.
18
New cards
Inverse Document Frequency (IDF)
A measure of how unique or rare a term is across all documents, calculated as log(N/nj).
19
New cards
TF-IDF Weighting
A ranking method that increases importance for terms that appear often in a document but rarely overall.
20
New cards
Stop Words
Common words like “and” or “the” that are ignored in search indexing to improve efficiency.
21
New cards
Stemming
Reduces words to their base form (e.g., “running” becomes “run”) to standardize search terms.
22
New cards
Relevance Ranking
The process of ordering search results based on how closely they match the query terms.
23
New cards
MySQL FULLTEXT Index
A feature that allows text-based indexing and searching for TEXT or VARCHAR columns.
24
New cards
Creating FULLTEXT Index
The SQL command CREATE TABLE or ALTER TABLE is used with FULLTEXT(title, body).
25
New cards
MATCH...AGAINST Syntax
The MySQL command for performing full text searches and retrieving ranked results.
26
New cards
Natural Language Mode
A MySQL search mode that interprets queries as normal phrases without special operators.
27
New cards
Boolean Mode
A MySQL search mode that supports operators like + (must include) and - (must exclude).
28
New cards
Query Expansion Mode
A MySQL search mode that expands queries by adding related terms automatically.
29
New cards
Relevance Score
A numerical value returned by MySQL to indicate how closely a record matches a search query.
30
New cards
SQL vs Full Text Search
SQL is exact-match based, while full-text search provides ranked, flexible, and faster querying.
31
New cards
Web Crawler
An automated program that browses the internet to collect and index web pages for search engines.
32
New cards
Web Crawling Process
Starts with known URLs, follows links recursively, respects robots.txt, and stores page data.
33
New cards
robots.txt
File on a website that defines which pages web crawlers are allowed or disallowed to access.
34
New cards
Duplicate Detection
Uses hashing to detect and ignore identical pages during crawling.
35
New cards
Recrawling
Revisiting web pages periodically to ensure index freshness and accuracy.
36
New cards
Search Engine Challenges
Issues like broken links, duplicate pages, and misleading or spam content.
37
New cards
Link Analysis
Technique where pages gain ranking based on how many other reputable pages link to them.
38
New cards
PageRank
Google’s algorithm that ranks pages based on incoming links weighted by their own importance.
39
New cards
PageRank Formula
PR(A) = (1-d) + d(PR(A1)/C(A1) + … + PR(An)/C(An)), where d is the damping factor (~0.85).
40
New cards
PageRank Damping Factor
A value (typically 0.85) representing the probability a user continues following links.
41
New cards
Hit List
In Google’s indexing, a list of tuples
42
New cards
Barrels
Inverted files grouped by topic or word ID range, used in Google’s index for efficient searching.
43
New cards
Lexicon
A mapping of words to word IDs and their positions in the index structure.
44
New cards
Google Ranking Combination
Google combines PageRank (link importance) with IR relevance (TF-IDF) for results.
45
New cards
SQL Search Complexity
Linear time O(N) — inefficient for large datasets.
46
New cards
Full Text Search Complexity
Constant time O(1) — efficient with indexes.
47
New cards
Difference Between IR and DBMS Indexing
IR focuses on keyword relevance; DBMS indexes focus on primary key access.
48
New cards
Crawling Freshness
The measure of how up-to-date a search index is compared to the live web.
49
New cards
Ranking Factors
Combination of TF-IDF scores, PageRank, and metadata relevance in modern search engines.
50
New cards
Full Text Search Example Query
SELECT * FROM articles WHERE MATCH(title, body) AGAINST('database' IN NATURAL LANGUAGE MODE);
51
New cards
Boolean Mode Example
SELECT * FROM articles WHERE MATCH(title, body) AGAINST('+database -MySQL' IN BOOLEAN MODE);
52
New cards
Query Expansion Example
SELECT * FROM articles WHERE MATCH(title, body) AGAINST('database' WITH QUERY EXPANSION);
53
New cards
Relevance Ranking Advantage
Ensures that more meaningful results appear higher in search output.
54
New cards
Information Retrieval Goal
Retrieve the most relevant documents efficiently based on the user’s search intent.