ITEC 4020 - L7: Indexing and Search

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/53

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

54 Terms

New cards

Search Types

The different kinds of searches users perform, including question answering, navigation, task accomplishment, and housekeeping.

New cards

Question Answering Search

A search focused on finding specific factual information, such as “best lens Sanibel.”

New cards

Navigation Search

A search used to locate a known web page or resource, such as “garden photography.”

New cards

Task Accomplishment Search

A search performed to complete an action, such as “photo sharing.”

New cards

Housekeeping Search

A search done to find site-related information like “privacy policy” or “contact us.”

New cards

SQL Search Limitation

SQL searches rely on exact matches and lack ranking or relevance-based results.

New cards

LIKE Operator Problem

The SQL LIKE operator performs linear scans (O(N)) and becomes inefficient with large datasets.

New cards

Full Text Search

A search technique that allows ranked, flexible, and high-performance retrieval from large text collections.

New cards

Benefits of Full Text Search

Provides fast results, relevance ranking, and reduces cases with no returned results.

New cards

Index in Databases

A data structure that improves retrieval speed by storing keywords and their locations.

New cards

Trade-off of Indexing

Indexes speed up reads but slow down writes and updates due to maintenance overhead.

New cards

Information Retrieval (IR) Index

An index designed for text and document retrieval, different from traditional DBMS indexes.

New cards

Inverted Index

A structure that maps words to the list of documents containing them, enabling fast searches.

New cards

Steps to Build an Inverted Index

Parse documents, extract tokens, sort tokens, merge duplicates, and store term frequencies.

New cards

Dictionary or Lexicon

In IR, a list of unique terms with the number of documents containing each term.

New cards

Postings File

A file that contains document IDs and positions for each term in the dictionary.

New cards

Term Frequency (TF)

The number of times a term appears within a document, used for ranking relevance.

New cards

Inverse Document Frequency (IDF)

A measure of how unique or rare a term is across all documents, calculated as log(N/nj).

New cards

TF-IDF Weighting

A ranking method that increases importance for terms that appear often in a document but rarely overall.

New cards

Stop Words

Common words like “and” or “the” that are ignored in search indexing to improve efficiency.

New cards

Stemming

Reduces words to their base form (e.g., “running” becomes “run”) to standardize search terms.

New cards

Relevance Ranking

The process of ordering search results based on how closely they match the query terms.

New cards

MySQL FULLTEXT Index

A feature that allows text-based indexing and searching for TEXT or VARCHAR columns.

New cards

Creating FULLTEXT Index

The SQL command CREATE TABLE or ALTER TABLE is used with FULLTEXT(title, body).

New cards

MATCH...AGAINST Syntax

The MySQL command for performing full text searches and retrieving ranked results.

New cards

Natural Language Mode

A MySQL search mode that interprets queries as normal phrases without special operators.

New cards

Boolean Mode

A MySQL search mode that supports operators like + (must include) and - (must exclude).

New cards

Query Expansion Mode

A MySQL search mode that expands queries by adding related terms automatically.

New cards

Relevance Score

A numerical value returned by MySQL to indicate how closely a record matches a search query.

New cards

SQL vs Full Text Search

SQL is exact-match based, while full-text search provides ranked, flexible, and faster querying.

New cards

Web Crawler

An automated program that browses the internet to collect and index web pages for search engines.

New cards

Web Crawling Process

Starts with known URLs, follows links recursively, respects robots.txt, and stores page data.

New cards

robots.txt

File on a website that defines which pages web crawlers are allowed or disallowed to access.

New cards

Duplicate Detection

Uses hashing to detect and ignore identical pages during crawling.

New cards

Recrawling

Revisiting web pages periodically to ensure index freshness and accuracy.

New cards

Search Engine Challenges

Issues like broken links, duplicate pages, and misleading or spam content.

New cards

Link Analysis

Technique where pages gain ranking based on how many other reputable pages link to them.

New cards

PageRank

Google’s algorithm that ranks pages based on incoming links weighted by their own importance.

New cards

PageRank Formula

PR(A) = (1-d) + d(PR(A1)/C(A1) + … + PR(An)/C(An)), where d is the damping factor (~0.85).

New cards

PageRank Damping Factor

A value (typically 0.85) representing the probability a user continues following links.

New cards

Hit List

In Google’s indexing, a list of tuples

New cards

Barrels

Inverted files grouped by topic or word ID range, used in Google’s index for efficient searching.

New cards

Lexicon

A mapping of words to word IDs and their positions in the index structure.

New cards

Google Ranking Combination

Google combines PageRank (link importance) with IR relevance (TF-IDF) for results.

New cards

SQL Search Complexity

Linear time O(N) — inefficient for large datasets.

New cards

Full Text Search Complexity

Constant time O(1) — efficient with indexes.

New cards

Difference Between IR and DBMS Indexing

IR focuses on keyword relevance; DBMS indexes focus on primary key access.

New cards

Crawling Freshness

The measure of how up-to-date a search index is compared to the live web.

New cards

Ranking Factors

Combination of TF-IDF scores, PageRank, and metadata relevance in modern search engines.

New cards

Full Text Search Example Query

SELECT * FROM articles WHERE MATCH(title, body) AGAINST('database' IN NATURAL LANGUAGE MODE);

New cards

Boolean Mode Example

SELECT * FROM articles WHERE MATCH(title, body) AGAINST('+database -MySQL' IN BOOLEAN MODE);

New cards

Query Expansion Example

SELECT * FROM articles WHERE MATCH(title, body) AGAINST('database' WITH QUERY EXPANSION);

New cards

Relevance Ranking Advantage

Ensures that more meaningful results appear higher in search output.

New cards

Information Retrieval Goal

Retrieve the most relevant documents efficiently based on the user’s search intent.