Search Engines and Crawlers

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/28

Earn XP

Description and Tags

Flashcards covering the fundamentals of search engine architecture, the indexing and query processes, text transformation techniques, and web crawling mechanisms.

Last updated 11:19 PM on 5/16/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

29 Terms

New cards

Information Retrieval

A field that includes the structure, analysis, storage, searching, and retrieving of information, often dealing with unstructured data in databases.

New cards

1 goal of search engines

Effectiveness: aims to retrieve the most relevant results for a user's query.

New cards

2nd goal of search engines

Efficiency: aims to process queries as quickly as possible.

New cards

Indexing Process

The method where search engines collect and store data to facilitate fast and accurate information retrieval.

New cards

Text Acquisition

The 1st phase of indexing, where the indexer opens files and acquires the raw text for the document data store.

New cards

Text Transformation

The 2nd phase of indexing that looks for patterns or repetitions to simplify documents, making them more easily categorisable.

New cards

Index Creation

the 3rd part of indexing, where documents are assigned into categories, which provides the format for the index results that appear following a search query.

New cards

Crawlers/Spiders

Software programs that follow hyperlinks or anchor tags to discover and download content from different web pages.

New cards

Focused/Topical Crawlers

Specialised bots designed to only discover pages relevant to a pre-defined topic, rather than searching the entire internet.

New cards

Push Feed

A document feed where the site alerts subscribers automatically whenever any updates occur.

New cards

Pull Feed

A document feed where the subscriber is required to manually retrieve the information from the source.

New cards

Parser

A component that breaks down data and text into smaller, more meaningful components known as tokens based on syntax.

New cards

Tokens

The smaller, meaningful components resulting from the breakdown of data by a parser for further analysis.

New cards

Stopping

The process of filtering and removing common or low-meaning words like “the”, “is”, or “to” from indexes to enhance search efficiency.

New cards

Stemming

The process of grouping words derived from common stems, such as replacing “fish”, “fishes”, and “fishing” with the single designated word “fish”.

New cards

Classifier

A component that identifies class-related metadata and assigns labels to documents to represent specific categories using clustering techniques.

New cards

Weighting

The calculation of the relative importance of data; terms that occur frequently across documents are typically assigned a low weight.

New cards

Inversion

A process that converts document-to-term information into term-to-document information to enable fast and efficient searches.

New cards

Query Transformation

Modifying a user's initial search query through tokenisation, stopping, and stemming to produce index terms comparable to document terms.

New cards

Term at a time scoring

A scoring method where calculations are done as the user enters a query; it is computationally intensive but provides up-to-date results.

New cards

Document at a time scoring

A scoring method where evaluation is performed at the same time as indexing, resulting in quicker search times but potentially less up-to-date ranking.

New cards

Query Broker

The component that manages all requested queries, prioritising those that are most asked or easiest to calculate.

New cards

Caching

The process of storing the results of commonly asked queries to speed up future retrieval.

New cards

Politeness Policy

A crawler setting where the bot only opens one link before returning to the home page to avoid overloading a web server.

New cards

/robot.txt

A file used by server administrators to allow or disallow crawlers from entering certain pages on a website.

New cards

Deep Web

Content that is difficult for crawlers to find, such as pages requiring a login, form submissions, or those written in script languages.

New cards

Sitemaps

Indicators provided by web owners that tell crawlers when and how often a site is updated to avoid redundant downloads.

New cards

The query process

where search engines return the most relevant response when a user enters a query

New cards

User Interaction

the 1st stage of query process where the user