Search Engines, Spiders, and Crawlers

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/28

flashcard set

Earn XP

Description and Tags

Vocabulary terms and definitions related to information retrieval, search engine architecture, and web crawling processes.

Last updated 8:00 PM on 5/14/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

29 Terms

1
New cards

Information Retrieval

A field concerned with the structure, analysis, organisation, storage, searching, and retrieval of information, primarily dealing with unstructured text. Distinction between document and a database record

2
New cards

Topical relevance

One of the two components of relevance in retrieval models, distinguishing the subject-based relation between a query and a document.

3
New cards

UMIA

Unstructured Information Management Architecture; the framework used for organizing search engine architecture.

4
New cards

Effectiveness

A measure of search quality where the goal is to retrieve the most relevant set of documents possible.

5
New cards

Efficiency

A measure of search speed focusing on processing user queries as quickly as possible.

6
New cards

Web crawler

A program that follows links to discover and download new pages, often restricted by a single site or focused on specific topics.

Challenges with handling huge volumes of new pages

Can be restricted to a single site for a site search

7
New cards

Feeds

Access real time stream of documents, squires documents by monitoring feeds. Readers monitor feeds and when something new appears it auto indexes it

8
New cards

Conversion

Documents are rarely in plain text so search engines require them to be converted to a consistent format, each crawler does this there own way.

9
New cards

Document Data Store

A database system used to manage a large number of documents and their associated structured data, typically stored compressed. Consists of metadata and is optimized for search and retrieval.

10
New cards

Parser

A component that processes the sequence of text tokens using knowledge of markup language syntax to identify document structure.

11
New cards

Stopping

The process of removing common function words from tokens to reduce index size without affecting search effectiveness. Can be hard to decide what words to have on the stop-word list

12
New cards

Stemming

The process of grouping words with similar meanings together and replacing them with a single word to increase performance and matching.

13
New cards

Link extraction

  • Analysis - popularity and authority of page

  • Anchor text - enhance the text context of page that links points to

14
New cards

Classifier

A tool that identifies class-related metadata and assigns labels to documents representing topic categories like sport, spam, or advertising.

15
New cards

Document statistics

Gathers records and stats about words, info is then used by ranking component and the stats are stored in a lookup table

16
New cards

tf.idf weighting

A weighting scheme that gives high weights to terms that occur in very few documents, calculated during the indexing process.

17
New cards

Inversion

The core of the indexing process that changes the stream of document-term information into term-document information.

18
New cards

Query input

Provides an interface and parser for query language, simple query language for most search engines - only small number of operators

19
New cards

Query transformation

Tokenisation, stopping and stemming must be done on the query text to produce index terms that are comparable to the document terms in the index.

20
New cards

Logging

A valuable source of information used for tuning and improving search engines by recording query-relevance judgment pairs.

21
New cards

Results Output

Component is responsible for constructing display of ranked documents, generates snippers to summarise retrieved docs, cluster output to identify groups, highlights important words or passages

22
New cards

Weighting

Calculates the weights using documents statistics and stores lookup tables, could be calculated during query processing but better during index processing

23
New cards

Performance Optimisation

Design ranking algorithms to decrease response time and improve query throughput. Calculate the scoring term-at-a-time or document-at-a-time

24
New cards

Distribution

Ranking can be distributed, or Query broker, caching

25
New cards

Evaluations

Logging, Ranking analysis, which uses log data to relevance judgment pairs to measure and compare ranking algorithms effectiveness,

  • Performance analysis component, monitoring and improving overall effectiveness

26
New cards

Crawls and Feeds

Crawling is finding and downloading web pages automatically, web pages usually not under the control of those building the search engine database

27
New cards

Retrieving web pages

Client program connects to a DNS server which then translates the hostname into an IP, program then attempts to connect to a serve computer with IP once the connection is established, client program sends a http request to a web server request page

28
New cards

Web crawler

Some servers may not be as powerful, so they may spend all the time handling request from the crawlers, due to politeness request queue gets split into a single queue per server

29
New cards

Focused Crawling

This is a less expensive approach, which relies on the fact that pages link to other pages, uses a number of popular pages as seeds, and use text classifiers to determine what the page is about