1/28
Vocabulary terms and definitions related to information retrieval, search engine architecture, and web crawling processes.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Information Retrieval
A field concerned with the structure, analysis, organisation, storage, searching, and retrieval of information, primarily dealing with unstructured text. Distinction between document and a database record
Topical relevance
One of the two components of relevance in retrieval models, distinguishing the subject-based relation between a query and a document.
UMIA
Unstructured Information Management Architecture; the framework used for organizing search engine architecture.
Effectiveness
A measure of search quality where the goal is to retrieve the most relevant set of documents possible.
Efficiency
A measure of search speed focusing on processing user queries as quickly as possible.
Web crawler
A program that follows links to discover and download new pages, often restricted by a single site or focused on specific topics.
Challenges with handling huge volumes of new pages
Can be restricted to a single site for a site search
Feeds
Access real time stream of documents, squires documents by monitoring feeds. Readers monitor feeds and when something new appears it auto indexes it
Conversion
Documents are rarely in plain text so search engines require them to be converted to a consistent format, each crawler does this there own way.
Document Data Store
A database system used to manage a large number of documents and their associated structured data, typically stored compressed. Consists of metadata and is optimized for search and retrieval.
Parser
A component that processes the sequence of text tokens using knowledge of markup language syntax to identify document structure.
Stopping
The process of removing common function words from tokens to reduce index size without affecting search effectiveness. Can be hard to decide what words to have on the stop-word list
Stemming
The process of grouping words with similar meanings together and replacing them with a single word to increase performance and matching.
Link extraction
Analysis - popularity and authority of page
Anchor text - enhance the text context of page that links points to
Classifier
A tool that identifies class-related metadata and assigns labels to documents representing topic categories like sport, spam, or advertising.
Document statistics
Gathers records and stats about words, info is then used by ranking component and the stats are stored in a lookup table
tf.idf weighting
A weighting scheme that gives high weights to terms that occur in very few documents, calculated during the indexing process.
Inversion
The core of the indexing process that changes the stream of document-term information into term-document information.
Query input
Provides an interface and parser for query language, simple query language for most search engines - only small number of operators
Query transformation
Tokenisation, stopping and stemming must be done on the query text to produce index terms that are comparable to the document terms in the index.
Logging
A valuable source of information used for tuning and improving search engines by recording query-relevance judgment pairs.
Results Output
Component is responsible for constructing display of ranked documents, generates snippers to summarise retrieved docs, cluster output to identify groups, highlights important words or passages
Weighting
Calculates the weights using documents statistics and stores lookup tables, could be calculated during query processing but better during index processing
Performance Optimisation
Design ranking algorithms to decrease response time and improve query throughput. Calculate the scoring term-at-a-time or document-at-a-time
Distribution
Ranking can be distributed, or Query broker, caching
Evaluations
Logging, Ranking analysis, which uses log data to relevance judgment pairs to measure and compare ranking algorithms effectiveness,
Performance analysis component, monitoring and improving overall effectiveness
Crawls and Feeds
Crawling is finding and downloading web pages automatically, web pages usually not under the control of those building the search engine database
Retrieving web pages
Client program connects to a DNS server which then translates the hostname into an IP, program then attempts to connect to a serve computer with IP once the connection is established, client program sends a http request to a web server request page
Web crawler
Some servers may not be as powerful, so they may spend all the time handling request from the crawlers, due to politeness request queue gets split into a single queue per server
Focused Crawling
This is a less expensive approach, which relies on the fact that pages link to other pages, uses a number of popular pages as seeds, and use text classifiers to determine what the page is about