1/28
Flashcards covering the fundamentals of search engine architecture, the indexing and query processes, text transformation techniques, and web crawling mechanisms.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Information Retrieval
A field that includes the structure, analysis, storage, searching, and retrieving of information, often dealing with unstructured data in databases.
1 goal of search engines
Effectiveness: aims to retrieve the most relevant results for a user's query.
2nd goal of search engines
Efficiency: aims to process queries as quickly as possible.
Indexing Process
The method where search engines collect and store data to facilitate fast and accurate information retrieval.
Text Acquisition
The 1st phase of indexing, where the indexer opens files and acquires the raw text for the document data store.
Text Transformation
The 2nd phase of indexing that looks for patterns or repetitions to simplify documents, making them more easily categorisable.
Index Creation
the 3rd part of indexing, where documents are assigned into categories, which provides the format for the index results that appear following a search query.
Crawlers/Spiders
Software programs that follow hyperlinks or anchor tags to discover and download content from different web pages.
Focused/Topical Crawlers
Specialised bots designed to only discover pages relevant to a pre-defined topic, rather than searching the entire internet.
Push Feed
A document feed where the site alerts subscribers automatically whenever any updates occur.
Pull Feed
A document feed where the subscriber is required to manually retrieve the information from the source.
Parser
A component that breaks down data and text into smaller, more meaningful components known as tokens based on syntax.
Tokens
The smaller, meaningful components resulting from the breakdown of data by a parser for further analysis.
Stopping
The process of filtering and removing common or low-meaning words like “the”, “is”, or “to” from indexes to enhance search efficiency.
Stemming
The process of grouping words derived from common stems, such as replacing “fish”, “fishes”, and “fishing” with the single designated word “fish”.
Classifier
A component that identifies class-related metadata and assigns labels to documents to represent specific categories using clustering techniques.
Weighting
The calculation of the relative importance of data; terms that occur frequently across documents are typically assigned a low weight.
Inversion
A process that converts document-to-term information into term-to-document information to enable fast and efficient searches.
Query Transformation
Modifying a user's initial search query through tokenisation, stopping, and stemming to produce index terms comparable to document terms.
Term at a time scoring
A scoring method where calculations are done as the user enters a query; it is computationally intensive but provides up-to-date results.
Document at a time scoring
A scoring method where evaluation is performed at the same time as indexing, resulting in quicker search times but potentially less up-to-date ranking.
Query Broker
The component that manages all requested queries, prioritising those that are most asked or easiest to calculate.
Caching
The process of storing the results of commonly asked queries to speed up future retrieval.
Politeness Policy
A crawler setting where the bot only opens one link before returning to the home page to avoid overloading a web server.
/robot.txt
A file used by server administrators to allow or disallow crawlers from entering certain pages on a website.
Deep Web
Content that is difficult for crawlers to find, such as pages requiring a login, form submissions, or those written in script languages.
Sitemaps
Indicators provided by web owners that tell crawlers when and how often a site is updated to avoid redundant downloads.
The query process
where search engines return the most relevant response when a user enters a query
User Interaction
the 1st stage of query process where the user