Information Retrieval Systems Notes

Information Retrieval Systems

Introduction

Definition: An Information Retrieval System (IRS) is a system designed for the storage, retrieval, and maintenance of information.
Objectives: The primary goal of an IRS is to minimize the overhead for a user in locating needed information, which includes time spent on query generation, execution, scanning results, and reading non-relevant items.
Functional Overview: An IRS comprises processes for item normalization, selective dissemination of information, archival document database search, and index database search.
Relationship to DBMS: While there can be confusion between Database Management Systems (DBMS) and Information Retrieval Systems (IRS), the key difference lies in the ability of an IRS to process "information" effectively, whereas a DBMS excels at handling structured data.
*Relationship to Digital Libraries and Data Warehouses: Information Storage and Retrieval technology addresses a subset of the issues associated with Digital Libraries. Data warehouses are similar to information storage and retrieval systems in that they both have a need for search and retrieval of information.

Information Retrieval System Capabilities

Search

Algorithms used for searching include Boolean, natural language processing, and probabilistic methods.
Probabilistic algorithms use the frequency of processing tokens (words) to determine similarities between queries and items.

Browse

Browse functions assist users in filtering search results to find relevant information.

Miscellaneous

Includes vocabulary browsing and iterative search capabilities.

Detailed Explanation of Information Retrieval System Components

Definition of IRS

An IRS is capable of storing, retrieving, and maintaining information, which can include text, images, audio, video, and other multi-media objects.
The term "item" represents the smallest complete unit processed by the system, which could be a document, video, or other media.

Objectives of IRS

The general objective is to minimize the overhead of a user locating needed information.
Overhead includes time spent on query generation, query execution, scanning results, and reading non-relevant items.
Relevant Item: An item containing the needed information; from a user's perspective, relevant and needed are synonymous.
Precision: Affected by the retrieval of non-relevant items; drops when non-relevant items are retrieved.
Recall: Not affected by the retrieval of non-relevant items; remains at 100\% once achieved.

Functional Overview

A total Information Storage and Retrieval System is composed of four major functional processes:

Item Normalization:
- The first step is to normalize incoming items to a standard format.
- This involves logical restructuring of the item.
- Additional operations include:
  - Identification of processing tokens (words).
  - Characterization of tokens.
  - Stemming (removing word endings) of tokens.
- Standardizing the input translates different external formats to formats acceptable to the system.
- Multi-media normalization includes standardizing textual and multi-media inputs.
  - Video standards: MPEG-2, MPEG-1, AVI, or Real Media.
  - Audio standards: WAV or Real Media (Real Audio).
  - Images: JPEG to BMP.
- Zoning: Parsing the item into logical sub-divisions (e.g., Title, Author, Abstract) to increase search precision and optimize display.
Selective Dissemination of Information (SDI):
- Dynamically compares newly received items against standing statements of interest of users.
- Delivers the item to users whose statement of interest matches the contents of the item.
- The process is composed of a search process, user profiles, and user mail files.
- Has not yet been applied to multimedia sources.
Document Database Search:
- Provides the capability for a query to search against all items received by the system.
- Composed of the search process, user-entered queries, and the document database.
- Items in the Document Database typically do not change once received.
Index Database Search:
- Allows users to save items of interest for future reference.
- Users can logically store an item in a file with additional index terms and descriptive text.
- Two classes of index files:
  - Public Index files: Maintained by professional library services personnel.
  - Private Index files: Maintained by individual users.

Relationship to Database Management Systems (DBMS)

Integration of DBMSs and Information Retrieval Systems is very important.
Commercial database companies have integrated the two types of systems.
Examples:
- INQUIRE DBMS: One of the first to integrate the two systems.
- ORACLE DBMS: Offers an embedded capability called CONVECTIS, using a comprehensive thesaurus to generate themes.
- INFORMIX DBMS: Can link to RetrievalWare for integration of structured data and information.

Data Warehouses

Data warehouses are focused on structured data and decision support technologies. Data mining analyzes data and extracts relationships and dependencies that were not part of the database design.

Information Retrieval System Capabilities

Search Capabilities

Objective: To map a user’s specified need to items in the information database.
Can consist of natural language text and/or query terms with Boolean logic.
Weighting: Assigning importance to search terms (values between 0.0 and 1.0).
Functions define relationships between terms (Boolean, Natural Language, Proximity, Contiguous Word Phrases, and Fuzzy Searches) and the interpretation of a particular word (Term Masking, Numeric and Date Range, Contiguous Word Phrases, and Concept/Thesaurus expansion).

Boolean Logic

Allows users to logically relate multiple concepts.
Operators: AND, OR, and NOT, implemented using set intersection, set union, and set difference.
M of N Logic: User lists a set of search terms and identifies any item containing a subset of the terms.

Proximity

Restricts the distance allowed between two search terms.
Increases the precision of a search.
Format: TERM1 within “m” “units” of TERM2 (units are Characters, Words, Sentences, or Paragraphs).
Adjacent (ADJ) operator: A special case of proximity with a distance operator of one and a forward-only direction.

Contiguous Word Phrases (CWP)

Two or more words treated as a single semantic unit (e.g., "United States of America").
Acts as a special search operator similar to the proximity operator.
Called Literal Strings in WAIS and Exact Phrases in RetrievalWare.

Fuzzy Searches

Locate spellings of words similar to the entered search term.
Increases recall but decreases precision.
Includes terms with similar spellings, giving more weight to words with similar lengths and character positions.

Term Masking

Expands a query term by masking a portion of the term.
Types: fixed length and variable length (“don’t care” functions).
Variable length masking: Allows masking of any number of characters.
- Suffix Search: “*COMPUTER”
- Prefix Search: “COMPUTER*”
- Imbedded String Search: “COMPUTER”

Numeric and Date Ranges

Term masking does not work for finding ranges of numbers or numeric dates.

Concept/Thesaurus Expansion

Expands search terms via Thesaurus or Concept Class database reference tool.
Thesaurus: one-level or two-level expansion of a term to similar terms.
Concept Class: A tree structure that expands each meaning of a word into related concepts.
Thesauri can be semantic or based on statistics.

Natural Language Queries

Improve the recall of systems with a decrease in precision when negation is required.

Browse Capabilities

Allow the user to determine which items are of interest and select those to be displayed.
Display options: line item status and data visualization.
Ranking: Relevance scores are normalized to a value between 0.0 and 1.0.
Collaborative filtering: Provides an option for selecting and ordering output based upon other users queries.
Zoning: Minimizing what an end user needs to review from a hit item by zoning of passages.
Highlighting: Display the begining of a item with the first highlight and allow subsequent jumping to the next highlight.

Miscellaneous Capabilities

Vocabulary Browse

Displays words from the document database in alphabetical order, along with a count of unique items in which the word is found.
Helps determine the impact of using a fixed or variable length mask on a search term and potential mis-spellings.

Iterative Search and Search History Log

Iterative search refines the results of a previous search.
The search history log displays all previous searches executed during the current session.

Canned Query

The capability to name a query and store it to be retrieved and executed during a later user session.

Standards

Z39.50

A computer-to-computer communications standard for database searching and record retrieval.
Defines eight operation types: Init, Search, Present, Delete, Scan, Sort, Resource-report, and Extended Services.
The client (Origin) initiates the search and translates the query into a standardized format.
The server (Target) interfaces to the database and responds to requests from the Origin.

WAIS (Wide Area Information Servers)

An Internet system with specialized subject databases at multiple server locations.
The user enters a search argument for a selected database, and the client accesses all the servers on which the database is distributed.

RetrievalWare

An enterprise search engine emphasizing natural language processing and semantic networks.