Information Retrieval in Biomedicine: Quick Reference Notes

Information Retrieval (IR) in Biomedicine: Quick Reference

IR Overview

IR (search) = acquisition, organization, and searching of knowledge-based information.
Biomedical IR now includes multimedia content beyond text (images, sequences, etc.).
Two main information categories:
- Patient-specific information (health records, mobile/wearable data)
- Knowledge-based information (research results, guidelines, consumer health info)
IR goal: find content that meets a user’s information needs by querying with metadata; indexing assigns metadata, retrieval uses queries to fetch items.
Metadata = data about data; used by search engines to match queries to content.

Content Categories in Health and Biomedicine

Patient-specific vs knowledge-based information (definitions above).
Knowledge-based content derived from observational/experimental research; supports evidence for individual care and broader knowledge.

The IR Process and Content Types

IR process relies on indexing (metadata assignment) and retrieval (user query + matching).
Content can be categorized into four main types:
- Bibliographic content (citations/pointers to literature; e.g., MEDLINE)
- Full-text content (online journals, books, textbooks, Web sites)
- Annotated content (databases with structured data like images, omics, EBM data, trials)
- Aggregated content (collections combining multiple content types, e.g., MedlinePlus)

Lifecycle of Knowledge-Based Information

Original research → scientific paper → peer review → acceptance/rejection → publication.
By-products: copyright transfer to publishers; secondary publications (reviews, books) and data generation for further research.
Data publishing: increasing emphasis on depositing underlying data in public repositories (e.g., genomics, clinical trials data).
Preprints: early postings before peer review; growing in biomedicine, with concerns about quality; registered reports proposed to mitigate delays.

Publishing Trends and Open Science

Internet-enabled electronic publishing lowers some costs but raises political/economic questions (who pays).
Open Access (OA): free online access with alternative funding sources; OA Gold (author pays) and OA Green (deposited manuscripts in repositories).
NIH data-sharing policy and PMC (PubMed Central) as repository; concerns about predatory journals.
Open science components: Open data, Open source, Open methodology, Open peer review.

Information Needs and Users

Clinicians: four states of information need (Gorman):
- Unrecognized need, Recognized need, Pursued need, Satisfied need.
Evidence suggests many information needs remain unmet; clinicians often rely on colleagues or textbooks when seeking answers.
Consumers: ~80% of Internet users search for personal health information; common topics include diseases, treatments, and providers.
Researchers: information needs studied less extensively but growing interest in data sharing and CDS resources.

Changes in Publishing and Access

Web/Internet transform: nearly all journals are electronic; access remains constrained by economics and licensing.
OA concerns: predatory journals; need for quality controls like HON codes, URAC accreditation, and JAMA-style criteria for quality.
Preprints and rapid dissemination have contested impacts on medicine (COVID-19 era highlighted both speed and risks).

Quality and Misinformation in Knowledge-Based Information

Misinformation on the Web and in science; quality varies by topic; responsible dissemination is critical.
Issues: spin in journals, selective reporting of risks, search engine manipulation via data voids, paywalls, and misinformation spread via social media.
Initiatives to improve quality: JAMA criteria, HON codes, URAC accreditation; Retraction Watch tracks retractions.
Publication bias (file-d drawer problem) and conflict of interest concerns remain active challenges.

Quality Assurance and Standards in Web Content

Web quality criteria (early efforts): JAMA criteria; HON codes; URAC accreditation.
Misconduct concerns include data fabrication, image manipulation, peer-review fraud, and “zombie science.”
Retracted papers may still be cited; Retraction Watch maintains a database.

Content Taxonomy for Knowledge-Based Information

Bibliographic Content
- MEDLINE as core bibliographic database; ~60 fields per record; PMIDs, PMCID, and AUID/ORCID identifiers.
- Other major bibliographic resources: CINAHL (nursing/allied health), EMBASE; Web catalogs and aggregations (HON Select, TRIP, CISMeF); Google Scholar; RSS feeds.
Full-Text Content
- Electronic journals/books; linking to abstracts, full text, and related resources; government content (e.g., NLM, CDC) and commercial publishers.
- Textbooks, encyclopedias, and online medical resources (e.g., NLM Bookshelf, UpToDate, Dynamed, etc.).
Annotated Content
- Image databases (e.g., Visible Human, Open-I), omics databases (NCBI), citation databases (Web of Science, Scopus), EBM databases (Cochrane, BMJ Best Practice, UpToDate), clinical trial registries (ClinicalTrials.gov), NIH RePORTER, data set catalogs (DataMed).
Aggregated Content
- Large, topic-focused aggregations (MedlinePlus) and model organism databases (Mouse Genome Informatics).

Indexing: Manual vs Automated

Indexing = assignment of metadata to content to enable retrieval.
Manual indexing
- Done by human indexers using controlled terminologies (e.g., MeSH); common for bibliographic and annotated content; follows explicit protocols.
Automated indexing
- Computer-generated indexing (word-based), increasingly used for large/full-text content; often hybrid with manual terms.
Controlled terminologies
- Terms, synonyms, and relationships (thesauri) to support retrieval.

Controlled Terminologies: MeSH and Others

MeSH (Medical Subject Headings): used to index most NLM databases.
- 28,000+ headings; 90,000+ entry terms; 16 trees; hierarchical structure; synonyms; related terms.
- Features to aid retrieval: subheadings, check tags, geographic (Z) tags, publication types.
Other thesauri: CINAHL Subject Headings (MeSH-based with domain-specific terms), EMTREE (EMBASE).
Manual indexing in MEDLINE uses MeSH terms and subheadings; some automation supported but manual input remains important.

Metadata for Web Content: Dublin Core and RDF

Dublin Core Metadata Initiative (DCMI): 15 elements used to describe Web resources (e.g., title, creator, subject, description, publisher, date, etc.).
RDF (Resource Description Framework): emerging standard for external metadata, enabling linked data.
Yahoo/dmoz (Open Directory) historically attempted human-curated directories; later largely superseded by search engines.
Limitations include inconsistencies across manual indexing and scalability issues for Web-scale content.

Manual vs Automated Indexing on the Web

Manual indexing limitations: inconsistency, time-consuming, not scalable for billions of pages.
Web catalogs and aggregations partly mitigate scalability issues but may introduce quality variability.

Retrieval: Exact-Match vs Partial-Match

Exact-Match Retrieval (Boolean/Search Set)
- Uses AND, OR, NOT to form document sets; common in bibliographic/annotated databases.
- PubMed Advanced Search Builder supports sets, limits, and automatic Boolean placement.
Partial-Match Retrieval (Ranking, Vector Space)
- Natural-language-like querying; documents ranked by relevance.
- Weighting schemes: TF*IDF where
- $IDF(t) = </li></ul> <p>\log \frac{N}{n_t}$
  - $TF(t,d)$ = frequency of term t in document d
  - $w_{t,d} = TF(t,d) \cdot IDF(t)$
  - Document score across query terms: $W(d,Q) = \sum<em>{t \in Q} w</em>{t,d}$
  - BM25 and learning-to-rank are popular ranking approaches.
- Query features in practice: phrase search, wildcards, stop word removal, stemming, synonym handling, and expansion via related terms.
Retrieval Systems: PubMed, Web Search, and Contextual Linking
- PubMed: automatic term mapping to MeSH, journals, phrases, and authors; supports phrase search, wildcards, limits, and Advanced Search sets.
- Web search engines (Google/Bing): natural language queries, implicit AND, some Boolean controls, location-aware results, and sponsored results (ads).
- PubMed Clinical Queries: specialized filters to retrieve best evidence for clinical questions; supports PICO-like queries.
- Infobuttons (HL7 standard): context-aware linking of patient data to knowledge resources within EHRs.
- Image retrieval: semantic/textual vs visual/content-based indexing; Open-I as a clinical image retrieval system; Google Images for medical imagery.
Evaluation of IR Systems
- Core metrics: recall (proportion of relevant documents retrieved) and precision (proportion of retrieved documents that are relevant).
- Aggregate measures: Mean Average Precision (MAP), B-Pref (for incomplete judgments), NDCG (graded relevance).
- Evaluation approaches: system-oriented (focus on IR system performance) and user-oriented (focus on user task success and satisfaction).
- TREC and other challenge evaluations provide standardized tasks, collections, and relevance judgments; biomedical tracks include Genomics, Medical Records, CDS/Precision Medicine, and Clinical Trials tracks.
- Findings: users frequently obtain answers with limited recall/precision; user studies show varying impact on knowledge and practice; technology evolves rapidly, complicating longitudinal comparisons.
Research Directions and Conclusions
- Ongoing needs: lower user effort in busy clinical settings; extract new knowledge from large corpora; ensure high-quality health information for consumers.
- Key research areas:
  - Information extraction and text mining (NLP) to extract facts from text.
  - Summarization to generate concise abstracts/overviews.
  - Question-answering to provide direct answers beyond document retrieval.
- Conclusions: IR and digital libraries have advanced substantially, but significant challenges remain in accessibility, accuracy, interoperability, and preserving trust in biomedical information.
Takeaways for Last-Minute Review
- Know the two main information types and the four content categories (bibliographic, full-text, annotated, aggregated).
- Understand the two IR processes: indexing (metadata) and retrieval (match + rank).
- Recall the core retrieval models: exact-match (Boolean) vs partial-match (ranking with TF*IDF, BM25, learning-to-rank).
- Be able to explain MeSH and Dublin Core roles in indexing/metadata.
- Recognize PubMed features (term mapping to MeSH, phrase search, limits, advanced search, clinical queries).
- Be aware of quality issues (misinformation, predatory OA, retractions, COI) and standards (HON, URAC).
- Remember evaluation metrics (recall, precision, MAP, NDCG) and the role of challenge evaluations (TREC).

Information Retrieval in Biomedicine: Quick Reference Notes

Information Retrieval (IR) in Biomedicine: Quick Reference

IR Overview

Content Categories in Health and Biomedicine

The IR Process and Content Types

Lifecycle of Knowledge-Based Information

Publishing Trends and Open Science

Information Needs and Users

Changes in Publishing and Access

Quality and Misinformation in Knowledge-Based Information

Quality Assurance and Standards in Web Content

Content Taxonomy for Knowledge-Based Information

Indexing: Manual vs Automated

Controlled Terminologies: MeSH and Others

Metadata for Web Content: Dublin Core and RDF

Manual vs Automated Indexing on the Web

Retrieval: Exact-Match vs Partial-Match

Retrieval Systems: PubMed, Web Search, and Contextual Linking

Evaluation of IR Systems

Research Directions and Conclusions

Takeaways for Last-Minute Review