Information Retrieval in Biomedicine: Quick Reference Notes

Information Retrieval (IR) in Biomedicine: Quick Reference

IR Overview

  • IR (search) = acquisition, organization, and searching of knowledge-based information.
  • Biomedical IR now includes multimedia content beyond text (images, sequences, etc.).
  • Two main information categories:
    • Patient-specific information (health records, mobile/wearable data)
    • Knowledge-based information (research results, guidelines, consumer health info)
  • IR goal: find content that meets a user’s information needs by querying with metadata; indexing assigns metadata, retrieval uses queries to fetch items.
  • Metadata = data about data; used by search engines to match queries to content.

Content Categories in Health and Biomedicine

  • Patient-specific vs knowledge-based information (definitions above).
  • Knowledge-based content derived from observational/experimental research; supports evidence for individual care and broader knowledge.

The IR Process and Content Types

  • IR process relies on indexing (metadata assignment) and retrieval (user query + matching).
  • Content can be categorized into four main types:
    • Bibliographic content (citations/pointers to literature; e.g., MEDLINE)
    • Full-text content (online journals, books, textbooks, Web sites)
    • Annotated content (databases with structured data like images, omics, EBM data, trials)
    • Aggregated content (collections combining multiple content types, e.g., MedlinePlus)

Lifecycle of Knowledge-Based Information

  • Original research → scientific paper → peer review → acceptance/rejection → publication.
  • By-products: copyright transfer to publishers; secondary publications (reviews, books) and data generation for further research.
  • Data publishing: increasing emphasis on depositing underlying data in public repositories (e.g., genomics, clinical trials data).
  • Preprints: early postings before peer review; growing in biomedicine, with concerns about quality; registered reports proposed to mitigate delays.

Publishing Trends and Open Science

  • Internet-enabled electronic publishing lowers some costs but raises political/economic questions (who pays).
  • Open Access (OA): free online access with alternative funding sources; OA Gold (author pays) and OA Green (deposited manuscripts in repositories).
  • NIH data-sharing policy and PMC (PubMed Central) as repository; concerns about predatory journals.
  • Open science components: Open data, Open source, Open methodology, Open peer review.

Information Needs and Users

  • Clinicians: four states of information need (Gorman):
    • Unrecognized need, Recognized need, Pursued need, Satisfied need.
  • Evidence suggests many information needs remain unmet; clinicians often rely on colleagues or textbooks when seeking answers.
  • Consumers: ~80% of Internet users search for personal health information; common topics include diseases, treatments, and providers.
  • Researchers: information needs studied less extensively but growing interest in data sharing and CDS resources.

Changes in Publishing and Access

  • Web/Internet transform: nearly all journals are electronic; access remains constrained by economics and licensing.
  • OA concerns: predatory journals; need for quality controls like HON codes, URAC accreditation, and JAMA-style criteria for quality.
  • Preprints and rapid dissemination have contested impacts on medicine (COVID-19 era highlighted both speed and risks).

Quality and Misinformation in Knowledge-Based Information

  • Misinformation on the Web and in science; quality varies by topic; responsible dissemination is critical.
  • Issues: spin in journals, selective reporting of risks, search engine manipulation via data voids, paywalls, and misinformation spread via social media.
  • Initiatives to improve quality: JAMA criteria, HON codes, URAC accreditation; Retraction Watch tracks retractions.
  • Publication bias (file-d drawer problem) and conflict of interest concerns remain active challenges.

Quality Assurance and Standards in Web Content

  • Web quality criteria (early efforts): JAMA criteria; HON codes; URAC accreditation.
  • Misconduct concerns include data fabrication, image manipulation, peer-review fraud, and “zombie science.”
  • Retracted papers may still be cited; Retraction Watch maintains a database.

Content Taxonomy for Knowledge-Based Information

  • Bibliographic Content
    • MEDLINE as core bibliographic database; ~60 fields per record; PMIDs, PMCID, and AUID/ORCID identifiers.
    • Other major bibliographic resources: CINAHL (nursing/allied health), EMBASE; Web catalogs and aggregations (HON Select, TRIP, CISMeF); Google Scholar; RSS feeds.
  • Full-Text Content
    • Electronic journals/books; linking to abstracts, full text, and related resources; government content (e.g., NLM, CDC) and commercial publishers.
    • Textbooks, encyclopedias, and online medical resources (e.g., NLM Bookshelf, UpToDate, Dynamed, etc.).
  • Annotated Content
    • Image databases (e.g., Visible Human, Open-I), omics databases (NCBI), citation databases (Web of Science, Scopus), EBM databases (Cochrane, BMJ Best Practice, UpToDate), clinical trial registries (ClinicalTrials.gov), NIH RePORTER, data set catalogs (DataMed).
  • Aggregated Content
    • Large, topic-focused aggregations (MedlinePlus) and model organism databases (Mouse Genome Informatics).

Indexing: Manual vs Automated

  • Indexing = assignment of metadata to content to enable retrieval.
  • Manual indexing
    • Done by human indexers using controlled terminologies (e.g., MeSH); common for bibliographic and annotated content; follows explicit protocols.
  • Automated indexing
    • Computer-generated indexing (word-based), increasingly used for large/full-text content; often hybrid with manual terms.
  • Controlled terminologies
    • Terms, synonyms, and relationships (thesauri) to support retrieval.

Controlled Terminologies: MeSH and Others

  • MeSH (Medical Subject Headings): used to index most NLM databases.
    • 28,000+ headings; 90,000+ entry terms; 16 trees; hierarchical structure; synonyms; related terms.
    • Features to aid retrieval: subheadings, check tags, geographic (Z) tags, publication types.
  • Other thesauri: CINAHL Subject Headings (MeSH-based with domain-specific terms), EMTREE (EMBASE).
  • Manual indexing in MEDLINE uses MeSH terms and subheadings; some automation supported but manual input remains important.

Metadata for Web Content: Dublin Core and RDF

  • Dublin Core Metadata Initiative (DCMI): 15 elements used to describe Web resources (e.g., title, creator, subject, description, publisher, date, etc.).
  • RDF (Resource Description Framework): emerging standard for external metadata, enabling linked data.
  • Yahoo/dmoz (Open Directory) historically attempted human-curated directories; later largely superseded by search engines.
  • Limitations include inconsistencies across manual indexing and scalability issues for Web-scale content.

Manual vs Automated Indexing on the Web

  • Manual indexing limitations: inconsistency, time-consuming, not scalable for billions of pages.
  • Web catalogs and aggregations partly mitigate scalability issues but may introduce quality variability.

Retrieval: Exact-Match vs Partial-Match

  • Exact-Match Retrieval (Boolean/Search Set)

    • Uses AND, OR, NOT to form document sets; common in bibliographic/annotated databases.
    • PubMed Advanced Search Builder supports sets, limits, and automatic Boolean placement.
  • Partial-Match Retrieval (Ranking, Vector Space)

    • Natural-language-like querying; documents ranked by relevance.
    • Weighting schemes: TF*IDF where
    • IDF(t)=</li></ul><p>logNntIDF(t) = </li></ul> <p>\log \frac{N}{n_t}

      • TF(t,d)TF(t,d) = frequency of term t in document d
      • wt,d=TF(t,d)IDF(t)w_{t,d} = TF(t,d) \cdot IDF(t)
      • Document score across query terms: W(d,Q)=<em>tQw</em>t,dW(d,Q) = \sum<em>{t \in Q} w</em>{t,d}
      • BM25 and learning-to-rank are popular ranking approaches.
    • Query features in practice: phrase search, wildcards, stop word removal, stemming, synonym handling, and expansion via related terms.

    Retrieval Systems: PubMed, Web Search, and Contextual Linking

    • PubMed: automatic term mapping to MeSH, journals, phrases, and authors; supports phrase search, wildcards, limits, and Advanced Search sets.
    • Web search engines (Google/Bing): natural language queries, implicit AND, some Boolean controls, location-aware results, and sponsored results (ads).
    • PubMed Clinical Queries: specialized filters to retrieve best evidence for clinical questions; supports PICO-like queries.
    • Infobuttons (HL7 standard): context-aware linking of patient data to knowledge resources within EHRs.
    • Image retrieval: semantic/textual vs visual/content-based indexing; Open-I as a clinical image retrieval system; Google Images for medical imagery.

    Evaluation of IR Systems

    • Core metrics: recall (proportion of relevant documents retrieved) and precision (proportion of retrieved documents that are relevant).
    • Aggregate measures: Mean Average Precision (MAP), B-Pref (for incomplete judgments), NDCG (graded relevance).
    • Evaluation approaches: system-oriented (focus on IR system performance) and user-oriented (focus on user task success and satisfaction).
    • TREC and other challenge evaluations provide standardized tasks, collections, and relevance judgments; biomedical tracks include Genomics, Medical Records, CDS/Precision Medicine, and Clinical Trials tracks.
    • Findings: users frequently obtain answers with limited recall/precision; user studies show varying impact on knowledge and practice; technology evolves rapidly, complicating longitudinal comparisons.

    Research Directions and Conclusions

    • Ongoing needs: lower user effort in busy clinical settings; extract new knowledge from large corpora; ensure high-quality health information for consumers.
    • Key research areas:
      • Information extraction and text mining (NLP) to extract facts from text.
      • Summarization to generate concise abstracts/overviews.
      • Question-answering to provide direct answers beyond document retrieval.
    • Conclusions: IR and digital libraries have advanced substantially, but significant challenges remain in accessibility, accuracy, interoperability, and preserving trust in biomedical information.

    Takeaways for Last-Minute Review

    • Know the two main information types and the four content categories (bibliographic, full-text, annotated, aggregated).
    • Understand the two IR processes: indexing (metadata) and retrieval (match + rank).
    • Recall the core retrieval models: exact-match (Boolean) vs partial-match (ranking with TF*IDF, BM25, learning-to-rank).
    • Be able to explain MeSH and Dublin Core roles in indexing/metadata.
    • Recognize PubMed features (term mapping to MeSH, phrase search, limits, advanced search, clinical queries).
    • Be aware of quality issues (misinformation, predatory OA, retractions, COI) and standards (HON, URAC).
    • Remember evaluation metrics (recall, precision, MAP, NDCG) and the role of challenge evaluations (TREC).