Information Retrieval in Biomedicine: Quick Reference Notes
Information Retrieval (IR) in Biomedicine: Quick Reference
IR Overview
- IR (search) = acquisition, organization, and searching of knowledge-based information.
- Biomedical IR now includes multimedia content beyond text (images, sequences, etc.).
- Two main information categories:
- Patient-specific information (health records, mobile/wearable data)
- Knowledge-based information (research results, guidelines, consumer health info)
- IR goal: find content that meets a user’s information needs by querying with metadata; indexing assigns metadata, retrieval uses queries to fetch items.
- Metadata = data about data; used by search engines to match queries to content.
Content Categories in Health and Biomedicine
- Patient-specific vs knowledge-based information (definitions above).
- Knowledge-based content derived from observational/experimental research; supports evidence for individual care and broader knowledge.
The IR Process and Content Types
- IR process relies on indexing (metadata assignment) and retrieval (user query + matching).
- Content can be categorized into four main types:
- Bibliographic content (citations/pointers to literature; e.g., MEDLINE)
- Full-text content (online journals, books, textbooks, Web sites)
- Annotated content (databases with structured data like images, omics, EBM data, trials)
- Aggregated content (collections combining multiple content types, e.g., MedlinePlus)
Lifecycle of Knowledge-Based Information
- Original research → scientific paper → peer review → acceptance/rejection → publication.
- By-products: copyright transfer to publishers; secondary publications (reviews, books) and data generation for further research.
- Data publishing: increasing emphasis on depositing underlying data in public repositories (e.g., genomics, clinical trials data).
- Preprints: early postings before peer review; growing in biomedicine, with concerns about quality; registered reports proposed to mitigate delays.
Publishing Trends and Open Science
- Internet-enabled electronic publishing lowers some costs but raises political/economic questions (who pays).
- Open Access (OA): free online access with alternative funding sources; OA Gold (author pays) and OA Green (deposited manuscripts in repositories).
- NIH data-sharing policy and PMC (PubMed Central) as repository; concerns about predatory journals.
- Open science components: Open data, Open source, Open methodology, Open peer review.
Information Needs and Users
- Clinicians: four states of information need (Gorman):
- Unrecognized need, Recognized need, Pursued need, Satisfied need.
- Evidence suggests many information needs remain unmet; clinicians often rely on colleagues or textbooks when seeking answers.
- Consumers: ~80% of Internet users search for personal health information; common topics include diseases, treatments, and providers.
- Researchers: information needs studied less extensively but growing interest in data sharing and CDS resources.
Changes in Publishing and Access
- Web/Internet transform: nearly all journals are electronic; access remains constrained by economics and licensing.
- OA concerns: predatory journals; need for quality controls like HON codes, URAC accreditation, and JAMA-style criteria for quality.
- Preprints and rapid dissemination have contested impacts on medicine (COVID-19 era highlighted both speed and risks).
Quality and Misinformation in Knowledge-Based Information
- Misinformation on the Web and in science; quality varies by topic; responsible dissemination is critical.
- Issues: spin in journals, selective reporting of risks, search engine manipulation via data voids, paywalls, and misinformation spread via social media.
- Initiatives to improve quality: JAMA criteria, HON codes, URAC accreditation; Retraction Watch tracks retractions.
- Publication bias (file-d drawer problem) and conflict of interest concerns remain active challenges.
Quality Assurance and Standards in Web Content
- Web quality criteria (early efforts): JAMA criteria; HON codes; URAC accreditation.
- Misconduct concerns include data fabrication, image manipulation, peer-review fraud, and “zombie science.”
- Retracted papers may still be cited; Retraction Watch maintains a database.
Content Taxonomy for Knowledge-Based Information
- Bibliographic Content
- MEDLINE as core bibliographic database; ~60 fields per record; PMIDs, PMCID, and AUID/ORCID identifiers.
- Other major bibliographic resources: CINAHL (nursing/allied health), EMBASE; Web catalogs and aggregations (HON Select, TRIP, CISMeF); Google Scholar; RSS feeds.
- Full-Text Content
- Electronic journals/books; linking to abstracts, full text, and related resources; government content (e.g., NLM, CDC) and commercial publishers.
- Textbooks, encyclopedias, and online medical resources (e.g., NLM Bookshelf, UpToDate, Dynamed, etc.).
- Annotated Content
- Image databases (e.g., Visible Human, Open-I), omics databases (NCBI), citation databases (Web of Science, Scopus), EBM databases (Cochrane, BMJ Best Practice, UpToDate), clinical trial registries (ClinicalTrials.gov), NIH RePORTER, data set catalogs (DataMed).
- Aggregated Content
- Large, topic-focused aggregations (MedlinePlus) and model organism databases (Mouse Genome Informatics).
Indexing: Manual vs Automated
- Indexing = assignment of metadata to content to enable retrieval.
- Manual indexing
- Done by human indexers using controlled terminologies (e.g., MeSH); common for bibliographic and annotated content; follows explicit protocols.
- Automated indexing
- Computer-generated indexing (word-based), increasingly used for large/full-text content; often hybrid with manual terms.
- Controlled terminologies
- Terms, synonyms, and relationships (thesauri) to support retrieval.
Controlled Terminologies: MeSH and Others
- MeSH (Medical Subject Headings): used to index most NLM databases.
- 28,000+ headings; 90,000+ entry terms; 16 trees; hierarchical structure; synonyms; related terms.
- Features to aid retrieval: subheadings, check tags, geographic (Z) tags, publication types.
- Other thesauri: CINAHL Subject Headings (MeSH-based with domain-specific terms), EMTREE (EMBASE).
- Manual indexing in MEDLINE uses MeSH terms and subheadings; some automation supported but manual input remains important.
Metadata for Web Content: Dublin Core and RDF
- Dublin Core Metadata Initiative (DCMI): 15 elements used to describe Web resources (e.g., title, creator, subject, description, publisher, date, etc.).
- RDF (Resource Description Framework): emerging standard for external metadata, enabling linked data.
- Yahoo/dmoz (Open Directory) historically attempted human-curated directories; later largely superseded by search engines.
- Limitations include inconsistencies across manual indexing and scalability issues for Web-scale content.
Manual vs Automated Indexing on the Web
- Manual indexing limitations: inconsistency, time-consuming, not scalable for billions of pages.
- Web catalogs and aggregations partly mitigate scalability issues but may introduce quality variability.
Retrieval: Exact-Match vs Partial-Match
Exact-Match Retrieval (Boolean/Search Set)
- Uses AND, OR, NOT to form document sets; common in bibliographic/annotated databases.
- PubMed Advanced Search Builder supports sets, limits, and automatic Boolean placement.
Partial-Match Retrieval (Ranking, Vector Space)
- Natural-language-like querying; documents ranked by relevance.
- Weighting schemes: TF*IDF where
- = frequency of term t in document d
- Document score across query terms:
- BM25 and learning-to-rank are popular ranking approaches.
Query features in practice: phrase search, wildcards, stop word removal, stemming, synonym handling, and expansion via related terms.
Retrieval Systems: PubMed, Web Search, and Contextual Linking
- PubMed: automatic term mapping to MeSH, journals, phrases, and authors; supports phrase search, wildcards, limits, and Advanced Search sets.
- Web search engines (Google/Bing): natural language queries, implicit AND, some Boolean controls, location-aware results, and sponsored results (ads).
- PubMed Clinical Queries: specialized filters to retrieve best evidence for clinical questions; supports PICO-like queries.
- Infobuttons (HL7 standard): context-aware linking of patient data to knowledge resources within EHRs.
- Image retrieval: semantic/textual vs visual/content-based indexing; Open-I as a clinical image retrieval system; Google Images for medical imagery.
Evaluation of IR Systems
- Core metrics: recall (proportion of relevant documents retrieved) and precision (proportion of retrieved documents that are relevant).
- Aggregate measures: Mean Average Precision (MAP), B-Pref (for incomplete judgments), NDCG (graded relevance).
- Evaluation approaches: system-oriented (focus on IR system performance) and user-oriented (focus on user task success and satisfaction).
- TREC and other challenge evaluations provide standardized tasks, collections, and relevance judgments; biomedical tracks include Genomics, Medical Records, CDS/Precision Medicine, and Clinical Trials tracks.
- Findings: users frequently obtain answers with limited recall/precision; user studies show varying impact on knowledge and practice; technology evolves rapidly, complicating longitudinal comparisons.
Research Directions and Conclusions
- Ongoing needs: lower user effort in busy clinical settings; extract new knowledge from large corpora; ensure high-quality health information for consumers.
- Key research areas:
- Information extraction and text mining (NLP) to extract facts from text.
- Summarization to generate concise abstracts/overviews.
- Question-answering to provide direct answers beyond document retrieval.
- Conclusions: IR and digital libraries have advanced substantially, but significant challenges remain in accessibility, accuracy, interoperability, and preserving trust in biomedical information.
Takeaways for Last-Minute Review
- Know the two main information types and the four content categories (bibliographic, full-text, annotated, aggregated).
- Understand the two IR processes: indexing (metadata) and retrieval (match + rank).
- Recall the core retrieval models: exact-match (Boolean) vs partial-match (ranking with TF*IDF, BM25, learning-to-rank).
- Be able to explain MeSH and Dublin Core roles in indexing/metadata.
- Recognize PubMed features (term mapping to MeSH, phrase search, limits, advanced search, clinical queries).
- Be aware of quality issues (misinformation, predatory OA, retractions, COI) and standards (HON, URAC).
- Remember evaluation metrics (recall, precision, MAP, NDCG) and the role of challenge evaluations (TREC).