Metadata Encoding Schemes – Comprehensive Study Notes

Definition and Core Idea of Metadata Encoding Schemes

  • Encoding schemes = agreed-upon ways of structuring, tagging, and representing metadata so both humans and machines can parse it.
  • Govern:
    • How individual replacement (surrogate) records are broken into elements.
    • Which tags / letters / words label each element.
  • Primary objectives
    • Display: render information consistently for users.
    • Access: support precise searching, sorting, and retrieval.
    • Integration of surrogates: let multiple descriptions coexist or merge.
    • Management: enable long-term maintenance, migration, and preservation of data sets.

Why Encoding Schemes Matter

  • Consistency
    • Promotes standardization across datasets and institutions.
  • Interoperability
    • Guarantees that metadata can move between disparate systems without loss.
    • Underpins cross-database harvesting, OAI-PMH, linked data, etc.
  • Automation
    • Allows software to interpret fields, run batch operations, and generate services (search facets, displays, APIs).

Major Families of Encoding Schemes

  • HTML – Hypertext Markup Language
  • XML – eXtensible Markup Language
  • SGML – Standard Generalised Markup Language
  • RDF – Resource Description Framework
  • MARC – Machine-Readable Cataloging
  • MIME – Multipurpose Internet Mail Extensions

HTML (Hypertext Markup Language)

  • Native language of the World Wide Web; looks like traditional typesetting code.
  • Purpose
    • Structures and formats hypertext documents (pages, menus, forms, mail, hypermedia, query results, graphics, etc.).
  • Syntax example
    • <p>This is a paragraph</p> marks a paragraph element.
  • Typical use cases
    • Web pages, help files, e-learning modules, documentation portals.
  • Relationship to other schemes
    • Subset of SGML with a fixed element set.
    • Focuses on how content looks and behaves, not on describing its data semantics.

XML (eXtensible Markup Language)

  • Flexible, text-based metalanguage derived from SGML (ISO 8879).
  • Key properties
    • Designers create custom tags—XML is not bound to a preset vocabulary.
    • Human-readable and machine-readable.
    • Separates content/structure from presentation.
  • Roles
    • Encoding and exchanging metadata in web services, enterprise data feeds, digital libraries, etc.
    • Handles large-scale electronic publishing as well as lightweight data interchange.
  • Comparison with HTML
    • HTML = presentation-centric; XML = data-centric.
    • HTML mixes data and display; XML captures only meaning/structure; styling done by external tech (XSLT, CSS).
  • Example document snippet
  <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
  <!DOCTYPE FAQ SYSTEM "FAQ.DTD">
  <FAQ>
    <INFO>
      <SUBJECT>XML</SUBJECT>
      <AUTHOR>Lars Marius Garshol</AUTHOR>
      <EMAIL>larsga@ifi.uio.no</EMAIL>
      <VERSION>1.0</VERSION>
      <DATE>20.jun.97</DATE>
    </INFO>
    <PART NO="1">
      <Q NO="1">
        <QTEXT>What is XML?</QTEXT>
        <A>SGML light.</A>
      </Q>
    </PART>
  </FAQ>
  • Practical domains
    • Data exchange (SOAP, RSS, Atom), configuration files, scientific datasets, metadata repositories (MODS, METS, EAD).

SGML (Standard Generalized Markup Language)

  • ISO standard \text{ISO 8879:1986}; progenitor of both HTML and XML.
  • Functions as a metalanguage—lets you create other markup languages by defining a Document Type Definition (DTD).
  • DTD example
  <!DOCTYPE book [
    <!ELEMENT book (title, author, chapter+)>
    <!ELEMENT title (#PCDATA)>
    <!ELEMENT author (#PCDATA)>
    <!ELEMENT chapter (title, content)>
    <!ELEMENT content (#PCDATA)>
  ]>
  <book>
    <title>Introduction to Metadata</title>
    <author>John Doe</author>
    <chapter>
      <title>Chapter 1: Basics</title>
      <content>This is the first chapter.</content>
    </chapter>
  </book>
  • Design goals
    • Provide grammar-like mechanism for users to prescribe document structure and allowable tags.
    • Support compound documents containing text, graphics, hypertext links.
  • Outputs and transformations
    • Typesetting, indexing, CD-ROM, internationalization, web delivery, etc.
  • Relation map
    • HTML = SGML subset with fixed tags.
    • XML = streamlined SGML for web data exchange (removes complex features such as marked sections, minimization rules).

RDF (Resource Description Framework)

  • W3C framework for representing metadata as machine-understandable, graph-based triples (subject–predicate–object).
  • Goals
    • Interoperability across applications; shared semantics.
    • Automated processing of web resources—cornerstone of the Semantic Web and "Web of Trust".
  • Application areas
    • Resource discovery ⇒ improves search engine relevance.
    • Cataloging digital libraries, collections, or single pages.
    • Knowledge sharing for intelligent agents.
    • Content rating systems.
    • IPR (intellectual property rights) encoding.
  • Security synergy
    • Coupled with digital signatures → enables e-commerce, collaboration, verified data provenance.

MARC (Machine-Readable Cataloging)

  • Highly standardized bibliographic format used by libraries worldwide.
  • Encapsulates metadata fields (author, title, subjects, codes) in fixed-length numeric tags.
  • Main use: library catalog records imported/exported between ILSs, OCLC, LoC, WorldCat, etc.

MIME (Multipurpose Internet Mail Extensions)

  • Internet standard that labels file/media types so they travel safely via e-mail, HTTP, and other protocols.
  • How it works
    • Sender encodes file; header includes type/subtype (e.g., image/jpeg).
    • Receiver’s MIME-aware application decodes and invokes correct viewer.
  • Dependence tree
    • WWW browsers & servers consult built-in or user-added MIME tables to decide display/handling routine.
  • Typical types
    • Graphics: image/jpeg, image/gif, image/tiff.
    • Audio: audio/basic (au), audio/wav.
    • Video: video/mp4, legacy motion formats.
  • Core in e-mail
    • Virtually all Internet e-mail = SMTP/MIME; transports non-ASCII attachments seamlessly.

Conceptual Connections & Implications

  • Evolutionary line: SGML → (simplified) XML; SGML → (subset) HTML.
  • Complementarity
    • XML or SGML may define the structure; CSS/XSLT or HTML define presentation.
    • RDF piggybacks on XML syntaxes (RDF/XML), JSON-LD, Turtle—bridging data graphs with web markup.
  • Ethical / practical concerns
    • Interoperability avoids vendor lock-in and data silos.
    • Accurate tagging enhances accessibility (screen readers rely on semantic HTML).
    • "Web of Trust" vision (RDF + signatures) underpins secure commerce, privacy, and provenance.
  • Real-world relevance
    • Digital preservation: libraries rely on MARC/XML crosswalks.
    • Open data portals publish in XML/JSON and expose RDF to integrate with knowledge graphs.
    • MIME ensures user safety by preventing incorrect application execution of foreign binaries.

Quick Reference Cheat-Sheet

  • HTML: Fixed tags, presentation, web pages.
  • XML: Custom tags, data representation, interchange.
  • SGML: Meta-standard; defines new markup languages via DTD.
  • RDF: Triples/graphs; semantic metadata; automation.
  • MARC: Library bibliographic exchange (inventor: Library of Congress).
  • MIME: Media type registry; e-mail & HTTP encoding of binary/non-ASCII files.