Metadata Encoding Schemes – Comprehensive Study Notes
- Encoding schemes = agreed-upon ways of structuring, tagging, and representing metadata so both humans and machines can parse it.
- Govern:
- How individual replacement (surrogate) records are broken into elements.
- Which tags / letters / words label each element.
- Primary objectives
- Display: render information consistently for users.
- Access: support precise searching, sorting, and retrieval.
- Integration of surrogates: let multiple descriptions coexist or merge.
- Management: enable long-term maintenance, migration, and preservation of data sets.
Why Encoding Schemes Matter
- Consistency
- Promotes standardization across datasets and institutions.
- Interoperability
- Guarantees that metadata can move between disparate systems without loss.
- Underpins cross-database harvesting, OAI-PMH, linked data, etc.
- Automation
- Allows software to interpret fields, run batch operations, and generate services (search facets, displays, APIs).
Major Families of Encoding Schemes
- HTML – Hypertext Markup Language
- XML – eXtensible Markup Language
- SGML – Standard Generalised Markup Language
- RDF – Resource Description Framework
- MARC – Machine-Readable Cataloging
- MIME – Multipurpose Internet Mail Extensions
HTML (Hypertext Markup Language)
- Native language of the World Wide Web; looks like traditional typesetting code.
- Purpose
- Structures and formats hypertext documents (pages, menus, forms, mail, hypermedia, query results, graphics, etc.).
- Syntax example
<p>This is a paragraph</p> marks a paragraph element.
- Typical use cases
- Web pages, help files, e-learning modules, documentation portals.
- Relationship to other schemes
- Subset of SGML with a fixed element set.
- Focuses on how content looks and behaves, not on describing its data semantics.
XML (eXtensible Markup Language)
- Flexible, text-based metalanguage derived from SGML (ISO 8879).
- Key properties
- Designers create custom tags—XML is not bound to a preset vocabulary.
- Human-readable and machine-readable.
- Separates content/structure from presentation.
- Roles
- Encoding and exchanging metadata in web services, enterprise data feeds, digital libraries, etc.
- Handles large-scale electronic publishing as well as lightweight data interchange.
- Comparison with HTML
- HTML = presentation-centric; XML = data-centric.
- HTML mixes data and display; XML captures only meaning/structure; styling done by external tech (XSLT, CSS).
- Example document snippet
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE FAQ SYSTEM "FAQ.DTD">
<FAQ>
<INFO>
<SUBJECT>XML</SUBJECT>
<AUTHOR>Lars Marius Garshol</AUTHOR>
<EMAIL>larsga@ifi.uio.no</EMAIL>
<VERSION>1.0</VERSION>
<DATE>20.jun.97</DATE>
</INFO>
<PART NO="1">
<Q NO="1">
<QTEXT>What is XML?</QTEXT>
<A>SGML light.</A>
</Q>
</PART>
</FAQ>
- Practical domains
- Data exchange (SOAP, RSS, Atom), configuration files, scientific datasets, metadata repositories (MODS, METS, EAD).
SGML (Standard Generalized Markup Language)
- ISO standard \text{ISO 8879:1986}; progenitor of both HTML and XML.
- Functions as a metalanguage—lets you create other markup languages by defining a Document Type Definition (DTD).
- DTD example
<!DOCTYPE book [
<!ELEMENT book (title, author, chapter+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT chapter (title, content)>
<!ELEMENT content (#PCDATA)>
]>
<book>
<title>Introduction to Metadata</title>
<author>John Doe</author>
<chapter>
<title>Chapter 1: Basics</title>
<content>This is the first chapter.</content>
</chapter>
</book>
- Design goals
- Provide grammar-like mechanism for users to prescribe document structure and allowable tags.
- Support compound documents containing text, graphics, hypertext links.
- Outputs and transformations
- Typesetting, indexing, CD-ROM, internationalization, web delivery, etc.
- Relation map
- HTML = SGML subset with fixed tags.
- XML = streamlined SGML for web data exchange (removes complex features such as marked sections, minimization rules).
RDF (Resource Description Framework)
- W3C framework for representing metadata as machine-understandable, graph-based triples (subject–predicate–object).
- Goals
- Interoperability across applications; shared semantics.
- Automated processing of web resources—cornerstone of the Semantic Web and "Web of Trust".
- Application areas
- Resource discovery ⇒ improves search engine relevance.
- Cataloging digital libraries, collections, or single pages.
- Knowledge sharing for intelligent agents.
- Content rating systems.
- IPR (intellectual property rights) encoding.
- Security synergy
- Coupled with digital signatures → enables e-commerce, collaboration, verified data provenance.
MARC (Machine-Readable Cataloging)
- Highly standardized bibliographic format used by libraries worldwide.
- Encapsulates metadata fields (author, title, subjects, codes) in fixed-length numeric tags.
- Main use: library catalog records imported/exported between ILSs, OCLC, LoC, WorldCat, etc.
MIME (Multipurpose Internet Mail Extensions)
- Internet standard that labels file/media types so they travel safely via e-mail, HTTP, and other protocols.
- How it works
- Sender encodes file; header includes type/subtype (e.g.,
image/jpeg). - Receiver’s MIME-aware application decodes and invokes correct viewer.
- Dependence tree
- WWW browsers & servers consult built-in or user-added MIME tables to decide display/handling routine.
- Typical types
- Graphics:
image/jpeg, image/gif, image/tiff. - Audio:
audio/basic (au), audio/wav. - Video:
video/mp4, legacy motion formats.
- Core in e-mail
- Virtually all Internet e-mail = SMTP/MIME; transports non-ASCII attachments seamlessly.
Conceptual Connections & Implications
- Evolutionary line: SGML → (simplified) XML; SGML → (subset) HTML.
- Complementarity
- XML or SGML may define the structure; CSS/XSLT or HTML define presentation.
- RDF piggybacks on XML syntaxes (RDF/XML), JSON-LD, Turtle—bridging data graphs with web markup.
- Ethical / practical concerns
- Interoperability avoids vendor lock-in and data silos.
- Accurate tagging enhances accessibility (screen readers rely on semantic HTML).
- "Web of Trust" vision (RDF + signatures) underpins secure commerce, privacy, and provenance.
- Real-world relevance
- Digital preservation: libraries rely on MARC/XML crosswalks.
- Open data portals publish in XML/JSON and expose RDF to integrate with knowledge graphs.
- MIME ensures user safety by preventing incorrect application execution of foreign binaries.
Quick Reference Cheat-Sheet
- HTML: Fixed tags, presentation, web pages.
- XML: Custom tags, data representation, interchange.
- SGML: Meta-standard; defines new markup languages via DTD.
- RDF: Triples/graphs; semantic metadata; automation.
- MARC: Library bibliographic exchange (inventor: Library of Congress).
- MIME: Media type registry; e-mail & HTTP encoding of binary/non-ASCII files.