Metaadata

Digitizing Information

  • Bits are used to digitize various forms of data: numbers, letters, and symbols.

  • Other content types like images, sound, and video will be discussed in Chapter 8.

  • The digitizing process converts content into binary, which is essential but only half of the information representation problem.

What is Metadata?

  • The second half of the problem involves describing the properties of information known as metadata.

  • Metadata answers essential questions such as:

    • How is the content structured?

    • What other content is it related to?

    • Where was it collected?

    • What units are measurements in?

    • How should it be displayed?

    • When was it created or captured?

  • It describes other information and does not require its own binary encoding.

  • Common methods of specifying metadata include using tags (e.g., in HTML).

  • Importance of separating content from its display metadata (e.g., font choices).

The Price of Metadata

  • The price of a product can be included as part of its metadata:

    • Example: Stores use barcodes (Universal Product Codes - UPCs) for pricing, which allows efficient price changes without altering individual items.

The Oxford English Dictionary (OED)

  • The OED is the comprehensive reference for English words, including meanings, etymologies, and usage.

  • Printed version consists of 20 volumes and weighs 150 pounds; initially expected to be 4 volumes of 6400 pages.

  • The first edition completed in 1928 included 15,490 pages with 252,200 entries.

  • Digitization started in 1984 to provide easier access and retrieval of word definitions.

  • ASCII for the pound symbol ("£"):

    • Hex: A3

    • Binary: 1010 0011

    • Decimal value of binary 1010 0011 is 179, indicating the price in pounds (i.e., £179).

Challenges in Digital Dictionaries

  • Difficulties arise when searching for definitions of common words (e.g., "set") due to frequency and context in which they appear.

  • Computers cannot inherently understand the meaning without context (e.g., punctuation, layout).

  • Metadata can enhance digital dictionary structure to facilitate easier searching by associating tags with definitions.

Structure Tags in the OED

  • The OED utilizes a special set of tags to specify structure.

    • Example tags include:

    • <hw> for headwords (words being defined).

    • <pr> for pronunciation.

    • <ph> for phonetic notations.

    • <ps> for part of speech.

    • <hm> for homonyms.

  • Each entry is enveloped with tags such as <e> for an entire entry and <hg> for all information at the start of a definition.

Formatting and Structure

  • Tags improve readability without altering content.

  • Users can locate definitions using structure tags, enhancing the search process.

  • Examples include specific tags for formatting textual details (e.g., bold for headwords and italics for parts of speech).

  • Structure tags enable automatic formatting based on the information provided.

Case Study: The Word "Byte"

  • The digitized entry for "byte" illustrates the application of both formatting and structure tags for clarity.

  • File sizes increase due to the inclusion of tag information, leading to larger data files compared to plain text.

  • The concept of "byte" was coined by Werner Buchholz in the 1950s to denote a memory unit that detects errors.

  • Error Detection:

    • Typically involves adding an extra parity bit for error detection with bytes to track bit-level changes.

    • A bit of memory subject to changes and how engineers create safeguards to prevent errors.

    • The name "byte" was chosen to avoid confusion with "bit".