Metaadata

Bits are used to digitize various forms of data: numbers, letters, and symbols.
Other content types like images, sound, and video will be discussed in Chapter 8.
The digitizing process converts content into binary, which is essential but only half of the information representation problem.

The second half of the problem involves describing the properties of information known as metadata.
Metadata answers essential questions such as:
- How is the content structured?
- What other content is it related to?
- Where was it collected?
- What units are measurements in?
- How should it be displayed?
- When was it created or captured?
It describes other information and does not require its own binary encoding.
Common methods of specifying metadata include using tags (e.g., in HTML).
Importance of separating content from its display metadata (e.g., font choices).

The price of a product can be included as part of its metadata:
- Example: Stores use barcodes (Universal Product Codes - UPCs) for pricing, which allows efficient price changes without altering individual items.

The OED is the comprehensive reference for English words, including meanings, etymologies, and usage.
Printed version consists of 20 volumes and weighs 150 pounds; initially expected to be 4 volumes of 6400 pages.
The first edition completed in 1928 included 15,490 pages with 252,200 entries.
Digitization started in 1984 to provide easier access and retrieval of word definitions.
ASCII for the pound symbol ("£"):
- Hex: A3
- Binary: 1010 0011
- Decimal value of binary 1010 0011 is 179, indicating the price in pounds (i.e., £179).

Difficulties arise when searching for definitions of common words (e.g., "set") due to frequency and context in which they appear.
Computers cannot inherently understand the meaning without context (e.g., punctuation, layout).
Metadata can enhance digital dictionary structure to facilitate easier searching by associating tags with definitions.

The OED utilizes a special set of tags to specify structure.
- Example tags include:
- <hw> for headwords (words being defined).
- <pr> for pronunciation.
- <ph> for phonetic notations.
- <ps> for part of speech.
- <hm> for homonyms.
Each entry is enveloped with tags such as <e> for an entire entry and <hg> for all information at the start of a definition.

Tags improve readability without altering content.
Users can locate definitions using structure tags, enhancing the search process.
Examples include specific tags for formatting textual details (e.g., bold for headwords and italics for parts of speech).
Structure tags enable automatic formatting based on the information provided.

The digitized entry for "byte" illustrates the application of both formatting and structure tags for clarity.
File sizes increase due to the inclusion of tag information, leading to larger data files compared to plain text.
The concept of "byte" was coined by Werner Buchholz in the 1950s to denote a memory unit that detects errors.
Error Detection:
- Typically involves adding an extra parity bit for error detection with bytes to track bit-level changes.
- A bit of memory subject to changes and how engineers create safeguards to prevent errors.
- The name "byte" was chosen to avoid confusion with "bit".