Notes on "ChatGPT Is a Blurry JPEG of the Web"

Xerox Photocopier Incident

  • A Xerox photocopier used lossy compression (JBIG2) which led to subtle but significant errors.
  • It replaced the area labels of rooms (14.13, 21.11, 17.42 sq meters) with a single value (14.13 sq meters) due to perceived similarity.
  • The problem was the inaccuracies were not immediately obvious, making copies seem accurate when they weren't.

ChatGPT as Lossy Compression

  • ChatGPT is like a "blurry JPEG" of the web, retaining information but not exact sequences.
  • It uses statistical regularities to compress text, similar to lossy compression.
  • "Hallucinations" (nonsensical answers) are compression artifacts, analogous to the Xerox copier's errors.
  • Interpolation: ChatGPT estimates missing information by looking at surrounding text, like averaging pixels in image compression.

Compression and Understanding

  • Marcus Hutter's Prize for Compressing Human Knowledge aims to losslessly compress Wikipedia, believing better compression leads to AI.
  • Greatest compression is achieved by understanding the text and underlying principles (e.g., arithmetic, economics).
  • ChatGPT identifies statistical regularities (e.g., "supply is low" and "prices rise") but doesn't necessarily understand the underlying concepts.
  • GPT-3 struggles with arithmetic, especially with larger numbers, indicating a lack of genuine understanding.

Lossy vs. Lossless Compression

  • Lossless compression would result in ChatGPT providing verbatim quotes, similar to a search engine.
  • Lossy compression makes ChatGPT seem smarter by rephrasing, creating an illusion of understanding.

Applications of Large Language Models

  • Evaluating LLMs requires considering whether they've been trained on biased or inaccurate data.
  • Unacceptable blurriness: fabrication.

Concerns

  • Generating web content with LLMs may lead to a blurrier web due to repackaging of existing information.
  • OpenAI might exclude ChatGPT-generated content from GPT-4 training data, indicating quality concerns.

Original Writing

  • Using LLMs might not be the best approach for original work.
  • The process of writing and rewriting, even unoriginal work, is crucial for developing original ideas and writing skills. A first draft is not simply a clear expression of idea but an original idea expressed poorly.