Notes on "ChatGPT Is a Blurry JPEG of the Web"
Xerox Photocopier Incident
- A Xerox photocopier used lossy compression (JBIG2) which led to subtle but significant errors.
- It replaced the area labels of rooms (14.13, 21.11, 17.42 sq meters) with a single value (14.13 sq meters) due to perceived similarity.
- The problem was the inaccuracies were not immediately obvious, making copies seem accurate when they weren't.
ChatGPT as Lossy Compression
- ChatGPT is like a "blurry JPEG" of the web, retaining information but not exact sequences.
- It uses statistical regularities to compress text, similar to lossy compression.
- "Hallucinations" (nonsensical answers) are compression artifacts, analogous to the Xerox copier's errors.
- Interpolation: ChatGPT estimates missing information by looking at surrounding text, like averaging pixels in image compression.
Compression and Understanding
- Marcus Hutter's Prize for Compressing Human Knowledge aims to losslessly compress Wikipedia, believing better compression leads to AI.
- Greatest compression is achieved by understanding the text and underlying principles (e.g., arithmetic, economics).
- ChatGPT identifies statistical regularities (e.g., "supply is low" and "prices rise") but doesn't necessarily understand the underlying concepts.
- GPT-3 struggles with arithmetic, especially with larger numbers, indicating a lack of genuine understanding.
Lossy vs. Lossless Compression
- Lossless compression would result in ChatGPT providing verbatim quotes, similar to a search engine.
- Lossy compression makes ChatGPT seem smarter by rephrasing, creating an illusion of understanding.
Applications of Large Language Models
- Evaluating LLMs requires considering whether they've been trained on biased or inaccurate data.
- Unacceptable blurriness: fabrication.
Concerns
- Generating web content with LLMs may lead to a blurrier web due to repackaging of existing information.
- OpenAI might exclude ChatGPT-generated content from GPT-4 training data, indicating quality concerns.
Original Writing
- Using LLMs might not be the best approach for original work.
- The process of writing and rewriting, even unoriginal work, is crucial for developing original ideas and writing skills. A first draft is not simply a clear expression of idea but an original idea expressed poorly.