ChatGPT as a Blurry JPEG: Lossy Compression, Artifacts, and Understanding
Xerox incident and the lossiness
2013: German Xerox photocopier copied floor plans; room labels collapsed to a single area, e.g., all rooms labeled as m^2 instead of m^2.
Patch released in 2014 to fix the issue.
Root cause: lossy compression used by the copier; it stores a single copy for similar-looking regions and reuses it, degrading fidelity in a subtle way.
Key idea: problem isn’t lossy compression per se, but that readable-but-incorrect data can be produced, appearing accurate yet being wrong.
Lossy vs. lossless compression (essential definitions)
Lossless: restored data exactly equals original.
Lossy: restored data is only an approximation.
Encoding → Decoding steps; aim to reduce size while preserving essential information.
Common domains: lossless for text/software; lossy for images/audio/video.
Compression ratio:
How lossy compression works in images
Identify similar-looking regions; store one copy and reuse on decompression.
Reduces storage but can introduce artifacts that are not immediately obvious.
Xerox used a lossy format (jb ig2) for black-and-white images; artifacts can occur even when text remains readable.
Analogy to large language models (ChatGPT)
Think of ChatGPT as a blurry JPEG of all text on the Web: retains much information but not in exact original form.
Exact word-for-word retrieval is impossible; the model provides gist and reformulations.
Hallucinations are compression artifacts: plausible but not guaranteed to be true; require checking against originals or knowledge.
Interpolation and understanding
A common lossy strategy: interpolate missing parts from neighbors; reconstruction using surrounding context.
This yields plausible text but not guaranteed accuracy or internal consistency with underlying facts.
Hutter Prize and compression of knowledge
Marcus Hutter’s Prize (since 2006) seeks lossless compression of a 1 GB snapshot of what ChatGPT does when prompted to describe, e.g., losing a sock, in a specified style.
Real-world progress: standard zip can reduce 1 GB to ~ MB; top prizes aim near ~ MB; success suggests a form of compactifying knowledge.
Aim: better text compression may reveal deeper understanding.
Does compression imply understanding?
LLMs identify statistical regularities (e.g., correlations like “supply” and “prices rise”).
They don’t necessarily derive underlying principles (e.g., arithmetic rules): GPT-3 can be weak on large-number arithmetic due to lack of genuine rule-based computation.
Even when outputs seem to show understanding, they may be surface-level generalizations rather than true comprehension.
Types of blurriness and evaluation
Acceptable blurriness: restatement in different words; preserves usefulness.
Unacceptable blurriness: fabrication; verifiability is compromised.
Challenge: whether we can keep useful, rephrased content while avoiding fabricated facts; current limits are uncertain.
Uses and limits of LLMs
Replacing traditional search: unlikely to reproduce exact originals; often rephrases rather than quotes.
As writing aids: can help draft, but starting from a blurry copy of unoriginal work risks hindering true originality.
Drafting process: writing original work typically emerges through revision and refinement beyond the initial blast of generated text.
Potential in art: Xerox-art analogy—AI could enable new forms, but most creators don’t rely on such tools as essential workflow.
Internet integrity, training data, and the Web
Proliferation of generated text could blur the Web, affecting search and credibility.
Training data ethics: if output from a model is used to train future models, this may indicate the current output is of higher quality or utility.
If training data excludes model-generated content, this reinforces the usefulness of the analogy between LLMs and lossy compression.
Future outlook and takeaway
If model outputs become good enough to train newer models, confidence in their quality grows; otherwise, limitations remain.
The lossy-compression analogy remains a useful lens to discuss capabilities and limitations of LLMs: they summarize and interpolate rather than faithfully reproduce original sources.
The broader implication: critical appraisal is needed when relying on LLMs for factual content; verification against reliable sources is essential.
Bottom line
Lossy compression explains why LLMs can give plausible, useful text without exact original sources.
Hallucinations and fabrications are akin to compression artifacts and require verification.
While helpful for many tasks, LLMs don’t yet replace precise retrieval, original thought, or rigorous computation; human judgment remains essential.