ChatGPT as a Blurry JPEG: Lossy Compression, Artifacts, and Understanding

2013: German Xerox photocopier copied floor plans; room labels collapsed to a single area, e.g., all rooms labeled as $14.13$ m^2 instead of $14.13, 21.11, 17.42$ m^2.
Patch released in 2014 to fix the issue.
Root cause: lossy compression used by the copier; it stores a single copy for similar-looking regions and reuses it, degrading fidelity in a subtle way.
Key idea: problem isn’t lossy compression per se, but that readable-but-incorrect data can be produced, appearing accurate yet being wrong.

Lossless: restored data exactly equals original. $F^{-1}(F(x)) = x$
Lossy: restored data is only an approximation. $F^{-1}(F(x)) <br>eq x \,(\text{in general}), \text{ but } \approx x$
Encoding → Decoding steps; aim to reduce size while preserving essential information.
Common domains: lossless for text/software; lossy for images/audio/video.
Compression ratio: $R = \frac{| ext{original}|}{| ext{compressed}|}$

Identify similar-looking regions; store one copy and reuse on decompression.
Reduces storage but can introduce artifacts that are not immediately obvious.
Xerox used a lossy format (jb ig2) for black-and-white images; artifacts can occur even when text remains readable.

Think of ChatGPT as a blurry JPEG of all text on the Web: retains much information but not in exact original form.
Exact word-for-word retrieval is impossible; the model provides gist and reformulations.
Hallucinations are compression artifacts: plausible but not guaranteed to be true; require checking against originals or knowledge.

A common lossy strategy: interpolate missing parts from neighbors; reconstruction using surrounding context.
This yields plausible text but not guaranteed accuracy or internal consistency with underlying facts.

Marcus Hutter’s Prize (since 2006) seeks lossless compression of a 1 GB snapshot of what ChatGPT does when prompted to describe, e.g., losing a sock, in a specified style.
Real-world progress: standard zip can reduce 1 GB to ~ $300$ MB; top prizes aim near ~ $115$ MB; success suggests a form of compactifying knowledge.
Aim: better text compression may reveal deeper understanding.

LLMs identify statistical regularities (e.g., correlations like “supply” and “prices rise”).
They don’t necessarily derive underlying principles (e.g., arithmetic rules): GPT-3 can be weak on large-number arithmetic due to lack of genuine rule-based computation.
Even when outputs seem to show understanding, they may be surface-level generalizations rather than true comprehension.

Acceptable blurriness: restatement in different words; preserves usefulness.
Unacceptable blurriness: fabrication; verifiability is compromised.
Challenge: whether we can keep useful, rephrased content while avoiding fabricated facts; current limits are uncertain.

Replacing traditional search: unlikely to reproduce exact originals; often rephrases rather than quotes.
As writing aids: can help draft, but starting from a blurry copy of unoriginal work risks hindering true originality.
Drafting process: writing original work typically emerges through revision and refinement beyond the initial blast of generated text.
Potential in art: Xerox-art analogy—AI could enable new forms, but most creators don’t rely on such tools as essential workflow.

Proliferation of generated text could blur the Web, affecting search and credibility.
Training data ethics: if output from a model is used to train future models, this may indicate the current output is of higher quality or utility.
If training data excludes model-generated content, this reinforces the usefulness of the analogy between LLMs and lossy compression.

If model outputs become good enough to train newer models, confidence in their quality grows; otherwise, limitations remain.
The lossy-compression analogy remains a useful lens to discuss capabilities and limitations of LLMs: they summarize and interpolate rather than faithfully reproduce original sources.
The broader implication: critical appraisal is needed when relying on LLMs for factual content; verification against reliable sources is essential.

Lossy compression explains why LLMs can give plausible, useful text without exact original sources.
Hallucinations and fabrications are akin to compression artifacts and require verification.
While helpful for many tasks, LLMs don’t yet replace precise retrieval, original thought, or rigorous computation; human judgment remains essential.