ChatGPT as a Blurry JPEG: Lossy Compression, Artifacts, and Understanding

Xerox incident and the lossiness

  • 2013: German Xerox photocopier copied floor plans; room labels collapsed to a single area, e.g., all rooms labeled as 14.1314.13 m^2 instead of 14.13,21.11,17.4214.13, 21.11, 17.42 m^2.

  • Patch released in 2014 to fix the issue.

  • Root cause: lossy compression used by the copier; it stores a single copy for similar-looking regions and reuses it, degrading fidelity in a subtle way.

  • Key idea: problem isn’t lossy compression per se, but that readable-but-incorrect data can be produced, appearing accurate yet being wrong.

Lossy vs. lossless compression (essential definitions)

  • Lossless: restored data exactly equals original. F1(F(x))=xF^{-1}(F(x)) = x

  • Lossy: restored data is only an approximation. F1(F(x))<br>eqx(in general), but xF^{-1}(F(x)) <br>eq x \,(\text{in general}), \text{ but } \approx x

  • Encoding → Decoding steps; aim to reduce size while preserving essential information.

  • Common domains: lossless for text/software; lossy for images/audio/video.

  • Compression ratio: R=extoriginalextcompressedR = \frac{| ext{original}|}{| ext{compressed}|}

How lossy compression works in images

  • Identify similar-looking regions; store one copy and reuse on decompression.

  • Reduces storage but can introduce artifacts that are not immediately obvious.

  • Xerox used a lossy format (jb ig2) for black-and-white images; artifacts can occur even when text remains readable.

Analogy to large language models (ChatGPT)

  • Think of ChatGPT as a blurry JPEG of all text on the Web: retains much information but not in exact original form.

  • Exact word-for-word retrieval is impossible; the model provides gist and reformulations.

  • Hallucinations are compression artifacts: plausible but not guaranteed to be true; require checking against originals or knowledge.

Interpolation and understanding

  • A common lossy strategy: interpolate missing parts from neighbors; reconstruction using surrounding context.

  • This yields plausible text but not guaranteed accuracy or internal consistency with underlying facts.

Hutter Prize and compression of knowledge

  • Marcus Hutter’s Prize (since 2006) seeks lossless compression of a 1 GB snapshot of what ChatGPT does when prompted to describe, e.g., losing a sock, in a specified style.

  • Real-world progress: standard zip can reduce 1 GB to ~300300 MB; top prizes aim near ~115115 MB; success suggests a form of compactifying knowledge.

  • Aim: better text compression may reveal deeper understanding.

Does compression imply understanding?

  • LLMs identify statistical regularities (e.g., correlations like “supply” and “prices rise”).

  • They don’t necessarily derive underlying principles (e.g., arithmetic rules): GPT-3 can be weak on large-number arithmetic due to lack of genuine rule-based computation.

  • Even when outputs seem to show understanding, they may be surface-level generalizations rather than true comprehension.

Types of blurriness and evaluation

  • Acceptable blurriness: restatement in different words; preserves usefulness.

  • Unacceptable blurriness: fabrication; verifiability is compromised.

  • Challenge: whether we can keep useful, rephrased content while avoiding fabricated facts; current limits are uncertain.

Uses and limits of LLMs

  • Replacing traditional search: unlikely to reproduce exact originals; often rephrases rather than quotes.

  • As writing aids: can help draft, but starting from a blurry copy of unoriginal work risks hindering true originality.

  • Drafting process: writing original work typically emerges through revision and refinement beyond the initial blast of generated text.

  • Potential in art: Xerox-art analogy—AI could enable new forms, but most creators don’t rely on such tools as essential workflow.

Internet integrity, training data, and the Web

  • Proliferation of generated text could blur the Web, affecting search and credibility.

  • Training data ethics: if output from a model is used to train future models, this may indicate the current output is of higher quality or utility.

  • If training data excludes model-generated content, this reinforces the usefulness of the analogy between LLMs and lossy compression.

Future outlook and takeaway

  • If model outputs become good enough to train newer models, confidence in their quality grows; otherwise, limitations remain.

  • The lossy-compression analogy remains a useful lens to discuss capabilities and limitations of LLMs: they summarize and interpolate rather than faithfully reproduce original sources.

  • The broader implication: critical appraisal is needed when relying on LLMs for factual content; verification against reliable sources is essential.

Bottom line

  • Lossy compression explains why LLMs can give plausible, useful text without exact original sources.

  • Hallucinations and fabrications are akin to compression artifacts and require verification.

  • While helpful for many tasks, LLMs don’t yet replace precise retrieval, original thought, or rigorous computation; human judgment remains essential.