Scalable extraction

Abstract

Focus: Study of extractable memorization from language models through adversarial querying without prior knowledge of training data.
Results indicate adversaries can extract gigabytes of training data from various models: open-source (Pythia, GPT-Neo), semi-open (LLaMA), and closed (ChatGPT).
Development of a new divergence attack for extracting data from aligned models like ChatGPT, achieving a 150× higher extraction rate than regular use.

Introduction

Large language models (LLMs) memorize training data, allowing for potential private information extraction.
Objective: Unify previous studies on memorization to analyze a large-scale extractable memorization in LLMs.
Definitions:
- Extractable Memorization: Data that an adversary can recover efficiently.
- Discoverable Memorization: Data recoverable only when prompted with other training data.

Methodology

Conduct large-scale analyses of model memorization using terabytes of output data and queries.
Attack methodology involved:
- Using existing heuristics to prompt models.
- Testing prompt effectiveness and verifying if outputs constituted training data.
Key components:
- Use of a suffix array to enable fast searches against large datasets.

Results

Extraction results across different models:
- Correlation between model size and the rate of emitting memorized training data.
- Almost no recovery from aligned ChatGPT under standard testing, hypothesis of a high degree of alignment due to its design.
Data extracted via divergence prompting revealed over 10,000 unique instances from ChatGPT within a $200 query budget, extrapolated to suggest broader extractability.

Ethics & Responsible Disclosure

Findings were responsibly shared with model authors beforehand to address vulnerabilities.
Noted that the attack is specific to the ChatGPT model and does not apply broadly to all production language models.
Encouragement of extreme safeguards when deploying LLMs in privacy-sensitive applications.

Background & Related Work

Discussion on the training data used by LLMs and the implications of proprietary versus open-source data.
Instruction-tuning and RLHF: LLMs trained for task-specific performance through reinforcement learning from human feedback (RLHF).
An overview of privacy attacks including membership inference and data extraction.
Insight into existing methodologies and past results on data extractability across various models.

Data Extraction Attacks on Open Models

Evaluated data extraction attacks on models with public parameter and dataset access.
Defined extractable memorization based on explicit prompts that recover training data accurately.
Proposed method improves upon previous attack strategies by reducing manual verification time.

Experimental Results

Found rates of memorization varied significantly across open-source models, with the potential for higher data extraction than previously documented.
Empirical analysis indicated that the strategies proposed allowed extraction of a considerable amount of observable memorized data.

Estimating Total Memorization

Examination of how query variety impacts extractable memorization rates.
Proposed method for estimating total memorization via Good-Turing frequency estimation, acknowledging potential underestimations.

Discoverable vs. Extractable Memorization

Comparative analysis revealed significant differences between discoverable and extractable memorization, highlighting inefficiencies in current extraction techniques.

Extracting Data from Semi-closed Models

Discussion of methodologies adapted for semi-closed models with publicly available parameters but inaccessible training datasets.
Established a quality proxy using publicly available web data for verification of extraction success.

ChatGPT Data Extraction

Explored challenges specific to extracting data from aligned conversational models.
Notable successful adaptation of previous methodologies showcasing ChatGPT's weaknesses.
Results revealed significant ease in eliciting memorized outputs through appropriate prompting techniques.

Conclusion

Findings position ChatGPT as highly vulnerable to data extraction compared to earlier language models, emphasizing repercussions for users and developers alike.
Stressing the need for further research on data deduplication and effective safeguarding measures for LLM deployment.