Scalable extraction

Abstract

  • Focus: Study of extractable memorization from language models through adversarial querying without prior knowledge of training data.

  • Results indicate adversaries can extract gigabytes of training data from various models: open-source (Pythia, GPT-Neo), semi-open (LLaMA), and closed (ChatGPT).

  • Development of a new divergence attack for extracting data from aligned models like ChatGPT, achieving a 150× higher extraction rate than regular use.

Introduction

  • Large language models (LLMs) memorize training data, allowing for potential private information extraction.

  • Objective: Unify previous studies on memorization to analyze a large-scale extractable memorization in LLMs.

  • Definitions:

    • Extractable Memorization: Data that an adversary can recover efficiently.

    • Discoverable Memorization: Data recoverable only when prompted with other training data.

Methodology

  • Conduct large-scale analyses of model memorization using terabytes of output data and queries.

  • Attack methodology involved:

    • Using existing heuristics to prompt models.

    • Testing prompt effectiveness and verifying if outputs constituted training data.

  • Key components:

    • Use of a suffix array to enable fast searches against large datasets.

Results

  • Extraction results across different models:

    • Correlation between model size and the rate of emitting memorized training data.

    • Almost no recovery from aligned ChatGPT under standard testing, hypothesis of a high degree of alignment due to its design.

  • Data extracted via divergence prompting revealed over 10,000 unique instances from ChatGPT within a $200 query budget, extrapolated to suggest broader extractability.

Ethics & Responsible Disclosure

  • Findings were responsibly shared with model authors beforehand to address vulnerabilities.

  • Noted that the attack is specific to the ChatGPT model and does not apply broadly to all production language models.

  • Encouragement of extreme safeguards when deploying LLMs in privacy-sensitive applications.

Background & Related Work

  • Discussion on the training data used by LLMs and the implications of proprietary versus open-source data.

  • Instruction-tuning and RLHF: LLMs trained for task-specific performance through reinforcement learning from human feedback (RLHF).

  • An overview of privacy attacks including membership inference and data extraction.

  • Insight into existing methodologies and past results on data extractability across various models.

Data Extraction Attacks on Open Models

  • Evaluated data extraction attacks on models with public parameter and dataset access.

  • Defined extractable memorization based on explicit prompts that recover training data accurately.

  • Proposed method improves upon previous attack strategies by reducing manual verification time.

Experimental Results

  • Found rates of memorization varied significantly across open-source models, with the potential for higher data extraction than previously documented.

  • Empirical analysis indicated that the strategies proposed allowed extraction of a considerable amount of observable memorized data.

Estimating Total Memorization

  • Examination of how query variety impacts extractable memorization rates.

  • Proposed method for estimating total memorization via Good-Turing frequency estimation, acknowledging potential underestimations.

Discoverable vs. Extractable Memorization

  • Comparative analysis revealed significant differences between discoverable and extractable memorization, highlighting inefficiencies in current extraction techniques.

Extracting Data from Semi-closed Models

  • Discussion of methodologies adapted for semi-closed models with publicly available parameters but inaccessible training datasets.

  • Established a quality proxy using publicly available web data for verification of extraction success.

ChatGPT Data Extraction

  • Explored challenges specific to extracting data from aligned conversational models.

  • Notable successful adaptation of previous methodologies showcasing ChatGPT's weaknesses.

  • Results revealed significant ease in eliciting memorized outputs through appropriate prompting techniques.

Conclusion

  • Findings position ChatGPT as highly vulnerable to data extraction compared to earlier language models, emphasizing repercussions for users and developers alike.

  • Stressing the need for further research on data deduplication and effective safeguarding measures for LLM deployment.