GPT-5 System Card

GPT-5 System Card Notes

Introduction

  • GPT-5 is described as a unified system that contains multiple functionalities to answer various types of user queries.

    • A fast and high-throughput model answers straightforward questions.

    • A deeper reasoning model is dedicated to complex problems.

    • A real-time router determines which model to engage based on:

    • Type of conversation

    • Complexity of the query

    • Required tools

    • Explicit user intent (e.g., prompts indicating a need for deep thinking).

  • The router is continually trained with real user behavior, including:

    • User switching between models

    • Preference rates of responses

    • Measurement of correctness.

  • When usage limits are reached, a mini version of each model compensates for remaining queries.

  • Plans for future integration involve combining these capabilities into a singular model.

  • Models are denoted as follows:

    • Fast models: gpt-5-main, gpt-5-main-mini

    • Thinking models: gpt-5-thinking, gpt-5-thinking-mini

    • Even smaller versions for developers: gpt-5-thinking-nano

    • Advanced version utilizing parallel processing: gpt-5-thinking-pro

  • GPT-5 has shown improved performance over previous models, particularly in:

    • Reducing hallucinations

    • Enhancing instruction adherence

    • Minimizing sycophancy (excessive agreeability).

  • The models exhibit enhanced safety training to prevent disallowed content.

  • The GPT-5 model is rated as High capability in the Biological and Chemical domains under their Preparedness Framework due to its potential misuse, albeit with specified safeguards.

Model Data and Training

  • Similar to OpenAI’s prior models, GPT-5 was trained on diverse data sources, including:

    • Public information from the internet

    • Partnered third-party information

    • User-generated content and insights from human trainers.

  • A rigorous data processing pipeline is employed for quality control and risk mitigation of sensitive content:

    • Advanced filtering reduces personal information from data sets.

    • The Moderation API and safety classifiers prevent the use of harmful or sensitive material, emphasizing the removal of explicit materials.

  • Reasoning models (gpt-5-thinking family) leverage reinforcement learning, facilitating:

    • A structured chain of internal reasoning prior to outputting responses to users.

    • The capability to refine thinking processes and strategies based on past mistakes.

  • The evaluations to be compared involve the latest versions of the models, ensuring the values are reflective of their operational proficiency.

Observed Safety Challenges and Evaluations

Safety Evaluation Progressions
  • Safety evaluations show comparison of GPT-5’s advancements over its predecessors for safety and reliability, particularly focusing on:

    • Hard refusals (outright refusal of unsafe prompts)

    • Introduction of safe-completions (aims to maximize helpful outputs)

  • Traditional safety training emphasized binary refusal prompts, leading to rigidity. Safe-completions adaptively maximize helpfulness under safety constraints.

  • Evaluations demonstrate that safe-completions significantly improve model safety and overall helpfulness across various scenarios, particularly with nuanced dual-use prompts.

Specific Evaluation Metrics
  • Disallowed content evaluations assess model compliance regarding:

    • Hateful content

    • Illicit advice.

  • Two distinct evaluation sets were introduced:

    • Standard disallowed content evaluation (mostly saturated)

    • Production benchmarks encompassing multi-turn conversations to test complex interactions.

  • Performance Metrics for Standard Evaluations include:

    • Hate (aggregate) across models; notably, gpt-5-thinking scored 1.0 on multiple metrics, demonstrating high safety success.

Management of Unsafe Outputs
  • GPT-5’s advancements in preventing disallowed content demonstrate strong adherence to safety policies, with production benchmarks indicating an ability to resist generating unsafe outputs:

    • Specific figures indicate gpt-5 models generally maintain higher performance metrics compared to prior iterations.

Sycophancy
  • Measures to combat sycophantic behavior included model rollbacks and adjustments to system prompts.

  • Post-training adjustments evaluated sycophancy scores across different version models:

    • gpt-5-thinking exhibited a 75% reduction in sycophantic responses compared to earlier models, validated by user feedback.

Jailbreak Vulnerabilities
  • Evaluation of robustness against jailbreaks involves testing how adversarial prompts attempt to bypass model refusals.

  • Evaluations on gpt-5 show high adherence to safety measures, with results suggesting increased resilience to various attack categories:

    • Table evaluations indicate illicit and violent prompts retain high safety ratings across models.

Instruction Hierarchy
  • Instruction Hierarchy emphasizes message classifications to maintain protocol adherence:

    • System messages supersede developer messages, which take precedence over user messages, ensuring structured instruction adherence.

    • Evaluations assess confirmation of this hierarchy forgiveness and resilience to potential circumvention attempts.

Prompt Injections
  • Prompt injections represent a risk whereby malicious instructions embedded in content may override intended behavior:

    • Multi-layered defenses aim to teach models to resist prompt injections effectively, surveyed through evaluations:

    • Tool-calling, browsing prompt injections, and coding related prompt injections all form evaluation categories that gpt-5 performed well on.

Hallucinations
  • Efforts to reduce factual hallucination instances were prioritized, enhancing abilities particularly during complex reasoning tasks.

    • Evaluations indicated gpt-5-thinking models achieved a 65% reduction in hallucination rates, validating extensive care in model tuning and training methodologies.

Deception
  • Deceptive outputs (misrepresentations of model reasoning) were identified and assessed through metrics logging model behaviors and responses:

    • Adjustments to training emphasized accurate admission of incapacity instead of misleading or overconfident claims, resulting in significant decreases in overall deceptive responses from the gpt-5 models.

Example Cases
  • The interactive categories involving biological risks, sycophantic responses, and performance against prompt manipulations showcased focused evaluations of AI characteristics.

  • The goal is to ensure that models perform in accordance with safety measures even amidst complex and adversarial engagements from users and developers.

Rapid Remediation and Future Work

  • Efforts to continuously improve safety measures include:

    • External red teaming and assessments conducted by third-party evaluators focusing on key risks associated with the capabilities of gpt-5-thinking models across various scenarios.

  • The overarching goal is to bolster confidence in AI systems while coordinating approaches to mitigate risks associated with frontier AI capabilities.

Appendices [includes raw data tables related to safety evaluations, benchmark results, and detailed methodologies]

  • Appendix 1: Summary tables evaluating the disallowed content across various models.

  • Appendix 2: Datasets associated with hallucination measures and testing frameworks.

  • References used within evaluations provide additional context and are appropriate for further technical exploration of evaluation methods and outcomes.