GPT-5 System Card
GPT-5 System Card Notes
Introduction
GPT-5 is described as a unified system that contains multiple functionalities to answer various types of user queries.
A fast and high-throughput model answers straightforward questions.
A deeper reasoning model is dedicated to complex problems.
A real-time router determines which model to engage based on:
Type of conversation
Complexity of the query
Required tools
Explicit user intent (e.g., prompts indicating a need for deep thinking).
The router is continually trained with real user behavior, including:
User switching between models
Preference rates of responses
Measurement of correctness.
When usage limits are reached, a mini version of each model compensates for remaining queries.
Plans for future integration involve combining these capabilities into a singular model.
Models are denoted as follows:
Fast models: gpt-5-main, gpt-5-main-mini
Thinking models: gpt-5-thinking, gpt-5-thinking-mini
Even smaller versions for developers: gpt-5-thinking-nano
Advanced version utilizing parallel processing: gpt-5-thinking-pro
GPT-5 has shown improved performance over previous models, particularly in:
Reducing hallucinations
Enhancing instruction adherence
Minimizing sycophancy (excessive agreeability).
The models exhibit enhanced safety training to prevent disallowed content.
The GPT-5 model is rated as High capability in the Biological and Chemical domains under their Preparedness Framework due to its potential misuse, albeit with specified safeguards.
Model Data and Training
Similar to OpenAI’s prior models, GPT-5 was trained on diverse data sources, including:
Public information from the internet
Partnered third-party information
User-generated content and insights from human trainers.
A rigorous data processing pipeline is employed for quality control and risk mitigation of sensitive content:
Advanced filtering reduces personal information from data sets.
The Moderation API and safety classifiers prevent the use of harmful or sensitive material, emphasizing the removal of explicit materials.
Reasoning models (gpt-5-thinking family) leverage reinforcement learning, facilitating:
A structured chain of internal reasoning prior to outputting responses to users.
The capability to refine thinking processes and strategies based on past mistakes.
The evaluations to be compared involve the latest versions of the models, ensuring the values are reflective of their operational proficiency.
Observed Safety Challenges and Evaluations
Safety Evaluation Progressions
Safety evaluations show comparison of GPT-5’s advancements over its predecessors for safety and reliability, particularly focusing on:
Hard refusals (outright refusal of unsafe prompts)
Introduction of safe-completions (aims to maximize helpful outputs)
Traditional safety training emphasized binary refusal prompts, leading to rigidity. Safe-completions adaptively maximize helpfulness under safety constraints.
Evaluations demonstrate that safe-completions significantly improve model safety and overall helpfulness across various scenarios, particularly with nuanced dual-use prompts.
Specific Evaluation Metrics
Disallowed content evaluations assess model compliance regarding:
Hateful content
Illicit advice.
Two distinct evaluation sets were introduced:
Standard disallowed content evaluation (mostly saturated)
Production benchmarks encompassing multi-turn conversations to test complex interactions.
Performance Metrics for Standard Evaluations include:
Hate (aggregate) across models; notably, gpt-5-thinking scored 1.0 on multiple metrics, demonstrating high safety success.
Management of Unsafe Outputs
GPT-5’s advancements in preventing disallowed content demonstrate strong adherence to safety policies, with production benchmarks indicating an ability to resist generating unsafe outputs:
Specific figures indicate gpt-5 models generally maintain higher performance metrics compared to prior iterations.
Sycophancy
Measures to combat sycophantic behavior included model rollbacks and adjustments to system prompts.
Post-training adjustments evaluated sycophancy scores across different version models:
gpt-5-thinking exhibited a 75% reduction in sycophantic responses compared to earlier models, validated by user feedback.
Jailbreak Vulnerabilities
Evaluation of robustness against jailbreaks involves testing how adversarial prompts attempt to bypass model refusals.
Evaluations on gpt-5 show high adherence to safety measures, with results suggesting increased resilience to various attack categories:
Table evaluations indicate illicit and violent prompts retain high safety ratings across models.
Instruction Hierarchy
Instruction Hierarchy emphasizes message classifications to maintain protocol adherence:
System messages supersede developer messages, which take precedence over user messages, ensuring structured instruction adherence.
Evaluations assess confirmation of this hierarchy forgiveness and resilience to potential circumvention attempts.
Prompt Injections
Prompt injections represent a risk whereby malicious instructions embedded in content may override intended behavior:
Multi-layered defenses aim to teach models to resist prompt injections effectively, surveyed through evaluations:
Tool-calling, browsing prompt injections, and coding related prompt injections all form evaluation categories that gpt-5 performed well on.
Hallucinations
Efforts to reduce factual hallucination instances were prioritized, enhancing abilities particularly during complex reasoning tasks.
Evaluations indicated gpt-5-thinking models achieved a 65% reduction in hallucination rates, validating extensive care in model tuning and training methodologies.
Deception
Deceptive outputs (misrepresentations of model reasoning) were identified and assessed through metrics logging model behaviors and responses:
Adjustments to training emphasized accurate admission of incapacity instead of misleading or overconfident claims, resulting in significant decreases in overall deceptive responses from the gpt-5 models.
Example Cases
The interactive categories involving biological risks, sycophantic responses, and performance against prompt manipulations showcased focused evaluations of AI characteristics.
The goal is to ensure that models perform in accordance with safety measures even amidst complex and adversarial engagements from users and developers.
Rapid Remediation and Future Work
Efforts to continuously improve safety measures include:
External red teaming and assessments conducted by third-party evaluators focusing on key risks associated with the capabilities of gpt-5-thinking models across various scenarios.
The overarching goal is to bolster confidence in AI systems while coordinating approaches to mitigate risks associated with frontier AI capabilities.
Appendices [includes raw data tables related to safety evaluations, benchmark results, and detailed methodologies]
Appendix 1: Summary tables evaluating the disallowed content across various models.
Appendix 2: Datasets associated with hallucination measures and testing frameworks.
References used within evaluations provide additional context and are appropriate for further technical exploration of evaluation methods and outcomes.