Paper: https://cdn.openai.com/papers/diverse...) Blog: https://openai.com/index/advancing-re...)
This OpenAI research paper presents novel methods for automated red teaming of large language models (LLMs). The approach factorizes the red-teaming task into generating diverse attack goals and then training a reinforcement learning (RL) attacker to achieve those goals effectively and diversely. Key contributions include using automatically generated rule-based rewards and a multi-step RL process that encourages stylistic diversity in attacks. The methods are applied to two tasks: indirect prompt injection and safety "jailbreaking," demonstrating improved diversity and effectiveness compared to prior approaches. The paper also addresses limitations and suggests future research directions.
ai , model , ai safety , openai, genai, generativeai, artificialintelligence , arxiv , research , paper , publication, reinforcement learning, rl