L'Apprentissage par Renforcement à partir des Feedbacks Humains (Reinforcement Learning with Human Feedback, RLHF) est apparu comme la technique centrale d'alignement utilisée pour aligner des systèmes d'IA de pointe tels que GPT-4, Claude, Bard, et Llama-2. Cependant, il y a peu de travaux publics formalisant systématiquement les problèmes liés à celle-ci. Dans une nouvelle revue de plus de 250 articles, fruit de la collaboration entre une quinzaine d’acteurs internationaux, dont EffiSciences, nous examinons les problèmes ouverts et les limites fondamentales de RLHF, en mettant l'accent sur les applications dans les grands modèles de langage.
Date de publication
Date de dernière modification :
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing the challenges it poses. In this paper, we (1) survey concrete challenges and open questions with RLHF; (2) overview techniques to better understand, improve, and complement it in practice; and (3) discuss how standards for transparency and auditing could help to reduce risks. We emphasize the importance of recognizing the limitations of RLHF and integrating it into a more complete framework for developing safer AI.
- Concrete challenges with RLHF: We taxonomize and survey problems with RLHF, dividing them into three primary categories: challenges with feedback, challenges with the reward model, and challenges with the policy. We also distinguish between challenges that are relatively tractable versus ones that are more fundamental limitations of alignment with RLHF.
- Incorporating RLHF into a broader technical safety framework: We discuss how RLHF is not a complete framework for developing safe AI and highlight additional approaches that can help to better understand, improve, and complement it.
- Governance and transparency: We consider the challenge of improving industry norms and regulations affecting models trained with RLHF. Specifically, we discuss how the disclosure of certain details by companies using RLHF to train AI systems can improve accountability and auditing.