Aims to identify the key technical gaps that need to be addressed for an effective alignment solution involving Reinforcement Learning from Human Feedback (RLHF).
RLHF is progress because it can do useful things now (like teach ChatGPT to be polite - a complex, fragile value) and withstands more optimization pressure than hand-crafted reward functions.
However, it’s currently insufficient because of:
- Non-robust ML systems: eg. ChatGPT can fail, even more so after push-back from the public against its refusal to answer some things. This kind of error may disappear in future, but it's unclear if it will.
- Incentive issues in reinforcement learning: eg. it wants a human to hit ‘approve’ - even if this involves deception. It may also over-agree with a user’s sensibilities or steer the topic toward positive sentiment subjects like weddings, and its trains of thought may be hard to read due to being mainly association.
- Problems with the human feedback part: human raters make errors, ChatGPT has heaps of raters and still has issues, and scaling feedback while maintaining its quality is hard.
- Superficial outer alignment: RLHF doesn’t involve values explicitly, only a process that we hope will 'show' it values through examples. You can also fine-tune an existing model in a bad direction eg. by asking it to mimic someone with bad values.
- The strawberry problem: the problem of getting an AI to place two identical (to cellular level) strawberries on a plate and then do nothing else. Requires capability, pointing at a goal, and corrigibility (we can tell it to stop, and it will). RLHF may not solve this, though the same could be said for other alignment approaches too.
- Unknown properties under generalization: it may not generalize well to new situations, may need to do negative actions (such as killing people) so we can give feedback on what is bad, or may act very differently at higher levels of performance or in dangerous domains