I am intrigued by this idea "human evaluation is the only way to get a reliable signal"
Do you mean RLHF or something more? For example, Alan Cowen from Hume AI believes RLHF will always be biased and instead need to move to something more akin to evaluation based on how models actually affect users (eg, https://x.com/AlanCowen/status/1613293979071664146)
It’s in line with the statement “human evaluation is the only way to get reliable signal”. Without superintelligence, models can only approximate the human preferences they are trained on.
Alan’s approach reminds me of how Atari built games early on, the same person who designed the game, also tested the game as a user and provided feedback. I can see many benefits to his approach:
* It's way more scalable than in-house paid RLHF, which is of course biased by the moral code of the company
* It's genuinely "democratized", provided that users of AI are generally representative of the world.
* It could learn signals about different moral frameworks around the world, and behave differently for different users and user groups
* It's how we've evolved our own moral codes in the wild, by leveraging our inbuilt empathy and reading social cues
At the same time, models can already pick up on subtle cues through rater feedback (e.g. pairwise RLHF). I question whether Cowen's claims are fully tested. ChatGPT can pick up subtle communication cues in just plain text.
Should we let ChatGPT rate its own conversations and learn from them? I'd be surprised if OpenAI didn't consider doing that, or perhaps they use such automated feedback to prioritize what to give to raters.
Overall, the balance between RLHF and user-based evaluation might be key to developing models that not only reflect human preferences more accurately but also adapt to a diverse range of moral and social contexts.
I think this is a pretty nuanced topic and it isn’t clear to me whether training will require these kinds of changes. Evals on human judged tasks makes sense though.
I am intrigued by this idea "human evaluation is the only way to get a reliable signal"
Do you mean RLHF or something more? For example, Alan Cowen from Hume AI believes RLHF will always be biased and instead need to move to something more akin to evaluation based on how models actually affect users (eg, https://x.com/AlanCowen/status/1613293979071664146)
It’s in line with the statement “human evaluation is the only way to get reliable signal”. Without superintelligence, models can only approximate the human preferences they are trained on.
Alan’s approach reminds me of how Atari built games early on, the same person who designed the game, also tested the game as a user and provided feedback. I can see many benefits to his approach:
* It's way more scalable than in-house paid RLHF, which is of course biased by the moral code of the company
* It's genuinely "democratized", provided that users of AI are generally representative of the world.
* It could learn signals about different moral frameworks around the world, and behave differently for different users and user groups
* It's how we've evolved our own moral codes in the wild, by leveraging our inbuilt empathy and reading social cues
At the same time, models can already pick up on subtle cues through rater feedback (e.g. pairwise RLHF). I question whether Cowen's claims are fully tested. ChatGPT can pick up subtle communication cues in just plain text.
Should we let ChatGPT rate its own conversations and learn from them? I'd be surprised if OpenAI didn't consider doing that, or perhaps they use such automated feedback to prioritize what to give to raters.
Overall, the balance between RLHF and user-based evaluation might be key to developing models that not only reflect human preferences more accurately but also adapt to a diverse range of moral and social contexts.
I think this is a pretty nuanced topic and it isn’t clear to me whether training will require these kinds of changes. Evals on human judged tasks makes sense though.
Thanks for the question.