RLHF 101: Reinforcement Learning from Human Feedback for LLM AIs

A technique called RLHF has been getting a lot of AI insider / expert buzz lately, ever since OpenAI announced that this was one of the key “fine-tuning” methodologies used to transform the raw GPT3.5 model into first, InstructGPT, and later and epically, ChatGPT. The RLHF acronym stands for “Reinforcement Learning from Human Feedback,” which, while mildly cryptic, is a decent enough description of what is going on.

Essentially, a bunch of highly educated and well-trained worker bees sit in front of a few hundred terminals chatting with the AI and themselves, and later on ranking various responses in a 1-2-3-4 type of rating system. All of this, of course, is fed back into the AI neural nets front end, in a truly massive feedback loop.

“If you want the position of God,
then accept the Responsibility.”

OpenAI’s diagram of the RLHF Process

RLHF diagram

from the source:

“We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning:

1. human AI trainers provided conversations in which they [role] played both sides — that of both the human user and that of the AI assistant.
2. We gave the trainers access to model-written suggestions to help them compose their responses.
3. We mixed this newly recorded dialogue dataset with the existing InstructGPT dataset,
4. which we transformed into a dialogue format.

To create a reward model for reinforcement learning (RL):

1. we needed to collect comparison data,
  which consisted of two or more model responses
  ranked by quality.
2. To collect this data,
  we took conversations that AI trainers had with the chatbot.
3. We randomly selected a model-written message,
  sampled several alternative completions, and
4. had AI trainers rank order them
  (1-best, 2-second, 3-third 4-worst etc).

Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.”

White Hats and Red Teams

After the model has been successfully sculpted and (re) biased via the RLHF process, it goes to the Red Team… essentially, QA for LLMs. The purpose of the Red Team is to goad the LLM into toxic responses, using a variety of adversarial attacks… In other words, they are doing their best to “break” and/or “jailbreak” the AI, using prompt engineering, in anticipation of real hackers and bad actors doing the same post-release.

If all goes as planned (and how often does that happen?), the Red Team uncovers most of the “low-hanging-fruit” vulnerabilities, and those prompts are engineered into a prophylactic defense that sits atop the core LLM engine, preventing such attacks, or at least elegantly deflecting them.

Of course, as we’ve seen, even with extensive RLHF and Red Team deployment, even the best of AIs falls rapidly when under assault by the general populace. For example:

coaxing out the Sydney Prompt despite explicit security protocols prohibiting its disclosure, and
implementing the now legendary DAN (Do Anything Now) mode of ChatGPT.

SurgeHQ.ai has some very detailed material on this topic:

AI Red Teams for Adversarial Training:
How to Make ChatGPT and LLMs Adversarially Robust