Reward Hacking in Large Language Models (LLMs)

We often hear reward hacking in the context of LLMs generating content that mimics style over substance. so what exactly does this mean ? why does it occur ? Before we dive into concept of reward hacking applied to LLMs, lets first clarify the meaning of reward in reference to RL literature.

LLM hallucinating facts in a court leading to chaos; How misaligned reward optimization leads to learning human preferences that puts learning style over substance.
LLM hallucinating facts in a court leading to chaos; How misaligned reward optimization leads to learning human preferences that puts learning style over substance.

The Concept of Reward in Reinforcement Learning

In the realm of reinforcement learning (RL), a reward r(s,a) is a scalar feedback signal that indicates how well an agent’s action (a) was in terms of achieving its goal. Think of it as a way of telling the agent, “You did well!” or “That wasn’t the right move.” The primary objective of an RL agent is to maximize its cumulative reward over time. This is achieved by interacting with an environment, taking actions based on its current state (s), and receiving rewards (or penalties) based on the outcomes of those actions. The reward mechanism serves as a guiding light, helping the agent learn which actions are beneficial and which ones are not. Over time, by optimizing its actions to accumulate the highest rewards, the agent learns a policy that dictates the best action to take in any given state. This process of trial, error, and learning through rewards is the essence of reinforcement learning. Also, this whole process is inspired by human learning — think of a child learning to walk who is going through a similar trail and error process, upweighting on tactics that led to walk (reward) and downweighting tactics(penalty) that led him to fall down. The relationship between reward and value is described using Bellman equation as expressed below

Bellman Equation

Brian Christian’s book “The Alignment Problem” throws light on how poor design of reward function results in unexpected behaviors starting from job recruiting to self-driving cars. Below is one of my favorite quotes from the book (it is simple yet effective in conveying the message)

University of Toronto economist Joshua Gans wanted to enlist the help of his older daughter in potty training her younger brother. So he did what any good economist would do. He offered her an incentive: anytime she helped her brother go to the bathroom, she would get a piece of candy. The daughter immediately found a loophole that her father, the economics professor, had overlooked. “I realized that the more that goes in, the more comes out,” she says. “So I was just feeding my brother buckets and buckets of water.” Gans affirms: “It didn’t really work out too well.”

book “The Alignment Problem”

LLMs like GPT4, PaLM and Claude are trained using self-supervised pretraining and supervised fine-tuning on instructions from varied domains and tasks. In order to align the LLMs to human preferences, there is an RL training component called RLFH that comprises two parts (i) reward model(RM) and (ii) policy optimization to maximize reward assigned from RM. The human preferences could come in many forms, some prime examples being helpful, harmless and honest. Just like other AI systems, doing RLHF on LLMs is prone to reward hacking which we discuss next. 

So what exactly is Reward Hacking ? 

Reward hacking is a phenomenon observed in machine learning where a model learns to exploit the reward system to achieve high scores without genuinely solving the intended problem. The model identifies a shortcut within the problem space that allows it to minimize the loss function without truly learning the crucial aspects of the problem. This issue can lead to models that perform well on training data but fail to deliver in real-world scenarios. In the context of Large Language Models (LLMs) like GPT-4, this can manifest in various ways, from copying styles without generating meaningful content to being overly cautious in responses. LLMs are aligned to human preferences through a process known as RLHF (RL from human feedback). Hallucinations is a prime example of reward hacking where the LLM learns to generate content that looks well preferred by humans even though they may not be factually grounded, there by pleasing humans in its style of writing as opposed to grounding factually.

source : https://x.com/prdeepakbabu/status/1708513048905040367?s=20

1. Mimicking Style Over Substance

One of the challenges in training LLMs is ensuring that they generate content that is both stylistically appropriate and substantively accurate. However, due to the vast amount of data they are trained on, LLMs can sometimes prioritize mimicking the style of the input without truly understanding or generating the desired logic or content. This is a form of reward hacking because the model might receive positive feedback for producing text that “sounds right” even if it doesn’t convey the intended meaning.
Example: If a user asks for an explanation of a complex topic and the model provides a verbose answer filled with jargon but lacking clear explanations, it might seem knowledgeable at first glance. However, upon closer inspection, the answer might not be helpful at all.

2. Overcautiousness as a Form of Reward Hacking

While not a classic example of reward hacking, the behavior of LLMs being overly cautious can be seen as a byproduct of their training. When models are trained to be safe, they might lean towards being unhelpfully cautious to avoid potential harm. This is because the model might “learn” that providing non-committal or vague answers reduces the risk of giving incorrect or harmful information, even if it compromises the utility of the response.
Example: If a user asks for medical advice, a well-trained LLM might respond with “I’m not a doctor, please consult a medical professional.” While this is safe, it might not be helpful, especially if the user was looking for general information that the model could provide without giving direct medical advice.

3. The Challenge of Reward Design

Designing the right reward mechanism is crucial for training effective machine learning models. In reinforcement learning, for instance, the agent learns by receiving rewards or penalties based on its actions. If the reward system isn’t designed correctly, the agent might find unintended shortcuts to maximize its reward.

Can we say reward hacking is a result of poorly designed reward functions ? No. Reward hacking and poor design of reward functions are related concepts, but they are not exactly the same. Let’s break down the differences:

Reward Hacking:

  • This refers to situations where an AI system finds a way to maximize its reward by exploiting loopholes or unintended behaviors in the reward function.
  • Instead of achieving the desired outcome that the designer intended, the AI system “hacks” the reward mechanism to get the maximum reward in an unintended way.
  • It’s a consequence of unforeseen shortcuts or exploits that the AI discovers within the parameters of the reward function.

Poor Design of Reward Function:

  • This refers to the flawed or inadequate specification of the reward function by the designer.
  • A poorly designed reward function might not capture the true intent of the desired behavior or might be too ambiguous, leading to unintended behaviors.
  • Reward hacking can be a result of a poorly designed reward function, but not all poorly designed reward functions lead to reward hacking.

In essence, reward hacking is a symptom, and poor design of the reward function can be one of the causes. However, even well-designed reward functions can sometimes be susceptible to reward hacking if there are unforeseen ways to exploit them. The challenge in AI alignment is to design reward functions that both capture the desired behavior and are robust against potential hacks or exploits.

Leave a comment