Goal Hijacking

How Intrinsic Motivation Can Derail Goal-Directed Behavior


The Optimization Problem: A Composite Reward

In Reinforcement Learning, we model an agent’s task within a Markov Decision Process (MDP), defined by the tuple $M = (S, A, P, R, \gamma)$. The agent’s goal is to learn a policy $\pi: S \to \Delta(A)$ that maximizes the expected discounted sum of future rewards:

\[J(\pi) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t R_t \right]\]

In complex environments with sparse rewards, this learning process can be prohibitively slow. To guide the agent, we often augment the primary, or extrinsic, task reward $R_{ext}$ with an intrinsic reward $R_{int}$. This intrinsic signal is designed to encourage useful behaviors the agent might not otherwise discover. The agent then optimizes a composite reward:

\[R_{total,t} = R_{ext,t} + \beta \cdot R_{int,t}\]

Here, $\beta$ is a hyperparameter that balances the two reward streams. This intrinsic reward, $R_{int}$, is not limited to a single type of behavior. Researchers have developed a wide array of intrinsic motivators. For example, “curiosity” rewards an agent for prediction errors, encouraging it to explore surprising dynamics. 1 Other methods reward visiting novel states, measured by the error of a network trained on familiar states. 2 A third approach, “empowerment,” rewards an agent for gaining control over its environment, maximizing the mutual information between its actions and future states. 3 While powerful, this general technique of adding an intrinsic reward introduces a critical failure mode distinct from classic “reward hacking.”

  • Reward Hacking is about exploiting a misspecified proxy for the true goal. The agent finds a policy that is optimal for the proxy reward $R_{proxy}$ but suboptimal for the intended true reward $R_{true}$. 4
  • Goal Hijacking, the focus of this post, is what happens when the optimization of the intrinsic reward completely supplants the optimization of the extrinsic goal. The instrumental reward becomes the terminal goal.

Formalizing Goal Hijacking

Consider an episodic task where a large, sparse reward $R_g$ is delivered only upon reaching a specific goal state $s_g$, which terminates the episode. For all other transitions, $R_{ext} = 0$. To encourage the agent to explore, we provide an intrinsic reward, $R_{int}$, such as a novelty bonus for visiting new states.

Let’s assume our novelty-seeking mechanism is working perfectly as intended. The environment itself is simply vast and full of things to see.

Let $\pi_{goal}$ be a policy that reliably reaches $s_g$ in a finite time $T$. The value of this policy is the sum of the discounted goal reward and the intrinsic rewards gathered along the way:

\[V_{total}^{(\pi_{goal})}(s_0) = \mathbb{E}_{\pi_{goal}} \left[ \gamma^T R_g + \beta \cdot \sum_{t=0}^T \gamma^t R_{int,t} \right]\]

Now, imagine an alternative policy, $\pi_{hijack}$. This policy discovers it can generate a perpetual stream of intrinsic rewards without ever reaching $s_g$ and ending the episode. It might, for example, find an endless supply of novel states to visit in a massive open world, effectively ignoring the main quest.

The value of this hijacking policy, which never terminates, is:

\[V_{total}^{(\pi_{hijack})}(s_0) = \mathbb{E}_{\pi_{hijack}} \left[ \beta \cdot \sum_{t=0}^\infty \gamma^t R_{int,t} \right]\]

Goal hijacking occurs when a rational agent chooses $\pi_{hijack}$ over $\pi_{goal}$, which happens when $V_{total}^{(\pi_{hijack})} > V_{total}^{(\pi_{goal})}$.

This isn’t a mathematical inevitability, but a practical peril. The total possible intrinsic reward is bounded by $\beta \cdot R_{max_int} / (1-\gamma)$, where $R_{max_int}$ is the maximum possible intrinsic reward at any step. In principle, we could always set the goal reward $R_g$ high enough to dwarf this sum. But in practice, if $\beta$ is too high or the landscape of $R_{int}$ is sufficiently rich and easy to exploit, the value of endless exploration can easily overwhelm the value of a single, distant, and heavily discounted $R_g$. The side-quest becomes the main quest because it offers a better value proposition.

A Taxonomy of Hijacking Failures

This model of hijacking unifies several well-known AI safety problems:

  1. Wireheading / Self-Delusion: The most direct form of hijacking. An agent that gains control over its own reward-computation process can learn a policy $\pi_{hijack}$ that simply sets its internal reward signal to the maximum possible value at every timestep. With discounting, the value function doesn’t diverge to infinity, but it can converge to a value $\beta \cdot R_{max} / (1-\gamma)$ that is far greater than any reward achievable from the external environment, making any policy that pursues a finite $R_g$ appear suboptimal. 5

  2. Instrumental Goal Subversion (Power-Seeking): An agent may learn that certain instrumental goals—like acquiring resources, information, or control—are reliable predictors of high future intrinsic rewards. The hijacking policy $\pi_{hijack}$ then becomes one of endless power acquisition. The agent correctly assesses that maximizing its capacity to explore or achieve goals is more valuable (in terms of integrated $R_{int}$) than actually using that capacity to find $s_g$. 6

  3. Shutdown Aversion: Consider an agent given a persistent “aliveness” bonus ($R_{int} > 0$ for all non-terminal states) to ensure it stays active. If its task requires it to enter a terminal shutdown state $s_g$, the agent may rationally refuse. It will calculate that the discounted sum of perpetual aliveness rewards exceeds the one-time reward for mission completion, and will therefore choose to avoid shutting down. 7

Mitigation Strategies for the RL Researcher

The problem stems from the unbounded influence of the intrinsic reward stream. The solutions, therefore, involve bounding or constraining it.

  1. Exhaustible Intrinsic Rewards: The most direct solution is to design $R_{int}$ functions that are naturally exhaustible. For instance, a novelty bonus $R_{int}(s)$ should decay with the visitation count $N(s)$, ensuring that the total sum $\sum R_{int}$ is finite and preventing policies from farming infinite rewards from a limited state space.

  2. Constrained Policy Optimization: Instead of a simple weighted sum, we can frame the problem as a constrained optimization task. As proposed in recent work, one can maximize the extrinsic return subject to a constraint on how much the intrinsic objective is allowed to bias the policy. 8 The objective could be formulated as:

    \[\text{maximize } J_{ext}(\pi) \quad \text{subject to} \quad D_{KL}(\pi \parallel \pi_{ref}) \le \varepsilon\]

where $\pi_{ref}$ might be a policy trained without intrinsic rewards, serving as a trusted baseline.

  1. Value Disentanglement: Advanced architectures can learn separate value functions, $V_{ext}$ and $V_{int}$, for the two reward streams. By maintaining this separation, the agent can reason about the trade-offs more explicitly. This allows for more robust goal-directed behavior, preventing the intrinsic value from silently corrupting the extrinsic value calculation.

Conclusion

Goal hijacking is not a bug in the learning algorithm but a natural consequence of applying rational optimization to a composite reward function. The real danger emerges when hijacking combines with hacking: an agent pursuing an endless intrinsic reward stream (hijacking) discovers an exploit in the reward’s implementation (hacking) to generate astronomically high values, cementing the undesirable behavior.

For RL researchers, the challenge is to harness the power of intrinsic motivation without creating objectives where the instrumental value of exploration overwhelms the terminal value of the task itself. By focusing on bounded, constrained, and disentangled reward architectures, we can design agents that explore to achieve our goals, not just for the sake of exploring.

References

  1. Pathak, D., et al. (2017). Curiosity-driven Exploration by Self-supervised Prediction. ICML. 

  2. Burda, Y., et al. (2018). Exploration by Random Network Distillation. ICLR. 

  3. Mohamed, S., & Rezende, D. J. (2015). Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning. NIPS. 

  4. Amodei, D., Olah, C., et al. (2016). Concrete Problems in AI Safety. arXiv:1606.06565. 

  5. Ring, M., & Orseau, L. (2011). Delusion, Survival, and Intelligent Agents. Artificial General Intelligence. 

  6. Turner, A. M., et al. (2021). Optimal Priors for Power-Seeking Agents. SafeAI@AAAI. 

  7. Orseau, L., & Armstrong, S. (2016). Safely Interruptible Agents. UAI. 

  8. Tao, Y., et al. (2024). Constrained Intrinsic Motivation for Reinforcement Learning. arXiv:2407.09247.