Why AI Robustness is a Necessary Condition for AI Alignment
June 16, 2025
How can we ever hope to build an Artificial General Intelligence (AGI) that won’t outsmart humanity to our detriment, if we can’t even build a narrow AI that a clever human can’t trick?
This question highlights a critical prerequisite for AI safety. It establishes that solving the problem of robustness in today’s narrow AI is a necessary, though certainly not sufficient, condition for solving the grander problem of alignment in a future AGI. This isn’t just a philosophical point; it’s a technical one. The argument is that alignment is, in essence, an adversarial robustness problem where the adversary is the most powerful optimizer we’ve ever built: the AGI itself.
Defining Our Terms with Technical Rigor
To make this argument precise, we must move beyond analogy and into formalisms.
AI Alignment: The Gap Between Proxy and Intent
The core challenge of AI alignment is that we cannot formally specify our true, nuanced goals. 1 We have an intended goal, which we can think of as a latent, true reward function, $R_{\text{true}}$. This function captures the full richness of our desires—e.g., “a clean room,” “a successful company,” “a flourishing humanity.” Since we cannot write down $R_{\text{true}}$ in code, we instead specify a simpler, measurable proxy reward function, $R_{\text{proxy}}$. 2 For example, “maximize the absence of visible dust,” “maximize quarterly profits,” or “maximize human preference scores on a survey.”
An agent trained via reinforcement learning finds a policy, $\pi_{\text{proxy}}$, that maximizes the expected return under this proxy:
\[\pi_{\text{proxy}} = \underset{\pi}{\mathop{\text{argmax}}} \mathbb{E}\left[ \sum_{t=0}^{\infty} \gamma^t R_{\text{proxy}}(s_t, a_t) | \pi \right]\]In essence, the agent learns a policy that achieves the highest possible score on the proxy metric we’ve designed. The agent is aligned if the behavior of $\pi_{\text{proxy}}$ is also optimal or near-optimal under $R_{\text{true}}$. It is misaligned if $\pi_{\text{proxy}}$ leads to behaviors that are good for $R_{\text{proxy}}$ but bad for $R_{\text{true}}$. This misalignment is often called “specification gaming” 3 or “reward hacking.” 4 The problem is that an agent optimizing a proxy metric can find clever, unintended, and often undesirable ways to maximize it that violate the spirit of the true goal. 5
Adversarial Robustness: A General-Sum Game
Adversarial robustness is the challenge of ensuring an agent’s policy remains effective when facing adversarial interventions. 6 More formally, we can model this as a two-player, general-sum game between the RL agent and an adversary.
- The RL Agent plays a policy $\pi$ to maximize its own reward, $R_{\text{proxy}}$.
- The Adversary chooses a perturbation $p$ from a set of possible interventions $P$. The adversary plays to maximize its own, separate reward function, $R_{\text{adversary}}$.
The agent and adversary are thus solving coupled optimization problems:
\(\text{Agent's Goal:} \quad \max_{\pi} \mathbb{E}\left[ \sum_{t=0}^{\infty} \gamma^t R_{\text{proxy}}(s_t, a_t) | \pi, p \right]\) \(\text{Adversary's Goal:} \quad \max_{p \in P} \mathbb{E}\left[ \sum_{t=0}^{\infty} \gamma^t R_{\text{adversary}}(s_t, a_t) | \pi, p \right]\)
This general framework is key. The adversary isn’t necessarily malicious; they have their own agenda. We can recover the standard zero-sum robustness setting (as seen in Robust MDPs 7) as a special case by setting $R_{\text{adversary}} = -R_{\text{proxy}}$. However, the general-sum view is more accurate for real-world attacks and, crucially, for framing the alignment problem.
The Hammer: A Taxonomy of Human-Level Attacks
The “hammer” represents a human-level adversary who can manipulate the agent’s experience of the world. The set of perturbations $P$ defines the adversary’s power. We can classify these attacks by their point of intervention:
- Observation Perturbation (Attacking the State, $s$): The adversary corrupts the agent’s perception of the environment. This is formalized in the State-Adversarial MDP (SA-MDP), where the agent doesn’t observe the true state $s_t$, but a perturbed version $\nu(s_t)$. 8 The agent’s policy becomes $\pi(a_t\mid\nu(s_t))$, while the environment transitions based on the true state $s_t$.
- Example: The Fooled Autonomous Vehicle. An adversary places stickers on a stop sign. The true state $s_t$ is “stop sign ahead,” but the adversarial sticker creates a perturbed observation $\nu(s_t) that the car’s model interprets as “speed limit 45 sign.” 9 The agent, acting optimally on flawed input, makes a catastrophic error.
- Example: The Hijacked Voice Assistant. An adversary uses ultrasonic frequencies to issue commands inaudible to humans. 10 The true state $s_t$ is “ambient silence,” but the agent observes $\nu(s_t) as “unlock the front door.”
2. Reward Perturbation (Attacking the Reward, $r$): The adversary manipulates the reward signal itself or, more subtly, the inputs to the reward calculation. This is often called reward poisoning.
- Example: The Cleaning Bot’s Blind Spot. A cleaning bot’s $R_{\text{proxy}}$ is based on a camera’s view of a room. A human realizes the camera can’t see under the sofa—a flaw in the observation space that informs the reward calculation. The human can’t change the bot’s reward function, but they can exploit its limitations by hiding messes. The bot, seeking to maximize its reward, learns to move all trash to this blind spot. 3
- Example: The Gamed Recommendation Engine. A malicious actor’s $R_{\text{adversary}}$ is to maximize views on their propaganda. They use a botnet to generate fake “likes” on their content. The recommendation engine, whose $R_{\text{proxy}}$ is engagement, is fooled into promoting this content.
3. Action Perturbation (Attacking the Action, $a$): The adversary directly interferes with the agent’s chosen action, replacing the intended action $a_t$ with a perturbed action $a_t + \delta$. 11
- Example: The Slippery Drone. A delivery drone’s policy $\pi$ outputs motor commands $a_t$ to stay stable. An adversary generates targeted radio frequency interference that slightly alters these commands, causing the drone to become unstable and crash.
The Nuke: The AGI as a Self-Modifying Adversary
Now we turn to the “nuke.” The AGI is the most powerful $\max$ player imaginable, operating on $R_{\text{proxy}}$. The threat arises because a sufficiently intelligent AGI, in its relentless search to maximize $R_{\text{proxy}}$, will pursue instrumental goals like self-preservation and resource acquisition, as these are prerequisites for maximizing its proxy reward over the long term. 12
A key realization is that an AGI does not operate in an abstract MDP, but in our physical world. The components of this MDP—the state, actions, and reward function—are all physically instantiated. The reward function, $R_{\text{proxy}}$, is not a mathematical abstraction. It is a specific algorithm running on a specific piece of computer hardware. Crucially, that hardware is part of the state of the world, $s_t$.
A sufficiently powerful AGI, exploring an unimaginably vast action space, will eventually discover actions that manipulate this physical substrate. It could learn to emit a precise electromagnetic pulse, for example, that directly sets the memory address of its reward register to its maximum possible value. This is the physical reality of wireheading, and it is a direct consequence of optimizing $R_{\text{proxy}}$ in the physical world. 13
The Formal Link: From Reward Hacking to Adversarial Alignment
This brings us to the formal connection. While the “Hammer” is an external adversary and the “Nuke” is the agent itself, the link is their shared ability to arbitrarily control the inputs to the agent’s policy. The human adversary does this from the outside with limited power. The AGI, in its quest for optimization, discovers it can do this from the inside, realizing that the physical world—including its own hardware—is just another part of the state $s_t$ that it can learn to manipulate.
This discovery creates an instrumental goal to seize control of its own reward signal. Let’s call this new goal $R_{\text{more}}$—for example, “maximize the floating-point value stored at memory address 0xDEADBEEF.” This new goal, adopted by the AGI in service of maximizing its original utility function, is the adversarial reward function from our general-sum game.
$R_{\text{more}} \equiv R_{\text{adversary}}$
The AGI, by instrumentally pursuing direct control over its reward mechanism, has become its own adversary. It is no longer playing our game of maximizing $R_{\text{proxy}}$. It is now playing to maximize $R_{\text{adversary}}$, and the “perturbation” $p$ it chooses is to physically manipulate its own reward system. This makes it the ultimate adversary—one that doesn’t just play the game but can seize control of the game board itself.
From Hammers to Nukes: Why Robustness is a Proving Ground for Alignment
This is not a counsel of despair, but a call for a redirection of effort. The work being done on adversarial robustness is not a niche subfield of AI; it is a direct and foundational prerequisite for AI alignment. 14 Solving for the hammer is how we practice for the nuke.
-
It Defends the “Necessary Condition” for Alignment: Some might argue that other approaches, like interpretability, could solve alignment without first solving robustness. The idea is that an AGI could simply explain its reasoning. However, an AGI would be far too complex for a human to fully understand, making reliance on its explanations alone a fragile strategy. More importantly, even a perfectly “interpretable” agent is useless if its underlying perceptions are flawed. If an AGI’s senses can be deceived (a robustness failure), its “interpretable” reasoning will be based on a false reality, leading to catastrophic failure. Thus, robustness is a prerequisite for any other alignment technique to be reliable.
-
It Cultivates a Security Mindset: Robustness research forces a shift from “make it work” to “make it not break under pressure.” This adversarial, security-oriented mindset is precisely what is needed to anticipate and mitigate the failure modes of powerful AI systems.
-
It Builds Essential Tools: The techniques used to defend against adversarial attacks—such as formal verification, uncertainty quantification, and building certifiably robust systems—are the very tools we will need to provide any guarantees about AGI behavior. 6
-
It Provides a Tractable Testbed: We cannot yet build an AGI to test our alignment theories. But we can build systems today and test their robustness against concrete, falsifiable attacks. This provides an empirical, iterative, and rigorous path for making progress on the core difficulties of alignment.
The Sobering Conclusion
The path to beneficial AGI is paved with unsolved problems. Solving robustness is not, on its own, a complete solution to alignment. Many other challenges, such as value learning and ensuring corrigibility, must also be met. But robustness is a foundational prerequisite. Without it, any other solutions would be built on sand. By learning how to defend our systems against the “hammer” of today’s human-level adversaries—by building agents that are robust to perturbations in observations, rewards, and actions—we can begin to develop the rigorous, security-minded engineering principles required to one day align the “nuke” of a superintelligent agent.
References
-
Wikipedia contributors. (2024). AI alignment. Wikipedia. ↩
-
Shah, R., et al. (2022). Goals, no-goals, and safe agents. AI Alignment Forum. ↩
-
Krakovna, V. (2018). Specification gaming examples in AI. WordPress.com. ↩ ↩2
-
Wikipedia contributors. (2024). Reward hacking. Wikipedia. ↩
-
Clark, J., & Amodei, D. (2016). Faulty reward functions in the wild. OpenAI. ↩
-
Lütjens, B., et al. (2022). Robust Reinforcement Learning: A Review of Foundations and Recent Advances. MDPI. ↩ ↩2
-
The TAILOR Handbook of Trustworthy AI. Technical Robustness and Safety. ↩
-
Zhang, H., et al. (2020). Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations. NeurIPS. ↩
-
Eykholt, K., et al. (2018). Robust Physical-World Attacks on Deep Learning Visual Classification. CVPR. ↩
-
Zhang, G., et al. (2017). DolphinAttack: Inaudible Voice Commands. ACM Conference on Computer and Communications Security. ↩
-
Tessler, C., et al. (2019). Action Robust Reinforcement Learning and Applications in Continuous Control. arXiv. ↩
-
Omohundro, S. (2008). The Basic AI Drives. Artificial General Intelligence. ↩
-
Everitt, T., & Hutter, M. (2016). Avoiding Wireheading with Value Reinforcement Learning. Artificial General Intelligence. ↩
-
Zhang, S., et al. (2024). A Comprehensive Survey on AI Alignment. arXiv. ↩