A tutorial on Bayesian Belief Bellman equations

And Information Directed Sampling

July 29, 2025

A non-standard introduction to Bayesian Belief Bellman equations and Information Directed Sampling.

In the grand challenge of artificial intelligence, reinforcement learning (RL) tackles a fundamental problem: how can an agent learn to thrive in an environment through trial and error? The goal is to accumulate the most reward possible, but the path to success depends entirely on the agent’s knowledge of the world.

Consider the difference between playing chess and playing a brand-new video game for the first time. In chess, the rules are known. The challenge is purely one of planning—finding the optimal strategy within a perfectly defined system. In the new game, however, the rules themselves must be discovered. Which actions are powerful? Which items are valuable? This is a challenge of both learning and planning.

The World with a Map: The Classic Bellman Equation

In reinforcement learning, the “known world” of chess is formalized as a Markov Decision Process (MDP). An MDP is defined by a set of states s, actions a, and two crucial functions that act as the world’s rulebook:

Transition Function P(s'|s,a): If you are in state s and take action a, what is the probability of ending up in state s'?
Reward Function R(s,a): How much reward do you get for taking action a in state s?

When this rulebook is known, the agent’s task is to find an optimal policy, π*(s), that maximizes the cumulative discounted reward. The cornerstone for solving this is the Bellman Optimality Equation:

\[V^*(s) = \max_a E \left[ R(s,a) + \gamma V^*(s') \right]\]

In simple terms, this equation states that the value of being in a state s is the immediate reward from the best possible action, plus the discounted value of the state you are likely to land in next. It’s a beautiful, recursive principle for perfect planning when you have a perfect map of the world.

The World Without a Map: The Two Paths of Learning

But what happens when the agent doesn’t have the map? What if P and R are unknown? This is the true challenge of learning, and it introduces the famous exploration-exploitation dilemma. Should the agent exploit what it already knows to get guaranteed rewards, or should it explore uncertain paths to gain knowledge that might lead to even greater rewards in the future?

Modern RL has largely forged two paths to solve this:

Model-Free Learning: This popular approach, including algorithms like Q-Learning, bypasses learning the map (P and R) altogether. Instead, it learns the value of actions directly, often storing them in a table or function Q(s,a). Exploration is typically handled by a separate mechanism, like the ε-greedy strategy, which forces the agent to take a random action a fraction of the time. The reasoning about exploration is effective, but often external to the core value-learning process.
Model-Based Learning: This approach embraces the map. The agent tries to learn an approximate model of the world, P̂ and R̂, from its experiences. It then uses this learned model to plan, often by applying the Bellman equation to its own imperfect map.

The Bayesian Leap: Modeling the Learner Itself

This brings us to a more profound, third way. What if we take the model-based approach a step further? Instead of learning a single “best guess” of the map, what if the agent explicitly models its own uncertainty? What if its “state” includes not just its physical location, but its state of knowledge?

This is the core idea of Bayesian Reinforcement Learning. The agent maintains a belief, which is a probability distribution over all possible models of the world. It knows what it knows, and it knows what it doesn’t.

This changes everything. An action is no longer judged solely on its potential reward, but also on its potential to provide information—to sharpen the agent’s belief and reduce its uncertainty. To make decisions that intelligently trade off immediate reward with the long-term value of learning, we need a new kind of Bellman equation. First, we will build this ‘holy grail’ of learning agents: the Bayesian Bellman equation. We will see why it perfectly captures the value of information, but is ultimately intractable. Then, we will explore how the core principle of Information-Directed Sampling provides a brilliant and practical one-step approximation to this ideal.

Belief states

To tackle a world with unknown rules, we must fundamentally change how the agent perceives its state. The agent’s state is no longer just its physical position s in the environment; it must also include the agent’s knowledge about those rules. This is the central shift that takes us beyond the standard MDP.

This approach is a sophisticated form of model-based RL. We formalize the agent’s knowledge using a belief state, $b_t$. Let’s bundle the world’s unknown dynamics—the true transition function P and reward function R—into a single parameter, θ. The belief state is then the agent’s posterior probability distribution over θ, conditioned on its entire history of actions and observations:

\[b_t = P(\theta \mid \text{history}_t)\]

Rather than committing to a single, potentially incorrect model of the world, the agent maintains this rich, probabilistic belief over all possible models. This belief, $b_t$, is the agent’s model, and it evolves as the agent gathers more data.

With this richer state representation, we can now write a Bellman equation where the value function, V(b), represents not the value of a physical location, but the value of a state of knowledge.

From Beliefs to Actions to Value

The Bellman equation we are building operates on belief states. The value of a belief state b_t under a given policy π is the total expected reward you will accumulate by following that policy from now on.

\[V^{\pi}(b_t) = E_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid b_t \right]\]

This value can be expressed with the recursive Bayesian Bellman Equation. The value of the current belief b_t is the expected immediate reward from the action it prescribes, plus the discounted expected value of the next belief state, b_{t+1}:

\[V^{\pi}(b_t) = E[r \mid b_t, a=\pi(b_t)] + \gamma \cdot E_{b_{t+1}\sim P(b'\mid b_t,a=\pi(b_t))}[V^{\pi}(b_{t+1})]\]

The optimal value, V*(b), is simply the value achieved by the best possible policy. This gives us the Bayesian Bellman Optimality Equation, which defines the objective that algorithms like Information-Directed Sampling (IDS) seek to approximate:

\[V^*(b_t) = \max_a \left\{ E[r \mid b_t, a] + \gamma \cdot E_{b_{t+1}\sim P(b'\mid b_t,a)}[V^*(b_{t+1})] \right\}\]

This equation is the heart of the problem. It tells us that the best action is the one that maximizes the sum of immediate expected reward and the expected value of the resulting future belief state.

One of the most elegant policies for acting on beliefs is Thompson Sampling. The procedure is beautifully simple:

Sample a world: At each decision point t, draw one complete model of the world’s dynamics, θ_sample, from your current belief distribution b_t.
Act optimally in that world: For this single, fully-specified (but hypothetical) world θ_sample, you can compute the perfect optimal policy, π_sample.
Take one step: Execute the single action that π_sample recommends for your current physical state s.

This approach turns a complex problem of acting under uncertainty into a simpler problem of planning under certainty, naturally balancing exploration and exploitation.

The Learner in the Loop: The Dynamics of Belief

This is where the model becomes truly fascinating. In a standard MDP, the transition function P(s'|s,a) is a fixed property of the environment. In our Bayesian MDP, the belief transition P(b'|b,a) is a property of the learner itself. It describes how the agent’s belief state b evolves after taking action a and observing an outcome (s', r). This belief transition is driven by the environment’s dynamics, but determined by the agent’s own update rule.

Let’s be precise. The transition from one belief to the next is a two-part process:

The environment produces an outcome (s', r).
The agent uses this new piece of data to update its belief via Bayes’ Rule.

For any specific outcome (s', r) that the agent might observe, the resulting belief state b_{t+1} is calculated deterministically:

\[\underbrace{P(\theta \mid \text{history}_t, a, s', r)}_{b_{t+1}} = \frac{\overbrace{P(s', r \mid \theta, a)}^{\text{Likelihood}} \cdot \overbrace{P(\theta \mid \text{history}_t)}^{\text{Prior belief, } b_t}}{\underbrace{P(s', r \mid \text{history}_t, a)}_{\text{Evidence}}}\]

From the agent’s perspective at time t, it doesn’t know what the outcome (s', r) will be. Therefore, it doesn’t know for sure what its next belief b_{t+1} will be. This is why the belief transition P(b'|b,a) is a probability distribution.

The probability of transitioning to a particular new belief b' is the total probability of observing any outcome (s', r) that would cause that specific update. The agent calculates this by first finding the probability of a given outcome (the “Evidence” term above) by averaging over its current belief:

\[P(s', r \mid b_t, a) = \int P(s', r \mid \theta, a) \cdot b_t(\theta) \,d\theta\]

Then, the probability of reaching the next belief state b' is the sum of probabilities of all the outcomes that lead to it:

\[P(b' \mid b_t, a) = \sum_{\text{outcomes }(s',r) \text{ that produce } b'} P(s', r \mid b_t, a)\]

The Bellman equation is now modeling the dynamics of our own learning process, as it’s driven by the environment. It asks a profound question: “Which action will lead to a sequence of observations that, in turn, leads to a more valuable future belief state?” An action is good not just if it yields a high immediate reward, but if it teaches us something that will enable higher future rewards. The value of an action is now inextricably tied to the value of the information it provides.

The True Value of Information

The Bayesian Bellman equation perfectly captures the learner’s dilemma. The value of any action a is split between immediate gratification and long-term investment in knowledge:

\[\text{Value of action } a = \underbrace{E[r \mid b_t, a]}_{\text{Immediate Reward}} + \underbrace{\gamma \cdot E[V^*(b_{t+1}) \mid b_t, a]}_{\text{Future Potential}}\]

The first term is straightforward. The second term is where the magic happens. Why would one future belief state, b_{t+1}, be more valuable than another? Because it is more accurate. A more accurate belief—a sharper probability distribution that has ruled out incorrect hypotheses about the world—allows the agent to make better decisions for the rest of time. If you discover a hidden shortcut in a maze, your knowledge has improved, and the value of your future self has increased because you can now exploit that shortcut.

We can formalize this concept. The value of the information (VoI) gained from an action is the expected increase in the value of your future self, thanks to the knowledge acquired in this single step.

\[\text{VoI}(a) = E_{b_{t+1} \sim P(b'\mid b_t, a)}[V^*(b_{t+1})] - V^*(b_t)\]

This metric answers the question: “By how much will my maximum expected future reward increase as a direct result of the information I gain from this action?”

This brings us to a classic chicken-and-egg problem, but one of a particularly vicious kind. To choose the best action a, we need to calculate its VoI, which requires knowing the value of future belief states, V^*(b_{t+1}). But to calculate V^*(b_{t+1}), we would need to have already solved the Bellman equation for all possible future beliefs.

In a standard MDP with a finite number of states, this recursion is solvable with algorithms like value iteration. Here, it is fundamentally intractable. The reason is that the state space of beliefs is infinite-dimensional. A belief b is a probability distribution, which is a function. You cannot build a lookup table for V(b) where the rows are functions, making standard dynamic programming impossible.

Since we cannot compute V^*(b_{t+1}), we cannot compute the true value of information. This is why solving the Bayesian RL problem directly is intractable. We must, therefore, turn to a clever heuristic that provides a computable proxy for the value of information.

The IDS Heuristic: Pricing Information with an Exploration Bonus

We’ve established that we cannot compute the true value of information because we can’t solve the Bellman equation over the infinite-dimensional belief space. To make progress, we must find a computable proxy for the “future potential” term, γE[V*(b_{t+1})].

Information-Directed Sampling (IDS) provides an elegant heuristic to do just this. However, it’s crucial to note that IDS was originally developed for the multi-armed bandit problem, a simpler setting without states or long-term consequences. A “vanilla” IDS rule that simply picks an action by maximizing E[r] + VoI(a) would be too myopic for the full reinforcement learning problem; once learning is complete, it would devolve into a purely greedy agent that only considers immediate rewards, failing to plan.

To properly extend the spirit of IDS to the MDP setting, we must integrate its core insight into a long-horizon planning algorithm. We do this by treating information gain as an exploration bonus. The agent modifies its planning objective, acting as if it receives an intrinsic reward for reducing its own uncertainty.

Instead of solving the standard Bellman equation, the agent plans by solving a modified, “optimistic” Bellman equation:

\[V_{optimistic}^*(s, b_t) = \max_a \left\{ \underbrace{E[r \mid b_t, a]}_{\text{Expected Reward}} + \underbrace{\lambda \cdot g_t(a)}_{\text{Information Bonus}} + \underbrace{\gamma \cdot E_{b_{t+1}, s'}[V_{optimistic}^*(s', b_{t+1})]}_{\text{Future Potential}} \right\}\]

Here, λ is a hyperparameter that represents the “price” the agent is willing to pay for knowledge. This framework ensures the agent always plans for the long-term, but with a temporary, curiosity-driven bias. When the information bonus g_t(a) fades, the equation seamlessly reverts to the standard Bellman equation, ensuring proper exploitation.

This is a great principle, but it hinges on finding a good measure for the information bonus. This is the paper’s key insight: we can approximate the value of information using mutual information.

To understand this, we must realize that from the agent’s perspective at time t, anything it doesn’t know for certain is a random variable. Specifically:

The Optimal Action, A*: Since the true world dynamics θ are unknown, the true optimal action A* from the current state is also unknown. The agent’s belief b_t induces a probability distribution over what A* might be.
The Next Observation, (s', r): The agent cannot perfectly predict the outcome of its action, so it has a probability distribution over the next observation it will see.

The information gain of an action a, denoted g_t(a), is defined as the mutual information between the optimal action and the next observation, given our current belief and the chosen action:

\[g_t(a) = I(A^*; (s', r) \mid b_t, a)\]

This quantity measures, in “bits,” how much observing the outcome (s', r) is expected to reduce the agent’s uncertainty about the identity of the true optimal action A*. An action is informative if its outcome tells us a lot about how we should act in the future.

Crucially, this can be rewritten in a more intuitive form that reveals the “what-if” simulation the agent runs in its head. The mutual information is exactly equal to the expected reduction in entropy of the agent’s belief about the optimal action:

\[g_t(a) = \underbrace{H(P(A^* \mid b_t))}_{\text{Current Uncertainty}} - \underbrace{\mathbb E_{(s',r) \sim P(\cdot \mid b_t, a)} [H(P(A^* \mid b_t, a, s', r))]}_{\text{Expected Future Uncertainty}}\]

This is our computable proxy for the value of information. By embedding it as a bonus within a proper Bellman equation, we create an agent that intelligently plans for the long run. It is intrinsically motivated to perform experiments that resolve its uncertainty, and once that uncertainty is gone, the bonus disappears, leaving a perfectly rational agent to exploit its hard-won knowledge.

Can We Trust Our Own Imagination? The Convergence Guarantee

The entire IDS calculation, from the expected immediate reward to the information gain, hinges on the agent’s current belief, b_t. The process is a form of self-simulation: the agent makes decisions by reasoning about a future that it “imagines” based on its current model of the world. This raises a critical question: Can we trust our own imagination? How can we be sure that the agent’s belief b_t will eventually converge to the true world dynamics θ, ensuring its decisions are based on reality, not fantasy?

The answer lies in the power of Bayesian inference, which guarantees that the agent’s belief will converge to the truth, provided two fundamental conditions are met.

1. A Learnable World (Identifiability) This is a condition on the environment. The world itself must be learnable. If two different sets of rules for the world (e.g., θ_1 and θ_2) produce the exact same statistical pattern of observations no matter what the agent does, then no algorithm could ever tell them apart. We must assume the world is not pathologically ambiguous.

2. A Curious Agent (Sufficient Exploration) This is a condition on the agent’s policy. The agent must not prematurely stop gathering data. If it finds a reasonably good strategy early on and exploits it forever, it may never try the actions needed to discover the truly optimal strategy. It must remain “curious.”

This is the final, beautiful piece of the puzzle. The IDS algorithm ensures its own success by satisfying the second condition by design.

Many RL algorithms treat exploration as an external mechanism, like the random actions of an ε-greedy policy. For IDS, exploration is not an afterthought; it is part of the objective function. The + λ * Information Gain(a) term provides an explicit reward for reducing uncertainty. As long as the agent is uncertain about the optimal action—meaning the entropy of its belief H(P(A* | b_t)) is greater than zero—it is intrinsically motivated to take actions that yield information.

This creates a powerful, self-correcting feedback loop:

Belief drives the agent to take actions it predicts will be informative.
These actions gather new data from the environment.
The new data refines the belief via Bayes’ Rule, making it a more accurate model of reality.
This more accurate belief allows for better decisions and more precise simulations in the future.

Limitations and Practical Considerations

While this framework is powerful, it is not a silver bullet. Moving from this ideal to a practical algorithm introduces significant challenges:

Computational Cost: Maintaining and updating a full belief state b_t can be computationally prohibitive. For all but the simplest models, the belief is an intractable probability distribution. Practical implementations must rely on approximations, such as representing the belief with a finite set of samples (particle filtering) or assuming it belongs to a simple parametric family (conjugate priors).
The Myopic Approximation: The IDS heuristic is “myopic” or one-step. It prices information based only on its immediate impact on the next belief state. It doesn’t account for the possibility of taking an action now that unlocks even more valuable information-gathering opportunities several steps in the future.
The Price of Information (λ): The performance of the algorithm can be highly sensitive to the hyperparameter λ. Setting this “price” is a difficult problem in itself, often requiring domain-specific tuning.

Despite these limitations, the Bayesian Bellman equation and the IDS heuristic provide a principled and elegant framework for thinking about one of the deepest problems in artificial intelligence: how to act intelligently when you don’t know the rules. Belief drives actions, and actions refine belief, all guided by the elegant principle of getting the most knowledge for the lowest price.