Inverting the Bellman Equation: From Q-Values to World Models

Inverting the Bellman Equation:
From $Q$-Values to World Models

Alistair Letcher
Mattie Fellows
Alexander Goldie
Jonathan Richens
Jakob Foerster
Oliver Richardson

TLDR: We prove that model-free agents trained on a sufficiently rich set of reward functions, e.g. using goal-conditioned RL, implicitly encode a unique world model (WM) in their $Q$-values. We introduce $P$-learning to extract this WM in practice, and show that agents trained on just a handful of goals encode accurate dynamics in Reacher, MountainCar, and stochastic FourRooms, even over variables that rewards never directly depend on. Surprisingly, policies trained exclusively on a Reacher agent's implicit WM are quasi-optimal on velocity-based goals despite position-only training.

Illustration of P-learning

Overview

The value functions typically learnt by model-free agents are intrinsically tied to policies and reward functions, making it challenging to interpret what they understand of the world (as separate from what they value). In general, distinct worlds (transition kernels) can give rise to identical value functions, an obstacle formalised as the value equivalence problem. $\newcommand{\cS}{\mathcal{S}} \newcommand{\cA}{\mathcal{A}} \newcommand{\cG}{\mathcal{G}} \newcommand{\cT}{\mathcal{T}} \newcommand{\E}{\mathbb{E}} \newcommand{\eq}{:=} \newcommand{\om}{\omega}$ This raises the question:

When do model-free agents encode an accurate model of their environment?

We hypothesise that agents trained on a sufficiently rich set of goals — e.g. via goal-conditioned RL (GCRL), successor features, or forward-backward learning — implicitly learn a unique world model. Taking GCRL as the broader framework for agents trained on goals parameterised by arbitrary reward functions, our work makes three contributions to test this hypothesis and bridge model-based, model-free and goal-conditioned RL.

Methodologically, we introduce $P$-learning, an inverse analogue to $Q$-learning: instead of iteratively updating value estimates for a fixed environment, $P$-learning updates a candidate WM to be consistent with fixed value functions, effectively inverting the Bellman equation. We make this precise by proving that tabular $P$-learning converges to a closed-form expression involving the Moore-Penrose pseudo-inverse.
Theoretically, we prove conditions on the families of reward functions for which value equivalence is broken, i.e. for which $P$ is uniquely determined by $Q$-values. Our results demonstrate that GCRL can bridge model-free and model-based RL, with $Q$-values (paired with rewards) becoming informationally equivalent to the kernel $P$.
Empirically, we show that agents trained with a handful of goals often contain accurate WMs, even in continuous spaces like Reacher, and even over variables that rewards never directly depend on. Surprisingly, this transfers to near-optimal planning (exclusively inside the WM) for out-of-distribution goals, including reaching specific angular velocities despite position-only training. We analyse these implicit generalisation capabilities in MountainCar, revealing that agents with widely different objectives can secretly encode similar models. Finally, we identify a strong correlation (Spearman $\rho = 0.98$) between an agent's performance and the accuracy of its internalised WM, suggesting that GCRL is an implicitly hybrid method linking model-free and model-based RL.

$P$-learning: extracting world models from $Q$-values

Our starting point is a simple symmetry. The (goal-augmented) Bellman operator for a fixed policy $\pi$ is given by

$$ \mathcal{T}^\pi_{P}(Q)(s, a, g) = \mathbb{E}_{s' \sim P(s,a),\,a'\sim\pi(s',g)}\bigl[\,r(s', g) + \gamma\, Q(s', a', g)\,\bigr] \,, $$

making explicit the dependence of $\cT^\pi_P$ on the kernel $P$. While $Q$-learning treats $P$ as fixed and searches for a value function $Q$ that satisfies $\mathcal{T}^*_P(Q) = Q$, we consider the inverse problem of extracting an internal model $P^\star$ of the environment from a fixed agent with goal-conditioned $Q$-values, policy $\pi$, and known reward function $r$. The guiding observation is that a model $P_\phi$ satisfying the Bellman equation $\cT^\pi_{P_\phi}(Q) = Q$ is compatible with the agent's behavior, and may thus be viewed as (one possible) subjective belief of the agent about the environment. A natural objective to extract such a model is therefore to minimise the Bellman residual

$$ \mathcal{L}(\phi) = \bigl\| \mathcal{T}^\pi_{P_\phi}(Q) - Q \bigr\|_d^2 $$

with respect to $\phi$, for some reference distribution $d \in \Delta(\cS \times \cA \times \cG)$, e.g. induced by an exploration policy. We call this $P$-learning. Defining the TD estimate $\hat\delta_\phi(s,a,g,s',a') \eq r(s',g) + \gamma Q(s',a',g) - Q(s,a,g)$, we prove under regularity assumptions that the gradient is given by

$$ \nabla_\phi \mathcal L = \E_{g,s,a,s',a'}\bigl[\delta_\phi \hat\delta_\phi \nabla_\phi \log P_\phi(s'|s,a)\bigr] \,, $$

where $(s,a,g) \sim d$ and $s' \sim P_\phi(s, a), a' \sim \pi(s', g)$. For finite MDPs, we can show that the Bellman equation becomes a set of linear systems $M\,P_\phi(s,a) = Q(s,a)$, where $M_{lk} := r(s'_k, g_l) + \gamma V(s'_k, g_l)$, and we prove (Theorem 1) that tabular $P$-learning converges, for any learning rate $0 < \alpha < 2/\sigma_{\max}^2(M)$, to

$$ P_\infty(s,a) = M^{+}\, Q(s,a) + \bigl(I - M^{+}M\bigr)\, P_0(s,a)\,, $$

where $M^{+}$ is the Moore–Penrose pseudo-inverse. This reveals a key condition under which value equivalence can be broken: if $M$ has full rank, the second term vanishes and the iteration converges to a unique solution $M^+Q$, with $M^{+}$ acting as an inverse Bellman map from values back to dynamics (note that $M^+ = M^{-1}$ when $M$ is square). If $Q$-values are moreover exact, i.e. satisfy $\cT_P(Q) = Q$, then $P$-learning converges to the true kernel $P\,$!

$P$-learning — Pseudocode

Require: $Q$-values, policy $\pi$, reward $r$, distribution $d$ over $(s, a, g)$, step sizes $\{\alpha_n\}$, parametric family $P_\phi\,$, initial $\phi_0$

1for $n = 0, 1, \ldots$ do

2sample $(s, a, g) \sim d$

3sample $s'_1, s'_2 \sim P_{\phi_n}(\cdot \mid s, a)$, $a'_1 \sim \pi(\cdot \mid s'_1, g)$, $a'_2 \sim \pi(\cdot \mid s'_2, g)$

4$\delta^{i} \gets r(s'_i, g) + \gamma\, Q(s'_i, a'_i, g) - Q(s, a, g)$ for $i = 1, 2$

5$\hat{g}_n \gets \delta^{1}\, \delta^{2}\, \nabla_\phi \log P_\phi(s'_2 \mid s, a)\big|_{\phi=\phi_n}$

6$\phi_{n+1} \gets \phi_n - \alpha_n\, \hat{g}_n$

7end for

8return extracted world model $P_{\phi_n}$

Theory: when is $P$ uniquely determined by $Q$?

While $P$-learning enables the efficient extraction of WMs, the resulting kernel may not be determined uniquely (e.g. if $M$ is rank-deficient in the setting above). We provide conditions under which agents break this degeneracy when $Q$-values are exact, with tight error bounds when they are approximate. Four regimes emerge for stochastic/deterministic kernels and finite/continuous state spaces. Indicator rewards are defined by $r(s,g) = \delta_{sg}$ in finite $\cS$ or $r(s,g) = \mathbf{1}[\|s-g\| \leq \sigma]$ with $\sigma > 0$ in continuous $\cS\,$; Gaussian rewards are given by $r(s,g) = e^{-\|s-g\|^2/2\sigma^2}$.

Conditions on $(\cG, r)$ under which environment dynamics are uniquely determined by $Q$-values.

State space $\cS$	Deterministic $P$	Stochastic $P$	Reward $r$
Finite $\cS$	$\|\cG\| \geq 1$	$\|\cG\| \geq \|\cS\|$	Generic
	$\|\cG\| \geq 1$	$\|\cG\| \geq \|\cS\|$	Gaussian
	$\|\cG\| \geq \|\cS\|$	$\|\cG\| \geq \|\cS\|$	Indicator
Continuous $\cS \subseteq \mathbb{R}^d$	if ^†‡ then $\|\cG\| \geq 2d+1$ else $\|\cG\|$ large (finite)	$\cG$ non-empty interior^‡	Gaussian
Continuous $\cS \subseteq \mathbb{R}^d$	$\cG \supseteq \cS + B_\sigma(0)$	$\cG \supseteq \cS + B_\sigma(0)$^‡	Indicator

^† For real-analytic value functions. ^‡ For unconditional policies.

Taken together, our results show that methods like GCRL can bridge model-free and model-based RL, with the collection $(Q_g, \pi_g, r_g)_{g \in \cG}$ becoming informationally equivalent to the kernel $P$ when $\cG$ is sufficiently rich. Note that policies are most often induced by $Q$-functions via $\text{argmax}$ or $\text{softmax}$, so $(Q_g, r_g)$ is typically sufficient in practice. Moreover, our results make no assumptions on policy optimality: only $Q$-value accuracy is required.

Experiment (`Reacher`): beyond theoretical guarantees

We train a goal-conditioned PQN agent on MuJoCo Reacher with only $|\mathcal{G}|=4$ sparse goals at the cardinal positions $\{(\pm 1, 0), (0, \pm 1)\}$, and extract its implicit WM with $P$-learning. We visualise the agent's return, $Q$-values and extracted WM over the course of training in the animation below. Despite imperfect $Q$-values (normalised MSE $=5.7\cdot10^{-1}$), the WM matches ground-truth dynamics with high fidelity (NMSE $=1.2\cdot10^{-4}$).

Top row: agent return on training goals, MSE of world model (WM) extracted via $P$-learning, and return of policies planned inside the WM on unseen goals. Bottom row: slice of the learnt $Q$-values, implicit WM, and policy rollouts, all shown vs ground truth.

Agent return. Discounted return of the Reacher agent on four training goals (fingertip targets at unit distance in the four cardinal directions), with the exponentially weighted average return in black.

Left: Discounted return of optimal $(R^\star)$ vs WM-trained $(R^{\text{WM}})$ policies on three unseen goals each. Mean $\pm$ SE over 10 seeds and 512 resets. WM-derived policies are quasi-optimal on every out-of-distribution task.

To further validate accuracy, we train policies inside the extracted WM for three unseen goals increasingly far from the training distribution: (i) far fingertip: reach fingertip position $(x, y) = (\sqrt{2}, \sqrt{2})$, which is twice as far as any training goal; (ii) target angle: reach joint configuration $\theta = (0, \pi)$, noting that distinct angle configurations can give the same fingertip position; and (iii) target velocity: reach velocity $\om = (2, 2)$, where $\omega$ covers a different dimension as the training goals. Despite the agent being trained on exclusively position-based goals, the WM-learnt policy is quasi-optimal on all three tasks, as shown in the bar plot, revealing implicit generalisation capabilities.

Finally, we test whether better agents have better implicit world models by sweeping over architecture capacity. We plot agent return, MSE of the extracted WM, and return of a WM-trained policy on unseen goals, against capacity in the figure below. The Spearman correlation coefficients between agent return and WM error, and agent return and WM-policy return on unseen goals, are $\rho = -0.98$ with 95% confidence interval $[-0.99, -0.95]$, and $\rho = +0.95$ $[+0.89, +0.97]$, suggesting that performance and implicit generalisation are tightly coupled.

Top: agent return, WM error, and WM-policy return on unseen goals over training, for fixed width $W = 512$ and increasing $Q$-network depth $D$. Bottom: the same three metrics as heatmaps over all widths and depths at the end of training. More capacity $\Rightarrow$ better agent and better implicit WM.

Experiment (`MountainCar`): agents with different goals secretly agree

Left: Quiver plots of WMs extracted from two agents trained with position- and velocity-based goals vs ground truth, for $a = \text{right}$. Each arrow starts from a state $s$ and points to the predicted or true next-state $s' = P(s,a)$.

Digging deeper, we repeat the same experiment in Mountaincar, and then train a second agent with 4 velocity-based goals. The resulting implicit WMs are near-identical (NMSE $= 1.7\cdot10^{-4}$), and ~42 times closer to each other than to the true kernel (NMSE $= 7.2\cdot10^{-3}$), as illustrated in the quiver plot to the right. This leads us to formulate an informal local-global hypothesis left for future exploration: value-based agents, trained to predict returns on a small number of local reward functions (i.e. defined on a subset of variables), tend to encode an accurate world model over all dependent environment variables.

Experiment (`FourRooms`): more stochasticity, more goals

To validate our theoretical results for stochastic environments, and assess the number of goals required in practice, we run experiments on three increasingly stochastic variants of the classic FourRooms gridworld. In each case we plot (left) the extracted world-model MSE as a function of training timestep, for increasing $|\cG|$. We also plot (right) the policy obtained by planning inside the extracted WM (solid) vs the optimal policy (dashed), for two out-of-distribution goals specified by reaching a given state without walking onto unsafe states.

Deterministic — $|\mathcal{G}| = 1$ suffices

Left: A single generic goal (reward function perturbed with uniform noise at initialisation) drives the extracted WM to zero error during training. Right: optimal (dashed) vs WM-derived (solid) policy on two out-of-distribution goals. They coincide exactly, in line with our theorem (deterministic $P$, finite $\cS$).

Windy (local stochasticity) — $|\mathcal{G}| = 4$ suffices

The windy variant realises the intended action w.p. $\tfrac{1}{2}$ and rotates it 90° either way w.p. $\tfrac{1}{4}$ each. The single-goal regime is no longer sufficient for stochastic dynamics, but for $|\mathcal{G}| = 4$, the optimal (dashed) vs WM-derived (solid) policies on OOD goals have returns matching to within a few percent.

Teleporting (full stochasticity) — $|\mathcal{G}| = 20$ suffices

In the teleporting variant, four cells (marked T) re-emit the agent uniformly into the diagonally-opposite room. For $|\mathcal{G}| = 20$, the extracted WM is almost perfect, again producing quasi-optimal policies on OOD goals, well below the worst-case theoretical bound $|\mathcal{G}| \geq |\mathcal{S}| = 68$.

Finally, we plot the return on out-of-distribution goals as a function of the number of training goals $|\cG|$, for each variant of the environment. Surprisingly, the number of goals inducing quasi-optimal OOD generalisation grows gently: $|\mathcal{G}| = 1 \to 4 \to 20\,$ goals for deterministic $\to$ local $\to$ fully stochastic, well below the $|\mathcal{G}| = 68$ our theory demands.

Mean return $\pm$ SE (10 training seeds), for WM-trained $(R^\text{WM})$ vs optimal $(R^\star)$ policies, on two unseen goals, in each FourRooms variant, for varying number of training goals $|\cG|$ on the $x$-axis.

Takeaways

Model-free agents that learn $Q$-values implicitly carry, in decodable form, a model of the world.
$P$-learning provides a concrete algorithm for inverting the Bellman equation: $(Q, \pi, r) \mapsto P$.
These internalised world models are often highly accurate, even for agents trained on a handful of goals, inducing implicit generalisation capabilities via zero-shot planning for out-of-distribution goals — including over variables the training reward functions never directly depend on.
Our results take a first step in bridging model-based, model-free, and goal-conditioned RL: when model-free agents are trained on a sufficiently rich set of goals, they implicitly contain a unique and accurate world model. We hope this will inspire interpretability and auditing tools based on implicit world models, to help ensure that agents can efficiently be repurposed when trained using reward functions that are, inevitably, misspecified.

Citation

@misc{letcher2026inverting,
    title={Inverting the Bellman Equation: From $Q$-Values to World Models}, 
    author={Alistair Letcher and Mattie Fellows and Alexander D. Goldie and Jonathan Richens and Jakob N. Foerster and Oliver Richardson},
    year={2026},
    eprint={2606.21173},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2606.21173}, 
}

The website template was borrowed from Easy Academic Website Template.