R2PO

Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

Rahaf Abu Hara1, Vaibbhav Murarri2, Claudio Zito3
1, 2, 3Laboratory of AI and Robotics Research (LAIRR)
Heriot-Watt University Dubai
rha4001@hw.ac.uk, vm81@hw.ac.uk, C.Zito@hw.ac.uk

Abstract

Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one.

We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM acts as a global policy optimizer and proposes candidate policy parameters; the environment executes them; a Critic-LLM then inspects the resulting rollouts and proposes targeted parameter revisions grounded in observed states, actions, and rewards.

Across ten environments, ablations show R2PO's gains arise from a design that explicitly separates global search from behavior-grounded revision and uses selection to filter high-variance edits. We further identify a dominant failure mode, salience bias: when presented with multiple rollouts, the Critic-LLM fixates on improving a single failure even when most trajectories succeed. In a three-trajectory variant, this behavior explains 76.6% of regressions on CartPole. R2PO mitigates this by reasoning over aggregate rollout statistics, median-trajectory selection, and a revision rule.

Using a relatively small open-weight 20B-parameter model, R2PO achieves the highest mean best reward across all ten environments, while reaching near-optimal performance substantially earlier in training (e.g., near-maximum CartPole reward within ~500 episodes), and training far more stably than both deep RL and prior LLM-based methods.

The R2PO Framework

R2PO uses two LLMs in distinct roles. The Search-LLM is the global policy optimizer: given the history of previously evaluated policies and their scalar mean rewards, it proposes where to move next in the parameter space. The Critic-LLM is the trajectory-based reviser: given a calibrated summary of the behavior observed when the Search-LLM's proposal was executed, it inspects what went wrong and proposes a targeted revision. Both roles use the same underlying 20B model, via two independent calls with distinct prompts tailored to their respective roles.

R2PO framework overview diagram

Stage 1: Search-LLM Proposal

At each iteration the Search-LLM is prompted with the optimization task and a reward-only replay history of previously evaluated parameters and their mean returns. It proposes an initial parameter vector, which the environment evaluates over K independent rollouts, returning the mean reward and a set of recorded trajectories.

Stage 2: Critic-LLM Reflection

The Critic-LLM receives a calibrated trajectory evidence package consisting of (i) the median trajectory reflecting typical policy behavior, (ii) aggregate rollout statistics (reward mean/min/max, episode length, success and failure rates), and (iii) a revision rule that preserves policies already performing well. It proposes a revised parameter vector, which is re-evaluated and kept only if it achieves higher reward.

Salience Bias

When shown multiple rollouts of a policy, the Critic-LLM systematically overweights the worst trajectory in its diagnosis and edits, even when most rollouts indicate good performance. We call this salience bias. On CartPole, 233 of 304 regressions (76.6%) in a three-trajectory setting meet the strict definition of a salience-problem regression. R2PO's evidence design — communicating failure frequency numerically via aggregate statistics rather than foregrounding a worst-case trace — mitigates this vulnerability while preserving the benefits of trajectory-grounded diagnosis.

Results

R2PO is evaluated on ten environments spanning discrete and continuous action spaces, as well as stochastic and deterministic dynamics: CartPole-v1, FrozenLake-v1, MountainCar-v0, MountainCarContinuous-v0, InvertedPendulum-v5, InvertedDoublePendulum-v5, Swimmer-v5, Maze, Nim, and Pong. All methods receive a matched budget of 200 LLM calls and 4,000 episodes per run, averaged over 10 independent runs. Best per environment is highlighted.

Table 1 — Mean Reward (± std) averaged across all training iterations

Environment ProPS ProPS+ Best SB3 R2PO
Nim −0.59 ± 0.21 −0.25 ± 0.07 0.01 ± 0.32 (A2C) 0.61 ± 0.03
Pong 0.74 ± 0.63 1.22 ± 0.58 1.02 ± 0.74 (PPO) 2.51 ± 0.21
Swimmer 89.22 ± 48.68 162.05 ± 66.07 44.60 ± 7.30 (TRPO) 260.35 ± 36.05
MountainCarContinuous −23.45 ± 41.65 17.90 ± 37.48 82.33 ± 2.57 (SAC) 81.61 ± 5.18
MountainCar −199.31 ± 2.18 −195.81 ± 3.73 −199.99 ± 0.02 (DQN) −147.84 ± 7.11
InvertedDoublePendulum 79.44 ± 16.50 86.71 ± 17.70 86.04 ± 0.33 (TRPO) 158.51 ± 68.41
InvertedPendulum 234.14 ± 169.76 309.52 ± 247.92 24.35 ± 0.13 (TRPO) 756.08 ± 154.50
FrozenLake 0.05 ± 0.05 0.37 ± 0.06 0.02 ± 0.02 (TRPO) 0.62 ± 0.07
CartPole 258.09 ± 138.58 253.06 ± 133.35 216.92 ± 63.34 (TRPO) 474.67 ± 16.90
Maze −1.03 ± 0.18 0.76 ± 0.05 0.83 ± 0.13 (A2C) 0.83 ± 0.06

Table 2 — Mean Best Reward (± std) — peak performance across all training iterations

Environment ProPS ProPS+ Best SB3 R2PO
Nim 0.75 ± 0.30 1.00 ± 0.00 0.88 ± 0.27 (A2C) 1.00 ± 0.00
Pong 2.15 ± 0.95 2.78 ± 0.41 2.33 ± 0.78 (PPO) 3.00 ± 0.00
Swimmer 208.52 ± 68.38 274.86 ± 38.81 67.89 ± 20.92 (TRPO) 294.57 ± 43.79
MountainCarContinuous 75.37 ± 30.57 98.70 ± 0.50 94.81 ± 0.42 (SAC) 98.75 ± 0.44
MountainCar −191.84 ± 25.80 −150.21 ± 26.62 −197.47 ± 3.66 (DQN) −111.04 ± 3.92
InvertedDoublePendulum 112.18 ± 20.94 128.81 ± 54.51 98.25 ± 2.56 (TRPO) 254.04 ± 232.39
InvertedPendulum 649.91 ± 432.17 657.88 ± 444.42 28.49 ± 0.71 (TRPO) 1000.00 ± 0.00
FrozenLake 0.24 ± 0.14 0.90 ± 0.04 0.12 ± 0.02 (TRPO) 0.93 ± 0.05
CartPole 427.74 ± 143.04 396.21 ± 155.21 490.76 ± 29.22 (TRPO) 500.00 ± 0.00
Maze −0.70 ± 0.88 0.97 ± 0.00 0.97 ± 0.00 (A2C) 0.97 ± 0.00

Qualitative Examples

Two representative R2PO revision episodes on CartPole. Example 1 shows a conservative one-parameter edit guided by aggregate statistics; Example 2 shows why selection remains essential — well-reasoned minimal edits can still fail. Changed parameters are underlined.

Example 1: Conservative Repair (CartPole, R2PO) 490.15 → 493.70 (+3.55)

Stats: Mean 490.15, Min 436, Max 500, Success 19/20

Initial:  [6.0, 5.5, 6.0, 6.0, -1.0, 6.0, -0.5, 6.0, -2.0, -2.0]
Revised: [6.0, 6.0, 6.0, 6.0, -1.0, 6.0, -0.5, 6.0, -2.0, -2.0]

Critic-LLM: "The average reward (490.15) and success rate (19/20) indicate the policy is already highly effective… I increased only params[1]."

Example 2: Honest Failure (CartPole, R2PO) 436.05 → 173.30 (−262.75)

Stats: Mean 436.05, Success 15/20, sporadic failures

Initial:  [6.0, 6.0, 6.0, 6.0, -1.0, 6.0, -0.5, 6.0, -1.8, -2.0]
Revised: [6.0, 6.0, 6.0, 6.0, -1.0, 6.0, -0.3, 6.0, -1.5, -2.0]

Critic-LLM: "The failures are sporadic; the median rollout is the same as with the current weights… The adjustments are small and target only the aspects linked to occasional failures."

Learning Curves

Mean reward (± standard deviation) over 10 independent runs across all ten environments. R2PO reaches strong performance earlier and maintains it more consistently than baselines.

CartPole reward curve
CartPole-v1
Swimmer reward curve
Swimmer-v5
InvertedPendulum reward curve
InvertedPendulum-v5
InvertedDoublePendulum reward curve
InvertedDoublePendulum-v5
MountainCar reward curve
MountainCar-v0
MountainCarContinuous reward curve
MountainCarContinuous-v0
FrozenLake reward curve
FrozenLake-v1
Pong reward curve
Pong
Nim reward curve
Nim
Maze reward curve
Maze

Contributions

  • R2PO, a two-stage LLM policy-optimization framework that combines scalar reward search with trajectory-grounded behavioral diagnosis, treating rollouts as first-class in-context evidence.
  • A systematic ablation across ten environments over search budget, prompt design, two-stage architecture, selection, and trajectory evidence, showing that R2PO's gains require the combination of trajectory-grounded revision, two-stage design, and keep-best selection.
  • A mechanistic analysis identifying and quantifying salience bias — a dominant failure mode in which the Critic-LLM overweights vivid worst-case rollouts when multiple trajectories are available, accounting for 76.6% of regressions on CartPole in a three-trajectory setting.
  • Empirical validation using a 20B open-weight LLM: R2PO achieves the highest mean reward on 9 of 10 environments, matches or exceeds all baselines on mean best reward, reaches near-optimal performance substantially earlier in training, and trains more stably than both deep RL and prior LLM-based baselines.

BibTeX

@misc{hara2026reflectivepromptedpolicyoptimization,
      title={Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias},
      author={Rahaf Abu Hara and Vaibbhav Murarri and Claudio Zito},
      year={2026},
      eprint={2605.08315},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.08315},
}