R2PO

Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

Rahaf Abu Hara¹, Vaibbhav Murarri², Claudio Zito³

^{1, 2, 3}Laboratory of AI and Robotics Research (LAIRR)
Heriot-Watt University Dubai

rha4001@hw.ac.uk, vm81@hw.ac.uk, C.Zito@hw.ac.uk

Paper Code

CartPole-v1

Pong

Swimmer-v5

InvertedPendulum-v5

InvertedDoublePendulum-v5

MountainCar-v0

MountainCarContinuous-v0

FrozenLake-v1

Hopper

Abstract

Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one.

We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM acts as a global policy optimizer and proposes candidate policy parameters; the environment executes them; a Critic-LLM then inspects the resulting rollouts and proposes targeted parameter revisions grounded in observed states, actions, and rewards.

Across ten environments, ablations show R2PO's gains arise from a design that explicitly separates global search from behavior-grounded revision and uses selection to filter high-variance edits. We further identify a dominant failure mode, salience bias: when presented with multiple rollouts, the Critic-LLM fixates on improving a single failure even when most trajectories succeed. In a three-trajectory variant, this behavior explains 76.6% of regressions on CartPole. R2PO mitigates this by reasoning over aggregate rollout statistics, median-trajectory selection, and a revision rule.

Using a relatively small open-weight 20B-parameter model, R2PO achieves the highest mean best reward across all ten environments, while reaching near-optimal performance substantially earlier in training (e.g., near-maximum CartPole reward within ~500 episodes), and training far more stably than both deep RL and prior LLM-based methods.

The R2PO Framework

R2PO uses two LLMs in distinct roles. The Search-LLM is the global policy optimizer: given the history of previously evaluated policies and their scalar mean rewards, it proposes where to move next in the parameter space. The Critic-LLM is the trajectory-based reviser: given a calibrated summary of the behavior observed when the Search-LLM's proposal was executed, it inspects what went wrong and proposes a targeted revision. Both roles use the same underlying 20B model, via two independent calls with distinct prompts tailored to their respective roles.

Stage 1: Search-LLM Proposal

At each iteration the Search-LLM is prompted with the optimization task and a reward-only replay history of previously evaluated parameters and their mean returns. It proposes an initial parameter vector, which the environment evaluates over K independent rollouts, returning the mean reward and a set of recorded trajectories.

Stage 2: Critic-LLM Reflection

The Critic-LLM receives a calibrated trajectory evidence package consisting of (i) the median trajectory reflecting typical policy behavior, (ii) aggregate rollout statistics (reward mean/min/max, episode length, success and failure rates), and (iii) a revision rule that preserves policies already performing well. It proposes a revised parameter vector, which is re-evaluated and kept only if it achieves higher reward.

Salience Bias

When shown multiple rollouts of a policy, the Critic-LLM systematically overweights the worst trajectory in its diagnosis and edits, even when most rollouts indicate good performance. We call this salience bias. On CartPole, 233 of 304 regressions (76.6%) in a three-trajectory setting meet the strict definition of a salience-problem regression. R2PO's evidence design — communicating failure frequency numerically via aggregate statistics rather than foregrounding a worst-case trace — mitigates this vulnerability while preserving the benefits of trajectory-grounded diagnosis.

Results

R2PO is evaluated on ten environments spanning discrete and continuous action spaces, as well as stochastic and deterministic dynamics: CartPole-v1, FrozenLake-v1, MountainCar-v0, MountainCarContinuous-v0, InvertedPendulum-v5, InvertedDoublePendulum-v5, Swimmer-v5, Maze, Nim, and Pong. All methods receive a matched budget of 200 LLM calls and 4,000 episodes per run, averaged over 10 independent runs. Best per environment is highlighted.

Table 1 — Mean Reward (± std) averaged across all training iterations

Environment	ProPS	ProPS+	Best SB3	R2PO
Nim	−0.59 ± 0.21	−0.25 ± 0.07	0.01 ± 0.32 (A2C)	0.61 ± 0.03
Pong	0.74 ± 0.63	1.22 ± 0.58	1.02 ± 0.74 (PPO)	2.51 ± 0.21
Swimmer	89.22 ± 48.68	162.05 ± 66.07	44.60 ± 7.30 (TRPO)	260.35 ± 36.05
MountainCarContinuous	−23.45 ± 41.65	17.90 ± 37.48	82.33 ± 2.57 (SAC)	81.61 ± 5.18
MountainCar	−199.31 ± 2.18	−195.81 ± 3.73	−199.99 ± 0.02 (DQN)	−147.84 ± 7.11
InvertedDoublePendulum	79.44 ± 16.50	86.71 ± 17.70	86.04 ± 0.33 (TRPO)	158.51 ± 68.41
InvertedPendulum	234.14 ± 169.76	309.52 ± 247.92	24.35 ± 0.13 (TRPO)	756.08 ± 154.50
FrozenLake	0.05 ± 0.05	0.37 ± 0.06	0.02 ± 0.02 (TRPO)	0.62 ± 0.07
CartPole	258.09 ± 138.58	253.06 ± 133.35	216.92 ± 63.34 (TRPO)	474.67 ± 16.90
Maze	−1.03 ± 0.18	0.76 ± 0.05	0.83 ± 0.13 (A2C)	0.83 ± 0.06

Table 2 — Mean Best Reward (± std) — peak performance across all training iterations

Environment	ProPS	ProPS+	Best SB3	R2PO
Nim	0.75 ± 0.30	1.00 ± 0.00	0.88 ± 0.27 (A2C)	1.00 ± 0.00
Pong	2.15 ± 0.95	2.78 ± 0.41	2.33 ± 0.78 (PPO)	3.00 ± 0.00
Swimmer	208.52 ± 68.38	274.86 ± 38.81	67.89 ± 20.92 (TRPO)	294.57 ± 43.79
MountainCarContinuous	75.37 ± 30.57	98.70 ± 0.50	94.81 ± 0.42 (SAC)	98.75 ± 0.44
MountainCar	−191.84 ± 25.80	−150.21 ± 26.62	−197.47 ± 3.66 (DQN)	−111.04 ± 3.92
InvertedDoublePendulum	112.18 ± 20.94	128.81 ± 54.51	98.25 ± 2.56 (TRPO)	254.04 ± 232.39
InvertedPendulum	649.91 ± 432.17	657.88 ± 444.42	28.49 ± 0.71 (TRPO)	1000.00 ± 0.00
FrozenLake	0.24 ± 0.14	0.90 ± 0.04	0.12 ± 0.02 (TRPO)	0.93 ± 0.05
CartPole	427.74 ± 143.04	396.21 ± 155.21	490.76 ± 29.22 (TRPO)	500.00 ± 0.00
Maze	−0.70 ± 0.88	0.97 ± 0.00	0.97 ± 0.00 (A2C)	0.97 ± 0.00

Qualitative Examples

Two representative R2PO revision episodes on CartPole. Example 1 shows a conservative one-parameter edit guided by aggregate statistics; Example 2 shows why selection remains essential — well-reasoned minimal edits can still fail. Changed parameters are underlined.

Example 1: Conservative Repair (CartPole, R2PO) 490.15 → 493.70 (+3.55)

Stats: Mean 490.15, Min 436, Max 500, Success 19/20

Initial: [6.0, 5.5, 6.0, 6.0, -1.0, 6.0, -0.5, 6.0, -2.0, -2.0]
Revised: [6.0, 6.0, 6.0, 6.0, -1.0, 6.0, -0.5, 6.0, -2.0, -2.0]

Critic-LLM: "The average reward (490.15) and success rate (19/20) indicate the policy is already highly effective… I increased only params[1]."

Takeaway: Aggregate statistics guide a minimal one-parameter edit. ✓ Accepted

Example 2: Honest Failure (CartPole, R2PO) 436.05 → 173.30 (−262.75)

Stats: Mean 436.05, Success 15/20, sporadic failures

Initial: [6.0, 6.0, 6.0, 6.0, -1.0, 6.0, -0.5, 6.0, -1.8, -2.0]
Revised: [6.0, 6.0, 6.0, 6.0, -1.0, 6.0, -0.3, 6.0, -1.5, -2.0]

Critic-LLM: "The failures are sporadic; the median rollout is the same as with the current weights… The adjustments are small and target only the aspects linked to occasional failures."

Takeaway: Selection catches a minimal edit that destroys the policy. × Original kept

Learning Curves

Mean reward (± standard deviation) over 10 independent runs across all ten environments. R2PO reaches strong performance earlier and maintains it more consistently than baselines.

InvertedPendulum reward curve — InvertedPendulum-v5

InvertedDoublePendulum reward curve — InvertedDoublePendulum-v5

MountainCar reward curve — MountainCar-v0

MountainCarContinuous reward curve — MountainCarContinuous-v0

Contributions

R2PO, a two-stage LLM policy-optimization framework that combines scalar reward search with trajectory-grounded behavioral diagnosis, treating rollouts as first-class in-context evidence.
A systematic ablation across ten environments over search budget, prompt design, two-stage architecture, selection, and trajectory evidence, showing that R2PO's gains require the combination of trajectory-grounded revision, two-stage design, and keep-best selection.
A mechanistic analysis identifying and quantifying salience bias — a dominant failure mode in which the Critic-LLM overweights vivid worst-case rollouts when multiple trajectories are available, accounting for 76.6% of regressions on CartPole in a three-trajectory setting.
Empirical validation using a 20B open-weight LLM: R2PO achieves the highest mean reward on 9 of 10 environments, matches or exceeds all baselines on mean best reward, reaches near-optimal performance substantially earlier in training, and trains more stably than both deep RL and prior LLM-based baselines.

BibTeX

@misc{hara2026reflectivepromptedpolicyoptimization,
      title={Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias},
      author={Rahaf Abu Hara and Vaibbhav Murarri and Claudio Zito},
      year={2026},
      eprint={2605.08315},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.08315},
}