Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Yunho Choi; Jongwon Lim; Woojin Ahn; Minjae Oh; Jeonghoon Shim; Yohan Jo

arXiv preprint · cs.LG · 2026

Your Language Model is Its Own Critic

Reinforcement Learning with Value Estimation from Actor’s Internal States

Yunho Choi^*, Jongwon Lim^*, Woojin Ahn, Minjae Oh, Jeonghoon Shim, Yohan Jo^†

^*Equal contribution. ^†Corresponding author.

Paper arXiv BibTeX

TL;DR

POISE asks how the actor's internal representations can be folded back into RL training: it turns hidden states from the model's own generation process into a baseline function for RL updates, without a separate critic or many extra samples.

What is this paper about?

Reinforcement learning with verifiable rewards trains language models from outcome feedback, such as whether a final math answer is correct. To turn those raw rewards into useful policy-gradient updates, RL methods usually subtract a baseline, so the update reflects whether a response did better or worse than expected for the prompt. Existing methods estimate that baseline either with a separate critic model or with multiple rollouts from the same prompt. POISE studies how to use the actor's own hidden states for that baseline, so the training loop can use information the model already computed while generating its answer.

Introduction

Why internal value estimation matters

A good baseline function converts raw rewards into advantages by comparing each response with the reward expected for its prompt. POISE replaces expensive baselines with a lightweight estimator trained on hidden states and entropy statistics that the actor already computes during generation.

No separate critic model

Instead of training another language-model-sized network, POISE uses the actor's internal states.

Less rollout overhead

POISE avoids spending many same-prompt samples just to estimate a group-relative baseline.

Stable online learning

The estimator is updated as the policy changes, using current and recent rollouts.

Preliminary

Internal states carry a usable value signal

Before using internal states inside RL training, POISE first tests whether they can support prompt-level value estimation. On held-out DAPO-Math rollouts, a lightweight probe trained on hidden states and entropy is compared against a separately trained policy-scale critic.

Preliminary value-estimation comparison between POISE probe and critic — Preliminary value-prediction experiment. Predictions are compared against empirical Avg@8 pass rates on held-out prompts.

Internal states are predictive

The internal-state probe reaches Pearson r = 0.870 and MAE = 0.141 on the preliminary benchmark, indicating that verifier-reward information is accessible from the actor's own forward-pass signals.

Competitive with a large critic

The comparison critic is finetuned from Qwen3-4B, but reports r = 0.676 and MAE = 0.262 in the same setting. POISE therefore starts from a simple observation: the actor already carries a useful value signal.

Method

Cross-rollout value estimation

A generated answer should not directly create its own baseline, because that can bias the policy gradient. POISE solves this with a simple paired-rollout trick.

POISE cross-rollout algorithm diagram — The two responses are independent samples for the same prompt. Each one uses the other response's internal-state features to estimate a baseline.

Generate two answers

The old policy samples two independent responses for each prompt.

Read internal signals

POISE collects hidden-state pools and token entropy statistics from each response.

Swap predicted values

Each response subtracts a value estimated from the other response, preserving independence.

Update the estimator

A small probe is trained online so its predictions follow the changing actor.

Results

Comparable accuracy with less wall-clock time

Across AMC23/24, AIME24/25/26, HMMT25, and BRUMO25, Avg@32 estimates pass rate from 32 sampled answers per problem. Against DAPO, a state-of-the-art GRPO-based RLVR method for mathematical reasoning, POISE reaches comparable accuracy while avoiding the same rollout-heavy baseline estimation.

Benchmark suite

The reported math suite spans olympiad-style reasoning problems from AMC, AIME, HMMT, and BRUMO. DAPO does not train a separate critic; POISE targets the same baseline-estimation bottleneck from the actor's internal states.

AMC23/24 AIME24/25/26 HMMT25 BRUMO25

Accuracy across the suite

The average is not meant to hide dataset-level variation; it summarizes whether POISE reaches the same regime as the SOTA DAPO baseline across the seven reported math tasks.

Qwen3-4BAvg@32

POISE

0.500

DAPO

0.508

DeepSeek-R1-Distill-Qwen-1.5BAvg@32

POISE

0.303

DAPO

0.296

Training cost

DAPO is the harder comparison because it is already a strong GRPO-based RLVR method. POISE aims for comparable accuracy while spending less wall-clock time on baseline estimation.

Qwen3-4B on 2x B200time

POISE

36h

DAPO

49h

DeepSeek 1.5Btime

POISE

18h

DAPO

24h

POISE benchmark table and training dynamics — Reported benchmark accuracy, wall-clock dynamics, gradient norms, and estimator calibration.

Analysis

What the estimator is actually learning

After showing that the probe is reliable enough to use, the paper analyzes why it works: whether it tracks a changing policy, generalizes beyond math, and depends on simple, linearly accessible internal signals.

Tracks policy drift

Online analyses evaluate the estimator every 10 training steps against empirical Avg@8 values from the current actor checkpoint, testing whether it remains useful as the policy changes.

Works beyond math

The same internal-state idea is tested on math, coding, tool-calling, and instruction-following RLVR tasks, suggesting the signal is not just a math benchmark artifact.

Mostly linear signal

Probe ablations show that a linear ridge probe is competitive with larger MLP probes, supporting the claim that value-relevant information is readily accessible.

Pearson correlation

Higher is better. Qwen3-4B value prediction across RLVR domains.

Math (DAPO-Math)

Probe

0.870

Critic

0.676

Math (DeepScaleR)

Probe

0.609

Critic

0.384

Tool use (ToolDial)

Probe

0.840

Critic

0.440

Instruction following (IF-RLVR)

Probe

0.642

Critic

0.150

Internal-state probeCritic

Mean absolute error

Lower is better. Shorter bars mean smaller prediction error.

Math (DAPO-Math)

Probe

0.141

Critic

0.262

Math (DeepScaleR)

Probe

0.231

Critic

0.393

Tool use (ToolDial)

Probe

0.188

Critic

0.303

Instruction following (IF-RLVR)

Probe

0.195

Critic

0.350

Internal-state probeCritic

Bottom line. The preliminary section establishes that the probe is accurate enough to motivate the method. The analysis section then asks a broader question: which internal signals make that possible, whether the estimator survives online policy drift, and whether the same idea works outside the main math-training experiment.

BibTeX

Citation

Citation metadata is based on the arXiv preprint record.

@misc{choi2026poise,
  title = {Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States},
  author = {Yunho Choi and Jongwon Lim and Woojin Ahn and Minjae Oh and Jeonghoon Shim and Yohan Jo},
  year = {2026},
  eprint = {2605.07579},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  note = {Preprint}
}