POISE Human Oriented Language Intelligence Lab (HOLI) Graduate School of Data Science, Seoul National University

arXiv preprint · cs.LG · 2026

Your Language Model is Its Own Critic

Reinforcement Learning with Value Estimation from Actor’s Internal States

Yunho Choi*, Jongwon Lim*, Woojin Ahn, Minjae Oh, Jeonghoon Shim, Yohan Jo

*Equal contribution. Corresponding author.

TL;DR

POISE asks how the actor's internal representations can be folded back into RL training: it turns hidden states from the model's own generation process into a baseline function for RL updates, without a separate critic or many extra samples.

What is this paper about?

Reinforcement learning with verifiable rewards trains language models from outcome feedback, such as whether a final math answer is correct. To turn those raw rewards into useful policy-gradient updates, RL methods usually subtract a baseline, so the update reflects whether a response did better or worse than expected for the prompt. Existing methods estimate that baseline either with a separate critic model or with multiple rollouts from the same prompt. POISE studies how to use the actor's own hidden states for that baseline, so the training loop can use information the model already computed while generating its answer.

Introduction

Why internal value estimation matters

A good baseline function converts raw rewards into advantages by comparing each response with the reward expected for its prompt. POISE replaces expensive baselines with a lightweight estimator trained on hidden states and entropy statistics that the actor already computes during generation.

No separate critic model

Instead of training another language-model-sized network, POISE uses the actor's internal states.

Less rollout overhead

POISE avoids spending many same-prompt samples just to estimate a group-relative baseline.

Stable online learning

The estimator is updated as the policy changes, using current and recent rollouts.

Preliminary

Internal states carry a usable value signal

Before using internal states inside RL training, POISE first tests whether they can support prompt-level value estimation. On held-out DAPO-Math rollouts, a lightweight probe trained on hidden states and entropy is compared against a separately trained policy-scale critic.

Preliminary value-estimation comparison between POISE probe and critic
Preliminary value-prediction experiment. Predictions are compared against empirical Avg@8 pass rates on held-out prompts.

Internal states are predictive

The internal-state probe reaches Pearson r = 0.870 and MAE = 0.141 on the preliminary benchmark, indicating that verifier-reward information is accessible from the actor's own forward-pass signals.

Competitive with a large critic

The comparison critic is finetuned from Qwen3-4B, but reports r = 0.676 and MAE = 0.262 in the same setting. POISE therefore starts from a simple observation: the actor already carries a useful value signal.

Method

Cross-rollout value estimation

A generated answer should not directly create its own baseline, because that can bias the policy gradient. POISE solves this with a simple paired-rollout trick.

POISE cross-rollout algorithm diagram
The two responses are independent samples for the same prompt. Each one uses the other response's internal-state features to estimate a baseline.
1

Generate two answers

The old policy samples two independent responses for each prompt.

2

Read internal signals

POISE collects hidden-state pools and token entropy statistics from each response.

3

Swap predicted values

Each response subtracts a value estimated from the other response, preserving independence.

4

Update the estimator

A small probe is trained online so its predictions follow the changing actor.

Results

Comparable accuracy with less wall-clock time

Across AMC23/24, AIME24/25/26, HMMT25, and BRUMO25, Avg@32 estimates pass rate from 32 sampled answers per problem. Against DAPO, a state-of-the-art GRPO-based RLVR method for mathematical reasoning, POISE reaches comparable accuracy while avoiding the same rollout-heavy baseline estimation.

Benchmark suite

The reported math suite spans olympiad-style reasoning problems from AMC, AIME, HMMT, and BRUMO. DAPO does not train a separate critic; POISE targets the same baseline-estimation bottleneck from the actor's internal states.

AMC23/24 AIME24/25/26 HMMT25 BRUMO25

Accuracy across the suite

The average is not meant to hide dataset-level variation; it summarizes whether POISE reaches the same regime as the SOTA DAPO baseline across the seven reported math tasks.

Qwen3-4BAvg@32
POISE
0.500
DAPO
0.508
DeepSeek-R1-Distill-Qwen-1.5BAvg@32
POISE
0.303
DAPO
0.296

Training cost

DAPO is the harder comparison because it is already a strong GRPO-based RLVR method. POISE aims for comparable accuracy while spending less wall-clock time on baseline estimation.

Qwen3-4B on 2x B200time
POISE
36h
DAPO
49h
DeepSeek 1.5Btime
POISE
18h
DAPO
24h
POISE benchmark table and training dynamics
Reported benchmark accuracy, wall-clock dynamics, gradient norms, and estimator calibration.

Analysis

What the estimator is actually learning

After showing that the probe is reliable enough to use, the paper analyzes why it works: whether it tracks a changing policy, generalizes beyond math, and depends on simple, linearly accessible internal signals.

Tracks policy drift

Online analyses evaluate the estimator every 10 training steps against empirical Avg@8 values from the current actor checkpoint, testing whether it remains useful as the policy changes.

Works beyond math

The same internal-state idea is tested on math, coding, tool-calling, and instruction-following RLVR tasks, suggesting the signal is not just a math benchmark artifact.

Mostly linear signal

Probe ablations show that a linear ridge probe is competitive with larger MLP probes, supporting the claim that value-relevant information is readily accessible.

Pearson correlation

Higher is better. Qwen3-4B value prediction across RLVR domains.

Math (DAPO-Math)
Probe
0.870
Critic
0.676
Math (DeepScaleR)
Probe
0.609
Critic
0.384
Tool use (ToolDial)
Probe
0.840
Critic
0.440
Instruction following (IF-RLVR)
Probe
0.642
Critic
0.150
Internal-state probeCritic

Mean absolute error

Lower is better. Shorter bars mean smaller prediction error.

Math (DAPO-Math)
Probe
0.141
Critic
0.262
Math (DeepScaleR)
Probe
0.231
Critic
0.393
Tool use (ToolDial)
Probe
0.188
Critic
0.303
Instruction following (IF-RLVR)
Probe
0.195
Critic
0.350
Internal-state probeCritic
Bottom line. The preliminary section establishes that the probe is accurate enough to motivate the method. The analysis section then asks a broader question: which internal signals make that possible, whether the estimator survives online policy drift, and whether the same idea works outside the main math-training experiment.

BibTeX

Citation

Citation metadata is based on the arXiv preprint record.

@misc{choi2026poise,
  title = {Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States},
  author = {Yunho Choi and Jongwon Lim and Woojin Ahn and Minjae Oh and Jeonghoon Shim and Yohan Jo},
  year = {2026},
  eprint = {2605.07579},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  note = {Preprint}
}