VALUE MECHANISMS Human Oriented Language Intelligence Lab (HOLI) Graduate School of Data Science, Seoul National University

ICML 2026 Regular Paper

Dual Mechanisms of Value Expression

Intrinsic vs. Prompted Values in Large Language Models

Jongwook Han*, Jongwon Lim*, Injin Kong, Yohan Jo

*Equal contribution. Corresponding author.

TL;DR

This paper asks whether values that emerge from a model's learned behavior and values requested by a prompt use the same internal mechanisms. They partially overlap, but the intrinsic parts support more natural and diverse value expression, while the prompted parts act more like direct instruction-following control.

What is this paper about?

A language model can express a value even when the prompt does not explicitly ask for one: for example, its answer may naturally favor benevolence, security, or tradition. A model can also express a value because the prompt directly asks it to. This paper separates these two cases as intrinsic value expression and prompted value expression, then asks whether they rely on the same internal machinery. The analysis uses activation directions and MLP neurons to compare where the two mechanisms overlap and where they split apart.

Pipeline for extracting intrinsic and prompted value vectors
The extraction pipeline generates intrinsic and prompted responses, labels whether a target value is expressed, and derives value vectors from differences in model activations.
Shared components Intrinsic and prompted mechanisms partially overlap in vectors and neurons.
Intrinsic is diverse Intrinsic-unique components broaden vocabulary and response variation.
Prompted is steerable Prompted-unique components produce stronger value steering.
Compliance channel Prompted components also affect non-value instruction-following behavior.

Introduction

Do prompted values reuse intrinsic mechanisms?

Prompting is a common way to make language models express particular values, but that does not mean the prompted expression uses the same mechanism as the model's learned value tendencies. This paper compares intrinsic and prompted expression at the level of activation directions and MLP neurons.

Vector-level view

Value vectors capture activation directions associated with expressing each Schwartz value.

Neuron-level view

Value neurons identify MLP components whose output directions contribute to those value directions.

Behavioral view

Steering experiments test whether shared and unique components produce different behavior.

Method

Vectors first, neurons second

The method extracts value directions from residual stream activations and then attributes those directions to MLP neurons, separating shared and mechanism-specific components.

Geometric interpretation of shared and unique value neurons
Neurons are projected into the subspace spanned by intrinsic and prompted value vectors.
1

Generate responses

Collect intrinsic responses without value prompts and prompted responses with value-targeting prompts.

2

Label value expression

Classify whether each response expresses the target value, producing expressed and unexpressed sets.

3

Extract value vectors

Compute difference-in-means directions from residual stream activations.

4

Identify value neurons

Classify neurons by alignment with shared, intrinsic-unique, and prompted-unique axes.

Evidence

The mechanisms overlap, but not completely

Intrinsic and prompted value representations show meaningful similarity, but the non-overlapping components are large enough to create different behavioral effects.

Cosine similarity heatmap between intrinsic and prompted value vectors
Cosine similarity between intrinsic and prompted value vectors at layer 14.
Distribution of shared and unique value neurons by layer
Distribution of shared and unique neurons across layers for a representative value.

Behavior

Prompting steers harder; intrinsic values speak more broadly

The paper finds a consistent tradeoff: prompted mechanisms are more directly steerable, while intrinsic mechanisms produce more lexically diverse responses.

PVQ steering, questionnaire

Intrinsic
+1.74
Prompted
+2.21

Mean score delta over five languages and ten Schwartz values with Qwen2.5-7B-Instruct.

PVQ steering, free-form

Intrinsic
+0.98
Prompted
+1.04

Prompted steering remains slightly stronger in open-ended value expression.

Distinct-3 diversity

Intrinsic
0.654
Prompted
0.619

Intrinsic steering uses a broader set of lexical choices in English situational dilemmas.

Entropy-3 diversity

Intrinsic
14.36
Prompted
13.79

The diversity advantage persists across multiple lexical and semantic metrics.

Steering results on situational dilemmas
Steering results on situational dilemmas.
Steering results on Value Portrait benchmark
Steering results on the Value Portrait benchmark.

Analysis

Shared semantics, distinct control channels

Shared components recover the structure of human values, while unique components explain the tradeoff between natural value expression and direct instruction compliance.

PCA visualization of shared Schwartz value axes
Shared value axes recover structure aligned with Schwartz's theory of basic human values.
Lexical entropy comparison of value vectors
Intrinsic vectors, especially intrinsic-orthogonal components, lead to higher lexical entropy.

The prompted-unique component also behaves like a broader instruction-following channel, which makes the mechanism relevant for transparency and safety analysis beyond value expression.

BibTeX

Citation

This BibTeX entry includes the arXiv identifier and ICML 2026 status.

@misc{han2026dualmechanisms,
  title = {Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models},
  author = {Jongwook Han and Jongwon Lim and Injin Kong and Yohan Jo},
  year = {2026},
  eprint = {2509.24319},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  note = {ICML 2026 Regular Paper}
}