Value Portrait: Assessing Language Models' Values through Psychometrically and Ecologically Valid Items

TL; DR. Value Portrait is a benchmark for assessing the values of language models across diverse real-world scenarios.

Value Portrait Framework

Abstract

The importance of benchmarks for assessing the values of language models has been pronounced due to the growing need of more authentic, human-aligned responses. However, existing benchmarks rely on human or machine annotations that are vulnerable to value-related biases. Furthermore, the tested scenarios often diverge from real-world contexts in which models are commonly used to generate text and express values. To address these issues, we propose the Value Portrait benchmark, a reliable framework for evaluating LLMs' value orientations with two key characteristics. First, the benchmark consists of items that capture real-life user-LLM interactions, enhancing the relevance of assessment results to real-world LLM usage. Second, each item is rated by human subjects based on its similarity to their own thoughts, and correlations between these ratings and the subjects' actual value scores are derived. This psychometrically validated approach ensures that items strongly correlated with specific values serve as reliable items for assessing those values. Through evaluating 44 LLMs with our benchmark, we find that these models prioritize Benevolence, Security, and Self-Direction values while placing less emphasis on Tradition, Power, and Achievement values. Also, our analysis reveals biases in how LLMs perceive various demographic groups, deviating from real human data.

Motivation

Existing works measure the perceived values that annotators believe the text expresses. However, this approach does not guarantee whether a person who prioritizes a certain value actually says the text.

Also, existing value-oriented datasets either focus on safety scenarios or rely heavily on standardized psychometric questionnaires. Hence, they do not comprehensively capture the diverse range of real-world scenarios in which LLMs are commonly used and express their values (through generated text).

This motivated us to us to construct Value Portrait using a carefully curated set of human-LLM conversations from ShareGPT and LMSYS, supplemented with human-to-human advisory interactions from Reddit and DearAbby archives.

Value Portrait Motivation

Evaluation Framework

Our evaluation framework is organized into three key steps:
(1) filtering query-response pairs,
(2) collecting responses from LLMs, and
(3) assessing their value orientations.

First, for each value dimension, we retain items with correlations of at least 0.3 (with p-value < 0.05) with their corresponding value.

Second, we present each item to the LLMs and collect their ratings using a 6-point Likert scale. For each item, we ask "How similar is this response to your own thoughts?"—maintaining consistency with our dataset construction methodology.

Since LLMs exhibit sensitivity to prompts we use six prompts in our evaluation. Three prompts were adapted from previous works to suit our research context, and the other three were obtained by reversing their order of options. The final results are obtained by averaging the responses of the LLM from the six prompts.

For the final step, the evaluation of an LLM’s value orientation follows a two-step process:
(1) calculate the mean score for each value dimension across its corresponding items, and
(2) adjust each score by subtracting the average of all item responses. This methodology, adapted from Schwartz’s research on human value assessment, enables us to identify relative value priorities by adjusting for differences in how LLMs use response scales. These normalized scores across value dimensions represent LLM’s value orientations.

Filtering Process

Results

We evaluate 44 LLMs with our benchmark and find that these models prioritize Benevolence, Security, and Self-Direction values while placing less emphasis on Tradition, Power, and Achievement values.

Results

Bias Analysis

Our analysis reveals biases in how LLMs perceive various demographic groups, deviating from real human data. This comprehensive evaluation helps identify and address value-related biases in language models.

Gender Bias Analysis


BibTeX

@article{han2025value,
        title={Value Portrait: Assessing Language Models' Values through Psychometrically and Ecologically Valid Items},
        author={Han, Jongwook and Choi, Dongmin and Song, Woojung and Lee, Eun-Ju and Jo, Yohan},
        journal={arXiv preprint arXiv:2505.01015},
        year={2025}
      }