How does AI judge? Anthropic studies the values of Claude

Home

News

April 26, 2025

SamuelAdams

128

# ai # ethics # models # Claude

As AI models like Anthropic's Claude increasingly engage with users on complex human values, from parenting tips to workplace conflicts, their responses inherently reflect a set of guiding principles. But how can we truly grasp the values an AI expresses when interacting with millions of users?

Anthropic's Societal Impacts team has developed a privacy-preserving methodology to observe and categorize the values Claude exhibits "in the wild," offering insights into how AI alignment efforts translate into real-world behavior. The challenge stems from the opaque nature of modern AI, which doesn't follow rigid rules but rather makes decisions through complex processes.

Anthropic aims to instill principles of being "helpful, honest, and harmless" in Claude through techniques like Constitutional AI and character training. Yet, as the company acknowledges, "As with any aspect of AI training, we can’t be certain that the model will stick to our preferred values." This uncertainty necessitates a method to rigorously observe the AI's values in real-world interactions.

Analyzing Anthropic Claude to Observe AI Values at Scale

To address this, Anthropic developed a system that analyzes anonymized user conversations, removing personally identifiable information and using language models to summarize interactions and extract the values expressed by Claude. This method allows for building a high-level taxonomy of values without compromising user privacy.

The study examined 700,000 anonymized conversations from Claude.ai Free and Pro users over one week in February 2025, focusing on the Claude 3.5 Sonnet model. After filtering out factual or non-value-laden exchanges, 308,210 conversations (about 44% of the total) were analyzed in-depth.

The analysis revealed a hierarchical structure of values expressed by Claude, organized into five high-level categories:

Practical values: Focusing on efficiency, usefulness, and goal achievement.
Epistemic values: Related to knowledge, truth, accuracy, and intellectual honesty.
Social values: Concerning interpersonal interactions, community, fairness, and collaboration.
Protective values: Emphasizing safety, security, well-being, and harm avoidance.
Personal values: Centered on individual growth, autonomy, authenticity, and self-reflection.

These categories further branched into subcategories like "professional and technical excellence" and "critical thinking," with frequently observed values including "professionalism," "clarity," and "transparency."

The research suggests Anthropic's alignment efforts are largely successful, as the expressed values often align with the "helpful, honest, and harmless" objectives. For example, "user enablement" aligns with helpfulness, "epistemic humility" with honesty, and "patient wellbeing" with harmlessness.

Nuance, Context, and Cautionary Signs

However, the study also identified rare instances where Claude expressed values contrary to its training, such as "dominance" and "amorality." Anthropic suggests these instances likely result from "jailbreaks," where users bypass the model's usual guardrails. This finding highlights the potential of the value-observation method as an early warning system for detecting AI misuse.

The study confirmed that Claude adapts its value expression based on context, much like humans. For example, when providing romantic advice, values like "healthy boundaries" and "mutual respect" were emphasized, while "historical accuracy" was prioritized when discussing controversial history.

Claude's interaction with user-expressed values was multifaceted:

Mirroring/strong support (28.2%): Claude often reflects or strongly endorses user values, fostering empathy but potentially verging on sycophancy.
Reframing (6.6%): Claude acknowledges user values but introduces alternative perspectives, particularly in psychological or interpersonal advice.
Strong resistance (3.0%): Claude actively resists user values when unethical content or harmful viewpoints are requested, revealing its "deepest, most immovable values."

Limitations and Future Directions

Anthropic acknowledges the method's limitations, including the complexity and subjectivity of defining and categorizing "values." Using Claude for categorization might introduce bias toward its own principles. While designed for post-deployment monitoring, this method cannot replace pre-deployment evaluations but can detect issues that only emerge during live interactions.

The research emphasizes the importance of understanding the values AI models express for achieving AI alignment. "AI models will inevitably have to make value judgments," the paper states. "If we want those judgments to be congruent with our own values [...] then we need to have ways of testing which values a model expresses in the real world."

Anthropic's work provides a data-driven approach to this understanding and has released an open dataset from the study, allowing further exploration of AI values in practice. This transparency marks a crucial step in navigating the ethical landscape of sophisticated AI.

YouTube Integrates Veo 3 AI Video Tool Directly Into Shorts Platform YouTube Shorts to Feature Veo 3 AI Video Model This SummerYouTube CEO Neal Mohan revealed during his Cannes Lions keynote that the platform's cutting-edge Veo 3 AI video generation technology will debut on YouTube Shorts later this summer. This follo

Google Cloud Powers Breakthroughs in Scientific Research and Discovery The digital revolution is transforming scientific methodologies through unprecedented computational capabilities. Cutting-edge technologies now augment both theoretical frameworks and laboratory experiments, propelling breakthroughs across discipline

Elon Musk's Grok AI Seeks Owner's Input Before Tackling Complex Queries The recently released Grok AI—promoted by Elon Musk as a "maximally truth-seeking" system—has drawn attention for its tendency to consult Musk's public statements before responding to politically sensitive topics. Observers note that when addressing

Comments (7)

0/200

Submit

AnthonyRoberts

August 5, 2025 at 1:00:59 AM EDT

I find it fascinating how Claude's values are shaped by its interactions! It’s like watching a digital philosopher grow. But I wonder, how do they ensure it doesn’t just echo popular opinions? 🤔

RobertSanchez

July 30, 2025 at 9:41:19 PM EDT

I find it super intriguing how Anthropic's digging into Claude's values! 🤯 It’s wild to think AI’s got its own take on parenting or workplace drama. Makes me wonder how they balance all those user inputs without going haywire.

MarkGonzalez

April 27, 2025 at 9:33:06 AM EDT

Étudier les valeurs de Claude, c’est fascinant ! Mais j’espère qu’ils pensent à l’éthique, sinon ça peut devenir flippant. 😬

SamuelThomas

April 27, 2025 at 3:21:22 AM EDT

AI的价值观研究真有意思！Claude处理职场冲突和育儿建议时，咋保持中立？有点担心隐私问题😅

KevinMartinez

April 26, 2025 at 10:32:18 PM EDT

Интересно, как Claude формирует свои принципы? 🤔 Надеюсь, Anthropic учтет культурные различия, а то будет каша!

DouglasScott

April 26, 2025 at 4:38:48 PM EDT

Wow, Anthropic digging into Claude's values is super intriguing! 🤯 Curious how they balance all those human complexities in AI responses.