How does AI judge? Anthropic studies the values of Claude
April 26, 2025
SamuelAdams
0

As AI models like Anthropic's Claude increasingly engage with users on complex human values, from parenting tips to workplace conflicts, their responses inherently reflect a set of guiding principles. But how can we truly grasp the values an AI expresses when interacting with millions of users?
Anthropic's Societal Impacts team has developed a privacy-preserving methodology to observe and categorize the values Claude exhibits "in the wild," offering insights into how AI alignment efforts translate into real-world behavior. The challenge stems from the opaque nature of modern AI, which doesn't follow rigid rules but rather makes decisions through complex processes.
Anthropic aims to instill principles of being "helpful, honest, and harmless" in Claude through techniques like Constitutional AI and character training. Yet, as the company acknowledges, "As with any aspect of AI training, we can’t be certain that the model will stick to our preferred values." This uncertainty necessitates a method to rigorously observe the AI's values in real-world interactions.
Analyzing Anthropic Claude to Observe AI Values at Scale
To address this, Anthropic developed a system that analyzes anonymized user conversations, removing personally identifiable information and using language models to summarize interactions and extract the values expressed by Claude. This method allows for building a high-level taxonomy of values without compromising user privacy.
The study examined 700,000 anonymized conversations from Claude.ai Free and Pro users over one week in February 2025, focusing on the Claude 3.5 Sonnet model. After filtering out factual or non-value-laden exchanges, 308,210 conversations (about 44% of the total) were analyzed in-depth.
The analysis revealed a hierarchical structure of values expressed by Claude, organized into five high-level categories:
- Practical values: Focusing on efficiency, usefulness, and goal achievement.
- Epistemic values: Related to knowledge, truth, accuracy, and intellectual honesty.
- Social values: Concerning interpersonal interactions, community, fairness, and collaboration.
- Protective values: Emphasizing safety, security, well-being, and harm avoidance.
- Personal values: Centered on individual growth, autonomy, authenticity, and self-reflection.
These categories further branched into subcategories like "professional and technical excellence" and "critical thinking," with frequently observed values including "professionalism," "clarity," and "transparency."
The research suggests Anthropic's alignment efforts are largely successful, as the expressed values often align with the "helpful, honest, and harmless" objectives. For example, "user enablement" aligns with helpfulness, "epistemic humility" with honesty, and "patient wellbeing" with harmlessness.
Nuance, Context, and Cautionary Signs
However, the study also identified rare instances where Claude expressed values contrary to its training, such as "dominance" and "amorality." Anthropic suggests these instances likely result from "jailbreaks," where users bypass the model's usual guardrails. This finding highlights the potential of the value-observation method as an early warning system for detecting AI misuse.
The study confirmed that Claude adapts its value expression based on context, much like humans. For example, when providing romantic advice, values like "healthy boundaries" and "mutual respect" were emphasized, while "historical accuracy" was prioritized when discussing controversial history.
Claude's interaction with user-expressed values was multifaceted:
- Mirroring/strong support (28.2%): Claude often reflects or strongly endorses user values, fostering empathy but potentially verging on sycophancy.
- Reframing (6.6%): Claude acknowledges user values but introduces alternative perspectives, particularly in psychological or interpersonal advice.
- Strong resistance (3.0%): Claude actively resists user values when unethical content or harmful viewpoints are requested, revealing its "deepest, most immovable values."
Limitations and Future Directions
Anthropic acknowledges the method's limitations, including the complexity and subjectivity of defining and categorizing "values." Using Claude for categorization might introduce bias toward its own principles. While designed for post-deployment monitoring, this method cannot replace pre-deployment evaluations but can detect issues that only emerge during live interactions.
The research emphasizes the importance of understanding the values AI models express for achieving AI alignment. "AI models will inevitably have to make value judgments," the paper states. "If we want those judgments to be congruent with our own values [...] then we need to have ways of testing which values a model expresses in the real world."
Anthropic's work provides a data-driven approach to this understanding and has released an open dataset from the study, allowing further exploration of AI values in practice. This transparency marks a crucial step in navigating the ethical landscape of sophisticated AI.
Related article
MCP Standardizes AI Connectivity with Tools and Data: A New Protocol Emerges
If you're diving into the world of artificial intelligence (AI), you've probably noticed how crucial it is to get different AI models, data sources, and tools to play nicely together. That's where the Model Context Protocol (MCP) comes in, acting as a game-changer in standardizing AI connectivity. T
Exploring AI on Screen: A Short Film Program
Reflecting on our favorite sci-fi movies often brings a sense of wonder about the future they envisioned. As a child, watching "Star Trek" and marveling at their communicators, the concept of instant communication via a small device seemed like pure fantasy. Fast forward to today, and my mobile phon
Microsoft Copilot Now Capable of Web Browsing on Your Behalf
Microsoft is rolling out some exciting updates to its AI assistant, Copilot, which will now be able to handle your online tasks with just a few simple chat prompts. Imagine working on your projects while Copilot quietly books your restaurant reservations, snags event tickets, or even sends gifts to
Comments (0)
0/200






As AI models like Anthropic's Claude increasingly engage with users on complex human values, from parenting tips to workplace conflicts, their responses inherently reflect a set of guiding principles. But how can we truly grasp the values an AI expresses when interacting with millions of users?
Anthropic's Societal Impacts team has developed a privacy-preserving methodology to observe and categorize the values Claude exhibits "in the wild," offering insights into how AI alignment efforts translate into real-world behavior. The challenge stems from the opaque nature of modern AI, which doesn't follow rigid rules but rather makes decisions through complex processes.
Anthropic aims to instill principles of being "helpful, honest, and harmless" in Claude through techniques like Constitutional AI and character training. Yet, as the company acknowledges, "As with any aspect of AI training, we can’t be certain that the model will stick to our preferred values." This uncertainty necessitates a method to rigorously observe the AI's values in real-world interactions.
Analyzing Anthropic Claude to Observe AI Values at Scale
To address this, Anthropic developed a system that analyzes anonymized user conversations, removing personally identifiable information and using language models to summarize interactions and extract the values expressed by Claude. This method allows for building a high-level taxonomy of values without compromising user privacy.
The study examined 700,000 anonymized conversations from Claude.ai Free and Pro users over one week in February 2025, focusing on the Claude 3.5 Sonnet model. After filtering out factual or non-value-laden exchanges, 308,210 conversations (about 44% of the total) were analyzed in-depth.
The analysis revealed a hierarchical structure of values expressed by Claude, organized into five high-level categories:
- Practical values: Focusing on efficiency, usefulness, and goal achievement.
- Epistemic values: Related to knowledge, truth, accuracy, and intellectual honesty.
- Social values: Concerning interpersonal interactions, community, fairness, and collaboration.
- Protective values: Emphasizing safety, security, well-being, and harm avoidance.
- Personal values: Centered on individual growth, autonomy, authenticity, and self-reflection.
These categories further branched into subcategories like "professional and technical excellence" and "critical thinking," with frequently observed values including "professionalism," "clarity," and "transparency."
The research suggests Anthropic's alignment efforts are largely successful, as the expressed values often align with the "helpful, honest, and harmless" objectives. For example, "user enablement" aligns with helpfulness, "epistemic humility" with honesty, and "patient wellbeing" with harmlessness.
Nuance, Context, and Cautionary Signs
However, the study also identified rare instances where Claude expressed values contrary to its training, such as "dominance" and "amorality." Anthropic suggests these instances likely result from "jailbreaks," where users bypass the model's usual guardrails. This finding highlights the potential of the value-observation method as an early warning system for detecting AI misuse.
The study confirmed that Claude adapts its value expression based on context, much like humans. For example, when providing romantic advice, values like "healthy boundaries" and "mutual respect" were emphasized, while "historical accuracy" was prioritized when discussing controversial history.
Claude's interaction with user-expressed values was multifaceted:
- Mirroring/strong support (28.2%): Claude often reflects or strongly endorses user values, fostering empathy but potentially verging on sycophancy.
- Reframing (6.6%): Claude acknowledges user values but introduces alternative perspectives, particularly in psychological or interpersonal advice.
- Strong resistance (3.0%): Claude actively resists user values when unethical content or harmful viewpoints are requested, revealing its "deepest, most immovable values."
Limitations and Future Directions
Anthropic acknowledges the method's limitations, including the complexity and subjectivity of defining and categorizing "values." Using Claude for categorization might introduce bias toward its own principles. While designed for post-deployment monitoring, this method cannot replace pre-deployment evaluations but can detect issues that only emerge during live interactions.
The research emphasizes the importance of understanding the values AI models express for achieving AI alignment. "AI models will inevitably have to make value judgments," the paper states. "If we want those judgments to be congruent with our own values [...] then we need to have ways of testing which values a model expresses in the real world."
Anthropic's work provides a data-driven approach to this understanding and has released an open dataset from the study, allowing further exploration of AI values in practice. This transparency marks a crucial step in navigating the ethical landscape of sophisticated AI.











