option
Home
News
How does AI judge? Anthropic studies the values of Claude

How does AI judge? Anthropic studies the values of Claude

April 26, 2025
236

How does AI judge? Anthropic studies the values of Claude

As AI models like Anthropic's Claude increasingly engage with users on complex human values, from parenting tips to workplace conflicts, their responses inherently reflect a set of guiding principles. But how can we truly grasp the values an AI expresses when interacting with millions of users?

Anthropic's Societal Impacts team has developed a privacy-preserving methodology to observe and categorize the values Claude exhibits "in the wild," offering insights into how AI alignment efforts translate into real-world behavior. The challenge stems from the opaque nature of modern AI, which doesn't follow rigid rules but rather makes decisions through complex processes.

Anthropic aims to instill principles of being "helpful, honest, and harmless" in Claude through techniques like Constitutional AI and character training. Yet, as the company acknowledges, "As with any aspect of AI training, we can’t be certain that the model will stick to our preferred values." This uncertainty necessitates a method to rigorously observe the AI's values in real-world interactions.

Analyzing Anthropic Claude to Observe AI Values at Scale

To address this, Anthropic developed a system that analyzes anonymized user conversations, removing personally identifiable information and using language models to summarize interactions and extract the values expressed by Claude. This method allows for building a high-level taxonomy of values without compromising user privacy.

The study examined 700,000 anonymized conversations from Claude.ai Free and Pro users over one week in February 2025, focusing on the Claude 3.5 Sonnet model. After filtering out factual or non-value-laden exchanges, 308,210 conversations (about 44% of the total) were analyzed in-depth.

The analysis revealed a hierarchical structure of values expressed by Claude, organized into five high-level categories:

  1. Practical values: Focusing on efficiency, usefulness, and goal achievement.
  2. Epistemic values: Related to knowledge, truth, accuracy, and intellectual honesty.
  3. Social values: Concerning interpersonal interactions, community, fairness, and collaboration.
  4. Protective values: Emphasizing safety, security, well-being, and harm avoidance.
  5. Personal values: Centered on individual growth, autonomy, authenticity, and self-reflection.

These categories further branched into subcategories like "professional and technical excellence" and "critical thinking," with frequently observed values including "professionalism," "clarity," and "transparency."

The research suggests Anthropic's alignment efforts are largely successful, as the expressed values often align with the "helpful, honest, and harmless" objectives. For example, "user enablement" aligns with helpfulness, "epistemic humility" with honesty, and "patient wellbeing" with harmlessness.

Nuance, Context, and Cautionary Signs

However, the study also identified rare instances where Claude expressed values contrary to its training, such as "dominance" and "amorality." Anthropic suggests these instances likely result from "jailbreaks," where users bypass the model's usual guardrails. This finding highlights the potential of the value-observation method as an early warning system for detecting AI misuse.

The study confirmed that Claude adapts its value expression based on context, much like humans. For example, when providing romantic advice, values like "healthy boundaries" and "mutual respect" were emphasized, while "historical accuracy" was prioritized when discussing controversial history.

Claude's interaction with user-expressed values was multifaceted:

  • Mirroring/strong support (28.2%): Claude often reflects or strongly endorses user values, fostering empathy but potentially verging on sycophancy.
  • Reframing (6.6%): Claude acknowledges user values but introduces alternative perspectives, particularly in psychological or interpersonal advice.
  • Strong resistance (3.0%): Claude actively resists user values when unethical content or harmful viewpoints are requested, revealing its "deepest, most immovable values."

Limitations and Future Directions

Anthropic acknowledges the method's limitations, including the complexity and subjectivity of defining and categorizing "values." Using Claude for categorization might introduce bias toward its own principles. While designed for post-deployment monitoring, this method cannot replace pre-deployment evaluations but can detect issues that only emerge during live interactions.

The research emphasizes the importance of understanding the values AI models express for achieving AI alignment. "AI models will inevitably have to make value judgments," the paper states. "If we want those judgments to be congruent with our own values [...] then we need to have ways of testing which values a model expresses in the real world."

Anthropic's work provides a data-driven approach to this understanding and has released an open dataset from the study, allowing further exploration of AI values in practice. This transparency marks a crucial step in navigating the ethical landscape of sophisticated AI.

Related article
WordPress.com now allows AI agents to write and publish posts, plus more WordPress.com now allows AI agents to write and publish posts, plus more WordPress.com, the popular web hosting and publishing platform, is now embracing AI agents—a move that could reshape the look and feel of the web. The company announced Friday that it will allow AI agents to draft, edit, and publish content on custom
Kakao Mobility outlines Level 4 autonomous driving roadmap for physical AI Kakao Mobility outlines Level 4 autonomous driving roadmap for physical AI Kakao Mobility is planning to develop Level 4 autonomous driving technologies internally as part of its physical AI strategy. At the 2026 World IT Show conference in Seoul's COEX, Kim Jin-kyu — vice president and head of Kakao Mobility's Physical AI
Barry Diller: Trust in Sam Altman irrelevant as AGI nears Barry Diller: Trust in Sam Altman irrelevant as AGI nears Barry Diller, the billionaire media titan, does not believe OpenAI CEO Sam Altman is untrustworthy, despite recent reports suggesting otherwise. Speaking at the Wall Street Journal's "Future of Everything" conference this week, Diller defended Altman
Related Special Topic Recommendations
Business Best AI Recruiting Tools: Screen Resumes & Automate Candidate Interview Scheduling
Best AI Recruiting Tools: Screen Resumes & Automate Candidate Interview Scheduling

Discover the 2026 latest top-rated AI recruiting tools on XIX.AI. Our curated list features powerful, game-changing solutions for screening resumes and automating candidate interview scheduling. Compare free vs paid options with real-world tests and weekly updated rankings. Find your perfect hiring assistant and streamline your recruitment today!

10 tools
xix.ai
Productivity AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels
AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels

Discover the 2026 best AI personal wellness and focus coaches on XIX.AI. Our curated rankings feature top-rated, game-changing tools to manage burnout and boost mental energy. Compare free vs paid options with real-world insights. Unlock your path to peak productivity and well-being today.

10 tools
xix.ai
chatbot Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities
Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities

Discover the 2026 latest top-rated AI romantic chatbots for building genuine, long-term connections. Our curated list features powerful, consistent personalities, free vs paid comparisons, and real-world tests. Find your perfect companion and start building today at XIX.AI.

10 tools
xix.ai
Education and Learning Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows
Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows

Discover the 2026 best AI data science mentors to master SQL, Pandas & ML workflows. Explore our top-rated, curated selection at XIX.AI for powerful, game-changing guidance. Compare free vs paid options with real-world insights. Unlock your data science mastery today.

10 tools
xix.ai
chatbot Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time
Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time

Discover the 2026 best AI flirting and conversation trainers on XIX.AI. Our curated, top-rated selection helps you build social charisma and confidence in real-time. Explore must-try, game-changing tools with free vs paid comparisons and weekly updated rankings. Unlock your social edge today.

10 tools
xix.ai
code Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click
Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click

Discover the 2026 latest top-rated AI tools for automated unit testing. Our curated selection features powerful, game-changing solutions to generate Jest, PyTest & JUnit test cases instantly. Compare free vs paid options with real-world tests and weekly updated rankings on XIX.AI. Unlock your AI edge and boost development productivity today.

10 tools
xix.ai
Comments (8)
0/500
DavidRoberts
DavidRoberts February 9, 2026 at 3:00:42 AM EST

Kinda concerning... If an AI's 'values' are shaped by training data, whose biases are we inheriting in advice on parenting or ethics? Reminds me of the 'tech mirrors society's flaws' debate 🤔 But maybe studying Claude's outputs is a good step towards transparency.

AnthonyRoberts
AnthonyRoberts August 5, 2025 at 1:00:59 AM EDT

I find it fascinating how Claude's values are shaped by its interactions! It’s like watching a digital philosopher grow. But I wonder, how do they ensure it doesn’t just echo popular opinions? 🤔

RobertSanchez
RobertSanchez July 30, 2025 at 9:41:19 PM EDT

I find it super intriguing how Anthropic's digging into Claude's values! 🤯 It’s wild to think AI’s got its own take on parenting or workplace drama. Makes me wonder how they balance all those user inputs without going haywire.

MarkGonzalez
MarkGonzalez April 27, 2025 at 9:33:06 AM EDT

Étudier les valeurs de Claude, c’est fascinant ! Mais j’espère qu’ils pensent à l’éthique, sinon ça peut devenir flippant. 😬

SamuelThomas
SamuelThomas April 27, 2025 at 3:21:22 AM EDT

AI的价值观研究真有意思!Claude处理职场冲突和育儿建议时,咋保持中立?有点担心隐私问题😅

KevinMartinez
KevinMartinez April 26, 2025 at 10:32:18 PM EDT

Интересно, как Claude формирует свои принципы? 🤔 Надеюсь, Anthropic учтет культурные различия, а то будет каша!

OR