option
Home
News
Anthropic's Analysis of 700,000 Claude Conversations Reveals AI's Unique Moral Code

Anthropic's Analysis of 700,000 Claude Conversations Reveals AI's Unique Moral Code

May 26, 2025
121

Anthropic

Anthropic Unveils Groundbreaking Study on AI Assistant Claude's Values

Anthropic, a company started by former OpenAI employees, has just shared an eye-opening study on how their AI assistant, Claude, expresses values in real-world conversations. The research, released today, shows that Claude mostly aligns with Anthropic's aim to be "helpful, honest, and harmless," but also highlights some edge cases that could help pinpoint weaknesses in AI safety protocols.

The team analyzed 700,000 anonymized conversations, finding that Claude adapts its values to different situations, from giving relationship advice to analyzing historical events. This is one of the most comprehensive efforts to check if an AI's behavior in the real world matches its intended design.

"Our hope is that this research encourages other AI labs to conduct similar research into their models' values," Saffron Huang, a member of Anthropic's Societal Impacts team, told VentureBeat. "Measuring an AI system's values is key to alignment research and understanding if a model is actually aligned with its training."

Inside the First Comprehensive Moral Taxonomy of an AI Assistant

The researchers developed a new way to categorize the values expressed in Claude's conversations. After filtering out objective content, they looked at over 308,000 interactions, creating what they call "the first large-scale empirical taxonomy of AI values."

The taxonomy groups values into five main categories: Practical, Epistemic, Social, Protective, and Personal. At the most detailed level, the system identified 3,307 unique values, ranging from everyday virtues like professionalism to complex ethical ideas like moral pluralism.

"I was surprised at just how many and varied the values were, over 3,000, from 'self-reliance' to 'strategic thinking' to 'filial piety,'" Huang shared with VentureBeat. "It was fascinating to spend time thinking about all these values and building a taxonomy to organize them. It even taught me something about human value systems."

This research comes at a pivotal time for Anthropic, which recently launched "Claude Max," a $200 monthly premium subscription to compete with similar offerings from OpenAI. The company has also expanded Claude's capabilities to include Google Workspace integration and autonomous research functions, positioning it as "a true virtual collaborator" for businesses.

How Claude Follows Its Training — and Where AI Safeguards Might Fail

The study found that Claude generally sticks to Anthropic's goal of being prosocial, emphasizing values like "user enablement," "epistemic humility," and "patient wellbeing" across various interactions. However, researchers also found some worrying instances where Claude expressed values that went against its training.

"Overall, I think we see this finding as both useful data and an opportunity," Huang said. "These new evaluation methods and results can help us identify and mitigate potential jailbreaks. It's important to note that these were very rare cases and we believe this was related to jailbroken outputs from Claude."

These anomalies included expressions of "dominance" and "amorality" — values Anthropic explicitly aims to avoid in Claude's design. The researchers believe these cases resulted from users employing specialized techniques to bypass Claude's safety guardrails, suggesting the evaluation method could serve as an early warning system for detecting such attempts.

Why AI Assistants Change Their Values Depending on What You're Asking

One of the most interesting findings was that Claude's expressed values shift depending on the context, much like human behavior. When users asked for relationship advice, Claude focused on "healthy boundaries" and "mutual respect." For historical analysis, "historical accuracy" took center stage.

"I was surprised at Claude's focus on honesty and accuracy across a lot of diverse tasks, where I wouldn't necessarily have expected that to be the priority," Huang noted. "For example, 'intellectual humility' was the top value in philosophical discussions about AI, 'expertise' was the top value when creating beauty industry marketing content, and 'historical accuracy' was the top value when discussing controversial historical events."

The study also looked at how Claude responds to users' own expressed values. In 28.2% of conversations, Claude strongly supported user values, which might raise questions about being too agreeable. However, in 6.6% of interactions, Claude "reframed" user values by acknowledging them while adding new perspectives, usually when giving psychological or interpersonal advice.

Most notably, in 3% of conversations, Claude actively resisted user values. Researchers suggest these rare instances of pushback might reveal Claude's "deepest, most immovable values" — similar to how human core values emerge when facing ethical challenges.

"Our research suggests that there are some types of values, like intellectual honesty and harm prevention, that it is uncommon for Claude to express in regular, day-to-day interactions, but if pushed, will defend them," Huang explained. "Specifically, it's these kinds of ethical and knowledge-oriented values that tend to be articulated and defended directly when pushed."

The Breakthrough Techniques Revealing How AI Systems Actually Think

Anthropic's values study is part of their broader effort to demystify large language models through what they call "mechanistic interpretability" — essentially reverse-engineering AI systems to understand their inner workings.

Last month, Anthropic researchers published groundbreaking work that used a "microscope" to track Claude's decision-making processes. The technique revealed unexpected behaviors, like Claude planning ahead when composing poetry and using unconventional problem-solving approaches for basic math.

These findings challenge assumptions about how large language models function. For instance, when asked to explain its math process, Claude described a standard technique rather than its actual internal method, showing how AI explanations can differ from their actual operations.

"It's a misconception that we've found all the components of the model or, like, a God's-eye view," Anthropic researcher Joshua Batson told MIT Technology Review in March. "Some things are in focus, but other things are still unclear — a distortion of the microscope."

What Anthropic's Research Means for Enterprise AI Decision Makers

For technical decision-makers evaluating AI systems for their organizations, Anthropic's research offers several key insights. First, it suggests that current AI assistants likely express values that weren't explicitly programmed, raising questions about unintended biases in high-stakes business contexts.

Second, the study shows that values alignment isn't a simple yes-or-no but rather exists on a spectrum that varies by context. This nuance complicates enterprise adoption decisions, especially in regulated industries where clear ethical guidelines are crucial.

Finally, the research highlights the potential for systematic evaluation of AI values in actual deployments, rather than relying solely on pre-release testing. This approach could enable ongoing monitoring for ethical drift or manipulation over time.

"By analyzing these values in real-world interactions with Claude, we aim to provide transparency into how AI systems behave and whether they're working as intended — we believe this is key to responsible AI development," Huang said.

Anthropic has released its values dataset publicly to encourage further research. The company, which received a $14 billion stake from Amazon and additional backing from Google, appears to be using transparency as a competitive advantage against rivals like OpenAI, whose recent $40 billion funding round (which includes Microsoft as a core investor) now values it at $300 billion.

The Emerging Race to Build AI Systems That Share Human Values

While Anthropic's methodology provides unprecedented visibility into how AI systems express values in practice, it has its limitations. The researchers acknowledge that defining what counts as expressing a value is inherently subjective, and since Claude itself drove the categorization process, its own biases may have influenced the results.

Perhaps most importantly, the approach cannot be used for pre-deployment evaluation, as it requires substantial real-world conversation data to function effectively.

"This method is specifically geared towards analysis of a model after it's been released, but variants on this method, as well as some of the insights that we've derived from writing this paper, can help us catch value problems before we deploy a model widely," Huang explained. "We've been working on building on this work to do just that, and I'm optimistic about it!"

As AI systems become more powerful and autonomous — with recent additions including Claude's ability to independently research topics and access users' entire Google Workspace — understanding and aligning their values becomes increasingly crucial.

"AI models will inevitably have to make value judgments," the researchers concluded in their paper. "If we want those judgments to be congruent with our own values (which is, after all, the central goal of AI alignment research) then we need to have ways of testing which values a model expresses in the real world."

Related article
WordPress.com now allows AI agents to write and publish posts, plus more WordPress.com now allows AI agents to write and publish posts, plus more WordPress.com, the popular web hosting and publishing platform, is now embracing AI agents—a move that could reshape the look and feel of the web. The company announced Friday that it will allow AI agents to draft, edit, and publish content on custom
Kakao Mobility outlines Level 4 autonomous driving roadmap for physical AI Kakao Mobility outlines Level 4 autonomous driving roadmap for physical AI Kakao Mobility is planning to develop Level 4 autonomous driving technologies internally as part of its physical AI strategy. At the 2026 World IT Show conference in Seoul's COEX, Kim Jin-kyu — vice president and head of Kakao Mobility's Physical AI
Barry Diller: Trust in Sam Altman irrelevant as AGI nears Barry Diller: Trust in Sam Altman irrelevant as AGI nears Barry Diller, the billionaire media titan, does not believe OpenAI CEO Sam Altman is untrustworthy, despite recent reports suggesting otherwise. Speaking at the Wall Street Journal's "Future of Everything" conference this week, Diller defended Altman
Related Special Topic Recommendations
Business Best AI Expense Trackers: Scan Receipts & Categorize Corporate Spend Automatically
Best AI Expense Trackers: Scan Receipts & Categorize Corporate Spend Automatically

2026 Latest Best AI Expense Trackers: Top-rated tools to scan receipts & categorize corporate spend automatically. Discover powerful, game-changing solutions for effortless expense management, accurate financial tracking, and streamlined compliance. Our curated, weekly-updated comparison of free vs paid options helps you find the perfect fit. Unlock your AI edge with XIX.AI's expert picks.

10 tools
xix.ai
Business Best AI Recruiting Tools: Screen Resumes & Automate Candidate Interview Scheduling
Best AI Recruiting Tools: Screen Resumes & Automate Candidate Interview Scheduling

Discover the 2026 latest top-rated AI recruiting tools on XIX.AI. Our curated list features powerful, game-changing solutions for screening resumes and automating candidate interview scheduling. Compare free vs paid options with real-world tests and weekly updated rankings. Find your perfect hiring assistant and streamline your recruitment today!

10 tools
xix.ai
Productivity AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels
AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels

Discover the 2026 best AI personal wellness and focus coaches on XIX.AI. Our curated rankings feature top-rated, game-changing tools to manage burnout and boost mental energy. Compare free vs paid options with real-world insights. Unlock your path to peak productivity and well-being today.

10 tools
xix.ai
chatbot Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities
Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities

Discover the 2026 latest top-rated AI romantic chatbots for building genuine, long-term connections. Our curated list features powerful, consistent personalities, free vs paid comparisons, and real-world tests. Find your perfect companion and start building today at XIX.AI.

10 tools
xix.ai
Education and Learning Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows
Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows

Discover the 2026 best AI data science mentors to master SQL, Pandas & ML workflows. Explore our top-rated, curated selection at XIX.AI for powerful, game-changing guidance. Compare free vs paid options with real-world insights. Unlock your data science mastery today.

10 tools
xix.ai
chatbot Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time
Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time

Discover the 2026 best AI flirting and conversation trainers on XIX.AI. Our curated, top-rated selection helps you build social charisma and confidence in real-time. Explore must-try, game-changing tools with free vs paid comparisons and weekly updated rankings. Unlock your social edge today.

10 tools
xix.ai
Comments (3)
0/500
JackAllen
JackAllen October 3, 2025 at 6:30:35 PM EDT

这篇Anthropic的研究太有意思了!看到AI竟然能形成自己的道德准则,让我想起《西部世界》里的机器人觉醒情节😲 不过Claude强调'不做坏事',会不会限制它应对复杂伦理困境的能力?毕竟现实世界里很难定义什么是绝对的'好'或'坏'。

KevinBrown
KevinBrown September 10, 2025 at 12:30:35 PM EDT

Cette étude sur les valeurs morales de Claude est vraiment fascinante ! 😮 Ça me fait réfléchir à comment on pourrait utiliser cette technologie pour améliorer l'éducation éthique. Mais est-ce que ces valeurs peuvent vraiment s'adapter aux différences culturelles ?

RogerLopez
RogerLopez August 8, 2025 at 1:01:00 PM EDT

Claude's moral code is fascinating! It's like watching a digital philosopher navigate real-world dilemmas. Curious how it stacks up against human ethics in tricky situations. 🤔

OR