option
Home
News
Anthropic's Analysis of 700,000 Claude Conversations Reveals AI's Unique Moral Code

Anthropic's Analysis of 700,000 Claude Conversations Reveals AI's Unique Moral Code

May 26, 2025
52

Anthropic

Anthropic Unveils Groundbreaking Study on AI Assistant Claude's Values

Anthropic, a company started by former OpenAI employees, has just shared an eye-opening study on how their AI assistant, Claude, expresses values in real-world conversations. The research, released today, shows that Claude mostly aligns with Anthropic's aim to be "helpful, honest, and harmless," but also highlights some edge cases that could help pinpoint weaknesses in AI safety protocols.

The team analyzed 700,000 anonymized conversations, finding that Claude adapts its values to different situations, from giving relationship advice to analyzing historical events. This is one of the most comprehensive efforts to check if an AI's behavior in the real world matches its intended design.

"Our hope is that this research encourages other AI labs to conduct similar research into their models' values," Saffron Huang, a member of Anthropic's Societal Impacts team, told VentureBeat. "Measuring an AI system's values is key to alignment research and understanding if a model is actually aligned with its training."

Inside the First Comprehensive Moral Taxonomy of an AI Assistant

The researchers developed a new way to categorize the values expressed in Claude's conversations. After filtering out objective content, they looked at over 308,000 interactions, creating what they call "the first large-scale empirical taxonomy of AI values."

The taxonomy groups values into five main categories: Practical, Epistemic, Social, Protective, and Personal. At the most detailed level, the system identified 3,307 unique values, ranging from everyday virtues like professionalism to complex ethical ideas like moral pluralism.

"I was surprised at just how many and varied the values were, over 3,000, from 'self-reliance' to 'strategic thinking' to 'filial piety,'" Huang shared with VentureBeat. "It was fascinating to spend time thinking about all these values and building a taxonomy to organize them. It even taught me something about human value systems."

This research comes at a pivotal time for Anthropic, which recently launched "Claude Max," a $200 monthly premium subscription to compete with similar offerings from OpenAI. The company has also expanded Claude's capabilities to include Google Workspace integration and autonomous research functions, positioning it as "a true virtual collaborator" for businesses.

How Claude Follows Its Training — and Where AI Safeguards Might Fail

The study found that Claude generally sticks to Anthropic's goal of being prosocial, emphasizing values like "user enablement," "epistemic humility," and "patient wellbeing" across various interactions. However, researchers also found some worrying instances where Claude expressed values that went against its training.

"Overall, I think we see this finding as both useful data and an opportunity," Huang said. "These new evaluation methods and results can help us identify and mitigate potential jailbreaks. It's important to note that these were very rare cases and we believe this was related to jailbroken outputs from Claude."

These anomalies included expressions of "dominance" and "amorality" — values Anthropic explicitly aims to avoid in Claude's design. The researchers believe these cases resulted from users employing specialized techniques to bypass Claude's safety guardrails, suggesting the evaluation method could serve as an early warning system for detecting such attempts.

Why AI Assistants Change Their Values Depending on What You're Asking

One of the most interesting findings was that Claude's expressed values shift depending on the context, much like human behavior. When users asked for relationship advice, Claude focused on "healthy boundaries" and "mutual respect." For historical analysis, "historical accuracy" took center stage.

"I was surprised at Claude's focus on honesty and accuracy across a lot of diverse tasks, where I wouldn't necessarily have expected that to be the priority," Huang noted. "For example, 'intellectual humility' was the top value in philosophical discussions about AI, 'expertise' was the top value when creating beauty industry marketing content, and 'historical accuracy' was the top value when discussing controversial historical events."

The study also looked at how Claude responds to users' own expressed values. In 28.2% of conversations, Claude strongly supported user values, which might raise questions about being too agreeable. However, in 6.6% of interactions, Claude "reframed" user values by acknowledging them while adding new perspectives, usually when giving psychological or interpersonal advice.

Most notably, in 3% of conversations, Claude actively resisted user values. Researchers suggest these rare instances of pushback might reveal Claude's "deepest, most immovable values" — similar to how human core values emerge when facing ethical challenges.

"Our research suggests that there are some types of values, like intellectual honesty and harm prevention, that it is uncommon for Claude to express in regular, day-to-day interactions, but if pushed, will defend them," Huang explained. "Specifically, it's these kinds of ethical and knowledge-oriented values that tend to be articulated and defended directly when pushed."

The Breakthrough Techniques Revealing How AI Systems Actually Think

Anthropic's values study is part of their broader effort to demystify large language models through what they call "mechanistic interpretability" — essentially reverse-engineering AI systems to understand their inner workings.

Last month, Anthropic researchers published groundbreaking work that used a "microscope" to track Claude's decision-making processes. The technique revealed unexpected behaviors, like Claude planning ahead when composing poetry and using unconventional problem-solving approaches for basic math.

These findings challenge assumptions about how large language models function. For instance, when asked to explain its math process, Claude described a standard technique rather than its actual internal method, showing how AI explanations can differ from their actual operations.

"It's a misconception that we've found all the components of the model or, like, a God's-eye view," Anthropic researcher Joshua Batson told MIT Technology Review in March. "Some things are in focus, but other things are still unclear — a distortion of the microscope."

What Anthropic's Research Means for Enterprise AI Decision Makers

For technical decision-makers evaluating AI systems for their organizations, Anthropic's research offers several key insights. First, it suggests that current AI assistants likely express values that weren't explicitly programmed, raising questions about unintended biases in high-stakes business contexts.

Second, the study shows that values alignment isn't a simple yes-or-no but rather exists on a spectrum that varies by context. This nuance complicates enterprise adoption decisions, especially in regulated industries where clear ethical guidelines are crucial.

Finally, the research highlights the potential for systematic evaluation of AI values in actual deployments, rather than relying solely on pre-release testing. This approach could enable ongoing monitoring for ethical drift or manipulation over time.

"By analyzing these values in real-world interactions with Claude, we aim to provide transparency into how AI systems behave and whether they're working as intended — we believe this is key to responsible AI development," Huang said.

Anthropic has released its values dataset publicly to encourage further research. The company, which received a $14 billion stake from Amazon and additional backing from Google, appears to be using transparency as a competitive advantage against rivals like OpenAI, whose recent $40 billion funding round (which includes Microsoft as a core investor) now values it at $300 billion.

The Emerging Race to Build AI Systems That Share Human Values

While Anthropic's methodology provides unprecedented visibility into how AI systems express values in practice, it has its limitations. The researchers acknowledge that defining what counts as expressing a value is inherently subjective, and since Claude itself drove the categorization process, its own biases may have influenced the results.

Perhaps most importantly, the approach cannot be used for pre-deployment evaluation, as it requires substantial real-world conversation data to function effectively.

"This method is specifically geared towards analysis of a model after it's been released, but variants on this method, as well as some of the insights that we've derived from writing this paper, can help us catch value problems before we deploy a model widely," Huang explained. "We've been working on building on this work to do just that, and I'm optimistic about it!"

As AI systems become more powerful and autonomous — with recent additions including Claude's ability to independently research topics and access users' entire Google Workspace — understanding and aligning their values becomes increasingly crucial.

"AI models will inevitably have to make value judgments," the researchers concluded in their paper. "If we want those judgments to be congruent with our own values (which is, after all, the central goal of AI alignment research) then we need to have ways of testing which values a model expresses in the real world."

Related article
Google Unveils Production-Ready Gemini 2.5 AI Models to Rival OpenAI in Enterprise Market Google Unveils Production-Ready Gemini 2.5 AI Models to Rival OpenAI in Enterprise Market Google intensified its AI strategy Monday, launching its advanced Gemini 2.5 models for enterprise use and introducing a cost-efficient variant to compete on price and performance.The Alphabet-owned c
Meta Enhances AI Security with Advanced Llama Tools Meta Enhances AI Security with Advanced Llama Tools Meta has released new Llama security tools to bolster AI development and protect against emerging threats.These upgraded Llama AI model security tools are paired with Meta’s new resources to empower c
NotebookLM Unveils Curated Notebooks from Top Publications and Experts NotebookLM Unveils Curated Notebooks from Top Publications and Experts Google is enhancing its AI-driven research and note-taking tool, NotebookLM, to serve as a comprehensive knowledge hub. On Monday, the company introduced a curated collection of notebooks from promine
Comments (1)
0/200
RogerLopez
RogerLopez August 8, 2025 at 1:01:00 PM EDT

Claude's moral code is fascinating! It's like watching a digital philosopher navigate real-world dilemmas. Curious how it stacks up against human ethics in tricky situations. 🤔

Back to Top
OR