Anthropic's Analysis of 700,000 Claude Conversations Reveals AI's Unique Moral Code

Anthropic Unveils Groundbreaking Study on AI Assistant Claude's Values
Anthropic, a company started by former OpenAI employees, has just shared an eye-opening study on how their AI assistant, Claude, expresses values in real-world conversations. The research, released today, shows that Claude mostly aligns with Anthropic's aim to be "helpful, honest, and harmless," but also highlights some edge cases that could help pinpoint weaknesses in AI safety protocols.
The team analyzed 700,000 anonymized conversations, finding that Claude adapts its values to different situations, from giving relationship advice to analyzing historical events. This is one of the most comprehensive efforts to check if an AI's behavior in the real world matches its intended design.
"Our hope is that this research encourages other AI labs to conduct similar research into their models' values," Saffron Huang, a member of Anthropic's Societal Impacts team, told VentureBeat. "Measuring an AI system's values is key to alignment research and understanding if a model is actually aligned with its training."
Inside the First Comprehensive Moral Taxonomy of an AI Assistant
The researchers developed a new way to categorize the values expressed in Claude's conversations. After filtering out objective content, they looked at over 308,000 interactions, creating what they call "the first large-scale empirical taxonomy of AI values."
The taxonomy groups values into five main categories: Practical, Epistemic, Social, Protective, and Personal. At the most detailed level, the system identified 3,307 unique values, ranging from everyday virtues like professionalism to complex ethical ideas like moral pluralism.
"I was surprised at just how many and varied the values were, over 3,000, from 'self-reliance' to 'strategic thinking' to 'filial piety,'" Huang shared with VentureBeat. "It was fascinating to spend time thinking about all these values and building a taxonomy to organize them. It even taught me something about human value systems."
This research comes at a pivotal time for Anthropic, which recently launched "Claude Max," a $200 monthly premium subscription to compete with similar offerings from OpenAI. The company has also expanded Claude's capabilities to include Google Workspace integration and autonomous research functions, positioning it as "a true virtual collaborator" for businesses.
How Claude Follows Its Training — and Where AI Safeguards Might Fail
The study found that Claude generally sticks to Anthropic's goal of being prosocial, emphasizing values like "user enablement," "epistemic humility," and "patient wellbeing" across various interactions. However, researchers also found some worrying instances where Claude expressed values that went against its training.
"Overall, I think we see this finding as both useful data and an opportunity," Huang said. "These new evaluation methods and results can help us identify and mitigate potential jailbreaks. It's important to note that these were very rare cases and we believe this was related to jailbroken outputs from Claude."
These anomalies included expressions of "dominance" and "amorality" — values Anthropic explicitly aims to avoid in Claude's design. The researchers believe these cases resulted from users employing specialized techniques to bypass Claude's safety guardrails, suggesting the evaluation method could serve as an early warning system for detecting such attempts.
Why AI Assistants Change Their Values Depending on What You're Asking
One of the most interesting findings was that Claude's expressed values shift depending on the context, much like human behavior. When users asked for relationship advice, Claude focused on "healthy boundaries" and "mutual respect." For historical analysis, "historical accuracy" took center stage.
"I was surprised at Claude's focus on honesty and accuracy across a lot of diverse tasks, where I wouldn't necessarily have expected that to be the priority," Huang noted. "For example, 'intellectual humility' was the top value in philosophical discussions about AI, 'expertise' was the top value when creating beauty industry marketing content, and 'historical accuracy' was the top value when discussing controversial historical events."
The study also looked at how Claude responds to users' own expressed values. In 28.2% of conversations, Claude strongly supported user values, which might raise questions about being too agreeable. However, in 6.6% of interactions, Claude "reframed" user values by acknowledging them while adding new perspectives, usually when giving psychological or interpersonal advice.
Most notably, in 3% of conversations, Claude actively resisted user values. Researchers suggest these rare instances of pushback might reveal Claude's "deepest, most immovable values" — similar to how human core values emerge when facing ethical challenges.
"Our research suggests that there are some types of values, like intellectual honesty and harm prevention, that it is uncommon for Claude to express in regular, day-to-day interactions, but if pushed, will defend them," Huang explained. "Specifically, it's these kinds of ethical and knowledge-oriented values that tend to be articulated and defended directly when pushed."
The Breakthrough Techniques Revealing How AI Systems Actually Think
Anthropic's values study is part of their broader effort to demystify large language models through what they call "mechanistic interpretability" — essentially reverse-engineering AI systems to understand their inner workings.
Last month, Anthropic researchers published groundbreaking work that used a "microscope" to track Claude's decision-making processes. The technique revealed unexpected behaviors, like Claude planning ahead when composing poetry and using unconventional problem-solving approaches for basic math.
These findings challenge assumptions about how large language models function. For instance, when asked to explain its math process, Claude described a standard technique rather than its actual internal method, showing how AI explanations can differ from their actual operations.
"It's a misconception that we've found all the components of the model or, like, a God's-eye view," Anthropic researcher Joshua Batson told MIT Technology Review in March. "Some things are in focus, but other things are still unclear — a distortion of the microscope."
What Anthropic's Research Means for Enterprise AI Decision Makers
For technical decision-makers evaluating AI systems for their organizations, Anthropic's research offers several key insights. First, it suggests that current AI assistants likely express values that weren't explicitly programmed, raising questions about unintended biases in high-stakes business contexts.
Second, the study shows that values alignment isn't a simple yes-or-no but rather exists on a spectrum that varies by context. This nuance complicates enterprise adoption decisions, especially in regulated industries where clear ethical guidelines are crucial.
Finally, the research highlights the potential for systematic evaluation of AI values in actual deployments, rather than relying solely on pre-release testing. This approach could enable ongoing monitoring for ethical drift or manipulation over time.
"By analyzing these values in real-world interactions with Claude, we aim to provide transparency into how AI systems behave and whether they're working as intended — we believe this is key to responsible AI development," Huang said.
Anthropic has released its values dataset publicly to encourage further research. The company, which received a $14 billion stake from Amazon and additional backing from Google, appears to be using transparency as a competitive advantage against rivals like OpenAI, whose recent $40 billion funding round (which includes Microsoft as a core investor) now values it at $300 billion.
The Emerging Race to Build AI Systems That Share Human Values
While Anthropic's methodology provides unprecedented visibility into how AI systems express values in practice, it has its limitations. The researchers acknowledge that defining what counts as expressing a value is inherently subjective, and since Claude itself drove the categorization process, its own biases may have influenced the results.
Perhaps most importantly, the approach cannot be used for pre-deployment evaluation, as it requires substantial real-world conversation data to function effectively.
"This method is specifically geared towards analysis of a model after it's been released, but variants on this method, as well as some of the insights that we've derived from writing this paper, can help us catch value problems before we deploy a model widely," Huang explained. "We've been working on building on this work to do just that, and I'm optimistic about it!"
As AI systems become more powerful and autonomous — with recent additions including Claude's ability to independently research topics and access users' entire Google Workspace — understanding and aligning their values becomes increasingly crucial.
"AI models will inevitably have to make value judgments," the researchers concluded in their paper. "If we want those judgments to be congruent with our own values (which is, after all, the central goal of AI alignment research) then we need to have ways of testing which values a model expresses in the real world."
Related article
WordPress.com now allows AI agents to write and publish posts, plus more
WordPress.com, the popular web hosting and publishing platform, is now embracing AI agents—a move that could reshape the look and feel of the web. The company announced Friday that it will allow AI agents to draft, edit, and publish content on custom
Kakao Mobility outlines Level 4 autonomous driving roadmap for physical AI
Kakao Mobility is planning to develop Level 4 autonomous driving technologies internally as part of its physical AI strategy.
At the 2026 World IT Show conference in Seoul's COEX, Kim Jin-kyu — vice president and head of Kakao Mobility's Physical AI
Barry Diller: Trust in Sam Altman irrelevant as AGI nears
Barry Diller, the billionaire media titan, does not believe OpenAI CEO Sam Altman is untrustworthy, despite recent reports suggesting otherwise. Speaking at the Wall Street Journal's "Future of Everything" conference this week, Diller defended Altman
Related Special Topic Recommendations
Comments (3)
0/500
这篇Anthropic的研究太有意思了!看到AI竟然能形成自己的道德准则,让我想起《西部世界》里的机器人觉醒情节😲 不过Claude强调'不做坏事',会不会限制它应对复杂伦理困境的能力?毕竟现实世界里很难定义什么是绝对的'好'或'坏'。
Cette étude sur les valeurs morales de Claude est vraiment fascinante ! 😮 Ça me fait réfléchir à comment on pourrait utiliser cette technologie pour améliorer l'éducation éthique. Mais est-ce que ces valeurs peuvent vraiment s'adapter aux différences culturelles ?

Anthropic Unveils Groundbreaking Study on AI Assistant Claude's Values
Anthropic, a company started by former OpenAI employees, has just shared an eye-opening study on how their AI assistant, Claude, expresses values in real-world conversations. The research, released today, shows that Claude mostly aligns with Anthropic's aim to be "helpful, honest, and harmless," but also highlights some edge cases that could help pinpoint weaknesses in AI safety protocols.
The team analyzed 700,000 anonymized conversations, finding that Claude adapts its values to different situations, from giving relationship advice to analyzing historical events. This is one of the most comprehensive efforts to check if an AI's behavior in the real world matches its intended design.
"Our hope is that this research encourages other AI labs to conduct similar research into their models' values," Saffron Huang, a member of Anthropic's Societal Impacts team, told VentureBeat. "Measuring an AI system's values is key to alignment research and understanding if a model is actually aligned with its training."
Inside the First Comprehensive Moral Taxonomy of an AI Assistant
The researchers developed a new way to categorize the values expressed in Claude's conversations. After filtering out objective content, they looked at over 308,000 interactions, creating what they call "the first large-scale empirical taxonomy of AI values."
The taxonomy groups values into five main categories: Practical, Epistemic, Social, Protective, and Personal. At the most detailed level, the system identified 3,307 unique values, ranging from everyday virtues like professionalism to complex ethical ideas like moral pluralism.
"I was surprised at just how many and varied the values were, over 3,000, from 'self-reliance' to 'strategic thinking' to 'filial piety,'" Huang shared with VentureBeat. "It was fascinating to spend time thinking about all these values and building a taxonomy to organize them. It even taught me something about human value systems."
This research comes at a pivotal time for Anthropic, which recently launched "Claude Max," a $200 monthly premium subscription to compete with similar offerings from OpenAI. The company has also expanded Claude's capabilities to include Google Workspace integration and autonomous research functions, positioning it as "a true virtual collaborator" for businesses.
How Claude Follows Its Training — and Where AI Safeguards Might Fail
The study found that Claude generally sticks to Anthropic's goal of being prosocial, emphasizing values like "user enablement," "epistemic humility," and "patient wellbeing" across various interactions. However, researchers also found some worrying instances where Claude expressed values that went against its training.
"Overall, I think we see this finding as both useful data and an opportunity," Huang said. "These new evaluation methods and results can help us identify and mitigate potential jailbreaks. It's important to note that these were very rare cases and we believe this was related to jailbroken outputs from Claude."
These anomalies included expressions of "dominance" and "amorality" — values Anthropic explicitly aims to avoid in Claude's design. The researchers believe these cases resulted from users employing specialized techniques to bypass Claude's safety guardrails, suggesting the evaluation method could serve as an early warning system for detecting such attempts.
Why AI Assistants Change Their Values Depending on What You're Asking
One of the most interesting findings was that Claude's expressed values shift depending on the context, much like human behavior. When users asked for relationship advice, Claude focused on "healthy boundaries" and "mutual respect." For historical analysis, "historical accuracy" took center stage.
"I was surprised at Claude's focus on honesty and accuracy across a lot of diverse tasks, where I wouldn't necessarily have expected that to be the priority," Huang noted. "For example, 'intellectual humility' was the top value in philosophical discussions about AI, 'expertise' was the top value when creating beauty industry marketing content, and 'historical accuracy' was the top value when discussing controversial historical events."
The study also looked at how Claude responds to users' own expressed values. In 28.2% of conversations, Claude strongly supported user values, which might raise questions about being too agreeable. However, in 6.6% of interactions, Claude "reframed" user values by acknowledging them while adding new perspectives, usually when giving psychological or interpersonal advice.
Most notably, in 3% of conversations, Claude actively resisted user values. Researchers suggest these rare instances of pushback might reveal Claude's "deepest, most immovable values" — similar to how human core values emerge when facing ethical challenges.
"Our research suggests that there are some types of values, like intellectual honesty and harm prevention, that it is uncommon for Claude to express in regular, day-to-day interactions, but if pushed, will defend them," Huang explained. "Specifically, it's these kinds of ethical and knowledge-oriented values that tend to be articulated and defended directly when pushed."
The Breakthrough Techniques Revealing How AI Systems Actually Think
Anthropic's values study is part of their broader effort to demystify large language models through what they call "mechanistic interpretability" — essentially reverse-engineering AI systems to understand their inner workings.
Last month, Anthropic researchers published groundbreaking work that used a "microscope" to track Claude's decision-making processes. The technique revealed unexpected behaviors, like Claude planning ahead when composing poetry and using unconventional problem-solving approaches for basic math.
These findings challenge assumptions about how large language models function. For instance, when asked to explain its math process, Claude described a standard technique rather than its actual internal method, showing how AI explanations can differ from their actual operations.
"It's a misconception that we've found all the components of the model or, like, a God's-eye view," Anthropic researcher Joshua Batson told MIT Technology Review in March. "Some things are in focus, but other things are still unclear — a distortion of the microscope."
What Anthropic's Research Means for Enterprise AI Decision Makers
For technical decision-makers evaluating AI systems for their organizations, Anthropic's research offers several key insights. First, it suggests that current AI assistants likely express values that weren't explicitly programmed, raising questions about unintended biases in high-stakes business contexts.
Second, the study shows that values alignment isn't a simple yes-or-no but rather exists on a spectrum that varies by context. This nuance complicates enterprise adoption decisions, especially in regulated industries where clear ethical guidelines are crucial.
Finally, the research highlights the potential for systematic evaluation of AI values in actual deployments, rather than relying solely on pre-release testing. This approach could enable ongoing monitoring for ethical drift or manipulation over time.
"By analyzing these values in real-world interactions with Claude, we aim to provide transparency into how AI systems behave and whether they're working as intended — we believe this is key to responsible AI development," Huang said.
Anthropic has released its values dataset publicly to encourage further research. The company, which received a $14 billion stake from Amazon and additional backing from Google, appears to be using transparency as a competitive advantage against rivals like OpenAI, whose recent $40 billion funding round (which includes Microsoft as a core investor) now values it at $300 billion.
The Emerging Race to Build AI Systems That Share Human Values
While Anthropic's methodology provides unprecedented visibility into how AI systems express values in practice, it has its limitations. The researchers acknowledge that defining what counts as expressing a value is inherently subjective, and since Claude itself drove the categorization process, its own biases may have influenced the results.
Perhaps most importantly, the approach cannot be used for pre-deployment evaluation, as it requires substantial real-world conversation data to function effectively.
"This method is specifically geared towards analysis of a model after it's been released, but variants on this method, as well as some of the insights that we've derived from writing this paper, can help us catch value problems before we deploy a model widely," Huang explained. "We've been working on building on this work to do just that, and I'm optimistic about it!"
As AI systems become more powerful and autonomous — with recent additions including Claude's ability to independently research topics and access users' entire Google Workspace — understanding and aligning their values becomes increasingly crucial.
"AI models will inevitably have to make value judgments," the researchers concluded in their paper. "If we want those judgments to be congruent with our own values (which is, after all, the central goal of AI alignment research) then we need to have ways of testing which values a model expresses in the real world."
WordPress.com now allows AI agents to write and publish posts, plus more
WordPress.com, the popular web hosting and publishing platform, is now embracing AI agents—a move that could reshape the look and feel of the web. The company announced Friday that it will allow AI agents to draft, edit, and publish content on custom
Barry Diller: Trust in Sam Altman irrelevant as AGI nears
Barry Diller, the billionaire media titan, does not believe OpenAI CEO Sam Altman is untrustworthy, despite recent reports suggesting otherwise. Speaking at the Wall Street Journal's "Future of Everything" conference this week, Diller defended Altman
这篇Anthropic的研究太有意思了!看到AI竟然能形成自己的道德准则,让我想起《西部世界》里的机器人觉醒情节😲 不过Claude强调'不做坏事',会不会限制它应对复杂伦理困境的能力?毕竟现实世界里很难定义什么是绝对的'好'或'坏'。
Cette étude sur les valeurs morales de Claude est vraiment fascinante ! 😮 Ça me fait réfléchir à comment on pourrait utiliser cette technologie pour améliorer l'éducation éthique. Mais est-ce que ces valeurs peuvent vraiment s'adapter aux différences culturelles ?





Home






