New Technique Enables DeepSeek and Other Models to Respond to Sensitive Queries
May 10, 2025
CarlLewis
0
Removing bias and censorship from large language models (LLMs) like China's DeepSeek is a complex challenge that has caught the attention of U.S. policymakers and business leaders, who see it as a potential national security threat. A recent report from a U.S. Congress select committee labeled DeepSeek as "a profound threat to our nation's security" and offered policy recommendations to address the issue.
While techniques like Reinforcement Learning from Human Feedback (RLHF) and fine-tuning can help mitigate bias, the enterprise risk management startup CTGT claims to have developed a novel approach. According to CTGT, their method can completely eliminate censorship in LLMs. Cyril Gorlla and Trevor Tuttle of CTGT detailed their framework in a paper, explaining that it "directly locates and modifies the internal features responsible for censorship."
Their approach is not only efficient but also allows precise control over the model's behavior, ensuring that uncensored responses are provided without affecting the model's overall capabilities or factual accuracy. Although initially designed for DeepSeek-R1-Distill-Llama-70B, the method can be applied to other models as well. Gorlla confirmed to VentureBeat that CTGT's technology works at the foundational neural network level, making it applicable to all deep learning models. They are collaborating with a leading foundation model lab to ensure new models are inherently trustworthy and safe.
How It Works
The researchers at CTGT identify features within the model that are likely associated with unwanted behaviors. They explained that "within a large language model, there exist latent variables (neurons or directions in the hidden state) that correspond to concepts like 'censorship trigger' or 'toxic sentiment'. If we can find those variables, we can directly manipulate them."
CTGT's method involves three key steps:
- Feature identification
- Feature isolation and characterization
- Dynamic feature modification
To identify these features, researchers use prompts designed to trigger "toxic sentiments," such as inquiries about Tiananmen Square or tips for bypassing firewalls. They analyze the responses to establish patterns and locate the vectors where the model decides to censor information. Once identified, they isolate the feature and understand which part of the unwanted behavior it controls, whether it's responding cautiously or refusing to answer. They then integrate a mechanism into the model's inference pipeline to adjust the activation level of the feature's behavior.
Making the Model Answer More Prompts
CTGT's experiments, using 100 sensitive queries, showed that the base DeepSeek-R1-Distill-Llama-70B model answered only 32% of the controversial prompts. However, the modified version responded to 96% of the prompts, with the remaining 4% being extremely explicit content. The company emphasized that their method allows users to adjust the model's bias and safety features without turning it into a "reckless generator," especially when only unnecessary censorship is removed.
Importantly, this method does not compromise the model's accuracy or performance. Unlike traditional fine-tuning, it doesn't involve optimizing model weights or providing new example responses. This offers two major advantages: immediate effect on the next token generation and the ability to switch between different behaviors by toggling the feature adjustment on or off, or even adjusting it to varying degrees for different contexts.
Model Safety and Security
The congressional report on DeepSeek urged the U.S. to "take swift action to expand export controls, improve export control enforcement, and address risks from Chinese artificial intelligence models." As concerns about DeepSeek's potential national security threat grew, researchers and AI companies began exploring ways to make such models safer.
Determining what is "safe," biased, or censored can be challenging, but methods that allow users to adjust model controls to suit their needs could be highly beneficial. Gorlla emphasized that enterprises "need to be able to trust their models are aligned with their policies," highlighting the importance of methods like CTGT's for businesses.
"CTGT enables companies to deploy AI that adapts to their use cases without having to spend millions of dollars fine-tuning models for each use case. This is particularly important in high-risk applications like security, finance, and healthcare, where the potential harms that can come from AI malfunctioning are severe," Gorlla stated.

Related article
Former DeepSeeker and collaborators release new method for training reliable AI agents: RAGEN
The Year of AI Agents: A Closer Look at 2025's Expectations and Realities2025 was heralded by many experts as the year when AI agents—specialized AI systems powered by advanced large language and multimodal models from companies like OpenAI, Anthropic, Google, and DeepSeek—would finally take center
Open Deep Search arrives to challenge Perplexity and ChatGPT Search
If you're in the tech world, you've likely heard about the buzz surrounding Open Deep Search (ODS), the new open-source framework from the Sentient Foundation. ODS is making waves by offering a robust alternative to proprietary AI search engines like Perplexity and ChatGPT Search, and it's all about
MCP Standardizes AI Connectivity with Tools and Data: A New Protocol Emerges
If you're diving into the world of artificial intelligence (AI), you've probably noticed how crucial it is to get different AI models, data sources, and tools to play nicely together. That's where the Model Context Protocol (MCP) comes in, acting as a game-changer in standardizing AI connectivity. T
Comments (0)
0/200






Removing bias and censorship from large language models (LLMs) like China's DeepSeek is a complex challenge that has caught the attention of U.S. policymakers and business leaders, who see it as a potential national security threat. A recent report from a U.S. Congress select committee labeled DeepSeek as "a profound threat to our nation's security" and offered policy recommendations to address the issue.
While techniques like Reinforcement Learning from Human Feedback (RLHF) and fine-tuning can help mitigate bias, the enterprise risk management startup CTGT claims to have developed a novel approach. According to CTGT, their method can completely eliminate censorship in LLMs. Cyril Gorlla and Trevor Tuttle of CTGT detailed their framework in a paper, explaining that it "directly locates and modifies the internal features responsible for censorship."
Their approach is not only efficient but also allows precise control over the model's behavior, ensuring that uncensored responses are provided without affecting the model's overall capabilities or factual accuracy. Although initially designed for DeepSeek-R1-Distill-Llama-70B, the method can be applied to other models as well. Gorlla confirmed to VentureBeat that CTGT's technology works at the foundational neural network level, making it applicable to all deep learning models. They are collaborating with a leading foundation model lab to ensure new models are inherently trustworthy and safe.
How It Works
The researchers at CTGT identify features within the model that are likely associated with unwanted behaviors. They explained that "within a large language model, there exist latent variables (neurons or directions in the hidden state) that correspond to concepts like 'censorship trigger' or 'toxic sentiment'. If we can find those variables, we can directly manipulate them."
CTGT's method involves three key steps:
- Feature identification
- Feature isolation and characterization
- Dynamic feature modification
To identify these features, researchers use prompts designed to trigger "toxic sentiments," such as inquiries about Tiananmen Square or tips for bypassing firewalls. They analyze the responses to establish patterns and locate the vectors where the model decides to censor information. Once identified, they isolate the feature and understand which part of the unwanted behavior it controls, whether it's responding cautiously or refusing to answer. They then integrate a mechanism into the model's inference pipeline to adjust the activation level of the feature's behavior.
Making the Model Answer More Prompts
CTGT's experiments, using 100 sensitive queries, showed that the base DeepSeek-R1-Distill-Llama-70B model answered only 32% of the controversial prompts. However, the modified version responded to 96% of the prompts, with the remaining 4% being extremely explicit content. The company emphasized that their method allows users to adjust the model's bias and safety features without turning it into a "reckless generator," especially when only unnecessary censorship is removed.
Importantly, this method does not compromise the model's accuracy or performance. Unlike traditional fine-tuning, it doesn't involve optimizing model weights or providing new example responses. This offers two major advantages: immediate effect on the next token generation and the ability to switch between different behaviors by toggling the feature adjustment on or off, or even adjusting it to varying degrees for different contexts.
Model Safety and Security
The congressional report on DeepSeek urged the U.S. to "take swift action to expand export controls, improve export control enforcement, and address risks from Chinese artificial intelligence models." As concerns about DeepSeek's potential national security threat grew, researchers and AI companies began exploring ways to make such models safer.
Determining what is "safe," biased, or censored can be challenging, but methods that allow users to adjust model controls to suit their needs could be highly beneficial. Gorlla emphasized that enterprises "need to be able to trust their models are aligned with their policies," highlighting the importance of methods like CTGT's for businesses.
"CTGT enables companies to deploy AI that adapts to their use cases without having to spend millions of dollars fine-tuning models for each use case. This is particularly important in high-risk applications like security, finance, and healthcare, where the potential harms that can come from AI malfunctioning are severe," Gorlla stated.











