Chinese AI Censorship Exposed by Leaked Data
China's use of AI to enhance its censorship capabilities has reached a new level, as revealed by a leaked database containing 133,000 examples of content flagged for sensitivity by the Chinese government. This sophisticated large language model (LLM) is designed to automatically detect and censor content related to a wide range of topics, from poverty in rural areas to corruption within the Communist Party and even subtle political satire.

This photo taken on June 4, 2019, shows the Chinese flag behind razor wire at a housing compound in Yengisar, south of Kashgar, in China’s western Xinjiang region.Image Credits:Greg Baker / AFP / Getty Images
According to Xiao Qiang, a researcher at UC Berkeley who specializes in Chinese censorship, this database is "clear evidence" that the Chinese government or its affiliates are using LLMs to bolster their repression efforts. Unlike traditional methods that depend on human moderators and keyword filtering, this AI-driven approach can significantly enhance the efficiency and precision of state-controlled information management.
The dataset, discovered by security researcher NetAskari on an unsecured Elasticsearch database hosted on a Baidu server, includes recent entries from December 2024. It's unclear who exactly created the dataset, but its purpose is evident: to train an LLM to identify and flag content related to sensitive topics such as pollution, food safety, financial fraud, labor disputes, and military matters. Political satire, especially when it involves historical analogies or references to Taiwan, is also a high-priority target.

Image Credits:Charles rollet
The training data includes various examples of content that could potentially stir social unrest, such as complaints about corrupt police officers, reports on rural poverty, and news about expelled Communist Party officials. The dataset also contains extensive references to Taiwan and military-related topics, with the Chinese word for Taiwan (台湾) appearing over 15,000 times.
The dataset's intended use is described as "public opinion work," a term that Michael Caster of Article 19 explains is typically associated with the Cyberspace Administration of China (CAC) and involves censorship and propaganda efforts. This aligns with Chinese President Xi Jinping's view of the internet as the "frontline" of the Communist Party's public opinion work.
This development is part of a broader trend of authoritarian regimes adopting AI technology for repressive purposes. OpenAI recently reported that an unidentified actor, likely from China, used generative AI to monitor social media and forward anti-government posts to the Chinese government. The same technology was also used to generate critical comments about a prominent Chinese dissident, Cai Xia.
While China's traditional censorship methods rely on basic algorithms to block blacklisted terms, the use of LLMs represents a significant advancement. These AI systems can detect even subtle criticism on a massive scale and continuously improve as they process more data.
"I think it's crucial to highlight how AI-driven censorship is evolving, making state control over public discourse even more sophisticated, especially at a time when Chinese AI models such as DeepSeek are making headwaves," Xiao Qiang told TechCrunch.
Related article
Alibaba Unveils Wan2.1-VACE: Open-Source AI Video Solution
Alibaba has introduced Wan2.1-VACE, an open-source AI model poised to transform video creation and editing processes.VACE is a key component of Alibaba’s Wan2.1 video AI model family, with the company
Huawei CEO Ren Zhengfei on China's AI Ambitions and Resilience Strategy
Huawei CEO Ren Zhengfei shares candid insights on China's AI landscape and the challenges his company faces."I haven't dwelled on it," Ren states in a People’s Daily Q&A. "Overthinking is futile."In a
Analysis Reveals AI's Responses on China Vary by Language
Exploring AI Censorship: A Language-Based AnalysisIt's no secret that AI models from Chinese labs, such as DeepSeek, are subject to strict censorship rules. A 2023 regulation from China's ruling party explicitly prohibits these models from generating content that could undermine national unity or so
Comments (37)
0/200
CharlesGonzalez
August 1, 2025 at 9:47:34 AM EDT
This leak is wild! 133,000 flagged posts show how deep China's AI censorship goes. It's like a digital Big Brother on steroids. 😳 Makes you wonder how much we're not seeing online.
0
ElijahWalker
July 22, 2025 at 3:35:51 AM EDT
This leak is wild! 133,000 flagged posts? That’s a scary peek into how AI’s being used to control speech in China. Makes you wonder how much is being filtered without us knowing. 😳
0
MichaelDavis
April 21, 2025 at 4:06:03 AM EDT
Essa ferramenta é reveladora! Mostra como a censura por AI na China é profunda. O vazamento do banco de dados é um pouco assustador, mas é importante saber o que está acontecendo nos bastidores. Definitivamente, algo que todos interessados em liberdade na internet devem conhecer. Fique de olho nisso! 👀
0
SebastianAnderson
April 19, 2025 at 6:25:56 PM EDT
Los datos filtrados sobre la censura de IA en China son escalofriantes. Es aterrador pensar en cómo se está utilizando la IA para controlar la información. Necesitamos más transparencia y menos censura, ¿no crees? 🤔
0
RoyYoung
April 19, 2025 at 12:38:42 PM EDT
中国的AI审查越来越失控了!😱 泄露了133,000个被标记内容的例子,显示出这有多深入。想到AI在自动审查东西,真是可怕。我们需要更多的透明度和更少的控制,对吧?🚫
0
EdwardTaylor
April 19, 2025 at 11:12:20 AM EDT
このツールは本当に驚きですね!中国のAIによる検閲がどれだけ深いかがよく分かります。データベースの漏洩はちょっと怖いですが、裏側で何が起こっているかを知ることは重要です。インターネットの自由に興味がある人は必見ですよ!👀
0
China's use of AI to enhance its censorship capabilities has reached a new level, as revealed by a leaked database containing 133,000 examples of content flagged for sensitivity by the Chinese government. This sophisticated large language model (LLM) is designed to automatically detect and censor content related to a wide range of topics, from poverty in rural areas to corruption within the Communist Party and even subtle political satire.
According to Xiao Qiang, a researcher at UC Berkeley who specializes in Chinese censorship, this database is "clear evidence" that the Chinese government or its affiliates are using LLMs to bolster their repression efforts. Unlike traditional methods that depend on human moderators and keyword filtering, this AI-driven approach can significantly enhance the efficiency and precision of state-controlled information management.
The dataset, discovered by security researcher NetAskari on an unsecured Elasticsearch database hosted on a Baidu server, includes recent entries from December 2024. It's unclear who exactly created the dataset, but its purpose is evident: to train an LLM to identify and flag content related to sensitive topics such as pollution, food safety, financial fraud, labor disputes, and military matters. Political satire, especially when it involves historical analogies or references to Taiwan, is also a high-priority target.
The training data includes various examples of content that could potentially stir social unrest, such as complaints about corrupt police officers, reports on rural poverty, and news about expelled Communist Party officials. The dataset also contains extensive references to Taiwan and military-related topics, with the Chinese word for Taiwan (台湾) appearing over 15,000 times.
The dataset's intended use is described as "public opinion work," a term that Michael Caster of Article 19 explains is typically associated with the Cyberspace Administration of China (CAC) and involves censorship and propaganda efforts. This aligns with Chinese President Xi Jinping's view of the internet as the "frontline" of the Communist Party's public opinion work.
This development is part of a broader trend of authoritarian regimes adopting AI technology for repressive purposes. OpenAI recently reported that an unidentified actor, likely from China, used generative AI to monitor social media and forward anti-government posts to the Chinese government. The same technology was also used to generate critical comments about a prominent Chinese dissident, Cai Xia.
While China's traditional censorship methods rely on basic algorithms to block blacklisted terms, the use of LLMs represents a significant advancement. These AI systems can detect even subtle criticism on a massive scale and continuously improve as they process more data.
"I think it's crucial to highlight how AI-driven censorship is evolving, making state control over public discourse even more sophisticated, especially at a time when Chinese AI models such as DeepSeek are making headwaves," Xiao Qiang told TechCrunch.



This leak is wild! 133,000 flagged posts show how deep China's AI censorship goes. It's like a digital Big Brother on steroids. 😳 Makes you wonder how much we're not seeing online.




This leak is wild! 133,000 flagged posts? That’s a scary peek into how AI’s being used to control speech in China. Makes you wonder how much is being filtered without us knowing. 😳




Essa ferramenta é reveladora! Mostra como a censura por AI na China é profunda. O vazamento do banco de dados é um pouco assustador, mas é importante saber o que está acontecendo nos bastidores. Definitivamente, algo que todos interessados em liberdade na internet devem conhecer. Fique de olho nisso! 👀




Los datos filtrados sobre la censura de IA en China son escalofriantes. Es aterrador pensar en cómo se está utilizando la IA para controlar la información. Necesitamos más transparencia y menos censura, ¿no crees? 🤔




中国的AI审查越来越失控了!😱 泄露了133,000个被标记内容的例子,显示出这有多深入。想到AI在自动审查东西,真是可怕。我们需要更多的透明度和更少的控制,对吧?🚫




このツールは本当に驚きですね!中国のAIによる検閲がどれだけ深いかがよく分かります。データベースの漏洩はちょっと怖いですが、裏側で何が起こっているかを知ることは重要です。インターネットの自由に興味がある人は必見ですよ!👀












