Chinese AI Censorship Exposed by Leaked Data

Home

News

April 10, 2025

WillGarcía

# China

China's use of AI to enhance its censorship capabilities has reached a new level, as revealed by a leaked database containing 133,000 examples of content flagged for sensitivity by the Chinese government. This sophisticated large language model (LLM) is designed to automatically detect and censor content related to a wide range of topics, from poverty in rural areas to corruption within the Communist Party and even subtle political satire.

Chinese flag on pole behind razor wire

This photo taken on June 4, 2019, shows the Chinese flag behind razor wire at a housing compound in Yengisar, south of Kashgar, in China’s western Xinjiang region.Image Credits:Greg Baker / AFP / Getty Images

According to Xiao Qiang, a researcher at UC Berkeley who specializes in Chinese censorship, this database is "clear evidence" that the Chinese government or its affiliates are using LLMs to bolster their repression efforts. Unlike traditional methods that depend on human moderators and keyword filtering, this AI-driven approach can significantly enhance the efficiency and precision of state-controlled information management.

The dataset, discovered by security researcher NetAskari on an unsecured Elasticsearch database hosted on a Baidu server, includes recent entries from December 2024. It's unclear who exactly created the dataset, but its purpose is evident: to train an LLM to identify and flag content related to sensitive topics such as pollution, food safety, financial fraud, labor disputes, and military matters. Political satire, especially when it involves historical analogies or references to Taiwan, is also a high-priority target.

a snippet of JSON code that references prompt tokens and LLMs. much of the contents are in Chinese.

Image Credits:Charles rollet

The training data includes various examples of content that could potentially stir social unrest, such as complaints about corrupt police officers, reports on rural poverty, and news about expelled Communist Party officials. The dataset also contains extensive references to Taiwan and military-related topics, with the Chinese word for Taiwan (台湾) appearing over 15,000 times.

The dataset's intended use is described as "public opinion work," a term that Michael Caster of Article 19 explains is typically associated with the Cyberspace Administration of China (CAC) and involves censorship and propaganda efforts. This aligns with Chinese President Xi Jinping's view of the internet as the "frontline" of the Communist Party's public opinion work.

This development is part of a broader trend of authoritarian regimes adopting AI technology for repressive purposes. OpenAI recently reported that an unidentified actor, likely from China, used generative AI to monitor social media and forward anti-government posts to the Chinese government. The same technology was also used to generate critical comments about a prominent Chinese dissident, Cai Xia.

While China's traditional censorship methods rely on basic algorithms to block blacklisted terms, the use of LLMs represents a significant advancement. These AI systems can detect even subtle criticism on a massive scale and continuously improve as they process more data.

"I think it's crucial to highlight how AI-driven censorship is evolving, making state control over public discourse even more sophisticated, especially at a time when Chinese AI models such as DeepSeek are making headwaves," Xiao Qiang told TechCrunch.

Alibaba Unveils Wan2.1-VACE: Open-Source AI Video Solution Alibaba has introduced Wan2.1-VACE, an open-source AI model poised to transform video creation and editing processes.VACE is a key component of Alibaba’s Wan2.1 video AI model family, with the company

Huawei CEO Ren Zhengfei on China's AI Ambitions and Resilience Strategy Huawei CEO Ren Zhengfei shares candid insights on China's AI landscape and the challenges his company faces."I haven't dwelled on it," Ren states in a People’s Daily Q&A. "Overthinking is futile."In a

Analysis Reveals AI's Responses on China Vary by Language Exploring AI Censorship: A Language-Based AnalysisIt's no secret that AI models from Chinese labs, such as DeepSeek, are subject to strict censorship rules. A 2023 regulation from China's ruling party explicitly prohibits these models from generating content that could undermine national unity or so

Comments (37)

0/200

Submit

CharlesGonzalez

August 1, 2025 at 9:47:34 AM EDT

This leak is wild! 133,000 flagged posts show how deep China's AI censorship goes. It's like a digital Big Brother on steroids. 😳 Makes you wonder how much we're not seeing online.

ElijahWalker

July 22, 2025 at 3:35:51 AM EDT

This leak is wild! 133,000 flagged posts? That’s a scary peek into how AI’s being used to control speech in China. Makes you wonder how much is being filtered without us knowing. 😳

MichaelDavis

April 21, 2025 at 4:06:03 AM EDT

Essa ferramenta é reveladora! Mostra como a censura por AI na China é profunda. O vazamento do banco de dados é um pouco assustador, mas é importante saber o que está acontecendo nos bastidores. Definitivamente, algo que todos interessados em liberdade na internet devem conhecer. Fique de olho nisso! 👀

SebastianAnderson

April 19, 2025 at 6:25:56 PM EDT

Los datos filtrados sobre la censura de IA en China son escalofriantes. Es aterrador pensar en cómo se está utilizando la IA para controlar la información. Necesitamos más transparencia y menos censura, ¿no crees? 🤔

RoyYoung

April 19, 2025 at 12:38:42 PM EDT

中国的AI审查越来越失控了！😱 泄露了133,000个被标记内容的例子，显示出这有多深入。想到AI在自动审查东西，真是可怕。我们需要更多的透明度和更少的控制，对吧？🚫

EdwardTaylor

April 19, 2025 at 11:12:20 AM EDT

このツールは本当に驚きですね！中国のAIによる検閲がどれだけ深いかがよく分かります。データベースの漏洩はちょっと怖いですが、裏側で何が起こっているかを知ることは重要です。インターネットの自由に興味がある人は必見ですよ！👀