option
Home
News
Wikipedia is giving AI developers its data to fend off bot scrapers

Wikipedia is giving AI developers its data to fend off bot scrapers

May 1, 2025
83

Wikipedia is giving AI developers its data to fend off bot scrapers

Wikipedia's New Strategy to Manage AI Data Scraping

Wikipedia, through the Wikimedia Foundation, is taking a proactive step to manage the impact of AI data scraping on its servers. On Wednesday, they announced a collaboration with Kaggle, a platform owned by Google and dedicated to data science and machine learning, to launch a beta dataset. This dataset contains "structured Wikipedia content in English and French," tailored specifically for AI training purposes.

The dataset, now available on Kaggle, has been crafted with AI developers in mind, simplifying the process of accessing machine-readable article data. This includes everything from research summaries and short descriptions to image links, infobox data, and various article sections. Importantly, this data is openly licensed and does not include references or non-textual elements like audio files, ensuring it's optimized for AI use cases like modeling, fine-tuning, and benchmarking.

Wikimedia's approach offers a well-structured JSON format of Wikipedia's content, which they hope will be a more appealing option for AI developers compared to the traditional method of scraping or parsing raw article text. This move is partly in response to the strain that AI bots have been putting on Wikipedia's servers due to their bandwidth consumption.

Already, Wikimedia has established content sharing agreements with giants like Google and the Internet Archive. However, the partnership with Kaggle is expected to make this data more accessible to smaller companies and independent data scientists, broadening the reach and utility of Wikipedia's content.

What Kaggle Brings to the Table

Brenda Flynn, Kaggle's partnerships lead, expressed enthusiasm about hosting Wikimedia's data. "As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data," she stated. Kaggle's role is crucial in keeping this data not just accessible but also relevant and useful for the machine learning community.

This strategic move by Wikipedia not only aims to ease the load on its servers but also fosters a more structured and beneficial relationship with the AI and machine learning communities.

Related article
Salesforce Unveils AI Digital Teammates in Slack to Rival Microsoft Copilot Salesforce Unveils AI Digital Teammates in Slack to Rival Microsoft Copilot Salesforce launched a new workplace AI strategy, introducing specialized “digital teammates” integrated into Slack conversations, the company revealed on Monday.The new tool, Agentforce in Slack, enab
Oracle's $40B Nvidia Chip Investment Boosts Texas AI Data Center Oracle's $40B Nvidia Chip Investment Boosts Texas AI Data Center Oracle is set to invest approximately $40 billion in Nvidia chips to power a major new data center in Texas, developed by OpenAI, as reported by the Financial Times. This deal, one of the largest chip
Sony WH-1000XM6 Headphone Features Revealed Ahead of Launch Sony WH-1000XM6 Headphone Features Revealed Ahead of Launch Sony is set to unveil the successor to its WH-1000XM5 noise-canceling headphones on May 15th, based on leaked details reported by Dealabs and Android Authority.The upcoming model, named the WH-1000XM6
Comments (2)
0/200
JustinJohnson
JustinJohnson August 15, 2025 at 11:00:59 AM EDT

Wow, Wikipedia teaming up with Kaggle to tackle AI scrapers? Smart move! It's like building a digital fortress to protect their data. Curious how this will impact AI model training in the long run. 🛡️

EricMartin
EricMartin July 30, 2025 at 9:41:20 PM EDT

Wow, Wikipedia teaming up with Kaggle to tackle AI scraping? That's a smart move! I love how they're turning a problem into an opportunity for data science. Wonder if this will spark new AI innovations or just keep the bots at bay. 🤔

Back to Top
OR