option
Home
News
Wikipedia is giving AI developers its data to fend off bot scrapers

Wikipedia is giving AI developers its data to fend off bot scrapers

May 1, 2025
46

Wikipedia is giving AI developers its data to fend off bot scrapers

Wikipedia's New Strategy to Manage AI Data Scraping

Wikipedia, through the Wikimedia Foundation, is taking a proactive step to manage the impact of AI data scraping on its servers. On Wednesday, they announced a collaboration with Kaggle, a platform owned by Google and dedicated to data science and machine learning, to launch a beta dataset. This dataset contains "structured Wikipedia content in English and French," tailored specifically for AI training purposes.

The dataset, now available on Kaggle, has been crafted with AI developers in mind, simplifying the process of accessing machine-readable article data. This includes everything from research summaries and short descriptions to image links, infobox data, and various article sections. Importantly, this data is openly licensed and does not include references or non-textual elements like audio files, ensuring it's optimized for AI use cases like modeling, fine-tuning, and benchmarking.

Wikimedia's approach offers a well-structured JSON format of Wikipedia's content, which they hope will be a more appealing option for AI developers compared to the traditional method of scraping or parsing raw article text. This move is partly in response to the strain that AI bots have been putting on Wikipedia's servers due to their bandwidth consumption.

Already, Wikimedia has established content sharing agreements with giants like Google and the Internet Archive. However, the partnership with Kaggle is expected to make this data more accessible to smaller companies and independent data scientists, broadening the reach and utility of Wikipedia's content.

What Kaggle Brings to the Table

Brenda Flynn, Kaggle's partnerships lead, expressed enthusiasm about hosting Wikimedia's data. "As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data," she stated. Kaggle's role is crucial in keeping this data not just accessible but also relevant and useful for the machine learning community.

This strategic move by Wikipedia not only aims to ease the load on its servers but also fosters a more structured and beneficial relationship with the AI and machine learning communities.

Related article
xAI posts Grok’s behind-the-scenes prompts xAI posts Grok’s behind-the-scenes prompts xAI Releases Grok's System Prompts After Controversial "White Genocide" ResponsesIn an unexpected move, xAI has decided to publicly share the system prompts for its AI chatbot Grok after an incident where the bot began generating unprompted responses about "white genocide" on X (formerly Twitter). T
Google Fi Unveils $35 Monthly Unlimited Plan Google Fi Unveils $35 Monthly Unlimited Plan Google Fi Shakes Up Its Plans: More Data, Lower Prices, and eSIM SupportGoogle Fi just dropped some major updates to its wireless plans—and if you're looking for an affordable unlimited option, there's good news. The carrier is introducing a brand-new Unlimited Essentials plan at just $35/month for
Billionaires Discuss Automating Jobs Away in This Week's AI Update Billionaires Discuss Automating Jobs Away in This Week's AI Update Hey everyone, welcome back to TechCrunch's AI newsletter! If you're not already subscribed, you can sign up here to get it delivered straight to your inbox every Wednesday.We took a little break last week, but for good reason—the AI news cycle was on fire, thanks in large part to the sudden surge of
Comments (0)
0/200
Back to Top
OR