Wikipedia is giving AI developers its data to fend off bot scrapers
May 1, 2025
PeterLopez
0

Wikipedia's New Strategy to Manage AI Data Scraping
Wikipedia, through the Wikimedia Foundation, is taking a proactive step to manage the impact of AI data scraping on its servers. On Wednesday, they announced a collaboration with Kaggle, a platform owned by Google and dedicated to data science and machine learning, to launch a beta dataset. This dataset contains "structured Wikipedia content in English and French," tailored specifically for AI training purposes.
The dataset, now available on Kaggle, has been crafted with AI developers in mind, simplifying the process of accessing machine-readable article data. This includes everything from research summaries and short descriptions to image links, infobox data, and various article sections. Importantly, this data is openly licensed and does not include references or non-textual elements like audio files, ensuring it's optimized for AI use cases like modeling, fine-tuning, and benchmarking.
Wikimedia's approach offers a well-structured JSON format of Wikipedia's content, which they hope will be a more appealing option for AI developers compared to the traditional method of scraping or parsing raw article text. This move is partly in response to the strain that AI bots have been putting on Wikipedia's servers due to their bandwidth consumption.
Already, Wikimedia has established content sharing agreements with giants like Google and the Internet Archive. However, the partnership with Kaggle is expected to make this data more accessible to smaller companies and independent data scientists, broadening the reach and utility of Wikipedia's content.
What Kaggle Brings to the Table
Brenda Flynn, Kaggle's partnerships lead, expressed enthusiasm about hosting Wikimedia's data. "As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data," she stated. Kaggle's role is crucial in keeping this data not just accessible but also relevant and useful for the machine learning community.
This strategic move by Wikipedia not only aims to ease the load on its servers but also fosters a more structured and beneficial relationship with the AI and machine learning communities.
Related article
Huawei's AI Hardware Breakthrough Poses Challenge to Nvidia's Dominance
Huawei's Bold Move in the Global AI Chip Race
Huawei, the Chinese tech giant, has taken a significant step forward that could shake up the global AI chip race. They've introduced a new computing system called the CloudMatrix 384 Supernode, which, according to local media, outperforms similar techno
How we’re using AI to help cities tackle extreme heat
It's looking like 2024 might just break the record for the hottest year yet, surpassing 2023. This trend is particularly tough on folks living in urban heat islands—those spots in cities where concrete and asphalt soak up the sun's rays and then radiate the heat right back out. These areas can warm
Google Search Introduces 'AI Mode' for Complex, Multi-Part Queries
Google Unveils "AI Mode" in Search to Rival Perplexity AI and ChatGPTGoogle is stepping up its game in the AI arena with the launch of an experimental "AI Mode" feature in its Search engine. Aimed at taking on the likes of Perplexity AI and OpenAI's ChatGPT Search, this new mode was announced on Wed
Comments (0)
0/200






Wikipedia's New Strategy to Manage AI Data Scraping
Wikipedia, through the Wikimedia Foundation, is taking a proactive step to manage the impact of AI data scraping on its servers. On Wednesday, they announced a collaboration with Kaggle, a platform owned by Google and dedicated to data science and machine learning, to launch a beta dataset. This dataset contains "structured Wikipedia content in English and French," tailored specifically for AI training purposes.
The dataset, now available on Kaggle, has been crafted with AI developers in mind, simplifying the process of accessing machine-readable article data. This includes everything from research summaries and short descriptions to image links, infobox data, and various article sections. Importantly, this data is openly licensed and does not include references or non-textual elements like audio files, ensuring it's optimized for AI use cases like modeling, fine-tuning, and benchmarking.
Wikimedia's approach offers a well-structured JSON format of Wikipedia's content, which they hope will be a more appealing option for AI developers compared to the traditional method of scraping or parsing raw article text. This move is partly in response to the strain that AI bots have been putting on Wikipedia's servers due to their bandwidth consumption.
Already, Wikimedia has established content sharing agreements with giants like Google and the Internet Archive. However, the partnership with Kaggle is expected to make this data more accessible to smaller companies and independent data scientists, broadening the reach and utility of Wikipedia's content.
What Kaggle Brings to the Table
Brenda Flynn, Kaggle's partnerships lead, expressed enthusiasm about hosting Wikimedia's data. "As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data," she stated. Kaggle's role is crucial in keeping this data not just accessible but also relevant and useful for the machine learning community.
This strategic move by Wikipedia not only aims to ease the load on its servers but also fosters a more structured and beneficial relationship with the AI and machine learning communities.











