AI Crawlers Surge Wikimedia Commons Bandwidth Demand by 50%

The Wikimedia Foundation, the parent body behind Wikipedia and numerous other crowd-sourced knowledge platforms, announced on Wednesday a staggering 50% increase in bandwidth usage for multimedia downloads from Wikimedia Commons since January 2024. This surge, as detailed in a blog post on Tuesday, isn't driven by an uptick in human curiosity, but rather by automated scrapers hungry for data to train AI models.
“Our infrastructure is designed to handle sudden surges in traffic from humans during major events, but the volume of traffic from scraper bots is unmatched and poses increasing risks and costs,” the post explains.
Wikimedia Commons serves as a freely accessible hub for images, videos, and audio files, all available under open licenses or in the public domain.
Delving deeper, Wikimedia revealed that a whopping 65% of the most resource-intensive traffic—measured by the type of content consumed—comes from bots. Yet, these bots account for just 35% of overall pageviews. The discrepancy, according to Wikimedia, stems from how frequently accessed content is cached closer to users, while less popular content, which bots often target, is stored in the more costly "core data center."
“While human readers tend to focus on specific, often similar, topics, crawler bots tend to ‘bulk read’ a larger number of pages and visit less popular ones as well,” Wikimedia noted. “This results in these requests being forwarded to the core datacenter, which significantly increases our resource consumption costs.”
As a result, the Wikimedia Foundation's site reliability team is dedicating substantial time and resources to blocking these crawlers to prevent disruptions for everyday users. This doesn't even touch on the escalating cloud costs the Foundation is contending with.
This scenario is part of a broader trend that's endangering the open internet. Just last month, software engineer and open-source advocate Drew DeVault lamented that AI crawlers are blatantly ignoring “robots.txt” files intended to deter automated traffic. Similarly, Gergely Orosz, known as the "pragmatic engineer," recently voiced his frustration over how AI scrapers from companies like Meta have spiked bandwidth demands for his projects.
While open-source infrastructures are particularly vulnerable, developers are responding with ingenuity and determination. TechCrunch highlighted last week that some tech companies are stepping up. For instance, Cloudflare introduced AI Labyrinth, designed to slow down crawlers with AI-generated content.
Yet, it remains a constant game of cat and mouse, one that might push many publishers to retreat behind logins and paywalls, ultimately harming the open nature of the web we all rely on.
Related article
DeepL, renowned for text translation, now targets voice translation
DeepL, a translation company best known for its text-based tools, has launched a voice-to-voice translation suite today that addresses scenarios such as meetings, mobile and web conversations, and group discussions for frontline workers through custo
Talat’s AI meeting notes live on your device, not the cloud
Granola, the AI-powered notetaking app valued at $250 million, has gained traction among tech founders and venture capitalists. But one developer sees demand for a more private, fully local alternative available for a one-time fee with no subscriptio
New Roewe i6 Hits Market at 659,000 Yuan, Powered by Snapdragon 8155 and Doubao Large Model
SAIC Roewe today launched the new Roewe i6, a compact sedan that fully adopts the visual language of the Roewe D7. Its distinctive large upright grille and horizontal halo light bar stretch across the front, creating a strong sense of technology and
Related Special Topic Recommendations
Comments (15)
0/500
這流量暴增也太誇張了吧!AI爬蟲把Wikimedia Commons的頻寬吃掉一半?難怪最近載圖變超慢...不過想想也合理,現在一堆AI模型都在狂抓訓練資料,但這樣搞下去會不會把非營利資源榨乾啊?有點擔心未來開放資源的永續性😅
Incroyable, 50% d'augmentation de bande passante pour Wikimedia Commons ! Ça montre à quel point l'IA aspire tout sur son passage, non ? 😅 J’espère juste que ça ne va pas surcharger les serveurs ou freiner l’accès pour les utilisateurs classiques.
Whoa, a 50% spike in Wikimedia Commons bandwidth? AI crawlers are eating up data like it’s an all-you-can-eat buffet! 😄 Makes me wonder how much of this is legit research vs. bots just hoarding images for some shady AI training. Anyone else curious about what’s driving this?
Wow, a 50% spike in bandwidth for Wikimedia Commons? That’s wild! AI crawlers are probably gobbling up all those images for training. Kinda cool but also makes me wonder if this is pushing the limits of what open platforms can handle. 😅
Wow, a 50% spike in bandwidth for Wikimedia Commons? That’s wild! AI crawlers are probably gobbling up all those images for training. Makes me wonder how much data these AI models are chugging through daily. 😳 Cool to see open knowledge fueling innovation, though!

The Wikimedia Foundation, the parent body behind Wikipedia and numerous other crowd-sourced knowledge platforms, announced on Wednesday a staggering 50% increase in bandwidth usage for multimedia downloads from Wikimedia Commons since January 2024. This surge, as detailed in a blog post on Tuesday, isn't driven by an uptick in human curiosity, but rather by automated scrapers hungry for data to train AI models.
“Our infrastructure is designed to handle sudden surges in traffic from humans during major events, but the volume of traffic from scraper bots is unmatched and poses increasing risks and costs,” the post explains.
Wikimedia Commons serves as a freely accessible hub for images, videos, and audio files, all available under open licenses or in the public domain.
Delving deeper, Wikimedia revealed that a whopping 65% of the most resource-intensive traffic—measured by the type of content consumed—comes from bots. Yet, these bots account for just 35% of overall pageviews. The discrepancy, according to Wikimedia, stems from how frequently accessed content is cached closer to users, while less popular content, which bots often target, is stored in the more costly "core data center."
“While human readers tend to focus on specific, often similar, topics, crawler bots tend to ‘bulk read’ a larger number of pages and visit less popular ones as well,” Wikimedia noted. “This results in these requests being forwarded to the core datacenter, which significantly increases our resource consumption costs.”
As a result, the Wikimedia Foundation's site reliability team is dedicating substantial time and resources to blocking these crawlers to prevent disruptions for everyday users. This doesn't even touch on the escalating cloud costs the Foundation is contending with.
This scenario is part of a broader trend that's endangering the open internet. Just last month, software engineer and open-source advocate Drew DeVault lamented that AI crawlers are blatantly ignoring “robots.txt” files intended to deter automated traffic. Similarly, Gergely Orosz, known as the "pragmatic engineer," recently voiced his frustration over how AI scrapers from companies like Meta have spiked bandwidth demands for his projects.
While open-source infrastructures are particularly vulnerable, developers are responding with ingenuity and determination. TechCrunch highlighted last week that some tech companies are stepping up. For instance, Cloudflare introduced AI Labyrinth, designed to slow down crawlers with AI-generated content.
Yet, it remains a constant game of cat and mouse, one that might push many publishers to retreat behind logins and paywalls, ultimately harming the open nature of the web we all rely on.
DeepL, renowned for text translation, now targets voice translation
DeepL, a translation company best known for its text-based tools, has launched a voice-to-voice translation suite today that addresses scenarios such as meetings, mobile and web conversations, and group discussions for frontline workers through custo
Talat’s AI meeting notes live on your device, not the cloud
Granola, the AI-powered notetaking app valued at $250 million, has gained traction among tech founders and venture capitalists. But one developer sees demand for a more private, fully local alternative available for a one-time fee with no subscriptio
New Roewe i6 Hits Market at 659,000 Yuan, Powered by Snapdragon 8155 and Doubao Large Model
SAIC Roewe today launched the new Roewe i6, a compact sedan that fully adopts the visual language of the Roewe D7. Its distinctive large upright grille and horizontal halo light bar stretch across the front, creating a strong sense of technology and
這流量暴增也太誇張了吧!AI爬蟲把Wikimedia Commons的頻寬吃掉一半?難怪最近載圖變超慢...不過想想也合理,現在一堆AI模型都在狂抓訓練資料,但這樣搞下去會不會把非營利資源榨乾啊?有點擔心未來開放資源的永續性😅
Incroyable, 50% d'augmentation de bande passante pour Wikimedia Commons ! Ça montre à quel point l'IA aspire tout sur son passage, non ? 😅 J’espère juste que ça ne va pas surcharger les serveurs ou freiner l’accès pour les utilisateurs classiques.
Whoa, a 50% spike in Wikimedia Commons bandwidth? AI crawlers are eating up data like it’s an all-you-can-eat buffet! 😄 Makes me wonder how much of this is legit research vs. bots just hoarding images for some shady AI training. Anyone else curious about what’s driving this?
Wow, a 50% spike in bandwidth for Wikimedia Commons? That’s wild! AI crawlers are probably gobbling up all those images for training. Kinda cool but also makes me wonder if this is pushing the limits of what open platforms can handle. 😅
Wow, a 50% spike in bandwidth for Wikimedia Commons? That’s wild! AI crawlers are probably gobbling up all those images for training. Makes me wonder how much data these AI models are chugging through daily. 😳 Cool to see open knowledge fueling innovation, though!





Home






