首页 新闻 关于AI基准测试的辩论已达到神奇宝贝

关于AI基准测试的辩论已达到神奇宝贝

2025年05月03日
JonathanDavis
0

关于AI基准测试的辩论已达到神奇宝贝

即使是神奇宝贝的挚爱世界也不能免疫AI基准的戏剧。最近在X上的病毒帖子引起了轰动,声称Google的最新双子座模特在经典的Pokémon视频游戏三部曲中超过了Anthropic的领先Claude模型。据《邮报》报道,双子座在开发商的Twitch流中令人印象深刻地到达了薰衣草小镇,而克劳德(Claude)截至2月下旬在穆特(Mount Moon)落后。

到达薰衣草小镇后

119仅实时景观顺便说一句,被低估的流pic.twitter.com/8avsovai4x

- Jush(@jush21e8)2025年4月10日

但是,这篇文章方便遗漏的事实是,双子座的优势有些不公平。 Reddit上精明的用户很快指出,Gemini Stream背后的开发人员制作了自定义的最小值。这种漂亮的工具有助于该模型识别游戏中的“瓷砖”,例如可剪裁的树,它大大减少了Gemini需要花费分析屏幕截图之前的时间,然后才能决定下一步。

现在,尽管Pokémon可能不是最严重的AI基准,但它确实是一个有趣但有说服力的例子,说明不同的设置如何偏向这些测试的结果。以Anthropic的最新模型为Anthropic 3.7十四行诗。在旨在测试编码能力的SWE基础验证的基准测试中,它的精度为62.3%。但是,凭借“自定义脚手架”,人类的鞭打得以升高,得分跃升至70.3%。

而且它不止于此。梅塔(Meta)采用了其较新的模特之一,雅玛4小牛(Llama 4 Maverick),并专门针对LM Arena Benchmark进行了微调。在同一测试中,该型号的香草版本几乎不太好。

鉴于AI基准测试(包括我们友好的神奇宝贝示例)已经有点受到打击,因此这些自定义的调整和非标准方法使得在模型上投入市场时进行有意义的比较变得更加棘手。似乎将苹果与苹果进行比较可能会越来越难。

相关文章
Top 10 AI Marketing Tools for April 2025 Top 10 AI Marketing Tools for April 2025 Artificial intelligence (AI) is shaking up industries left and right, and marketing is no exception. From small startups to big corporations, businesses are increasingly turning to AI marketing tools to boost their brand visibility and drive their growth. Incorporating these tools into your business
Wikipedia is giving AI developers its data to fend off bot scrapers Wikipedia is giving AI developers its data to fend off bot scrapers Wikipedia's New Strategy to Manage AI Data Scraping Wikipedia, through the Wikimedia Foundation, is taking a proactive step to manage the impact of AI data scraping on its servers. On Wednesday, they announced a collaboration with Kaggle, a platform owned by Google and dedicated to data science and
Huawei's AI Hardware Breakthrough Poses Challenge to Nvidia's Dominance Huawei's AI Hardware Breakthrough Poses Challenge to Nvidia's Dominance Huawei's Bold Move in the Global AI Chip Race Huawei, the Chinese tech giant, has taken a significant step forward that could shake up the global AI chip race. They've introduced a new computing system called the CloudMatrix 384 Supernode, which, according to local media, outperforms similar techno
评论 (0)
0/200
返回顶部
OR