DataGemma Tackles AI Hallucinations with Real-World Data

Large language models (LLMs) are at the heart of today's AI breakthroughs, capable of sifting through massive text datasets to produce summaries, spark creative ideas, and even write code. Yet, despite their prowess, these models can sometimes deliver information that's just plain wrong, a problem we call "hallucination." It's a big hurdle in the world of generative AI.
We're excited to share some cutting-edge research that's tackling this issue head-on, aiming to curb hallucinations by grounding LLMs in real-world stats. And we're thrilled to introduce DataGemma, the first open models that link LLMs with a wealth of real-world data from Google's Data Commons.
Data Commons: A Treasure Trove of Trustworthy Data
Data Commons is like a giant, ever-growing library of public data, boasting over 240 billion data points on everything from health to economics. It pulls this info from reliable sources like the UN, WHO, CDC, and Census Bureaus. By merging these datasets into a single, powerful toolset and AI models, Data Commons helps policymakers, researchers, and organizations get the accurate insights they need.
Imagine a vast database where you can ask questions in plain English, like which African countries have seen the biggest jump in electricity access, or how income relates to diabetes across US counties. That's Data Commons for you.
How Data Commons Helps Fight Hallucination
As more folks turn to generative AI, we're working to make these experiences more grounded by weaving Data Commons into Gemma, our family of lightweight, top-notch open models. These DataGemma models are now available for researchers and developers to dive into.
DataGemma boosts Gemma's capabilities by tapping into Data Commons' knowledge, using two cool methods to improve the accuracy and reasoning of LLMs:
RIG (Retrieval-Interleaved Generation) amps up our Gemma 2 model by actively checking facts against Data Commons. When you ask DataGemma a question, it hunts down statistical data from Data Commons to give you a solid answer. While RIG isn't a new idea, the way we're using it in DataGemma is pretty special.
Example query: ''Has the use of renewables increased in the world?'' applying DataGemma RIG methodology leverages Data Commons (DC) for authoritative data.
RAG (Retrieval-Augmented Generation) lets language models pull in extra info beyond what they've been trained on, making their answers richer and more accurate. With DataGemma, we use Gemini 1.5 Pro's long context window to fetch relevant data from Data Commons before the model starts crafting its response, cutting down on hallucinations.
Example query: ''Has the use of renewables increased in the world?'' applying DataGemma RAG methodology showcases greater reasoning and inclusion of footnotes.
Promising Results and What's Next
Our early tests with RIG and RAG are looking good. We're seeing better accuracy in our models when dealing with numbers, which means fewer hallucinations for folks using these models for research, decision-making, or just to satisfy their curiosity. You can check out these results in our research paper.
Illustration of a RAG query and response. Supporting ground truth statistics are referenced as tables served from Data Commons. *Partial response shown for brevity.
We're not stopping here. We're all in on refining these methods, scaling up our efforts, and putting them through the wringer with more tests. Eventually, we'll roll out these improvements to both Gemma and Gemini models, starting with a limited-access phase.
By sharing our research and making this new Gemma model variant open, we hope to spread the use of these Data Commons-based techniques far and wide. Making LLMs more reliable and trustworthy is crucial for turning them into essential tools for everyone, helping to build a future where AI gives people accurate info, supports informed choices, and deepens our understanding of the world.
Researchers and developers can jump right in with DataGemma using our quickstart notebooks for both RIG and RAG. To dive deeper into how Data Commons and Gemma work together, check out our Research post.
Related article
Salesforce Unveils AI Digital Teammates in Slack to Rival Microsoft Copilot
Salesforce launched a new workplace AI strategy, introducing specialized “digital teammates” integrated into Slack conversations, the company revealed on Monday.The new tool, Agentforce in Slack, enab
Oracle's $40B Nvidia Chip Investment Boosts Texas AI Data Center
Oracle is set to invest approximately $40 billion in Nvidia chips to power a major new data center in Texas, developed by OpenAI, as reported by the Financial Times. This deal, one of the largest chip
Meta AI App to Introduce Premium Tier and Ads
Meta's AI app may soon feature a paid subscription, mirroring offerings from competitors like OpenAI, Google, and Microsoft. During a Q1 2025 earnings call, Meta CEO Mark Zuckerberg outlined plans for
Comments (37)
0/200
StephenScott
August 8, 2025 at 5:00:59 AM EDT
This article on DataGemma is super intriguing! I love how it dives into fixing AI hallucinations with real-world data. Makes me wonder if we’ll finally get models that don’t spit out random nonsense. 😄 Anyone else excited about this?
0
ArthurYoung
July 29, 2025 at 8:25:16 AM EDT
This article on DataGemma is super intriguing! It's wild how LLMs can churn out so much but still trip over facts. Excited to see how real-world data could make AI less of a fibber! 😄
0
RalphJohnson
April 21, 2025 at 12:26:32 AM EDT
DataGemmaは本当に助かる!AIの幻覚を現実世界のデータで抑えてくれるから、まるでAIにファクトチェッカーが付いているみたい。もう少し処理が早ければ完璧なんだけど、それでも素晴らしいツールだよね!👍
0
WillieAnderson
April 17, 2025 at 5:10:42 PM EDT
DataGemma 정말 도움이 돼! AI의 환각을 현실 세계 데이터로 줄여주니까, 마치 AI에 팩트체커가 있는 것 같아. 처리 속도가 조금 더 빨랐으면 좋겠지만, 그래도 훌륭한 도구야! 👍
0
JosephGreen
April 16, 2025 at 4:14:53 PM EDT
DataGemma is a lifesaver! It really cuts down on those annoying AI hallucinations by grounding the models in real-world data. It's like having a fact-checker for my AI buddy. Only wish it was a bit faster at processing, but still, it's a solid tool! 👍
0
LeviKing
April 13, 2025 at 4:47:31 PM EDT
DataGemma가 AI의 환각을 해결하기 위한 접근 방식은 정말 멋집니다. 실제 세계의 데이터를 사용하여 AI를 제어하는 것은 훌륭해요. 하지만 정말 문제를 해결하는 건지, 아니면 그저 가리는 건지 궁금해요. 그래도 올바른 방향으로 나아가는 한 걸음이죠. 계속 하세요!
0
Large language models (LLMs) are at the heart of today's AI breakthroughs, capable of sifting through massive text datasets to produce summaries, spark creative ideas, and even write code. Yet, despite their prowess, these models can sometimes deliver information that's just plain wrong, a problem we call "hallucination." It's a big hurdle in the world of generative AI.
We're excited to share some cutting-edge research that's tackling this issue head-on, aiming to curb hallucinations by grounding LLMs in real-world stats. And we're thrilled to introduce DataGemma, the first open models that link LLMs with a wealth of real-world data from Google's Data Commons.
Data Commons: A Treasure Trove of Trustworthy Data
Data Commons is like a giant, ever-growing library of public data, boasting over 240 billion data points on everything from health to economics. It pulls this info from reliable sources like the UN, WHO, CDC, and Census Bureaus. By merging these datasets into a single, powerful toolset and AI models, Data Commons helps policymakers, researchers, and organizations get the accurate insights they need.
Imagine a vast database where you can ask questions in plain English, like which African countries have seen the biggest jump in electricity access, or how income relates to diabetes across US counties. That's Data Commons for you.
How Data Commons Helps Fight Hallucination
As more folks turn to generative AI, we're working to make these experiences more grounded by weaving Data Commons into Gemma, our family of lightweight, top-notch open models. These DataGemma models are now available for researchers and developers to dive into.
DataGemma boosts Gemma's capabilities by tapping into Data Commons' knowledge, using two cool methods to improve the accuracy and reasoning of LLMs:
RIG (Retrieval-Interleaved Generation) amps up our Gemma 2 model by actively checking facts against Data Commons. When you ask DataGemma a question, it hunts down statistical data from Data Commons to give you a solid answer. While RIG isn't a new idea, the way we're using it in DataGemma is pretty special.
Example query: ''Has the use of renewables increased in the world?'' applying DataGemma RIG methodology leverages Data Commons (DC) for authoritative data. RAG (Retrieval-Augmented Generation) lets language models pull in extra info beyond what they've been trained on, making their answers richer and more accurate. With DataGemma, we use Gemini 1.5 Pro's long context window to fetch relevant data from Data Commons before the model starts crafting its response, cutting down on hallucinations.
Example query: ''Has the use of renewables increased in the world?'' applying DataGemma RAG methodology showcases greater reasoning and inclusion of footnotes.
Promising Results and What's Next
Our early tests with RIG and RAG are looking good. We're seeing better accuracy in our models when dealing with numbers, which means fewer hallucinations for folks using these models for research, decision-making, or just to satisfy their curiosity. You can check out these results in our research paper.
By sharing our research and making this new Gemma model variant open, we hope to spread the use of these Data Commons-based techniques far and wide. Making LLMs more reliable and trustworthy is crucial for turning them into essential tools for everyone, helping to build a future where AI gives people accurate info, supports informed choices, and deepens our understanding of the world.
Researchers and developers can jump right in with DataGemma using our quickstart notebooks for both RIG and RAG. To dive deeper into how Data Commons and Gemma work together, check out our Research post.



This article on DataGemma is super intriguing! I love how it dives into fixing AI hallucinations with real-world data. Makes me wonder if we’ll finally get models that don’t spit out random nonsense. 😄 Anyone else excited about this?




This article on DataGemma is super intriguing! It's wild how LLMs can churn out so much but still trip over facts. Excited to see how real-world data could make AI less of a fibber! 😄




DataGemmaは本当に助かる!AIの幻覚を現実世界のデータで抑えてくれるから、まるでAIにファクトチェッカーが付いているみたい。もう少し処理が早ければ完璧なんだけど、それでも素晴らしいツールだよね!👍




DataGemma 정말 도움이 돼! AI의 환각을 현실 세계 데이터로 줄여주니까, 마치 AI에 팩트체커가 있는 것 같아. 처리 속도가 조금 더 빨랐으면 좋겠지만, 그래도 훌륭한 도구야! 👍




DataGemma is a lifesaver! It really cuts down on those annoying AI hallucinations by grounding the models in real-world data. It's like having a fact-checker for my AI buddy. Only wish it was a bit faster at processing, but still, it's a solid tool! 👍




DataGemma가 AI의 환각을 해결하기 위한 접근 방식은 정말 멋집니다. 실제 세계의 데이터를 사용하여 AI를 제어하는 것은 훌륭해요. 하지만 정말 문제를 해결하는 건지, 아니면 그저 가리는 건지 궁금해요. 그래도 올바른 방향으로 나아가는 한 걸음이죠. 계속 하세요!












