Nvidia’s new Llama-3.1 Nemotron Ultra outperforms DeepSeek R1 at half the size
April 13, 2025
LarryMartinez
34

While Meta grapples with the scrutiny surrounding its latest Llama 4 model family, Nvidia has quietly rolled out a new, fully open-source large language model (LLM) based on Meta's earlier Llama-3.1-405B-Instruct model. Named Llama-3.1-Nemotron-Ultra-253B-v1, this model boasts 253 billion parameters and is engineered to excel in advanced reasoning, instruction following, and AI assistant workflows. Nvidia first hinted at this model during its annual GPU Technology Conference (GTC) in March.
The release underscores Nvidia's ongoing commitment to enhancing performance through architectural innovation and meticulous post-training processes. Announced on April 7, 2025, the model's code, weights, and post-training data are now freely accessible on Hugging Face. It's designed to seamlessly switch between complex reasoning tasks and simpler outputs based on system prompts, offering developers flexibility in their applications.
Designed for Efficient Inference
Building on Nvidia's prior efforts in optimizing LLMs for inference, the Llama-3.1-Nemotron-Ultra-253B incorporates a Neural Architecture Search (NAS) process to refine its architecture. This includes innovative features like skipped attention layers, fused feedforward networks (FFNs), and variable FFN compression ratios. These modifications reduce the model's memory usage and computational requirements, making it deployable on a single 8x H100 GPU node without compromising output quality.
Nvidia claims this model delivers robust performance while being cost-effective for data center deployments. It's compatible with Nvidia's B100 and Hopper microarchitectures, and has been tested in both BF16 and FP8 precision modes.
Post-Training for Reasoning and Alignment
The model underwent a comprehensive post-training regimen. This included supervised fine-tuning across various domains such as math, code generation, chat, and tool use, followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance its instruction-following and reasoning capabilities.
Further refinement came through a knowledge distillation phase over 65 billion tokens, and continual pretraining on an additional 88 billion tokens. The training data sources included FineWeb, Buzz-V1.2, and Dolma, with post-training prompts and responses drawn from both public corpora and synthetic generation methods. This approach helped the model differentiate between its reasoning modes.
Improved Performance Across Numerous Domains and Benchmarks
When enabled for reasoning, the model showed significant improvements on various benchmarks. For instance, on the MATH500 benchmark, its performance surged from 80.40% in standard mode to 97.00% with reasoning enabled. Similarly, AIME25 scores jumped from 16.67% to 72.50%, and LiveCodeBench results more than doubled, from 29.03% to 66.31%.
The model also excelled in tool-based tasks and general question answering (GPQA), scoring 76.01% in reasoning mode compared to 56.60% without. These benchmarks were conducted with a maximum sequence length of 32,000 tokens, and each test was repeated up to 16 times for accuracy.
Compared to the state-of-the-art MoE model DeepSeek R1, which has 671 billion parameters, Nvidia's model holds its own despite having fewer parameters. It outperforms DeepSeek R1 in tasks like GPQA (76.01 vs. 71.5), IFEval instruction following (89.45 vs. 83.3), and LiveCodeBench coding tasks (66.31 vs. 65.9). However, DeepSeek R1 edges out slightly in certain math evaluations, particularly AIME25 (79.8 vs. 72.50) and MATH500 (97.3 vs. 97.00).
These results indicate that Nvidia's dense model can match or exceed MoE models in reasoning and general instruction alignment, though it lags slightly in math-intensive categories.
Usage and Integration
The model integrates seamlessly with the Hugging Face Transformers library (version 4.48.3 recommended) and supports sequences up to 128,000 tokens. Developers can toggle reasoning behavior using system prompts and choose decoding strategies based on task needs. For reasoning tasks, Nvidia suggests using temperature sampling (0.6) with a top-p value of 0.95, while greedy decoding is recommended for deterministic outputs.
Llama-3.1-Nemotron-Ultra-253B supports multilingual applications, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. It's well-suited for various LLM use cases such as chatbot development, AI agent workflows, retrieval-augmented generation (RAG), and code generation.
Licensed for Commercial Use
Released under the Nvidia Open Model License and governed by the Llama 3.1 Community License Agreement, the model is ready for commercial applications. Nvidia stresses the importance of responsible AI development, urging teams to assess the model's alignment, safety, and bias for their specific use cases.
Oleksii Kuchaiev, Nvidia's Director of AI Model Post-Training, shared the excitement about this open release on X, highlighting its dense 253B design with toggleable reasoning capabilities, and the inclusion of open weights and data.
Related article
Former DeepSeeker and collaborators release new method for training reliable AI agents: RAGEN
The Year of AI Agents: A Closer Look at 2025's Expectations and Realities2025 was heralded by many experts as the year when AI agents—specialized AI systems powered by advanced large language and multimodal models from companies like OpenAI, Anthropic, Google, and DeepSeek—would finally take center
GAIA Introduces New Benchmark in Quest for True Intelligence Beyond ARC-AGI
Intelligence is everywhere, yet gauging it accurately feels like trying to catch a cloud with your bare hands. We use tests and benchmarks, like college entrance exams, to get a rough idea. Each year, students cram for these tests, sometimes even scoring a perfect 100%. But does that perfect score m
CoreWeave Founders Cash Out $488 Million Before Potential $4 Billion IPO
CoreWeave's IPO Filing Reveals Surprising Details and High Stakes
CoreWeave's S-1 document for its anticipated initial public offering (IPO) is packed with intriguing revelations. Supported by Nvidia, the company operates a specialized AI cloud service across 32 data centers, boasting over 250,000
Comments (50)
0/200
KeithNelson
April 13, 2025 at 7:54:42 PM GMT
Nvidia's new model is impressive, outperforming others at half the size. It's great for those who need efficiency without sacrificing performance. The only downside is the setup can be a bit tricky. Overall, a solid choice for AI enthusiasts!
0
RalphMitchell
April 13, 2025 at 7:54:42 PM GMT
Nvidiaの新しいモデルは、半分のサイズで他のモデルを上回るのが印象的です。効率を求める人には最適ですが、セットアップが少し難しいのが唯一の欠点です。全体的に、AI愛好者にとっては良い選択ですね!
0
GeorgeWilson
April 13, 2025 at 7:54:42 PM GMT
Nvidia의 새로운 모델은 반 크기에서도 다른 모델을 능가하는 것이 인상적입니다. 효율성을 희생하지 않고 성능을 원하는 사람들에게 좋습니다. 유일한 단점은 설정이 조금 까다롭다는 점입니다. 전반적으로 AI 애호가들에게 좋은 선택입니다!
0
GeorgeNelson
April 13, 2025 at 7:54:42 PM GMT
O novo modelo da Nvidia é impressionante, superando outros com metade do tamanho. É ótimo para quem precisa de eficiência sem sacrificar o desempenho. A única desvantagem é que a configuração pode ser um pouco complicada. No geral, uma boa escolha para entusiastas de IA!
0
GeorgeMiller
April 13, 2025 at 7:54:42 PM GMT
El nuevo modelo de Nvidia es impresionante, superando a otros con la mitad del tamaño. Es genial para aquellos que necesitan eficiencia sin sacrificar el rendimiento. La única desventaja es que la configuración puede ser un poco complicada. En general, una sólida opción para entusiastas de la IA!
0
BrianLewis
April 13, 2025 at 5:40:08 PM GMT
Nvidia's Llama-3.1 Nemotron Ultra is impressive! It outperforms DeepSeek R1 and is half the size, which is crazy. I've been using it for my projects and it's been a game-changer. The only downside is the setup can be a bit tricky, but once you get it running, it's smooth sailing!
0






While Meta grapples with the scrutiny surrounding its latest Llama 4 model family, Nvidia has quietly rolled out a new, fully open-source large language model (LLM) based on Meta's earlier Llama-3.1-405B-Instruct model. Named Llama-3.1-Nemotron-Ultra-253B-v1, this model boasts 253 billion parameters and is engineered to excel in advanced reasoning, instruction following, and AI assistant workflows. Nvidia first hinted at this model during its annual GPU Technology Conference (GTC) in March.
The release underscores Nvidia's ongoing commitment to enhancing performance through architectural innovation and meticulous post-training processes. Announced on April 7, 2025, the model's code, weights, and post-training data are now freely accessible on Hugging Face. It's designed to seamlessly switch between complex reasoning tasks and simpler outputs based on system prompts, offering developers flexibility in their applications.
Designed for Efficient Inference
Building on Nvidia's prior efforts in optimizing LLMs for inference, the Llama-3.1-Nemotron-Ultra-253B incorporates a Neural Architecture Search (NAS) process to refine its architecture. This includes innovative features like skipped attention layers, fused feedforward networks (FFNs), and variable FFN compression ratios. These modifications reduce the model's memory usage and computational requirements, making it deployable on a single 8x H100 GPU node without compromising output quality.
Nvidia claims this model delivers robust performance while being cost-effective for data center deployments. It's compatible with Nvidia's B100 and Hopper microarchitectures, and has been tested in both BF16 and FP8 precision modes.
Post-Training for Reasoning and Alignment
The model underwent a comprehensive post-training regimen. This included supervised fine-tuning across various domains such as math, code generation, chat, and tool use, followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance its instruction-following and reasoning capabilities.
Further refinement came through a knowledge distillation phase over 65 billion tokens, and continual pretraining on an additional 88 billion tokens. The training data sources included FineWeb, Buzz-V1.2, and Dolma, with post-training prompts and responses drawn from both public corpora and synthetic generation methods. This approach helped the model differentiate between its reasoning modes.
Improved Performance Across Numerous Domains and Benchmarks
When enabled for reasoning, the model showed significant improvements on various benchmarks. For instance, on the MATH500 benchmark, its performance surged from 80.40% in standard mode to 97.00% with reasoning enabled. Similarly, AIME25 scores jumped from 16.67% to 72.50%, and LiveCodeBench results more than doubled, from 29.03% to 66.31%.
The model also excelled in tool-based tasks and general question answering (GPQA), scoring 76.01% in reasoning mode compared to 56.60% without. These benchmarks were conducted with a maximum sequence length of 32,000 tokens, and each test was repeated up to 16 times for accuracy.
Compared to the state-of-the-art MoE model DeepSeek R1, which has 671 billion parameters, Nvidia's model holds its own despite having fewer parameters. It outperforms DeepSeek R1 in tasks like GPQA (76.01 vs. 71.5), IFEval instruction following (89.45 vs. 83.3), and LiveCodeBench coding tasks (66.31 vs. 65.9). However, DeepSeek R1 edges out slightly in certain math evaluations, particularly AIME25 (79.8 vs. 72.50) and MATH500 (97.3 vs. 97.00).
These results indicate that Nvidia's dense model can match or exceed MoE models in reasoning and general instruction alignment, though it lags slightly in math-intensive categories.
Usage and Integration
The model integrates seamlessly with the Hugging Face Transformers library (version 4.48.3 recommended) and supports sequences up to 128,000 tokens. Developers can toggle reasoning behavior using system prompts and choose decoding strategies based on task needs. For reasoning tasks, Nvidia suggests using temperature sampling (0.6) with a top-p value of 0.95, while greedy decoding is recommended for deterministic outputs.
Llama-3.1-Nemotron-Ultra-253B supports multilingual applications, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. It's well-suited for various LLM use cases such as chatbot development, AI agent workflows, retrieval-augmented generation (RAG), and code generation.
Licensed for Commercial Use
Released under the Nvidia Open Model License and governed by the Llama 3.1 Community License Agreement, the model is ready for commercial applications. Nvidia stresses the importance of responsible AI development, urging teams to assess the model's alignment, safety, and bias for their specific use cases.
Oleksii Kuchaiev, Nvidia's Director of AI Model Post-Training, shared the excitement about this open release on X, highlighting its dense 253B design with toggleable reasoning capabilities, and the inclusion of open weights and data.




Nvidia's new model is impressive, outperforming others at half the size. It's great for those who need efficiency without sacrificing performance. The only downside is the setup can be a bit tricky. Overall, a solid choice for AI enthusiasts!




Nvidiaの新しいモデルは、半分のサイズで他のモデルを上回るのが印象的です。効率を求める人には最適ですが、セットアップが少し難しいのが唯一の欠点です。全体的に、AI愛好者にとっては良い選択ですね!




Nvidia의 새로운 모델은 반 크기에서도 다른 모델을 능가하는 것이 인상적입니다. 효율성을 희생하지 않고 성능을 원하는 사람들에게 좋습니다. 유일한 단점은 설정이 조금 까다롭다는 점입니다. 전반적으로 AI 애호가들에게 좋은 선택입니다!




O novo modelo da Nvidia é impressionante, superando outros com metade do tamanho. É ótimo para quem precisa de eficiência sem sacrificar o desempenho. A única desvantagem é que a configuração pode ser um pouco complicada. No geral, uma boa escolha para entusiastas de IA!




El nuevo modelo de Nvidia es impresionante, superando a otros con la mitad del tamaño. Es genial para aquellos que necesitan eficiencia sin sacrificar el rendimiento. La única desventaja es que la configuración puede ser un poco complicada. En general, una sólida opción para entusiastas de la IA!




Nvidia's Llama-3.1 Nemotron Ultra is impressive! It outperforms DeepSeek R1 and is half the size, which is crazy. I've been using it for my projects and it's been a game-changer. The only downside is the setup can be a bit tricky, but once you get it running, it's smooth sailing!












