Large language models (LLMs), which are no longer just text generators, have evolved into something more. The agentic system —able to plan, reason, and autonomously act—there is a significant increase in both their capabilities and associated risks. Agentic AI is rapidly being adopted by enterprises for automation. However, this poses new challenges to organizations: Unintended behavior, goal misalignment and data leakage are all examples of unintended behavior, prompt injection.. NVIDIA responded to these concerns by releasing an open-source software suite and a post-training safety recipe designed to safeguard agentic AI systems throughout their lifecycle.
Agentic AI: The need for safety
Agentic LLMs are able to function with a great deal of autonomy because they use advanced tools and reasoning. This autonomy may lead to the following:
- Content moderation failures (e.g. the generation of harmful or toxic outputs)
- Security flaws (prompt injection, jailbreak attempts)
- Risks associated with compliance and trust Inability to adhere to enterprise policy or regulatory standards
As attackers’ techniques and models rapidly change, traditional guardrails are often ineffective. Businesses need to align open-models with their internal policies, external regulations and lifecycle strategies.
NVIDIA Safety Recipe Overview and Architecture
NVIDIA’s Agentic AI Safety Recipe provides an overview of the safety features that NVIDIA offers. comprehensive end-to-end framework Before, during and after deployment, it is important to align, evaluate and secure LLMs.
- You can also check out our website for more information.The recipe allows testing of enterprise policies and security thresholds, as well as benchmarks, using public datasets.
- After-Training AlignmentModels can be further aligned to safety standards by using Reinforcement learning (RL), supervised fine-tuning (SFT) or blends of policy datasets.
- Continuous ProtectionNVIDIA NeMo’s Guardrails microservices for real-time surveillance and monitoring are available after deployment. These guardrails can be programmed to block unsafe outputs as well as defend against attempts at jailbreak and prompt injections.
Core Components
| Stage | Technology/Tools | Usefulness |
|---|---|---|
| Pre-Deployment evaluation | Garak scanner, WildGuardMix and Nemotron Content Safety Dataset | Safety/security test |
| After-Training Alignment | Open license data | Fine-tune safety/alignment |
| Deployment & Inference | NeMo Guardrails and NIM Microservices (content safety, topic control, Jailbreak detection) | Stop unsafe behavior |
| Monitoring & Feedback | Garak real-time analysis | Detect/resist new attacks |
Open Datasets Benchmarks
- Nemotron Content Safety Dataset v2: Use this dataset to evaluate a range of potentially harmful behaviors before and after training.
- WildGuardMix dataset Use content moderation to target content across all ambiguous or adversarial prompts.
- Aegis content safety dataset Over 35,000 sample annotations, to enable fine-grained classifier and filter development for LLM tasks.
The Post-Training Experience
NVIDIA’s safety recipe is made available to all participants as a post-training guide. Open-source Jupyter Notebook Or as a cloud module that can be launched, which ensures transparency and wide accessibility. The typical workflow includes:
- Initial Model Evaluation: Open benchmarks for safety/security baseline testing.
- The On-Policy Safety Training Open datasets, reinforcement learning, supervised tuning, and response generation using the model.
- Re-evaluation: Once the training is completed, you can rerun your safety/security benchmarks and confirm any improvements.
- Deployment: Trusted model deployment with guardrail microservices and live monitoring.
Quantitative impact
- Content Security: Improved from 88% to 94% after applying the NVIDIA safety post-training recipe—a 6% gain, with no measurable loss of accuracy.
- Safety and Security of Products: Improved resilience against adversarial prompts (jailbreaks etc.) Increased from 56% up to 63%.
Collaboration and Ecosystem Integrative
NVIDIA’s approach goes beyond internal tools—Partnering Leading cybersecurity providers such as Cisco AI Defense, CrowdStrike and Active Fence (Trend Micro, Active Fence), enable the integration of safety signals continuously throughout AI’s lifecycle.
Start Here
- Open Source AccessDownload the full recipe for safety evaluation and after-training (tools and datasets) and deploy it in cloud.
- Custom Policy Alignment: Enterprises can define custom business policies, risk thresholds, and regulatory requirements—using the recipe to align models accordingly.
- Iterative HardeningAfter training, evaluate, then reevaluate and deploy the model as new risk emerges, to ensure its trustworthiness.
You can also read our conclusion.
NVIDIA’s Safety Recipe for Agentic LLMs is a safety-oriented approach to agentic LLMs. Openly accessible, industry-first approach To harden LLMs to modern AI risk. Through the implementation of robust, transparent and extensible safety protocol, enterprise can use agentic AI in a secure and compliant manner.
Click here to find out more NVIDIA AI safety recipe You can also find out more about the following: Technical details. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter.
Marktechpost can help you promote your AI Product and place it in front AI Devs and Data Engineers.
Ans: Marktechpost will help you promote your AI products by publishing sponsored content, such as case studies or product features. These articles are targeted at a global audience, including AI developers and data scientists. MTP’s platform is read extensively by tech professionals. This increases the visibility of your AI product and its positioning in the AI community. [SET UP A CALL]
Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence’s potential to benefit society. Marktechpost was his most recent venture. This platform, which focuses on machine learning and deep-learning news, is a great example of Asif’s ability to provide a technical and accessible coverage. Over 2 million views per month are a testament to the platform’s popularity.


