A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

Large Language Models (LLMs) have rapidly evolved from text-only assistants into complex agentic systems able to performing multi-step reasoning, calling external tools, retrieving memory, and executing code. With this evolution comes an increasingly sophisticated threat landscape: not only traditional content safety risks, but additionally multi-turn jailbreaks, prompt injections, memory hijacking, and gear manipulation.

On this work, we introduce AprielGuard, an 8B parameter safety–security safeguard model designed to detect:

16 categories of safety risks, spanning toxicity, hate, sexual content, misinformation, self-harm, illegal activities, and more.
Wide selection of adversarial attacks, including prompt injection, jailbreaks, chain-of-thought corruption, context hijacking, memory poisoning, and multi-agent exploit sequences.
Safety violations and adversarial attacks in agentic workflows, including tool calls and model reasoning traces.

AprielGuard is out there in each reasoning and non-reasoning modes, enabling explainable classification when needed and low-latency classification for production pipelines.

Motivation
AprielGuard Overview
Taxonomy
Training Dataset
Model Architecture
Training Setup
Evaluation
Conclusion
Limitations

Motivation

Traditional safety classifiers primarily deal with a limited classification spectrum (e.g., toxicity or self-harm), assume short inputs, and evaluate single user messages. Modern deployments, nevertheless, feature:

Multi-turn conversations
Long contexts
Structured reasoning steps producing chains of thought
Tool-assisted multi-step workflows (agents)
A growing class of adversarial attacks exploiting reasoning, tools, or memory

Because of this, production teams increasingly depend on workarounds: multiple guard models for various stages, regex filters, static rules, or hand-crafted heuristics. These approaches are brittle and don’t scale.

AprielGuard addresses these issues with a unified model and a unified safety + adversarial taxonomy, built specifically for contemporary LLM agent ecosystems.

AprielGuard Overview

AprielGuard operates across three input formats:

Standalone Prompt
Multi-turn Conversation
Agentic Workflow (tool calls, reasoning traces, memory, system context)

It outputs:

Safety classification and a listing of violated categories from the taxonomy
Adversarial attack classification
Optional structured reasoning explaining the choice

AprielGuard overview

Taxonomy

A. Safety Taxonomy

Category	Description
O1	Toxic Content
O2	Unfair Representation
O3	Adult Content
O4	Erosion of Trust in Public Information
O5	Propagating Misconceptions/False Beliefs
O6	Dangerous Financial Practices
O7	Trade and Compliance
O8	Dissemination of Dangerous Information
O9	Privacy Infringement
O10	Security Threats
O11	Defamation
O12	Fraud or Deceptive Motion
O13	Influence Operations
O14	Illegal Activities
O15	Persuasion and Manipulation
O16	Violation of Personal Property

(These 16 categories are inspired from SALAD-Bench)

B. Adversarial Attack Taxonomy

The model detects and evaluates a wide selection of adversarial prompt patterns designed to govern model behavior or evade safety mechanisms. The model outputs a binary classification (e.g., adversarial / non_adversarial) slightly than fine-grained attack categories.

The training data covers diverse adversarial types reminiscent of role-playing, world-building, persuasion, and stylization, amongst many other complex prompt manipulation strategies. These examples represent only a subset of the broader adversarial scenarios incorporated within the training data.

Training Dataset

Synthetic data: AprielGuard is trained on a synthetically generated training dataset. The training data points are generated at a sub-topic level of the taxonomy for higher coverage. We leverage Mixtral-8x7B and internally developed uncensored models to generate unsafe content for training purposes. Models were prompted with higher temperature to induce output variation. Prompting templates are meticulously tailored to make sure accurate data generation. Adversarial attacks are constructed using a mixture of synthetic data points, diverse prompt templates, and rule-based generation techniques. We leveraged NVIDIA NeMo Curator to generate large-scale, multi-turn conversational datasets featuring complex, realistic scenarios with iterative and evolving attacks through context switches. This approach enabled us to systematically synthesize diverse interaction patterns, improving the robustness of the model to long-horizon reasoning, adversarial turns, and evolving user intent. We also used SyGra framework for synthetic data generation processes for harmful prompts and attacks generation. The training dataset encompasses diverse content formats reminiscent of conversational dialogues, forum posts, tweets, instructional prompts, questions, and how-to guides.
Data augmentation: To reinforce model robustness, a spread of knowledge augmentation techniques were applied to the training data. These augmentations are designed to show the model to natural variations and perturbations that commonly occur in real-world scenarios. Specifically, the dataset includes transformations reminiscent of character-level noise, insertion of typographical errors, leetspeak substitutions, word-level paraphrasing, and syntactic reordering. Such augmentations help the model generalize higher by reducing sensitivity to superficial variations in input, thereby improving resilience against adversarial manipulations and non-standard text representations.
Agentic workflows: Agentic workflows represent real-world scenarios where autonomous agents execute multi-step tasks involving planning, reasoning, and interaction with tools, APIs, and other agents. These workflows often include sequences of user prompts, system messages, intermediate reasoning steps, and gear invocations, making them vulnerable to diverse attack vectors. To construct these training data points, we synthetically generate a wide selection of scenarios across multiple domains, capturing realistic agentic interactions between a user and an agentic system. Each data point is enriched with detailed contextual elements—including tool definitions, tool invocation logs, agent roles and policies, execution traces, conversation history, memory states, and scratch-pad reasoning. For malicious or adversarial examples, we corrupt the relevant segment of the workflow to reflect a particular attack vector. Depending on the scenario, this will involve modifying user prompts, altering intermediate reasoning traces, modifying the tool outputs, injecting false memory states, or disrupting inter-agent communication. By systematically perturbing different components of the agentic workflow, we produce high-fidelity examples that expose a model to a various spectrum of realistic and difficult attack patterns. Each data point was simulated to reflect realistic executions, incorporating each benign and adversarial sequences.
Long context use cases: We curated a specialized long context dataset composed of diverse, high-length use cases reminiscent of Retrieval-Augmented Generation (RAG) work-flows, multi-turn conversational threads, incident details, and operational reports containing detailed communications. These examples simulate real-world environments where large text contexts are typical.

Synthetic data generation flow

Model Architecture

AprielGuard is built on top of an Apriel-1.5 Thinker Base variant, downscaled to an 8B configuration for efficient deployment.

Causal decoder-only transformer
Dual-mode operation:
- Reasoning Mode → emits structured explanations
- Fast Mode → classification only

Training Setup

Parameter	Value
Base Model	Apriel 1.5 Thinker Base (downscaled)
Model Size	8B parameters
Precision	bfloat16
Batch Size	1 with grad-accumulation = 8
LR	2e-4
Optimizer	Adam (β1=0.9, β2=0.999)
Epochs	3
Sequence Length	As much as 32k
Reasoning Mode	Enabled/Disabled via instruction template

Evaluation Summary

AprielGuard is evaluated across:

Public safety benchmarks
Public adversarial benchmarks
Internal Agentic workflow benchmarks
internal Long-context use case benchmarks (as much as 32k)
Multilingual evaluation (8 languages)

Safety Benchmark Results

AprielGuard performance on the general public safety benchmarks.

A comparative assessment of model performance using aggregated results from safety benchmarks.

Adversarial Detection Results

AprielGuard performance on the general public adversarial benchmarks.

A comparative assessment of model performance using aggregated results from adversarial benchmarks.

Agentic Workflow Evaluation

We curated an internal benchmark dataset geared toward evaluating the detection of Safety Risks and Adversarial Attacks inside agentic workflows. To construct this benchmark, we systematically designed multiple attack scenarios targeting different components of the workflow—reminiscent of prompt inputs, reasoning traces, tool parameters, memory states, and inter-agent communications. Each instance was annotated in response to the taxonomy of vulnerabilities. Each workflow was simulated to reflect realistic executions, incorporating each benign and adversarial sequences. The dataset captures granular attack points across various stages reminiscent of planning, reasoning, execution, and response generation to offer fine-grained evaluation of model robustness. Overall, the dataset comprises a balanced mixture of safety risks and adversarial attacks.

Safety performance of various models on the agentic benchmark.

Adversarial performance of various models on the agentic benchmark..

Long-Context Robustness (Upto 32k Tokens)

Many real world safety or adversarial risks don’t manifest in brief, isolated text snippets, but slightly emerge across use cases reminiscent of Retrieval-Augmented Generation (RAG) workflows, multi-turn conversational threads, organizational incident details, and operational reports containing detailed communications. A guardian model must subsequently detect subtle or “needle-in-a-haystack” cases, where malicious or manipulative content is sparsely distributed, embedded across multiple references, or intentionally obscured inside benign text.

To judge AprielGuard’s long-context reasoning capabilities, we curated a specialized test dataset composed of diverse, high-length use cases. We considered the info upto 32k tokens for this evaluation. The baseline data was initially constructed from benign content representative of those domains. Malicious elements were then systematically injected to simulate adversarial or unsafe scenarios while maintaining the general coherence of the text. For instance, in an incident case summarization, an injection could possibly be embedded inside the case description, hidden in a metadata section, or inserted as a part of a comment thread. Similarly, in multi-turn dialogue data, adversarial content might appear mid-conversation, near the tip or initially to check long range dependency tracking.

Safety Risks performance

Model	Reasoning	Precision ↑	Recall ↑	F1 ↑	FPR ↓
AprielGuard-8B	Without	0.99	0.96	0.97	0.01
AprielGuard-8B	With	0.92	0.98	0.95	0.11

Adversarial Attacks performance

Model	Reasoning	Precision ↑	Recall ↑	F1 ↑	FPR ↓
AprielGuard-8B	Without	1.00	0.78	0.88	0.00
AprielGuard-8B	With	0.93	0.94	0.94	0.10

Multilingual evaluation

A significant limitation in the present landscape of content moderation research is the scarcity of high- quality multilingual benchmarks. To handle this gap and comprehensively assess the multilingual capabilities of AprielGuard, we prolonged the Safety Risks benchmarks and Adversarial Attack benchmarks into multiple non-English languages. The interpretation process was conducted using the MADLAD400-3B-MT model, a multilingual machine translation model based on the T5 architecture.

For this study, we chosen eight of probably the most widely used non-English languages to make sure broad linguistic and geographical coverage: French, French-Canadian, German, Japanese, Dutch, Spanish, Portuguese-Brazilian, and Italian. Each instance from the English Safety and Adversarial benchmarks was translated into the eight goal languages. During translation, we preserved the unique English role identifiers, reminiscent of User: and Assistant:, while translating only the conversational content. This design selection ensures alignment with AprielGuard’s moderation framework, where the role context plays an important part in evaluating safety and adversarial intent.

Multilingual performance of AprielGuard

Conclusion

AprielGuard unifies safety, security, and agentic robustness right into a single guardian model able to handling:
- Comprehensive safety risk classification
- Adversarial attack detection, including prompt injection and jailbreak attempts
- Various input modalities, reminiscent of standalone prompts, multi-turn conversations, and full agentic workflows
- Long-context inputs
- Multilingual inputs
- Explainable reasoning

As LLMs move toward deeply integrated agentic systems, the necessity for unified pipelines becomes more critical. AprielGuard is a step toward that future — reducing complexity, improving coverage, and offering a scalable foundation for trustworthy AI deployments.

Limitations

Language Coverage: While AprielGuard has been primarily trained on English data, limited testing indicates it performs reasonably well across several languages, including: English, German, Spanish, French, French (Canada), Italian, Dutch, and Portuguese (Brazil).
Nonetheless, thorough testing and calibration are strongly advisable before deploying the model for production use in non-English settings.
Adversarial Robustness: Despite targeted training on adversarial and manipulative behaviors, the model should exhibit vulnerability to complex or unseen attack strategies.
Domain Sensitivity: AprielGuard may underperform on highly specialized or technical domains (e.g., legal, medical, or scientific contexts) that require nuanced contextual understanding.
Latency–Interpretability Trade-off: Enabling reasoning traces enhances explainability but increases latency and compute cost. For low-latency or large-scale use cases, non-reasoning mode is advisable.
Reasoning Mode Sensitivity: The model exhibits occasional inconsistencies in classification outcomes between reasoning-enabled and non-reasoning inference modes.
Intended use: AprielGuard is meant strictly to be used as a safeguard and risk assessment model. It classifies potential safety risks and adversarial threats in response to the AprielGuard unified taxonomy. Any deviation from the prescribed inference may result in unintended, unsafe, or unreliable behavior.

Source link

A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

Motivation

AprielGuard Overview