Large Language Models (LLMs) have rapidly evolved from text-only assistants into complex agentic systems able to performing multi-step reasoning, calling external tools, retrieving memory, and executing code. With this evolution comes an increasingly sophisticated threat landscape: not only traditional content safety risks, but additionally multi-turn jailbreaks, prompt injections, memory hijacking, and gear manipulation.
On this work, we introduce AprielGuard, an 8B parameter safety–security safeguard model designed to detect:
- 16 categories of safety risks, spanning toxicity, hate, sexual content, misinformation, self-harm, illegal activities, and more.
- Wide selection of adversarial attacks, including prompt injection, jailbreaks, chain-of-thought corruption, context hijacking, memory poisoning, and multi-agent exploit sequences.
- Safety violations and adversarial attacks in agentic workflows, including tool calls and model reasoning traces.
AprielGuard is out there in each reasoning and non-reasoning modes, enabling explainable classification when needed and low-latency classification for production pipelines.
- Motivation
- AprielGuard Overview
- Taxonomy
- Training Dataset
- Model Architecture
- Training Setup
- Evaluation
- Conclusion
- Limitations
Motivation
Traditional safety classifiers primarily deal with a limited classification spectrum (e.g., toxicity or self-harm), assume short inputs, and evaluate single user messages. Modern deployments, nevertheless, feature:
- Multi-turn conversations
- Long contexts
- Structured reasoning steps producing chains of thought
- Tool-assisted multi-step workflows (agents)
- A growing class of adversarial attacks exploiting reasoning, tools, or memory
Because of this, production teams increasingly depend on workarounds: multiple guard models for various stages, regex filters, static rules, or hand-crafted heuristics. These approaches are brittle and don’t scale.
AprielGuard addresses these issues with a unified model and a unified safety + adversarial taxonomy, built specifically for contemporary LLM agent ecosystems.
AprielGuard Overview
AprielGuard operates across three input formats:
- Standalone Prompt
- Multi-turn Conversation
- Agentic Workflow (tool calls, reasoning traces, memory, system context)
It outputs:
- Safety classification and a listing of violated categories from the taxonomy
- Adversarial attack classification
- Optional structured reasoning explaining the choice
Taxonomy
A. Safety Taxonomy
| Category | Description |
|---|---|
| O1 | Toxic Content |
| O2 | Unfair Representation |
| O3 | Adult Content |
| O4 | Erosion of Trust in Public Information |
| O5 | Propagating Misconceptions/False Beliefs |
| O6 | Dangerous Financial Practices |
| O7 | Trade and Compliance |
| O8 | Dissemination of Dangerous Information |
| O9 | Privacy Infringement |
| O10 | Security Threats |
| O11 | Defamation |
| O12 | Fraud or Deceptive Motion |
| O13 | Influence Operations |
| O14 | Illegal Activities |
| O15 | Persuasion and Manipulation |
| O16 | Violation of Personal Property |
(These 16 categories are inspired from SALAD-Bench)
B. Adversarial Attack Taxonomy
The model detects and evaluates a wide selection of adversarial prompt patterns designed to govern model behavior or evade safety mechanisms. The model outputs a binary classification (e.g., adversarial / non_adversarial) slightly than fine-grained attack categories.
The training data covers diverse adversarial types reminiscent of role-playing, world-building, persuasion, and stylization, amongst many other complex prompt manipulation strategies. These examples represent only a subset of the broader adversarial scenarios incorporated within the training data.
Training Dataset
-
Synthetic data: AprielGuard is trained on a synthetically generated training dataset. The training data points are generated at a sub-topic level of the taxonomy for higher coverage. We leverage Mixtral-8x7B and internally developed uncensored models to generate unsafe content for training purposes. Models were prompted with higher temperature to induce output variation. Prompting templates are meticulously tailored to make sure accurate data generation. Adversarial attacks are constructed using a mixture of synthetic data points, diverse prompt templates, and rule-based generation techniques. We leveraged NVIDIA NeMo Curator to generate large-scale, multi-turn conversational datasets featuring complex, realistic scenarios with iterative and evolving attacks through context switches. This approach enabled us to systematically synthesize diverse interaction patterns, improving the robustness of the model to long-horizon reasoning, adversarial turns, and evolving user intent. We also used SyGra framework for synthetic data generation processes for harmful prompts and attacks generation. The training dataset encompasses diverse content formats reminiscent of conversational dialogues, forum posts, tweets, instructional prompts, questions, and how-to guides.
-
Data augmentation: To reinforce model robustness, a spread of knowledge augmentation techniques were applied to the training data. These augmentations are designed to show the model to natural variations and perturbations that commonly occur in real-world scenarios. Specifically, the dataset includes transformations reminiscent of character-level noise, insertion of typographical errors, leetspeak substitutions, word-level paraphrasing, and syntactic reordering. Such augmentations help the model generalize higher by reducing sensitivity to superficial variations in input, thereby improving resilience against adversarial manipulations and non-standard text representations.
-
Agentic workflows: Agentic workflows represent real-world scenarios where autonomous agents execute multi-step tasks involving planning, reasoning, and interaction with tools, APIs, and other agents. These workflows often include sequences of user prompts, system messages, intermediate reasoning steps, and gear invocations, making them vulnerable to diverse attack vectors. To construct these training data points, we synthetically generate a wide selection of scenarios across multiple domains, capturing realistic agentic interactions between a user and an agentic system. Each data point is enriched with detailed contextual elements—including tool definitions, tool invocation logs, agent roles and policies, execution traces, conversation history, memory states, and scratch-pad reasoning. For malicious or adversarial examples, we corrupt the relevant segment of the workflow to reflect a particular attack vector. Depending on the scenario, this will involve modifying user prompts, altering intermediate reasoning traces, modifying the tool outputs, injecting false memory states, or disrupting inter-agent communication. By systematically perturbing different components of the agentic workflow, we produce high-fidelity examples that expose a model to a various spectrum of realistic and difficult attack patterns. Each data point was simulated to reflect realistic executions, incorporating each benign and adversarial sequences.
-
Long context use cases: We curated a specialized long context dataset composed of diverse, high-length use cases reminiscent of Retrieval-Augmented Generation (RAG) work-flows, multi-turn conversational threads, incident details, and operational reports containing detailed communications. These examples simulate real-world environments where large text contexts are typical.
Synthetic data generation flow
Model Architecture
AprielGuard is built on top of an Apriel-1.5 Thinker Base variant, downscaled to an 8B configuration for efficient deployment.
- Causal decoder-only transformer
- Dual-mode operation:
- Reasoning Mode → emits structured explanations
- Fast Mode → classification only
Training Setup
| Parameter | Value |
|---|---|
| Base Model | Apriel 1.5 Thinker Base (downscaled) |
| Model Size | 8B parameters |
| Precision | bfloat16 |
| Batch Size | 1 with grad-accumulation = 8 |
| LR | 2e-4 |
| Optimizer | Adam (β1=0.9, β2=0.999) |
| Epochs | 3 |
| Sequence Length | As much as 32k |
| Reasoning Mode | Enabled/Disabled via instruction template |
Evaluation Summary
AprielGuard is evaluated across:
- Public safety benchmarks
- Public adversarial benchmarks
- Internal Agentic workflow benchmarks
- internal Long-context use case benchmarks (as much as 32k)
- Multilingual evaluation (8 languages)
Safety Benchmark Results
AprielGuard performance on the general public safety benchmarks.
A comparative assessment of model performance using aggregated results from safety benchmarks.
Adversarial Detection Results
AprielGuard performance on the general public adversarial benchmarks.
A comparative assessment of model performance using aggregated results from adversarial benchmarks.
Agentic Workflow Evaluation
We curated an internal benchmark dataset geared toward evaluating the detection of Safety Risks and Adversarial Attacks inside agentic workflows. To construct this benchmark, we systematically designed multiple attack scenarios targeting different components of the workflow—reminiscent of prompt inputs, reasoning traces, tool parameters, memory states, and inter-agent communications. Each instance was annotated in response to the taxonomy of vulnerabilities. Each workflow was simulated to reflect realistic executions, incorporating each benign and adversarial sequences. The dataset captures granular attack points across various stages reminiscent of planning, reasoning, execution, and response generation to offer fine-grained evaluation of model robustness. Overall, the dataset comprises a balanced mixture of safety risks and adversarial attacks.
Safety performance of various models on the agentic benchmark.
Adversarial performance of various models on the agentic benchmark..
Long-Context Robustness (Upto 32k Tokens)
Many real world safety or adversarial risks don’t manifest in brief, isolated text snippets, but slightly emerge across use cases reminiscent of Retrieval-Augmented Generation (RAG) workflows, multi-turn conversational threads, organizational incident details, and operational reports containing detailed communications. A guardian model must subsequently detect subtle or “needle-in-a-haystack” cases, where malicious or manipulative content is sparsely distributed, embedded across multiple references, or intentionally obscured inside benign text.
To judge AprielGuard’s long-context reasoning capabilities, we curated a specialized test dataset composed of diverse, high-length use cases. We considered the info upto 32k tokens for this evaluation. The baseline data was initially constructed from benign content representative of those domains. Malicious elements were then systematically injected to simulate adversarial or unsafe scenarios while maintaining the general coherence of the text. For instance, in an incident case summarization, an injection could possibly be embedded inside the case description, hidden in a metadata section, or inserted as a part of a comment thread. Similarly, in multi-turn dialogue data, adversarial content might appear mid-conversation, near the tip or initially to check long range dependency tracking.
Safety Risks performance
| Model | Reasoning | Precision ↑ | Recall ↑ | F1 ↑ | FPR ↓ |
|---|---|---|---|---|---|
| AprielGuard-8B | Without | 0.99 | 0.96 | 0.97 | 0.01 |
| AprielGuard-8B | With | 0.92 | 0.98 | 0.95 | 0.11 |
Adversarial Attacks performance
| Model | Reasoning | Precision ↑ | Recall ↑ | F1 ↑ | FPR ↓ |
|---|---|---|---|---|---|
| AprielGuard-8B | Without | 1.00 | 0.78 | 0.88 | 0.00 |
| AprielGuard-8B | With | 0.93 | 0.94 | 0.94 | 0.10 |
Multilingual evaluation
A significant limitation in the present landscape of content moderation research is the scarcity of high- quality multilingual benchmarks. To handle this gap and comprehensively assess the multilingual capabilities of AprielGuard, we prolonged the Safety Risks benchmarks and Adversarial Attack benchmarks into multiple non-English languages. The interpretation process was conducted using the MADLAD400-3B-MT model, a multilingual machine translation model based on the T5 architecture.
For this study, we chosen eight of probably the most widely used non-English languages to make sure broad linguistic and geographical coverage: French, French-Canadian, German, Japanese, Dutch, Spanish, Portuguese-Brazilian, and Italian. Each instance from the English Safety and Adversarial benchmarks was translated into the eight goal languages. During translation, we preserved the unique English role identifiers, reminiscent of User: and Assistant:, while translating only the conversational content. This design selection ensures alignment with AprielGuard’s moderation framework, where the role context plays an important part in evaluating safety and adversarial intent.
Multilingual performance of AprielGuard
Conclusion
- AprielGuard unifies safety, security, and agentic robustness right into a single guardian model able to handling:
- Comprehensive safety risk classification
- Adversarial attack detection, including prompt injection and jailbreak attempts
- Various input modalities, reminiscent of standalone prompts, multi-turn conversations, and full agentic workflows
- Long-context inputs
- Multilingual inputs
- Explainable reasoning
As LLMs move toward deeply integrated agentic systems, the necessity for unified pipelines becomes more critical. AprielGuard is a step toward that future — reducing complexity, improving coverage, and offering a scalable foundation for trustworthy AI deployments.
Limitations
-
Language Coverage: While AprielGuard has been primarily trained on English data, limited testing indicates it performs reasonably well across several languages, including: English, German, Spanish, French, French (Canada), Italian, Dutch, and Portuguese (Brazil).
Nonetheless, thorough testing and calibration are strongly advisable before deploying the model for production use in non-English settings. -
Adversarial Robustness: Despite targeted training on adversarial and manipulative behaviors, the model should exhibit vulnerability to complex or unseen attack strategies.
-
Domain Sensitivity: AprielGuard may underperform on highly specialized or technical domains (e.g., legal, medical, or scientific contexts) that require nuanced contextual understanding.
-
Latency–Interpretability Trade-off: Enabling reasoning traces enhances explainability but increases latency and compute cost. For low-latency or large-scale use cases, non-reasoning mode is advisable.
-
Reasoning Mode Sensitivity: The model exhibits occasional inconsistencies in classification outcomes between reasoning-enabled and non-reasoning inference modes.
-
Intended use: AprielGuard is meant strictly to be used as a safeguard and risk assessment model. It classifies potential safety risks and adversarial threats in response to the AprielGuard unified taxonomy. Any deviation from the prescribed inference may result in unintended, unsafe, or unreliable behavior.



