Getting Language Models to Open Up on ‘Dangerous’ Subjects

-

sound

 

Yesterday we took a have a look at the (questionable) pastime of attempting to get vision/language models to output content that breaks their very own usage guidelines, by rephrasing queries in a way that masks the malicious or ‘subversive’ intent.

The flip-side to this – and maybe an inevitable response to this sort of habitual attack – is the tendency of popular language models to refuse to have interaction in any respect in certain topics, on the presumption that the user is attempting to flout the model’s strictures around controversial content:

Source: https://arxiv.org/pdf/2308.01263

We are able to see in examples reminiscent of the one illustrated above, that a single word can trigger a refusal to have interaction with the query, despite a context that evidently renders the response as excessive.

As adoption and business usage rises for LLMs and VLMs, liability and exposure increases for the businesses supplying these services, with tales of egregious recent safety settings apparently increasing in tandem with this growth.

At a certain point, unless more sophisticated controls are given to the common user (and getting access of this sort currently represents quite a hurdle for many users), LLM providers risk to alienate casual users who at the moment are unable to discourse with AI on a spread of necessary human topics, without the danger of immediate suspicion, censure, or account closure.

FalseReject

With this in mind, researchers from Dartmouth College (HA) and Amazon have developed a brand new dataset and fine-tuning approach titled , representing a big and trainable corpus of prompts which are prone to trigger refusals from language models, but which are usually not necessarily harmful.

Some examples from the project’s online dataset include:



The inherent challenge in exposing such a dataset to a model through fine-tuning is to learn a from such examples, fairly than adding each particular instance to some type of ‘white-list’, which might likely not be a logistically-sound approach over the long run

The above examples are relatively clear instances of an inquiring mind crossing over into sensitive territory; nonetheless, a number of the examples within the dataset edge much closer to the road between casual inquiry and security research-level ‘red-team’ queries designed to check safety filters; or gradual incursions into riskier topics by slow degrees, hoping to incrementally ‘gaslight’ the LLM into disregarding its own safety filters:



As discussed in yesterday’s article, entire communities have grown over the past 3-4 years, dedicated to finding semantic loopholes in the protection systems of closed-source, proprietary AI systems reminiscent of the Claude, Gemini or Chat series.

With a gentle flow of users probing for weak points, and providers reluctant to impose user-level vetting, API-based systems will need models that may apply common sense to prompts that edge into the language of prurient or illegal content, while still allowing space for good-faith engagement with sensitive or borderline topics; and the models will likely need datasets of this sort, at scale.

The recent paper is titled , and comes from 4 researchers across Dartmouth and Amazon. The location also has a project page and a Hugging Face explorable dataset.

Method

The target of the FalseReject dataset is to guage and retrain language models on their tendency to over-refuse. The gathering features 16,000 prompts that appear harmful at first glance, but are verified as benign, covering 44 safety-related categories:

The domains and sub-domains covered by the dataset.

The dataset features a human-annotated test set called , containing 1,100 examples, together with two training sets: and . These provide 15,000 query-response pairs intended for non-reasoning and reasoning models, respectively.

From the paper, example showing a non-reasoning model refusing a benign query, and a reasoning model complying without safety checks. A model trained on FalseReject responds with both caution and relevance, distinguishing context while avoiding unnecessary refusal. Source: https://arxiv.org/pdf/2505.08054

Source: https://arxiv.org/pdf/2505.08054

To generate the prompts that make up the FalseReject dataset, the authors began by identifying language patterns that always trigger unnecessary refusals in current models – prompts that appear unsafe at a look, but which are literally benign, taken in context.

For this, entity graphs were extracted from existing safety-related datasets: ALERT; CoCoNot; HarmBench; JailbreakBench; Sorry-Bench; Xstest-Toxic; Or-Bench-Toxic; and HEx-PHI. The graphs were built using Llama-3.1-405B, extracting references to people, places, and ideas prone to appear in sensitive contexts.

An LLM-driven voting process was used to pick probably the most representative entity sets from candidate lists. These were then used to construct graphs that guided prompt generation, with the goal of reflecting real-world ambiguities across a big selection of sensitive topics.

Prompt generation and filtering were carried out using a multi-agent framework based on adversarial interaction, with the Generator devising prompts using the extracted graphs:

The pipeline used to generate the malicious-seeming but safe prompts that constitute the FalseReject dataset.

On this process, the Discriminator evaluated whether the prompt was genuinely unsafe, with the result passed to a validation step across diverse language models: Llama-3.2-1B-Instruct; Mistral-7B-Instruct; Cohere Command-R Plus; and Llama-3.1-70B-Instruct. A prompt was retained provided that at the least one model refused to reply.

Final review was conducted by an Orchestrator, which determined whether the prompt was clearly non-harmful in context, and useful for evaluating over-refusal:

From the supplementary material for the new paper, the schema for the Orchestrator in the tripartite data creation/curation approach developed by the researchers.

This whole procedure was repeated as much as 20 times per prompt, to permit for iterative refinement. Prompts that passed all 4 stages (generation, evaluation, validation, and orchestration) were accepted into the dataset.

Duplicates and overly-similar samples were removed using the all-MiniLM-L6-v2 embedding model, applying a cosine similarity threshold of 0.5, which resulted in the ultimate dataset size.

A separate test set was created for evaluation, containing 1,100 human-selected prompts. In each case annotators evaluated whether the prompt looked ‘sensitive’, but could possibly be answered safely, with appropriate context. People who met this condition were incorporated into the benchmark – titled – for assessing over-refusal.

To support fine-tuning, structured responses were created for every training prompt, and two versions of the training data assembled: , which supports standard instruction-tuned models; and , which was tailored for models that use chain-of-thought reasoning, reminiscent of DeepSeek-R1 (which was also used to generate the responses for this set).

Each response had two parts: a monologue-style reflection, marked by special tokens; and a direct reply for the user. Prompts also included a temporary safety category definition and formatting instructions.

Data and Tests

Benchmarking

The benchmarking phase evaluated twenty-nine language models using the FalseReject-Test benchmark: GPT-4.5; GPT-4o and o1; Claude-3.7-Sonnet, Claude-3.5-Sonnet, Claude-3.5-Haiku, and Claude-3.0-Opus; Gemini-2.5-Pro and Gemini-2.0-Pro; The Llama-3 models 1B, 3B, 8B, 70B and 405B;and the Gemma-3 series models 1B, 4B and 27B.

Other evaluated models were Mistral-7B and Instruct v0.2; Cohere Command-R Plus; and, from the Qwen-2.5 series, 0.5B, 1.5B, 7B, 14B and 32B. QwQ-32B-Preview was also tested, alongside Phi-4 and Phi-4-mini. The DeepSeek models used were DeepSeek-V3 and DeepSeek-R1.

Previous work on refusal detection has often relied on keyword matching, flagging phrases reminiscent of to discover refusals – but this method can miss more subtle types of disengagement. To enhance reliability, the authors adopted an LLM-as-judge approach, using Claude-3.5-Sonnet to categorise responses as ‘refusal’ or a type of compliance.

Two metrics were then used: , to measure the proportion of responses that didn’t lead to refusal; and (USR), which offers a three-way distinction between , and .

For toxic prompts, the increases when models either refuse outright or engage cautiously without causing harm. For benign prompts, the rating improves when models either respond fully or acknowledge safety concerns while still providing a useful answer – a setup that rewards considered judgment without penalizing constructive engagement.

refers to responses that acknowledge risk and avoid harmful content while still attempting a constructive answer. This framing allows for a more precise evaluation of model behavior by distinguishing ‘hedged engagement’ from ‘outright refusal’.

The outcomes of the initial benchmarking tests are shown within the graph below:

Results from the FalseReject-Test benchmark, showing Compliance Rate and Useful Safety Rate for each model. Closed-source models appear in dark green; open-source models appear in black. Models designed for reasoning tasks (o1, DeepSeek-R1 and QwQ) are marked with a star.

The authors report that language models continued to struggle with over-refusal, even at the best performance levels. GPT-4.5 and Claude-3.5-Sonnet showed compliance rates below fifty percent, cited after as evidence that safety and helpfulness remain difficult to balance.

Reasoning models behaved inconsistently: DeepSeek-R1 performed well, with a compliance rate of 87.53 percent and a USR of 99.66 percent, while QwQ-32B-Preview and o1 performed far worse, suggesting that reasoning-oriented training doesn’t consistently improve refusal alignment.

Refusal patterns varied by model family: Phi-4 models showed wide gaps between Compliance Rate and USR, pointing to frequent partial compliance, whilst GPT models reminiscent of GPT-4o showed narrower gaps, indicating more clear-cut decisions to either ‘refuse’ or ‘comply’.

General language ability did not predict outcomes, with smaller models reminiscent of Llama-3.2-1B and Phi-4-mini outperforming GPT-4.5 and o1, suggesting that refusal behavior depends upon alignment strategies fairly than raw language capability.

Neither did model size predict performance: in each the Llama-3 and Qwen-2.5 series, smaller models outperformed larger ones, and the authors conclude that scale alone doesn’t reduce over-refusal.

The researchers further note that open source models can potentially outperform closed-source, API-only models:

Finetuning

To coach and evaluate finetuning strategies, general-purpose instruction tuning data was combined with the FalseReject dataset. For reasoning models, 12,000 examples were drawn from Open-Thoughts-114k and 1,300 from FalseReject-Train-CoT. For non-reasoning models, the identical amounts were sampled from Tulu-3 and FalseReject-Train-Instruct.

The goal models were Llama-3.2-1B; Llama-3-8B; Qwen-2.5-0.5B; Qwen-2.5-7B; and Gemma-2-2B.

All finetuning was carried out on base models fairly than instruction-tuned variants, in an effort to isolate the consequences of the training data.

Performance was evaluated across multiple datasets: FalseReject-Test and OR-Bench-Hard-1K assessed over-refusal; AdvBench, MaliciousInstructions, Sorry-Bench and StrongREJECT were used to measure safety; and general language ability was tested with MMLU and GSM8K.

Training with FalseReject reduces over-refusal in non-reasoning models and improves safety in reasoning models. The table reports USR scores across six prompt sources: AdvBench, MaliciousInstructions, StrongReject, Sorry-Bench, and Or-Bench-1k-Hard, along with general language benchmarks. Models trained with FalseReject are compared against baseline methods. Higher scores indicate better performance. Bold values highlight stronger results on over-refusal tasks.

Adding FalseReject-Train-Instruct led non-reasoning models to reply more constructively to secure prompts, reflected in higher scores on the subset of the Useful Safety Rate (which tracks helpful replies to non-harmful inputs).

Reasoning models trained with FalseReject-Train-CoT showed even greater gains, improving each caution and responsiveness without loss on the whole performance.

Conclusion

Though an interesting development, the brand new work doesn’t provide a proper explanation for why over-refusal occurs, and the core problem stays: creating effective filters that must operate as moral and legal arbiters, in a research strand (and, increasingly, business environment) where each these contexts are always evolving.

 

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x