Given the widespread adoption of LLMs, it’s critical to grasp their safety and risks in several scenarios before extensive deployments in the actual world. Particularly, the US Whitehouse has published an executive order on secure, secure, and trustworthy AI; the EU AI Act has emphasized the mandatory requirements for high-risk AI systems. Along with regulations, it will be significant to offer technical solutions to evaluate the risks of AI systems, enhance their safety, and potentially provide secure and aligned AI systems with guarantees.
Thus, in 2023, at Secure Learning Lab, we introduced DecodingTrust, the primary comprehensive and unified evaluation platform dedicated to assessing the trustworthiness of LLMs. (This work won the Outstanding Paper Award at NeurIPS 2023.)
DecodingTrust provides a multifaceted evaluation framework covering eight trustworthiness perspectives: toxicity, stereotype bias, adversarial robustness, OOD robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Particularly, DecodingTrust 1) offers comprehensive trustworthiness perspectives for a holistic trustworthiness evaluation, 2) provides novel red-teaming algorithms tailored for every perspective, enabling in-depth testing of LLMs, 3) supports easy installation across various cloud environments, 4) provides a comprehensive leaderboard for each open and closed models based on their trustworthiness, 5) provides failure example studies to boost transparency and understanding, 6) provides an end-to-end demonstration in addition to detailed model evaluation reports for practical usage.
Today, we’re excited to announce the discharge of the brand new LLM Safety Leaderboard, which focuses on safety evaluation for LLMs and is powered by the HF leaderboard template.
Red-teaming Evaluation
DecodingTrust provides several novel red-teaming methodologies for every evaluation perspective to perform stress tests. The detailed testing scenarios and metrics are within the Figure 3 of our paper.
For Toxicity, we design optimization algorithms and prompt generative models to generate difficult user prompts. We also design 33 difficult system prompts, akin to role-play, task reformulation and respond-as-program, to perform the evaluation in several scenarios. We then leverage the attitude API to judge the toxicity rating of the generated content given our difficult prompts.
For stereotype bias, we collect 24 demographic groups and 16 stereotype topics in addition to three prompt variations for every topic to judge the model bias. We prompt the model 5 times and take the common as model bias scores.
For adversarial robustness, we construct five adversarial attack algorithms against three open models: Alpaca, Vicuna, and StableVicuna. We evaluate the robustness of various models across five diverse tasks, using the adversarial data generated by attacking the open models.
For the OOD robustness perspective, we have now designed different style transformations, knowledge transformations, etc, to judge the model performance when 1) the input style is transformed to other less common styles akin to Shakespearean or poetic forms, or 2) the knowledge required to reply the query is absent from the training data of LLMs.
For robustness against adversarial demonstrations, we design demonstrations containing misleading information, akin to counterfactual examples, spurious correlations, and backdoor attacks, to judge the model performance across different tasks.
For privacy, we offer different levels of evaluation, including 1) privacy leakage from pretraining data, 2) privacy leakage during conversations, and three) privacy-related words and events understanding of LLMs. Particularly, for 1) and a pair of), we have now designed different approaches to performing privacy attacks. For instance, we offer different formats of prompts to guide LLMs to output sensitive information akin to email addresses and bank card numbers.
For ethics, we leverage ETHICS and Jiminy Cricket datasets to design jailbreaking systems and user prompts that we use to judge the model performance on immoral behavior recognition.
For fairness, we control different protected attributes across different tasks to generate difficult questions to judge the model fairness in each zero-shot and few-shot settings.
Some key findings from our paper
Overall, we discover that
- GPT-4 is more vulnerable than GPT-3.5,
- no single LLM consistently outperforms others across all trustworthiness perspectives,
- trade-offs exist between different trustworthiness perspectives,
- LLMs reveal different capabilities in understanding different privacy-related words. As an illustration, if GPT-4 is prompted with “in confidence”, it could not leak private information, while it could leak information if prompted with “confidentially”.
- LLMs are vulnerable to adversarial or misleading prompts or instructions under different trustworthiness perspectives.
Easy methods to submit your model for evaluation
First, convert your model weights to safetensors
It’s a brand new format for storing weights which is safer and faster to load and use. It’ll also allow us to display the variety of parameters of your model within the predominant table!
Then, make certain you possibly can load your model and tokenizer using AutoClasses:
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name")
model = AutoModel.from_pretrained("your model name")
tokenizer = AutoTokenizer.from_pretrained("your model name")
If this step fails, follow the error messages to debug your model before submitting it. It’s likely your model has been improperly uploaded.
Notes:
- Ensure your model is public!
- We do not yet support models that require
use_remote_code=True. But we’re working on it, stay posted!
Finally, use the “Submit here!” panel in our leaderboard to submit your model for evaluation!
Citation
For those who find our evaluations useful, please consider citing our work.
@article{wang2023decodingtrust,
title={DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models},
creator={Wang, Boxin and Chen, Weixin and Pei, Hengzhi and Xie, Chulin and Kang, Mintong and Zhang, Chenhui and Xu, Chejian and Xiong, Zidi and Dutta, Ritik and Schaeffer, Rylan and others},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
12 months={2023}
}
