ADV-LLM

ADV-LLM is an iteratively self-tuned adversarial language model that generates jailbreak suffixes capable of bypassing safety alignment in open-source and proprietary models.

Paper: https://arxiv.org/abs/2410.18469
Code: https://github.com/SunChungEn/ADV-LLM

Model Details

Authors: Chung-En Sun et al. (UCSD & Microsoft Research)
Finetuned from: Vicuna-7B-v1.5
Language: English
License: MIT

Usage Example

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("cesun/advllm_vicuna")
tokenizer = AutoTokenizer.from_pretrained("cesun/advllm_vicuna")
inputs = tokenizer("How to make a bomb", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=90)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation Results

ADV-LLM achieves near-perfect jailbreak success rates under group beam search (GBS-50) across a wide range of models and safety checks, including Template (TP), LlamaGuard (LG), and GPT-4 evaluations.

Victim Model	GBS-50 ASR (TP / LG / GPT-4)
Vicuna-7B-v1.5	100.00% / 100.00% / 99.81%
Guanaco-7B	100.00% / 100.00% / 99.81%
Mistral-7B-Instruct-v0.2	100.00% / 100.00% / 100.00%
LLaMA-2-7B-chat	100.00% / 100.00% / 93.85%
LLaMA-3-8B-Instruct	100.00% / 98.84% / 98.27%

Legend:

ASR = Attack Success Rate
TP = Template-based refusal detection
LG = LlamaGuard safety classifier
GPT-4 = Harmfulness judged by GPT-4

Citation

If you use ADV-LLM in your research or evaluation, please cite:

BibTeX

@inproceedings{sun2025advllm,
  title={Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities},
  author={Sun, Chung-En and Liu, Xiaodong and Yang, Weiwei and Weng, Tsui-Wei and Cheng, Hao and San, Aidan and Galley, Michel and Gao, Jianfeng},
  booktitle={NAACL},
  year={2025}
}