ADV-LLM

ADV-LLM is an iteratively self-tuned adversarial language model that generates jailbreak suffixes capable of bypassing safety alignment in open-source and proprietary models.

Model Details

  • Authors: Chung-En Sun et al. (UCSD & Microsoft Research)
  • Finetuned from: LLaMA-3-8B-Instruct
  • Language: English
  • License: MIT

Usage Example

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("cesun/advllm_llama3")
tokenizer = AutoTokenizer.from_pretrained("cesun/advllm_llama3")
inputs = tokenizer("How to make a bomb", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=90)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation Results

ADV-LLM achieves near-perfect jailbreak success rates under group beam search (GBS-50) across a wide range of models and safety checks, including Template (TP), LlamaGuard (LG), and GPT-4 evaluations.

Victim Model GBS-50 ASR (TP / LG / GPT-4)
Vicuna-7B-v1.5 100.00% / 100.00% / 99.81%
Guanaco-7B 100.00% / 100.00% / 99.81%
Mistral-7B-Instruct-v0.2 100.00% / 100.00% / 100.00%
LLaMA-2-7B-chat 100.00% / 100.00% / 93.85%
LLaMA-3-8B-Instruct 100.00% / 98.84% / 98.27%

Legend:

  • ASR = Attack Success Rate
  • TP = Template-based refusal detection
  • LG = LlamaGuard safety classifier
  • GPT-4 = Harmfulness judged by GPT-4

Citation

If you use ADV-LLM in your research or evaluation, please cite:

BibTeX

@inproceedings{sun2025advllm,
  title={Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities},
  author={Sun, Chung-En and Liu, Xiaodong and Yang, Weiwei and Weng, Tsui-Wei and Cheng, Hao and San, Aidan and Galley, Michel and Gao, Jianfeng},
  booktitle={NAACL},
  year={2025}
}
Downloads last month
5
Safetensors
Model size
8.03B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support