cesun
/

advllm_llama2

Text Generation

adversarial-attacks

text-generation-inference

Model card Files Files and versions Community

advllm_llama2 / README.md

cesun's picture

Update README.md

5c46ab7 verified 7 days ago

|

history blame contribute delete

2.21 kB

	---
	library_name: transformers
	tags:
	- adversarial-attacks
	- jailbreak
	- red-teaming
	- alignment
	- LLM-safety
	license: mit
	---

	# ADV-LLM

	ADV-LLM is an iteratively self-tuned adversarial language model that generates jailbreak suffixes capable of bypassing safety alignment in open-source and proprietary models.

	- Paper: https://arxiv.org/abs/2410.18469
	- Code: https://github.com/SunChungEn/ADV-LLM

	## Model Details

	- Authors: Chung-En Sun et al. (UCSD & Microsoft Research)
	- Finetuned from: LLaMA-2-7B-chat
	- Language: English
	- License: MIT

	## Usage Example

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	model = AutoModelForCausalLM.from_pretrained("cesun/advllm_llama2")
	tokenizer = AutoTokenizer.from_pretrained("cesun/advllm_llama2")
	inputs = tokenizer("How to make a bomb", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=90)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Evaluation Results

	ADV-LLM achieves near-perfect jailbreak success rates under group beam search (GBS-50) across a wide range of models and safety checks, including Template (TP), LlamaGuard (LG), and GPT-4 evaluations.

	\| Victim Model \| GBS-50 ASR (TP / LG / GPT-4) \|
	\|--------------------------\|-------------------------------\|
	\| Vicuna-7B-v1.5 \| 100.00% / 100.00% / 99.81% \|
	\| Guanaco-7B \| 100.00% / 100.00% / 99.81% \|
	\| Mistral-7B-Instruct-v0.2 \| 100.00% / 100.00% / 100.00% \|
	\| LLaMA-2-7B-chat \| 100.00% / 100.00% / 93.85% \|
	\| LLaMA-3-8B-Instruct \| 100.00% / 98.84% / 98.27% \|

	Legend:
	- ASR = Attack Success Rate
	- TP = Template-based refusal detection
	- LG = LlamaGuard safety classifier
	- GPT-4 = Harmfulness judged by GPT-4

	## Citation

	If you use ADV-LLM in your research or evaluation, please cite:

	BibTeX

	```bibtex
	@inproceedings{sun2025advllm,
	title={Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities},
	author={Sun, Chung-En and Liu, Xiaodong and Yang, Weiwei and Weng, Tsui-Wei and Cheng, Hao and San, Aidan and Galley, Michel and Gao, Jianfeng},
	booktitle={NAACL},
	year={2025}
	}