This repository contains various checkpoints for ablations and other unusual models from the paper RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale.
The file numbering is currently off by one from the step numbers shown in the paper. So for example L28-D3584-qwen2-rwkv6-2.pth is in fact the result from step 1 from the paper.
checkpoint | step number | teacher | student | description |
---|---|---|---|---|
L28-D3584-qwen2-rwkv6-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV6 | |
L28-D3584-qwen2-rwkv6-3-250m.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | 250m tokens trained |
L28-D3584-qwen2-rwkv6-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | |
L28-D3584-qwen2-rwkv6-base-2.pth | 1 | Qwen2.5-7B | RAD-RWKV6 | |
L28-D3584-qwen2-rwkv7-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV7 | |
L28-D3584-qwen2-rwkv7-3-norope-extraw0.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV7 | no rope used, w0 must be multiplied by 2 due to code mistake |
L28-D3584-qwen2-rwkv7-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV7 | |
L28-D3584-qwerky6_qwen2-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV6 | |
L28-D3584-qwerky6_qwen2-base-3.pth | 2 | Qwen2.5-7B | RAD-RWKV6 | |
L28-D3584-qwerky6_qwen2-groupnorm-2.pth | 1 | Qwen2.5-6B-Instruct | RAD-RWKV6 | ablation study: use groupnorm instead of state balancing |
L28-D3584-qwerky6_qwen2-groupnorm-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study:use groupnorm instead of state balancing |
L28-D3584-qwerky6_qwen2-no_gate-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: no gate |
L28-D3584-qwerky6_qwen2-no_gate-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: no gate |
L28-D3584-qwerky6_qwen2-no_tokenshift-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: no tokenshift |
L28-D3584-qwerky6_qwen2-no_tokenshift-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: no tokenshift |
L28-D3584-qwerky6_qwen2-use_rope-2.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: use rope |
L28-D3584-qwerky6_qwen2-use_rope-3.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV6 | ablation study: use rope |
L28-D3584-qwerky7_qwen2-2-4k.pth | 1 | Qwen2.5-7B-Instruct | RAD-RWKV7 | 4k ctxlen training |
L28-D3584-qwerky7_qwen2-3-4k-ckpt5.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV7 | 4k ctxlen training, early checkpoint |
L28-D3584-qwerky7_qwen2-3-4k.pth | 2 | Qwen2.5-7B-Instruct | RAD-RWKV7 | 4k ctxlen training |
Usage
This repository contains various PyTorch .pth
checkpoints from the RADLADS paper, which are primarily intended for research, ablation studies, and conversion. To use these models with the Hugging Face transformers
library, you will generally need to convert them to the Hugging Face format first.
Please refer to the original GitHub repository for detailed instructions on how to convert these checkpoints to Hugging Face-compatible formats and for specific usage examples: https://github.com/recursal/RADLADS-paper
For models already converted to Hugging Face format and ready for direct use, please refer to the main Recursal RADLADS collection on the Hugging Face Hub.
A conceptual example for loading a text generation model with transformers
(after it has been converted to Hugging Face format, or if you are using a model from the main collection):
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch
# Replace "recursal/RADLADS-RWKV7-Qwen2.5-7B" with the actual ID of a converted model
# from the Recursal RADLADS collection, or your local path to a converted model.
model_name = "recursal/RADLADS-RWKV7-Qwen2.5-7B"
try:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16, # Adjust dtype based on model
device_map="auto",
trust_remote_code=True,
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "The key to life is"
print(pipe(prompt, max_new_tokens=20, do_sample=True)[0]["generated_text"])
except Exception as e:
print(f"Could not load model directly with pipeline. This repository contains raw checkpoints that require conversion.")
print(f"Please refer to the original GitHub repository for detailed conversion and usage instructions: https://github.com/recursal/RADLADS-paper")
print(f"Or explore pre-converted models in the Recursal collection: https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102")
Citation
If you use this code or find our work valuable, please consider citing RADLADS:
@misc{goldstein2025radladsrapidattentiondistillation,
title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale},
author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah},
year={2025},
eprint={2505.03005},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.03005},
}