Text Generation
Transformers

This repository contains various checkpoints for ablations and other unusual models from the paper RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale.

The file numbering is currently off by one from the step numbers shown in the paper. So for example L28-D3584-qwen2-rwkv6-2.pth is in fact the result from step 1 from the paper.

checkpoint step number teacher student description
L28-D3584-qwen2-rwkv6-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV6
L28-D3584-qwen2-rwkv6-3-250m.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6 250m tokens trained
L28-D3584-qwen2-rwkv6-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6
L28-D3584-qwen2-rwkv6-base-2.pth 1 Qwen2.5-7B RAD-RWKV6
L28-D3584-qwen2-rwkv7-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV7
L28-D3584-qwen2-rwkv7-3-norope-extraw0.pth 2 Qwen2.5-7B-Instruct RAD-RWKV7 no rope used, w0 must be multiplied by 2 due to code mistake
L28-D3584-qwen2-rwkv7-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV7
L28-D3584-qwerky6_qwen2-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV6
L28-D3584-qwerky6_qwen2-base-3.pth 2 Qwen2.5-7B RAD-RWKV6
L28-D3584-qwerky6_qwen2-groupnorm-2.pth 1 Qwen2.5-6B-Instruct RAD-RWKV6 ablation study: use groupnorm instead of state balancing
L28-D3584-qwerky6_qwen2-groupnorm-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study:use groupnorm instead of state balancing
L28-D3584-qwerky6_qwen2-no_gate-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: no gate
L28-D3584-qwerky6_qwen2-no_gate-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: no gate
L28-D3584-qwerky6_qwen2-no_tokenshift-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: no tokenshift
L28-D3584-qwerky6_qwen2-no_tokenshift-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: no tokenshift
L28-D3584-qwerky6_qwen2-use_rope-2.pth 1 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: use rope
L28-D3584-qwerky6_qwen2-use_rope-3.pth 2 Qwen2.5-7B-Instruct RAD-RWKV6 ablation study: use rope
L28-D3584-qwerky7_qwen2-2-4k.pth 1 Qwen2.5-7B-Instruct RAD-RWKV7 4k ctxlen training
L28-D3584-qwerky7_qwen2-3-4k-ckpt5.pth 2 Qwen2.5-7B-Instruct RAD-RWKV7 4k ctxlen training, early checkpoint
L28-D3584-qwerky7_qwen2-3-4k.pth 2 Qwen2.5-7B-Instruct RAD-RWKV7 4k ctxlen training

Usage

This repository contains various PyTorch .pth checkpoints from the RADLADS paper, which are primarily intended for research, ablation studies, and conversion. To use these models with the Hugging Face transformers library, you will generally need to convert them to the Hugging Face format first.

Please refer to the original GitHub repository for detailed instructions on how to convert these checkpoints to Hugging Face-compatible formats and for specific usage examples: https://github.com/recursal/RADLADS-paper

For models already converted to Hugging Face format and ready for direct use, please refer to the main Recursal RADLADS collection on the Hugging Face Hub.

A conceptual example for loading a text generation model with transformers (after it has been converted to Hugging Face format, or if you are using a model from the main collection):

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch

# Replace "recursal/RADLADS-RWKV7-Qwen2.5-7B" with the actual ID of a converted model
# from the Recursal RADLADS collection, or your local path to a converted model.
model_name = "recursal/RADLADS-RWKV7-Qwen2.5-7B" 

try:
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16, # Adjust dtype based on model
        device_map="auto",
        trust_remote_code=True,
    )
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

    prompt = "The key to life is"
    print(pipe(prompt, max_new_tokens=20, do_sample=True)[0]["generated_text"])

except Exception as e:
    print(f"Could not load model directly with pipeline. This repository contains raw checkpoints that require conversion.")
    print(f"Please refer to the original GitHub repository for detailed conversion and usage instructions: https://github.com/recursal/RADLADS-paper")
    print(f"Or explore pre-converted models in the Recursal collection: https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102")

Citation

If you use this code or find our work valuable, please consider citing RADLADS:

@misc{goldstein2025radladsrapidattentiondistillation,
      title={RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale}, 
      author={Daniel Goldstein and Eric Alcaide and Janna Lu and Eugene Cheah},
      year={2025},
      eprint={2505.03005},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.03005}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including recursal/radlads-7b-various