GRPO-Trained Qwen 2.5 LoRA

This is a GRPO (Generative Reward-Penalty Optimization) fine-tuned version of Qwen 2.5 using LoRA.

Model Details

Base model: Qwen/Qwen2.5
Training method: GRPO
LoRA rank: 16
Training objectives: XML formatting, correctness, and structured output
Use case: Generating well-formatted, structured outputs with improved accuracy

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the base model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5")
# Load the LoRA adapter
model = PeftModel.from_pretrained(model, "yashwanthjanke/qwen2.5-grpo-lora")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5")

# Use the model
input_text = "Your prompt here"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)