Rustamshry
/

Math-RL

Text Generation

Model card Files Files and versions

Math-RL / README.md

Rustamshry's picture

Update README.md

334b2d6 verified 12 days ago

|

history blame contribute delete

3.17 kB

	---
	base_model: unsloth/Qwen2.5-0.5B-Instruct
	library_name: peft
	license: mit
	datasets:
	- HoangHa/pensez-grpo
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- math
	- trl
	- unsloth
	- grpo
	- transformers
	---

	# Model Card for Math-RL

	## Model Details

	This model is a fine-tuned version of Qwen2.5-0.5B-Instruct, optimized with Group Relative Policy Optimization (GRPO) on a curated math dataset of 700 problems.
	The fine-tuning process aims to enhance the model’s step-by-step reasoning ability in mathematical problem solving, improving its performance on structured reasoning tasks.

	### Model Description


	- Language(s) (NLP): English
	- License: MIT
	- Finetuned from model: Qwen2.5-0.5B-Instruct
	- Fine-tuning Method: GRPO with LoRa
	- Domain: Mathematics (problem-solving, reasoning)
	- Dataset Size: ~700 examples


	## Uses

	### Direct Use

	The model is intended for:

	- Educational purposes: assisting students with math problems
	- Research on small-scale RLHF-style fine-tuning (GRPO)
	- Experiments in reasoning with small instruction-tuned models
	- Serving as a lightweight math reasoning assistant in constrained environments


	## Bias, Risks, and Limitations

	- Small Dataset: Fine-tuned only on 700 math problems, so generalization is limited.
	- Reasoning Errors: May produce incorrect or hallucinated answers. Always verify results.
	- Not a Math Oracle: Should not be used in high-stakes scenarios (e.g., exams, grading, critical calculations).
	- Limited Scope: Performance is strongest on problems similar to the fine-tuning dataset; outside domains may degrade.
	- Language: While the base model supports multiple languages, math-specific fine-tuning was primarily English-based.


	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from huggingface_hub import login
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel

	login(token="")

	tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct",)
	base_model = AutoModelForCausalLM.from_pretrained(
	"unsloth/Qwen2.5-0.5B-Instruct",
	device_map={"": 0}, token=""
	)

	model = PeftModel.from_pretrained(base_model,"Rustamshry/Math-RL")

	question = """
	Translate the graph of the function $y=\sin 2x$ along the $x$-axis to the left by $\dfrac{\pi }{6}$ units, and stretch the ordinate to twice its original length (the abscissa remains unchanged) to obtain the graph of the function $y=f(x)$. If the minimum value of the function $y=f(x)+a$ on the interval $\left[ 0,\dfrac{\pi }{2} \right]$ is $\sqrt{3}$, then $a=\boxed{\_\_\_\_\_}$.
	"""

	system = """
	Respond in the following format:
	<reasoning>
	...
	</reasoning>
	<answer>
	...
	</answer>
	"""

	messages = [
	{"role" : "system", "content" : system},
	{"role" : "user", "content" : question}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize = False,
	)

	from transformers import TextStreamer
	_ = model.generate(
	**tokenizer(text, return_tensors = "pt").to("cuda"),
	max_new_tokens = 2048,
	streamer = TextStreamer(tokenizer, skip_prompt = True),
	)
	```


	### Framework versions

	- PEFT 0.15.2