NiuTrans
/

GRAM-LLaMA3.2-3B-RewardModel

@@ -12,9 +12,9 @@ tags:
 ---
 # Introduction
-This repository contains the released models for the paper [GRAM: A Generative Foundation Reward Model for Reward Generalization 📝]().
-<img src="https://raw.githubusercontent.com/wangclnlp/GRAM/refs/heads/main/gram.png?token=GHSAT0AAAAAAC5DHKJKFOGQKURCJNSPUTJG2CRBUJQ" width="1000px"></img>
 This training process is introduced above. Traditionally, these models are trained using labeled data, which can limit their potential. In this study, we propose a new method that combines both labeled and unlabeled data for training reward models. We introduce a generative reward model that first learns from a large amount of unlabeled data and is then fine-tuned with supervised data. Additionally, we demonstrate that using label smoothing during training improves performance by optimizing a regularized ranking loss. This approach bridges generative and discriminative models, offering a new perspective on training reward models. Our model can be easily applied to various tasks without the need for extensive fine-tuning. This means that when aligning LLMs, there is no longer a need to train a reward model from scratch with large amounts of task-specific labeled data. Instead, **you can directly apply our reward model or adapt it to align your LLM based on our [code](https://github.com/wangclnlp/GRAM/tree/main)**.
@@ -96,5 +96,13 @@ print({
 If you find this model helpful for your research, please cite GRAM:
 ```bash
-bib
 ```

 ---
 # Introduction
+This repository contains the released models for the paper [GRAM: A Generative Foundation Reward Model for Reward Generalization 📝](https://arxiv.org/abs/2506.14175).
+<img src="https://raw.githubusercontent.com/wangclnlp/GRAM/refs/heads/main/gram.png" width="1000px"></img>
 This training process is introduced above. Traditionally, these models are trained using labeled data, which can limit their potential. In this study, we propose a new method that combines both labeled and unlabeled data for training reward models. We introduce a generative reward model that first learns from a large amount of unlabeled data and is then fine-tuned with supervised data. Additionally, we demonstrate that using label smoothing during training improves performance by optimizing a regularized ranking loss. This approach bridges generative and discriminative models, offering a new perspective on training reward models. Our model can be easily applied to various tasks without the need for extensive fine-tuning. This means that when aligning LLMs, there is no longer a need to train a reward model from scratch with large amounts of task-specific labeled data. Instead, **you can directly apply our reward model or adapt it to align your LLM based on our [code](https://github.com/wangclnlp/GRAM/tree/main)**.
 If you find this model helpful for your research, please cite GRAM:
 ```bash
+@misc{wang2025gram,
+      title={GRAM: A Generative Foundation Reward Model for Reward Generalization},
+      author={Chenglong Wang and Yang Gan and Yifu Huo and Yongyu Mu and Qiaozhi He and Murun Yang and Bei Li and Tong Xiao and Chunliang Zhang and Tongran Liu and Jingbo Zhu},
+      year={2025},
+      eprint={2506.14175},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2506.14175},
+}
 ```