add citation
Browse files
README.md
CHANGED
@@ -12,4 +12,20 @@ We release the scripts to evaluate our model's performance [here](https://github
|
|
12 |
|
13 |
## Training
|
14 |
|
15 |
-
Our code reranker is based on LLM-based listwise reranking, which has gained prominence for the ability to score multiple passages simultaneously. Training data for listwise reranking was generated by selecting 50,000 <query, positive, negatives> tuples from our high-quality dataset [CoRNStack](https://gangiswag.github.io/cornstack/), filtered to ensure higher similarity scores and better ranks for the positives. Since CoRNStack doesn't contain the ranked ordering data required for training listwise rerankers, we leverage [Qwen-2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) LLM provided ranked orderings for each example to serve as ranking supervision. We initialize our reranker with [Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) and fine-tune using a language modeling objective that minimizes the prediction error of the next token in the sequence.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
## Training
|
14 |
|
15 |
+
Our code reranker is based on LLM-based listwise reranking, which has gained prominence for the ability to score multiple passages simultaneously. Training data for listwise reranking was generated by selecting 50,000 <query, positive, negatives> tuples from our high-quality dataset [CoRNStack](https://gangiswag.github.io/cornstack/), filtered to ensure higher similarity scores and better ranks for the positives. Since CoRNStack doesn't contain the ranked ordering data required for training listwise rerankers, we leverage [Qwen-2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) LLM provided ranked orderings for each example to serve as ranking supervision. We initialize our reranker with [Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) and fine-tune using a language modeling objective that minimizes the prediction error of the next token in the sequence.
|
16 |
+
|
17 |
+
# Citation
|
18 |
+
|
19 |
+
If you find the model, dataset, or training code useful, please cite our work:
|
20 |
+
|
21 |
+
```bibtex
|
22 |
+
@misc{suresh2025cornstackhighqualitycontrastivedata,
|
23 |
+
title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
|
24 |
+
author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
|
25 |
+
year={2025},
|
26 |
+
eprint={2412.01007},
|
27 |
+
archivePrefix={arXiv},
|
28 |
+
primaryClass={cs.CL},
|
29 |
+
url={https://arxiv.org/abs/2412.01007},
|
30 |
+
}
|
31 |
+
```
|