tarsur909 commited on
Commit
a33c32c
·
verified ·
1 Parent(s): 6f4cc2c

add citation

Browse files
Files changed (1) hide show
  1. README.md +17 -1
README.md CHANGED
@@ -12,4 +12,20 @@ We release the scripts to evaluate our model's performance [here](https://github
12
 
13
  ## Training
14
 
15
- Our code reranker is based on LLM-based listwise reranking, which has gained prominence for the ability to score multiple passages simultaneously. Training data for listwise reranking was generated by selecting 50,000 <query, positive, negatives> tuples from our high-quality dataset [CoRNStack](https://gangiswag.github.io/cornstack/), filtered to ensure higher similarity scores and better ranks for the positives. Since CoRNStack doesn't contain the ranked ordering data required for training listwise rerankers, we leverage [Qwen-2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) LLM provided ranked orderings for each example to serve as ranking supervision. We initialize our reranker with [Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) and fine-tune using a language modeling objective that minimizes the prediction error of the next token in the sequence.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  ## Training
14
 
15
+ Our code reranker is based on LLM-based listwise reranking, which has gained prominence for the ability to score multiple passages simultaneously. Training data for listwise reranking was generated by selecting 50,000 <query, positive, negatives> tuples from our high-quality dataset [CoRNStack](https://gangiswag.github.io/cornstack/), filtered to ensure higher similarity scores and better ranks for the positives. Since CoRNStack doesn't contain the ranked ordering data required for training listwise rerankers, we leverage [Qwen-2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) LLM provided ranked orderings for each example to serve as ranking supervision. We initialize our reranker with [Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) and fine-tune using a language modeling objective that minimizes the prediction error of the next token in the sequence.
16
+
17
+ # Citation
18
+
19
+ If you find the model, dataset, or training code useful, please cite our work:
20
+
21
+ ```bibtex
22
+ @misc{suresh2025cornstackhighqualitycontrastivedata,
23
+ title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
24
+ author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
25
+ year={2025},
26
+ eprint={2412.01007},
27
+ archivePrefix={arXiv},
28
+ primaryClass={cs.CL},
29
+ url={https://arxiv.org/abs/2412.01007},
30
+ }
31
+ ```