license: mit | |
library_name: fasttext | |
pipeline_tag: text-classification | |
This is the fastText pretraining data filter targeting the SciQ task, discussed in the main text of the Perplexity Correlations paper: https://arxiv.org/abs/2409.05816 | |
This package can be used to get LLM pretraining data sampling distributions using simple statistical methods. The compute requirements are minimal, and you don't need to train any LLMs yourself. | |
Essentially, this approach encourages training on domains where lower loss is very correlated with higher downstream performance. We can use existing and freely available LLMs to do this. | |
Code: https://github.com/TristanThrush/perplexity-correlations |