File size: 695 Bytes

c8b570c
 
59c1b96
 
c8b570c
59c1b96

---
license: mit
library_name: fasttext
pipeline_tag: text-classification
---

This is the fastText pretraining data filter targeting the SciQ task, discussed in the main text of the Perplexity Correlations paper: https://arxiv.org/abs/2409.05816

This package can be used to get LLM pretraining data sampling distributions using simple statistical methods. The compute requirements are minimal, and you don't need to train any LLMs yourself.

Essentially, this approach encourages training on domains where lower loss is very correlated with higher downstream performance. We can use existing and freely available LLMs to do this.

Code: https://github.com/TristanThrush/perplexity-correlations