code-generation-space

Paused

code-generation-space / datasets /polycoder.txt

loubnabnl HF Staff

update

6dc2b45 over 3 years ago

455 Bytes

	[PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The model was trained on 254GB of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
	- Exact match deduplication
	- Filtering:
	- Average line length < 100 tokens
	- Maximum line length < 1000 MB