Fujitsu
/

pytorrent

Feature Extraction

Model card Files Files and versions Community

pytorrent / README.md

mbahrami's picture

Update README.md

1bf2dc4 over 3 years ago

|

history blame contribute delete

1.69 kB

	---
	license: mit
	widget:
	language:
	- en

	datasets:
	- pytorrent
	---

	# 🔥 RoBERTa-MLM-based PyTorrent 1M 🔥
	Pretrained weights based on [PyTorrent Dataset](https://github.com/fla-sil/PyTorrent) which is a curated data from a large official Python packages.
	We use PyTorrent dataset to train a preliminary DistilBERT-Masked Language Modeling(MLM) model from scratch. The trained model, along with the dataset, aims to help researchers to easily and efficiently work on a large dataset of Python packages using only 5 lines of codes to load the transformer-based model. We use 1M raw Python scripts of PyTorrent that includes 12,350,000 LOC to train the model. We also train a byte-level Byte-pair encoding (BPE) tokenizer that includes 56,000 tokens, which is truncated LOC with the length of 50 to save computation resources.

	### Training Objective
	This model is trained with a Masked Language Model (MLM) objective.

	## How to use the model?
	```python
	from transformers import AutoTokenizer, AutoModel


	tokenizer = AutoTokenizer.from_pretrained("Fujitsu/pytorrent")
	model = AutoModel.from_pretrained("Fujitsu/pytorrent")
	```
	## Citation
	Preprint: [https://arxiv.org/pdf/2110.01710.pdf](https://arxiv.org/pdf/2110.01710.pdf)
	```
	@misc{bahrami2021pytorrent,
	title={PyTorrent: A Python Library Corpus for Large-scale Language Models},
	author={Mehdi Bahrami and N. C. Shrikanth and Shade Ruangwan and Lei Liu and Yuji Mizobuchi and Masahiro Fukuyori and Wei-Peng Chen and Kazuki Munakata and Tim Menzies},
	year={2021},
	eprint={2110.01710},
	archivePrefix={arXiv},
	primaryClass={cs.SE},
	howpublished={https://arxiv.org/pdf/2110.01710},
	}
	```