|
--- |
|
license: mit |
|
widget: |
|
language: |
|
- en |
|
|
|
datasets: |
|
- pytorrent |
|
--- |
|
|
|
# 🔥 RoBERTa-MLM-based PyTorrent 1M 🔥 |
|
Pretrained weights based on [PyTorrent Dataset](https://github.com/fla-sil/PyTorrent) which is a curated data from a large official Python packages. |
|
We use PyTorrent dataset to train a preliminary DistilBERT-Masked Language Modeling(MLM) model from scratch. The trained model, along with the dataset, aims to help researchers to easily and efficiently work on a large dataset of Python packages using only 5 lines of codes to load the transformer-based model. We use 1M raw Python scripts of PyTorrent that includes 12,350,000 LOC to train the model. We also train a byte-level Byte-pair encoding (BPE) tokenizer that includes 56,000 tokens, which is truncated LOC with the length of 50 to save computation resources. |
|
|
|
### Training Objective |
|
This model is trained with a Masked Language Model (MLM) objective. |
|
|
|
## How to use the model? |
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Fujitsu/pytorrent") |
|
model = AutoModel.from_pretrained("Fujitsu/pytorrent") |
|
``` |
|
## Citation |
|
Preprint: [https://arxiv.org/pdf/2110.01710.pdf](https://arxiv.org/pdf/2110.01710.pdf) |
|
``` |
|
@misc{bahrami2021pytorrent, |
|
title={PyTorrent: A Python Library Corpus for Large-scale Language Models}, |
|
author={Mehdi Bahrami and N. C. Shrikanth and Shade Ruangwan and Lei Liu and Yuji Mizobuchi and Masahiro Fukuyori and Wei-Peng Chen and Kazuki Munakata and Tim Menzies}, |
|
year={2021}, |
|
eprint={2110.01710}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.SE}, |
|
howpublished={https://arxiv.org/pdf/2110.01710}, |
|
} |
|
``` |
|
|