|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: text-classification |
|
tags: |
|
- code |
|
metrics: |
|
- accuracy |
|
- f1 |
|
--- |
|
# CodeBERT-SO |
|
Repository for CodeBERT, fine-tuned on Stack Overflow snippets with respect to NL-PL pairs of 6 languages (Python, Java, JavaScript, PHP, Ruby, Go). |
|
## Training Objective |
|
This model is initialized with [CodeBERT-base](https://huggingface.co/microsoft/codebert-base) and trained to classify whether a user will drop out given their posts and code snippets. |
|
## Training Regime |
|
Preprocessing methods for input texts include unicode normalisation (NFC form), removal of extraneous whitespaces, removal of punctuations (except within links), lowercasing and removal of stopwords. |
|
Code snippets were also removed of their in-line comments or docstrings (cf. the main manuscript). RoBERTa tokenizer was used, as the built-in tokenizer for the original CodeBERT implementation. |
|
|
|
Training was done across 8 epochs with a batch size of 8, learning rate of 1e-5, epsilon (weight update denominator) of 1e-8. |
|
A random 20% sample of the entire dataset was used as the validation set. |
|
## Performance |
|
* Final validation accuracy: 0.822 |
|
* Final validation F1: 0.809 |
|
* Final validation loss: 0.5 |