modelling101
/

CodeBERT-SO

Text Classification

Model card Files Files and versions

modelling101 commited on Jun 1, 2024

Commit

e1acfdf

·

verified ·

1 Parent(s): 3e38b87

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -16,7 +16,7 @@ Repository for CodeBERT, fine-tuned on Stack Overflow snippets with respect to N
 This model is initialized with [CodeBERT-base](https://huggingface.co/microsoft/codebert-base) and trained to classify whether a user will drop out given their posts and code snippets.
 ## Training Regime
 Preprocessing methods for input texts include unicode normalisation (NFC form), removal of extraneous whitespaces, removal of punctuations (except within links), lowercasing and removal of stopwords.
-Code snippets were also removed of their in-line comments or docstrings (cf. the main manuscript). RoBERTa tokenizer was used, as the built-in tokenizer for the original [CodeBERT implementation](https://arxiv.org/abs/2002.08155).
 Training was done across 8 epochs with a batch size of 8, learning rate of 1e-5, epsilon (weight update denominator) of 1e-8.
 A random 20% sample of the entire dataset was used as the validation set.

 This model is initialized with [CodeBERT-base](https://huggingface.co/microsoft/codebert-base) and trained to classify whether a user will drop out given their posts and code snippets.
 ## Training Regime
 Preprocessing methods for input texts include unicode normalisation (NFC form), removal of extraneous whitespaces, removal of punctuations (except within links), lowercasing and removal of stopwords.
+Code snippets were also removed of their in-line comments or docstrings (cf. the main manuscript). RoBERTa tokenizer was used, as the built-in tokenizer for the original CodeBERT implementation.
 Training was done across 8 epochs with a batch size of 8, learning rate of 1e-5, epsilon (weight update denominator) of 1e-8.
 A random 20% sample of the entire dataset was used as the validation set.