|
|
|
|
|
# CodeT5 for Code Comment Generation |
|
This is a CodeT5 model fine-tuned from Salesforce/codet5-base for generating natural language comments from Python code snippets. It maps code snippets to descriptive comments and can be used for automated code documentation, code understanding, or educational purposes. |
|
|
|
# Model Details |
|
**Model Description** |
|
**Model Type:** Sequence-to-Sequence Transformer |
|
**Base Model:** Salesforce/codet5-base |
|
**Maximum Sequence Length:** 128 tokens (input and output) |
|
**Output:** Natural language comments describing the input code |
|
**Task:** Code-to-comment generation |
|
|
|
# Model Sources |
|
**Documentation:** CodeT5 Documentation |
|
**Repository:** CodeT5 on GitHub |
|
**Hugging Face:** CodeT5 on Hugging Face |
|
|
|
# Full Model Architecture |
|
``` |
|
T5ForConditionalGeneration( |
|
(shared): Embedding(32100, 768) |
|
(encoder): T5Stack( |
|
(embed_tokens): Embedding(32100, 768) |
|
(block): ModuleList(...) |
|
(final_layer_norm): LayerNorm((768,), eps=1e-12) |
|
(dropout): Dropout(p=0.1) |
|
) |
|
(decoder): T5Stack( |
|
(embed_tokens): Embedding(32100, 768) |
|
(block): ModuleList(...) |
|
(final_layer_norm): LayerNorm((768,), eps=1e-12) |
|
(dropout): Dropout(p=0.1) |
|
) |
|
(lm_head): Linear(in_features=768, out_features=32100, bias=False) |
|
) |
|
|
|
``` |
|
|
|
```bash |
|
pip install -U transformers torch datasets |
|
#Then, load the model and run inference: |
|
``` |
|
from transformers import T5ForConditionalGeneration, RobertaTokenizer |
|
|
|
# Download from the 🤗 Hub |
|
```python |
|
model_name = "AventIQ-AI/t5_code_summarizer" # Update with your HF model ID |
|
tokenizer = RobertaTokenizer.from_pretrained(model_name) |
|
model = T5ForConditionalGeneration.from_pretrained(model_name) |
|
|
|
# Move to GPU if available |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
|
|
# Inference |
|
code_snippet = "sum(d * 10 ** i for i, d in enumerate(x[::-1]))" |
|
inputs = tokenizer(code_snippet, max_length=128, truncation=True, padding="max_length", return_tensors="pt").to(device) |
|
outputs = model.generate( |
|
input_ids=inputs["input_ids"], |
|
attention_mask=inputs["attention_mask"], |
|
max_length=128, |
|
num_beams=4, |
|
early_stopping=True |
|
) |
|
comment = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(f"Code: {code_snippet}") |
|
print(f"Comment: {comment}") |
|
# Expected output: Something close to "Concatenate elements of a list 'x' of multiple integers to a single integer" |
|
``` |
|
|
|
# Training Details |
|
Training Dataset |
|
**Name:** janrauhl/conala |
|
**Size:** 2,300 training samples, 477 validation samples |
|
**Columns:** snippet (code), rewritten_intent (comment), intent, question_id |
|
|
|
# Approximate Statistics (based on inspection): |
|
``` |
|
snippet: |
|
Type: string |
|
Min length: ~10 tokens |
|
Mean length: ~20-30 tokens (estimated) |
|
Max length: ~100 tokens (before truncation) |
|
rewritten_intent: |
|
Type: string |
|
Min length: ~5 tokens |
|
Mean length: ~10-15 tokens (estimated) |
|
Max length: ~50 tokens (before truncation) |
|
Samples: |
|
snippet: sum(d * 10 ** i for i, d in enumerate(x[::-1])), rewritten_intent: "Concatenate elements of a list 'x' of multiple integers to a single integer" |
|
snippet: int(''.join(map(str, x))), rewritten_intent: "Convert a list of integers into a single integer" |
|
snippet: datetime.strptime('2010-11-13 10:33:54.227806', '%Y-%m-%d %H:%M:%S.%f'), rewritten_intent: "Convert a DateTime string back to a DateTime object of format '%Y-%m-%d %H:%M:%S.%f'" |
|
``` |
|
# Training Hyperparameters |
|
### Non-Default Hyperparameters: |
|
- **per_device_train_batch_size:** 4 |
|
- **per_device_eval_batch_size:** 4 |
|
- **gradient_accumulation_steps:** 2 (effective batch size = 8) |
|
- **num_train_epochs:** 10 |
|
- **learning_rate:** 1e-4 |
|
- **fp16:** True |
|
|
|
|