File size: 3,628 Bytes
ee31b37 46e674f 5512374 ee31b37 5512374 ee31b37 5512374 46e674f ee31b37 5512374 ee31b37 46e674f ee31b37 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# CodeT5 for Code Comment Generation
This is a CodeT5 model fine-tuned from Salesforce/codet5-base for generating natural language comments from Python code snippets. It maps code snippets to descriptive comments and can be used for automated code documentation, code understanding, or educational purposes.
# Model Details
**Model Description**
**Model Type:** Sequence-to-Sequence Transformer
**Base Model:** Salesforce/codet5-base
**Maximum Sequence Length:** 128 tokens (input and output)
**Output:** Natural language comments describing the input code
**Task:** Code-to-comment generation
# Model Sources
**Documentation:** CodeT5 Documentation
**Repository:** CodeT5 on GitHub
**Hugging Face:** CodeT5 on Hugging Face
# Full Model Architecture
```
T5ForConditionalGeneration(
(shared): Embedding(32100, 768)
(encoder): T5Stack(
(embed_tokens): Embedding(32100, 768)
(block): ModuleList(...)
(final_layer_norm): LayerNorm((768,), eps=1e-12)
(dropout): Dropout(p=0.1)
)
(decoder): T5Stack(
(embed_tokens): Embedding(32100, 768)
(block): ModuleList(...)
(final_layer_norm): LayerNorm((768,), eps=1e-12)
(dropout): Dropout(p=0.1)
)
(lm_head): Linear(in_features=768, out_features=32100, bias=False)
)
```
```bash
pip install -U transformers torch datasets
#Then, load the model and run inference:
```
from transformers import T5ForConditionalGeneration, RobertaTokenizer
# Download from the 🤗 Hub
```python
model_name = "AventIQ-AI/t5_code_summarizer" # Update with your HF model ID
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Inference
code_snippet = "sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
inputs = tokenizer(code_snippet, max_length=128, truncation=True, padding="max_length", return_tensors="pt").to(device)
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=128,
num_beams=4,
early_stopping=True
)
comment = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Code: {code_snippet}")
print(f"Comment: {comment}")
# Expected output: Something close to "Concatenate elements of a list 'x' of multiple integers to a single integer"
```
# Training Details
Training Dataset
**Name:** janrauhl/conala
**Size:** 2,300 training samples, 477 validation samples
**Columns:** snippet (code), rewritten_intent (comment), intent, question_id
# Approximate Statistics (based on inspection):
```
snippet:
Type: string
Min length: ~10 tokens
Mean length: ~20-30 tokens (estimated)
Max length: ~100 tokens (before truncation)
rewritten_intent:
Type: string
Min length: ~5 tokens
Mean length: ~10-15 tokens (estimated)
Max length: ~50 tokens (before truncation)
Samples:
snippet: sum(d * 10 ** i for i, d in enumerate(x[::-1])), rewritten_intent: "Concatenate elements of a list 'x' of multiple integers to a single integer"
snippet: int(''.join(map(str, x))), rewritten_intent: "Convert a list of integers into a single integer"
snippet: datetime.strptime('2010-11-13 10:33:54.227806', '%Y-%m-%d %H:%M:%S.%f'), rewritten_intent: "Convert a DateTime string back to a DateTime object of format '%Y-%m-%d %H:%M:%S.%f'"
```
# Training Hyperparameters
### Non-Default Hyperparameters:
- **per_device_train_batch_size:** 4
- **per_device_eval_batch_size:** 4
- **gradient_accumulation_steps:** 2 (effective batch size = 8)
- **num_train_epochs:** 10
- **learning_rate:** 1e-4
- **fp16:** True
|