CodeT5 for Code Comment Generation
This is a CodeT5 model fine-tuned from Salesforce/codet5-base for generating natural language comments from Python code snippets. It maps code snippets to descriptive comments and can be used for automated code documentation, code understanding, or educational purposes.
Model Details
Model Description Model Type: Sequence-to-Sequence Transformer Base Model: Salesforce/codet5-base Maximum Sequence Length: 128 tokens (input and output) Output: Natural language comments describing the input code Task: Code-to-comment generation
Model Sources
Documentation: CodeT5 Documentation Repository: CodeT5 on GitHub Hugging Face: CodeT5 on Hugging Face
Full Model Architecture
T5ForConditionalGeneration(
(shared): Embedding(32100, 768)
(encoder): T5Stack(
(embed_tokens): Embedding(32100, 768)
(block): ModuleList(...)
(final_layer_norm): LayerNorm((768,), eps=1e-12)
(dropout): Dropout(p=0.1)
)
(decoder): T5Stack(
(embed_tokens): Embedding(32100, 768)
(block): ModuleList(...)
(final_layer_norm): LayerNorm((768,), eps=1e-12)
(dropout): Dropout(p=0.1)
)
(lm_head): Linear(in_features=768, out_features=32100, bias=False)
)
pip install -U transformers torch datasets
#Then, load the model and run inference:
from transformers import T5ForConditionalGeneration, RobertaTokenizer
Download from the π€ Hub
model_name = "AventIQ-AI/t5_code_summarizer" # Update with your HF model ID
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Inference
code_snippet = "sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
inputs = tokenizer(code_snippet, max_length=128, truncation=True, padding="max_length", return_tensors="pt").to(device)
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=128,
num_beams=4,
early_stopping=True
)
comment = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Code: {code_snippet}")
print(f"Comment: {comment}")
# Expected output: Something close to "Concatenate elements of a list 'x' of multiple integers to a single integer"
Training Details
Training Dataset Name: janrauhl/conala Size: 2,300 training samples, 477 validation samples Columns: snippet (code), rewritten_intent (comment), intent, question_id
Approximate Statistics (based on inspection):
snippet:
Type: string
Min length: ~10 tokens
Mean length: ~20-30 tokens (estimated)
Max length: ~100 tokens (before truncation)
rewritten_intent:
Type: string
Min length: ~5 tokens
Mean length: ~10-15 tokens (estimated)
Max length: ~50 tokens (before truncation)
Samples:
snippet: sum(d * 10 ** i for i, d in enumerate(x[::-1])), rewritten_intent: "Concatenate elements of a list 'x' of multiple integers to a single integer"
snippet: int(''.join(map(str, x))), rewritten_intent: "Convert a list of integers into a single integer"
snippet: datetime.strptime('2010-11-13 10:33:54.227806', '%Y-%m-%d %H:%M:%S.%f'), rewritten_intent: "Convert a DateTime string back to a DateTime object of format '%Y-%m-%d %H:%M:%S.%f'"
Training Hyperparameters
Non-Default Hyperparameters:
- per_device_train_batch_size: 4
- per_device_eval_batch_size: 4
- gradient_accumulation_steps: 2 (effective batch size = 8)
- num_train_epochs: 10
- learning_rate: 1e-4
- fp16: True
- Downloads last month
- 5