CodeT5 for Code Comment Generation

This is a CodeT5 model fine-tuned from Salesforce/codet5-base for generating natural language comments from Python code snippets. It maps code snippets to descriptive comments and can be used for automated code documentation, code understanding, or educational purposes.

Model Details

Model Description Model Type: Sequence-to-Sequence Transformer Base Model: Salesforce/codet5-base Maximum Sequence Length: 128 tokens (input and output) Output: Natural language comments describing the input code Task: Code-to-comment generation

Model Sources

Documentation: CodeT5 Documentation Repository: CodeT5 on GitHub Hugging Face: CodeT5 on Hugging Face

Full Model Architecture

T5ForConditionalGeneration(
  (shared): Embedding(32100, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32100, 768)
    (block): ModuleList(...)
    (final_layer_norm): LayerNorm((768,), eps=1e-12)
    (dropout): Dropout(p=0.1)
  )
  (decoder): T5Stack(
    (embed_tokens): Embedding(32100, 768)
    (block): ModuleList(...)
    (final_layer_norm): LayerNorm((768,), eps=1e-12)
    (dropout): Dropout(p=0.1)
  )
  (lm_head): Linear(in_features=768, out_features=32100, bias=False)
)

pip install -U transformers torch datasets
#Then, load the model and run inference:

from transformers import T5ForConditionalGeneration, RobertaTokenizer

Download from the 🤗 Hub

model_name = "AventIQ-AI/t5_code_summarizer"  # Update with your HF model ID
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Inference
code_snippet = "sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
inputs = tokenizer(code_snippet, max_length=128, truncation=True, padding="max_length", return_tensors="pt").to(device)
outputs = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_length=128,
    num_beams=4,
    early_stopping=True
)
comment = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Code: {code_snippet}")
print(f"Comment: {comment}")
# Expected output: Something close to "Concatenate elements of a list 'x' of multiple integers to a single integer"

Training Details

Training Dataset Name: janrauhl/conala Size: 2,300 training samples, 477 validation samples Columns: snippet (code), rewritten_intent (comment), intent, question_id

Approximate Statistics (based on inspection):

snippet:
Type: string
Min length: ~10 tokens
Mean length: ~20-30 tokens (estimated)
Max length: ~100 tokens (before truncation)
rewritten_intent:
Type: string
Min length: ~5 tokens
Mean length: ~10-15 tokens (estimated)
Max length: ~50 tokens (before truncation)
Samples:
snippet: sum(d * 10 ** i for i, d in enumerate(x[::-1])), rewritten_intent: "Concatenate elements of a list 'x' of multiple integers to a single integer"
snippet: int(''.join(map(str, x))), rewritten_intent: "Convert a list of integers into a single integer"
snippet: datetime.strptime('2010-11-13 10:33:54.227806', '%Y-%m-%d %H:%M:%S.%f'), rewritten_intent: "Convert a DateTime string back to a DateTime object of format '%Y-%m-%d %H:%M:%S.%f'"

Training Hyperparameters

Non-Default Hyperparameters:

per_device_train_batch_size: 4
per_device_eval_batch_size: 4
gradient_accumulation_steps: 2 (effective batch size = 8)
num_train_epochs: 10
learning_rate: 1e-4
fp16: True