# CodeT5 for Code Comment Generation This is a CodeT5 model fine-tuned from Salesforce/codet5-base for generating natural language comments from Python code snippets. It maps code snippets to descriptive comments and can be used for automated code documentation, code understanding, or educational purposes. # Model Details **Model Description** **Model Type:** Sequence-to-Sequence Transformer **Base Model:** Salesforce/codet5-base **Maximum Sequence Length:** 128 tokens (input and output) **Output:** Natural language comments describing the input code **Task:** Code-to-comment generation # Model Sources **Documentation:** CodeT5 Documentation **Repository:** CodeT5 on GitHub **Hugging Face:** CodeT5 on Hugging Face # Full Model Architecture ``` T5ForConditionalGeneration( (shared): Embedding(32100, 768) (encoder): T5Stack( (embed_tokens): Embedding(32100, 768) (block): ModuleList(...) (final_layer_norm): LayerNorm((768,), eps=1e-12) (dropout): Dropout(p=0.1) ) (decoder): T5Stack( (embed_tokens): Embedding(32100, 768) (block): ModuleList(...) (final_layer_norm): LayerNorm((768,), eps=1e-12) (dropout): Dropout(p=0.1) ) (lm_head): Linear(in_features=768, out_features=32100, bias=False) ) ``` ```bash pip install -U transformers torch datasets #Then, load the model and run inference: ``` from transformers import T5ForConditionalGeneration, RobertaTokenizer # Download from the 🤗 Hub ```python model_name = "AventIQ-AI/t5_code_summarizer" # Update with your HF model ID tokenizer = RobertaTokenizer.from_pretrained(model_name) model = T5ForConditionalGeneration.from_pretrained(model_name) # Move to GPU if available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) # Inference code_snippet = "sum(d * 10 ** i for i, d in enumerate(x[::-1]))" inputs = tokenizer(code_snippet, max_length=128, truncation=True, padding="max_length", return_tensors="pt").to(device) outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_length=128, num_beams=4, early_stopping=True ) comment = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"Code: {code_snippet}") print(f"Comment: {comment}") # Expected output: Something close to "Concatenate elements of a list 'x' of multiple integers to a single integer" ``` # Training Details Training Dataset **Name:** janrauhl/conala **Size:** 2,300 training samples, 477 validation samples **Columns:** snippet (code), rewritten_intent (comment), intent, question_id # Approximate Statistics (based on inspection): ``` snippet: Type: string Min length: ~10 tokens Mean length: ~20-30 tokens (estimated) Max length: ~100 tokens (before truncation) rewritten_intent: Type: string Min length: ~5 tokens Mean length: ~10-15 tokens (estimated) Max length: ~50 tokens (before truncation) Samples: snippet: sum(d * 10 ** i for i, d in enumerate(x[::-1])), rewritten_intent: "Concatenate elements of a list 'x' of multiple integers to a single integer" snippet: int(''.join(map(str, x))), rewritten_intent: "Convert a list of integers into a single integer" snippet: datetime.strptime('2010-11-13 10:33:54.227806', '%Y-%m-%d %H:%M:%S.%f'), rewritten_intent: "Convert a DateTime string back to a DateTime object of format '%Y-%m-%d %H:%M:%S.%f'" ``` # Training Hyperparameters ### Non-Default Hyperparameters: - **per_device_train_batch_size:** 4 - **per_device_eval_batch_size:** 4 - **gradient_accumulation_steps:** 2 (effective batch size = 8) - **num_train_epochs:** 10 - **learning_rate:** 1e-4 - **fp16:** True