t5_code_summarizer / README.md

Update README.md

46e674f verified 4 months ago

3.63 kB



	# CodeT5 for Code Comment Generation
	This is a CodeT5 model fine-tuned from Salesforce/codet5-base for generating natural language comments from Python code snippets. It maps code snippets to descriptive comments and can be used for automated code documentation, code understanding, or educational purposes.

	# Model Details
	Model Description
	Model Type: Sequence-to-Sequence Transformer
	Base Model: Salesforce/codet5-base
	Maximum Sequence Length: 128 tokens (input and output)
	Output: Natural language comments describing the input code
	Task: Code-to-comment generation

	# Model Sources
	Documentation: CodeT5 Documentation
	Repository: CodeT5 on GitHub
	Hugging Face: CodeT5 on Hugging Face

	# Full Model Architecture
	```
	T5ForConditionalGeneration(
	(shared): Embedding(32100, 768)
	(encoder): T5Stack(
	(embed_tokens): Embedding(32100, 768)
	(block): ModuleList(...)
	(final_layer_norm): LayerNorm((768,), eps=1e-12)
	(dropout): Dropout(p=0.1)
	)
	(decoder): T5Stack(
	(embed_tokens): Embedding(32100, 768)
	(block): ModuleList(...)
	(final_layer_norm): LayerNorm((768,), eps=1e-12)
	(dropout): Dropout(p=0.1)
	)
	(lm_head): Linear(in_features=768, out_features=32100, bias=False)
	)

	```

	```bash
	pip install -U transformers torch datasets
	#Then, load the model and run inference:
	```
	from transformers import T5ForConditionalGeneration, RobertaTokenizer

	# Download from the 🤗 Hub
	```python
	model_name = "AventIQ-AI/t5_code_summarizer" # Update with your HF model ID
	tokenizer = RobertaTokenizer.from_pretrained(model_name)
	model = T5ForConditionalGeneration.from_pretrained(model_name)

	# Move to GPU if available
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)

	# Inference
	code_snippet = "sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
	inputs = tokenizer(code_snippet, max_length=128, truncation=True, padding="max_length", return_tensors="pt").to(device)
	outputs = model.generate(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	max_length=128,
	num_beams=4,
	early_stopping=True
	)
	comment = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(f"Code: {code_snippet}")
	print(f"Comment: {comment}")
	# Expected output: Something close to "Concatenate elements of a list 'x' of multiple integers to a single integer"
	```

	# Training Details
	Training Dataset
	Name: janrauhl/conala
	Size: 2,300 training samples, 477 validation samples
	Columns: snippet (code), rewritten_intent (comment), intent, question_id

	# Approximate Statistics (based on inspection):
	```
	snippet:
	Type: string
	Min length: ~10 tokens
	Mean length: ~20-30 tokens (estimated)
	Max length: ~100 tokens (before truncation)
	rewritten_intent:
	Type: string
	Min length: ~5 tokens
	Mean length: ~10-15 tokens (estimated)
	Max length: ~50 tokens (before truncation)
	Samples:
	snippet: sum(d * 10 ** i for i, d in enumerate(x[::-1])), rewritten_intent: "Concatenate elements of a list 'x' of multiple integers to a single integer"
	snippet: int(''.join(map(str, x))), rewritten_intent: "Convert a list of integers into a single integer"
	snippet: datetime.strptime('2010-11-13 10:33:54.227806', '%Y-%m-%d %H:%M:%S.%f'), rewritten_intent: "Convert a DateTime string back to a DateTime object of format '%Y-%m-%d %H:%M:%S.%f'"
	```
	# Training Hyperparameters
	### Non-Default Hyperparameters:
	- per_device_train_batch_size: 4
	- per_device_eval_batch_size: 4
	- gradient_accumulation_steps: 2 (effective batch size = 8)
	- num_train_epochs: 10
	- learning_rate: 1e-4
	- fp16: True