YashikaNagpal commited on
Commit
ee31b37
·
verified ·
1 Parent(s): 5a99f95

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -0
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ # CodeT5 for Code Comment Generation
4
+ This is a CodeT5 model fine-tuned from Salesforce/codet5-base for generating natural language comments from Python code snippets. It maps code snippets to descriptive comments and can be used for automated code documentation, code understanding, or educational purposes.
5
+
6
+ # Model Details
7
+ **Model Description**
8
+ **Model Type:** Sequence-to-Sequence Transformer
9
+ **Base Model:** Salesforce/codet5-base
10
+ **Maximum Sequence Length:** 128 tokens (input and output)
11
+ **Output:** Natural language comments describing the input code
12
+ **Task:** Code-to-comment generation
13
+
14
+ # Model Sources
15
+ **Documentation:** CodeT5 Documentation
16
+ **Repository:** CodeT5 on GitHub
17
+ **Hugging Face:** CodeT5 on Hugging Face
18
+
19
+ # Full Model Architecture
20
+ ```
21
+ T5ForConditionalGeneration(
22
+ (shared): Embedding(32100, 768)
23
+ (encoder): T5Stack(
24
+ (embed_tokens): Embedding(32100, 768)
25
+ (block): ModuleList(...)
26
+ (final_layer_norm): LayerNorm((768,), eps=1e-12)
27
+ (dropout): Dropout(p=0.1)
28
+ )
29
+ (decoder): T5Stack(
30
+ (embed_tokens): Embedding(32100, 768)
31
+ (block): ModuleList(...)
32
+ (final_layer_norm): LayerNorm((768,), eps=1e-12)
33
+ (dropout): Dropout(p=0.1)
34
+ )
35
+ (lm_head): Linear(in_features=768, out_features=32100, bias=False)
36
+ )
37
+ ```
38
+
39
+ pip install -U transformers torch datasets
40
+ Then, load the model and run inference:
41
+
42
+ ```
43
+ from transformers import T5ForConditionalGeneration, RobertaTokenizer
44
+
45
+ # Download from the 🤗 Hub (replace with your model ID after uploading)
46
+ model_name = "your-username/codet5-conala-comments" # Update with your HF model ID
47
+ tokenizer = RobertaTokenizer.from_pretrained(model_name)
48
+ model = T5ForConditionalGeneration.from_pretrained(model_name)
49
+
50
+ # Move to GPU if available
51
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
52
+ model.to(device)
53
+
54
+ # Inference
55
+ code_snippet = "sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
56
+ inputs = tokenizer(code_snippet, max_length=128, truncation=True, padding="max_length", return_tensors="pt").to(device)
57
+ outputs = model.generate(
58
+ input_ids=inputs["input_ids"],
59
+ attention_mask=inputs["attention_mask"],
60
+ max_length=128,
61
+ num_beams=4,
62
+ early_stopping=True
63
+ )
64
+ comment = tokenizer.decode(outputs[0], skip_special_tokens=True)
65
+ print(f"Code: {code_snippet}")
66
+ print(f"Comment: {comment}")
67
+ # Expected output: Something close to "Concatenate elements of a list 'x' of multiple integers to a single integer"
68
+
69
+ ```
70
+ # Training Details
71
+ Training Dataset
72
+ **Name:** janrauhl/conala
73
+ **Size:** 2,300 training samples, 477 validation samples
74
+ **Columns:** snippet (code), rewritten_intent (comment), intent, question_id
75
+
76
+ # Approximate Statistics (based on inspection):
77
+ ```
78
+ snippet:
79
+ Type: string
80
+ Min length: ~10 tokens
81
+ Mean length: ~20-30 tokens (estimated)
82
+ Max length: ~100 tokens (before truncation)
83
+ rewritten_intent:
84
+ Type: string
85
+ Min length: ~5 tokens
86
+ Mean length: ~10-15 tokens (estimated)
87
+ Max length: ~50 tokens (before truncation)
88
+ Samples:
89
+ snippet: sum(d * 10 ** i for i, d in enumerate(x[::-1])), rewritten_intent: "Concatenate elements of a list 'x' of multiple integers to a single integer"
90
+ snippet: int(''.join(map(str, x))), rewritten_intent: "Convert a list of integers into a single integer"
91
+ snippet: datetime.strptime('2010-11-13 10:33:54.227806', '%Y-%m-%d %H:%M:%S.%f'), rewritten_intent: "Convert a DateTime string back to a DateTime object of format '%Y-%m-%d %H:%M:%S.%f'"
92
+ ```
93
+ # Training Hyperparameters
94
+ Non-Default Hyperparameters:
95
+ **per_device_train_batch_size:** 4
96
+ **per_device_eval_batch_size:** 4
97
+ **gradient_accumulation_steps:** 2 (effective batch size = 8)
98
+ **num_train_epochs:** 10
99
+ **learning_rate:** 1e-4
100
+ **fp16:** True
101
+
102
+ ```
103
+ @article{wang2021codet5,
104
+ title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
105
+ author={Wang, Yue and Wang, Weishi and Joty, Shafiq and Hoi, Steven C. H.},
106
+ journal={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
107
+ year={2021},
108
+ url={https://arxiv.org/abs/2109.00859}
109
+ }
110
+
111
+ ```