Shuu12121 commited on
Commit
e5d2aed
·
verified ·
1 Parent(s): 01f93fb

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -0
README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - multilingual
6
+ tags:
7
+ - code-to-docstring
8
+ - code-summarization
9
+ - code-documentation
10
+ - encoder-decoder
11
+ - code
12
+ - python
13
+ - java
14
+ - transformers
15
+ - huggingface
16
+ - modernbert
17
+ - gpt2
18
+ base_model:
19
+ - Shuu12121/CodeModernBERT-Ghost
20
+ - openai-community/gpt2-large
21
+ pipeline_tag: text2text-generation
22
+ ---
23
+
24
+ # CodeEncoderDecoderModel-Ghost-large👻
25
+
26
+ A multilingual encoder-decoder model for generating **docstrings from code snippets**.
27
+ It is based on a custom BERT-style encoder pretrained on source code (`CodeModernBERT-Ghost`) and a large-scale decoder model (`GPT2-large`).
28
+
29
+ ## 🏗️ Model Architecture
30
+
31
+ - **Encoder:** [`Shuu12121/CodeModernBERT-Ghost`](https://huggingface.co/Shuu12121/CodeModernBERT-Ghost)
32
+ - **Decoder:** [`openai-community/gpt2-large`](https://huggingface.co/openai-community/gpt2-large)
33
+ - Connected via HuggingFace's `EncoderDecoderModel` with cross-attention.
34
+
35
+ ## 🎯 Intended Use
36
+
37
+ - Generating docstrings (documentation comments) for functions or methods in multiple languages.
38
+ - Summarizing code for educational or review purposes.
39
+ - Assisting in automated documentation generation pipelines.
40
+
41
+ Supported languages (code input):
42
+ - Python
43
+ - Java
44
+
45
+ ## 📦 How to Use
46
+
47
+ ```python
48
+ from transformers import AutoTokenizer, EncoderDecoderModel
49
+ import torch
50
+
51
+ model = EncoderDecoderModel.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large").to("cuda")
52
+ encoder_tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large", subfolder="encoder_tokenizer")
53
+ decoder_tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large", subfolder="decoder_tokenizer")
54
+
55
+ if decoder_tokenizer.pad_token is None:
56
+ decoder_tokenizer.pad_token = decoder_tokenizer.eos_token
57
+
58
+ code = '''
59
+ def greet(name):
60
+ return f"Hello, {name}!"
61
+ '''
62
+
63
+ inputs = encoder_tokenizer(code, return_tensors="pt", truncation=True, padding=True, max_length=2048).to("cuda")
64
+ outputs = model.generate(
65
+ input_ids=inputs.input_ids,
66
+ attention_mask=inputs.attention_mask,
67
+ max_length=256,
68
+ num_beams=5,
69
+ early_stopping=True,
70
+ decoder_start_token_id=model.config.decoder_start_token_id,
71
+ eos_token_id=model.config.eos_token_id,
72
+ pad_token_id=model.config.pad_token_id,
73
+ no_repeat_ngram_size=2
74
+ )
75
+
76
+ docstring = decoder_tokenizer.decode(outputs[0], skip_special_tokens=True)
77
+ print(docstring)
78
+ ```
79
+
80
+ ## 🧪 Training Details
81
+
82
+ - **Task:** Code-to-docstring generation
83
+ - **Dataset:** [CodeXGLUE: Code-to-Text](https://github.com/microsoft/CodeXGLUE) – using subsets of Python, Java, JavaScript, Go, Ruby, PHP
84
+ - **Loss:** Cross-entropy loss over tokenized docstrings
85
+ - **Max input length:** 2048 (encoder), max output length: 256 (decoder)
86
+ - **Decoder modifications:** Adapted GPT2-large with padding and cross-attention
87
+
88
+ ## ⚠️ Limitations & Risks
89
+
90
+ 1. **Generated documentation may be inaccurate, incomplete, or misleading**. Always review generated docstrings manually.
91
+ 2. **Formatting may not follow specific standards** (e.g., Google/Numpy style in Python or full Javadoc).
92
+ 3. **Limited context:** Only considers single-function input; lacks broader project-level understanding.
93
+ 4. **Language variance:** Performance may differ depending on the programming language due to data distribution.
94
+ 5. **⚠️ Decoder risks (GPT2-large):**
95
+ GPT-2 models are known to sometimes generate inappropriate, offensive, or biased outputs, depending on the prompt.
96
+ Although this model is fine-tuned on technical datasets (code-docstring pairs), due to inherited properties from `gpt2-large`, similar risks **may still be present** in edge cases. Please exercise caution, especially when using the model in public or educational settings.
97
+
98
+ ## 📄 License
99
+
100
+ Apache-2.0
101
+ Model weights and tokenizer artifacts are released under the same license. You are free to use, modify, and redistribute with attribution.
102
+