Create README.md

Browse files

Files changed (1) hide show

README.md +102 -0

README.md ADDED Viewed

	@@ -0,0 +1,102 @@

+---
+license: apache-2.0
+language:
+- en
+- multilingual
+tags:
+- code-to-docstring
+- code-summarization
+- code-documentation
+- encoder-decoder
+- code
+- python
+- java
+- transformers
+- huggingface
+- modernbert
+- gpt2
+base_model:
+  - Shuu12121/CodeModernBERT-Ghost
+  - openai-community/gpt2-large
+pipeline_tag: text2text-generation
+---
+#　CodeEncoderDecoderModel-Ghost-large👻
+A multilingual encoder-decoder model for generating **docstrings from code snippets**.
+It is based on a custom BERT-style encoder pretrained on source code (`CodeModernBERT-Ghost`) and a large-scale decoder model (`GPT2-large`).
+## 🏗️ Model Architecture
+- **Encoder:** [`Shuu12121/CodeModernBERT-Ghost`](https://huggingface.co/Shuu12121/CodeModernBERT-Ghost)
+- **Decoder:** [`openai-community/gpt2-large`](https://huggingface.co/openai-community/gpt2-large)
+- Connected via HuggingFace's `EncoderDecoderModel` with cross-attention.
+## 🎯 Intended Use
+- Generating docstrings (documentation comments) for functions or methods in multiple languages.
+- Summarizing code for educational or review purposes.
+- Assisting in automated documentation generation pipelines.
+Supported languages (code input):
+- Python
+- Java
+## 📦 How to Use
+```python
+from transformers import AutoTokenizer, EncoderDecoderModel
+import torch
+model = EncoderDecoderModel.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large").to("cuda")
+encoder_tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large", subfolder="encoder_tokenizer")
+decoder_tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large", subfolder="decoder_tokenizer")
+if decoder_tokenizer.pad_token is None:
+    decoder_tokenizer.pad_token = decoder_tokenizer.eos_token
+code = '''
+def greet(name):
+    return f"Hello, {name}!"
+'''
+inputs = encoder_tokenizer(code, return_tensors="pt", truncation=True, padding=True, max_length=2048).to("cuda")
+outputs = model.generate(
+    input_ids=inputs.input_ids,
+    attention_mask=inputs.attention_mask,
+    max_length=256,
+    num_beams=5,
+    early_stopping=True,
+    decoder_start_token_id=model.config.decoder_start_token_id,
+    eos_token_id=model.config.eos_token_id,
+    pad_token_id=model.config.pad_token_id,
+    no_repeat_ngram_size=2
+)
+docstring = decoder_tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(docstring)
+```
+## 🧪 Training Details
+- **Task:** Code-to-docstring generation
+- **Dataset:** [CodeXGLUE: Code-to-Text](https://github.com/microsoft/CodeXGLUE) – using subsets of Python, Java, JavaScript, Go, Ruby, PHP
+- **Loss:** Cross-entropy loss over tokenized docstrings
+- **Max input length:** 2048 (encoder), max output length: 256 (decoder)
+- **Decoder modifications:** Adapted GPT2-large with padding and cross-attention
+## ⚠️ Limitations & Risks
+1. **Generated documentation may be inaccurate, incomplete, or misleading**. Always review generated docstrings manually.
+2. **Formatting may not follow specific standards** (e.g., Google/Numpy style in Python or full Javadoc).
+3. **Limited context:** Only considers single-function input; lacks broader project-level understanding.
+4. **Language variance:** Performance may differ depending on the programming language due to data distribution.
+5. **⚠️ Decoder risks (GPT2-large):**
+   GPT-2 models are known to sometimes generate inappropriate, offensive, or biased outputs, depending on the prompt.
+   Although this model is fine-tuned on technical datasets (code-docstring pairs), due to inherited properties from `gpt2-large`, similar risks **may still be present** in edge cases. Please exercise caution, especially when using the model in public or educational settings.
+## 📄 License
+Apache-2.0
+Model weights and tokenizer artifacts are released under the same license. You are free to use, modify, and redistribute with attribution.