File size: 6,811 Bytes
f50b6fc a0b6b34 f50b6fc a0b6b34 95eeb1d a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc 632d07c a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 d90aaa2 a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 edf421e d90aaa2 edf421e f50b6fc d90aaa2 95276fd d90aaa2 a0b6b34 d90aaa2 a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc a0b6b34 f50b6fc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
---
tags:
- sentence-transformers
- sentence-similarity
- dataset_size:901028
- loss:CosineSimilarityLoss
base_model: Shuu12121/CodeModernBERT-Owl
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- pearson_cosine
- accuracy
- f1
model-index:
- name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: val
type: val
metrics:
- type: pearson_cosine
value: 0.9481467499740959
name: Training Pearson Cosine
- type: accuracy
value: 0.9900051996071408
name: Test Accuracy
- type: f1
value: 0.963323498754483
name: Test F1 Score
license: apache-2.0
datasets:
- google/code_x_glue_cc_clone_detection_big_clone_bench
---
# SentenceTransformer based on `Shuu12121/CodeModernBERT-Owl🦉`
This model is a SentenceTransformer fine-tuned from [`Shuu12121/CodeModernBERT-Owl🦉`](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) on the [BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench) dataset for **code clone detection**. It maps code snippets into a 768-dimensional dense vector space for semantic similarity tasks.
## 🎯 Distinctive Performance and Stability
This model achieves **very high accuracy and F1 scores** in code clone detection.
One particularly noteworthy characteristic is that **changing the similarity threshold has minimal impact on classification performance**.
This indicates that the model has learned to **clearly separate clones from non-clones**, resulting in a **stable and reliable similarity score distribution**.
| Threshold | Accuracy | F1 Score |
|-------------------|-------------------|--------------------|
| 0.5 | 0.9900 | 0.9633 |
| 0.85 | 0.9903 | 0.9641 |
| 0.90 | 0.9902 | 0.9637 |
| 0.95 | 0.9887 | 0.9579 |
| 0.98 | 0.9879 | 0.9540 |
- **High Stability**: Between thresholds of 0.85 and 0.98, accuracy and F1 scores remain nearly constant.
_(This suggests that code pairs considered clones generally score between 0.9 and 1.0 in cosine similarity.)_
- **Reliable in Real-World Applications**: Even if the similarity threshold is slightly adjusted for different tasks or environments, the model maintains consistent performance without significant degradation.
## 📌 Model Overview
- **Architecture**: Sentence-BERT (SBERT)
- **Base Model**: `Shuu12121/CodeModernBERT-Owl`
- **Output Dimension**: 768
- **Max Sequence Length**: 2048 tokens
- **Pooling Method**: CLS token pooling
- **Similarity Function**: Cosine Similarity
---
## 🏋️♂️ Training Configuration
- **Loss Function**: `CosineSimilarityLoss`
- **Epochs**: 1
- **Batch Size**: 32
- **Warmup Steps**: 3% of training steps
- **Evaluator**: `EmbeddingSimilarityEvaluator` (on validation)
---
## 📊 Evaluation Metrics
| Metric | Score |
|---------------------------|--------------------|
| Pearson Cosine (Train) | `0.9481` |
| Accuracy (Test) | `0.9902` |
| F1 Score (Test) | `0.9637` |
---
## 📚 Dataset
- [Google BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench)
---
## 🧪 How to Use
```python
from sentence_transformers import SentenceTransformer
from torch.nn.functional import cosine_similarity
import torch
# Load the fine-tuned model
model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")
# Two code snippets to compare
code1 = "def add(a, b): return a + b"
code2 = "def sum(x, y): return x + y"
# Encode the code snippets
embeddings = model.encode([code1, code2], convert_to_tensor=True)
# Compute cosine similarity
similarity_score = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)).item()
# Print the result
print(f"Cosine Similarity: {similarity_score:.4f}")
if similarity_score >= 0.9:
print("🟢 These code snippets are considered CLONES.")
else:
print("🔴 These code snippets are NOT considered clones.")
```
## 🧪 How to Test
```python
!pip install -U sentence-transformers datasets
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import torch
from sklearn.metrics import accuracy_score, f1_score
# --- データセットのロード ---
ds_test = load_dataset("google/code_x_glue_cc_clone_detection_big_clone_bench", split="test")
model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")
model.to("cuda")
test_sentences1 = ds_test["func1"]
test_sentences2 = ds_test["func2"]
test_labels = ds_test["label"]
batch_size = 256 # GPUメモリに合わせて調整
print("Encoding sentences1...")
embeddings1 = model.encode(
test_sentences1,
convert_to_tensor=True,
batch_size=batch_size,
show_progress_bar=True
)
print("Encoding sentences2...")
embeddings2 = model.encode(
test_sentences2,
convert_to_tensor=True,
batch_size=batch_size,
show_progress_bar=True
)
print("Calculating cosine scores...")
cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2)
# 閾値設定(ここでは0.9を採用)
threshold = 0.9
print(f"Using threshold: {threshold}")
predictions = (cosine_scores > threshold).long().cpu().numpy()
accuracy = accuracy_score(test_labels, predictions)
f1 = f1_score(test_labels, predictions)
print("Test Accuracy:", accuracy)
print("Test F1 Score:", f1)
```
## 🛠️ Model Architecture
```python
SentenceTransformer(
(0): Transformer({'max_seq_length': 2048}) with model 'ModernBertModel'
(1): Pooling({
'word_embedding_dimension': 768,
'pooling_mode_cls_token': True,
...
})
)
```
---
## 📦 Dependencies
- Python: `3.11.11`
- sentence-transformers: `4.0.1`
- transformers: `4.50.3`
- torch: `2.6.0+cu124`
- datasets: `3.5.0`
- tokenizers: `0.21.1`
- flash-attn: ✅ Installed
### Install Required Libraries
```bash
pip install -U sentence-transformers transformers>=4.48.0 flash-attn datasets
```
---
## 🔐 Optional: Authentication
```python
from huggingface_hub import login
login("your_huggingface_token")
import wandb
wandb.login(key="your_wandb_token")
```
---
## 🧾 Citation
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "EMNLP 2019",
url = "https://arxiv.org/abs/1908.10084"
}
```
---
## 🔓 License
Apache License 2.0
|