File size: 6,811 Bytes
f50b6fc
 
 
 
 
 
 
 
 
 
 
a0b6b34
 
f50b6fc
 
 
 
 
 
 
 
 
 
 
 
a0b6b34
 
 
 
 
 
 
95eeb1d
a0b6b34
 
f50b6fc
 
a0b6b34
f50b6fc
a0b6b34
f50b6fc
632d07c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0b6b34
 
 
 
 
 
 
 
 
 
 
 
 
f50b6fc
a0b6b34
 
 
 
 
f50b6fc
a0b6b34
 
 
 
 
 
 
d90aaa2
 
a0b6b34
 
f50b6fc
a0b6b34
f50b6fc
a0b6b34
 
 
 
 
 
 
 
 
 
 
edf421e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d90aaa2
edf421e
 
 
f50b6fc
d90aaa2
95276fd
 
d90aaa2
a0b6b34
d90aaa2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0b6b34
 
 
 
f50b6fc
a0b6b34
 
 
 
 
 
f50b6fc
 
 
a0b6b34
 
 
f50b6fc
a0b6b34
 
 
 
 
 
 
f50b6fc
a0b6b34
f50b6fc
 
a0b6b34
f50b6fc
 
a0b6b34
f50b6fc
a0b6b34
 
 
 
 
f50b6fc
a0b6b34
 
f50b6fc
 
a0b6b34
f50b6fc
a0b6b34
f50b6fc
 
 
 
 
a0b6b34
 
f50b6fc
 
 
a0b6b34
f50b6fc
a0b6b34
f50b6fc
a0b6b34
f50b6fc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
tags:
- sentence-transformers
- sentence-similarity
- dataset_size:901028
- loss:CosineSimilarityLoss
base_model: Shuu12121/CodeModernBERT-Owl
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- pearson_cosine
- accuracy
- f1
model-index:
- name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl
  results:
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      name: val
      type: val
    metrics:
    - type: pearson_cosine
      value: 0.9481467499740959
      name: Training Pearson Cosine
    - type: accuracy
      value: 0.9900051996071408
      name: Test Accuracy
    - type: f1
      value: 0.963323498754483
      name: Test F1 Score
license: apache-2.0
datasets:
- google/code_x_glue_cc_clone_detection_big_clone_bench
---

# SentenceTransformer based on `Shuu12121/CodeModernBERT-Owl🦉`

This model is a SentenceTransformer fine-tuned from [`Shuu12121/CodeModernBERT-Owl🦉`](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) on the [BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench) dataset for **code clone detection**. It maps code snippets into a 768-dimensional dense vector space for semantic similarity tasks.



## 🎯 Distinctive Performance and Stability

This model achieves **very high accuracy and F1 scores** in code clone detection.  
One particularly noteworthy characteristic is that **changing the similarity threshold has minimal impact on classification performance**.  
This indicates that the model has learned to **clearly separate clones from non-clones**, resulting in a **stable and reliable similarity score distribution**.

| Threshold         | Accuracy          | F1 Score           |
|-------------------|-------------------|--------------------|
| 0.5               | 0.9900            | 0.9633             |
| 0.85              | 0.9903            | 0.9641             |
| 0.90              | 0.9902            | 0.9637             |
| 0.95              | 0.9887            | 0.9579             |
| 0.98              | 0.9879            | 0.9540             |

- **High Stability**: Between thresholds of 0.85 and 0.98, accuracy and F1 scores remain nearly constant.  
  _(This suggests that code pairs considered clones generally score between 0.9 and 1.0 in cosine similarity.)_

- **Reliable in Real-World Applications**: Even if the similarity threshold is slightly adjusted for different tasks or environments, the model maintains consistent performance without significant degradation.



## 📌 Model Overview

- **Architecture**: Sentence-BERT (SBERT)
- **Base Model**: `Shuu12121/CodeModernBERT-Owl`
- **Output Dimension**: 768
- **Max Sequence Length**: 2048 tokens
- **Pooling Method**: CLS token pooling
- **Similarity Function**: Cosine Similarity

---

## 🏋️‍♂️ Training Configuration

- **Loss Function**: `CosineSimilarityLoss`
- **Epochs**: 1
- **Batch Size**: 32
- **Warmup Steps**: 3% of training steps
- **Evaluator**: `EmbeddingSimilarityEvaluator` (on validation)

---

## 📊 Evaluation Metrics

| Metric                    | Score              |
|---------------------------|--------------------|
| Pearson Cosine (Train)    | `0.9481`           |
| Accuracy (Test)           | `0.9902`           |
| F1 Score (Test)           | `0.9637`           |

---

## 📚 Dataset

- [Google BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench)

---

## 🧪 How to Use

```python
from sentence_transformers import SentenceTransformer
from torch.nn.functional import cosine_similarity
import torch

# Load the fine-tuned model
model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")

# Two code snippets to compare
code1 = "def add(a, b): return a + b"
code2 = "def sum(x, y): return x + y"

# Encode the code snippets
embeddings = model.encode([code1, code2], convert_to_tensor=True)

# Compute cosine similarity
similarity_score = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)).item()

# Print the result
print(f"Cosine Similarity: {similarity_score:.4f}")
if similarity_score >= 0.9:
    print("🟢 These code snippets are considered CLONES.")
else:
    print("🔴 These code snippets are NOT considered clones.")
```
## 🧪 How to Test

```python
!pip install -U sentence-transformers datasets

from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import torch
from sklearn.metrics import accuracy_score, f1_score

# --- データセットのロード ---
ds_test = load_dataset("google/code_x_glue_cc_clone_detection_big_clone_bench", split="test")

model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")
model.to("cuda")


test_sentences1 = ds_test["func1"]
test_sentences2 = ds_test["func2"]
test_labels = ds_test["label"]

batch_size = 256  # GPUメモリに合わせて調整

print("Encoding sentences1...")

embeddings1 = model.encode(
    test_sentences1,
    convert_to_tensor=True,
    batch_size=batch_size,
    show_progress_bar=True
)

print("Encoding sentences2...")
embeddings2 = model.encode(
    test_sentences2,
    convert_to_tensor=True,
    batch_size=batch_size,
    show_progress_bar=True
)

print("Calculating cosine scores...")
cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2)

# 閾値設定(ここでは0.9を採用)
threshold = 0.9
print(f"Using threshold: {threshold}")
predictions = (cosine_scores > threshold).long().cpu().numpy()

accuracy = accuracy_score(test_labels, predictions)
f1 = f1_score(test_labels, predictions)
print("Test Accuracy:", accuracy)
print("Test F1 Score:", f1)

```

## 🛠️ Model Architecture

```python
SentenceTransformer(
  (0): Transformer({'max_seq_length': 2048}) with model 'ModernBertModel'
  (1): Pooling({
        'word_embedding_dimension': 768,
        'pooling_mode_cls_token': True,
        ...
  })
)
```

---

## 📦 Dependencies

- Python: `3.11.11`
- sentence-transformers: `4.0.1`
- transformers: `4.50.3`
- torch: `2.6.0+cu124`
- datasets: `3.5.0`
- tokenizers: `0.21.1`
- flash-attn: ✅ Installed

### Install Required Libraries

```bash
pip install -U sentence-transformers transformers>=4.48.0 flash-attn datasets
```

---

## 🔐 Optional: Authentication

```python
from huggingface_hub import login
login("your_huggingface_token")

import wandb
wandb.login(key="your_wandb_token")
```

---

## 🧾 Citation

```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "EMNLP 2019",
    url = "https://arxiv.org/abs/1908.10084"
}
```

---

## 🔓 License

Apache License 2.0