|
--- |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- dataset_size:901028 |
|
- loss:CosineSimilarityLoss |
|
base_model: Shuu12121/CodeModernBERT-Owl |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
metrics: |
|
- pearson_cosine |
|
- accuracy |
|
- f1 |
|
model-index: |
|
- name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl |
|
results: |
|
- task: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
dataset: |
|
name: val |
|
type: val |
|
metrics: |
|
- type: pearson_cosine |
|
value: 0.9481467499740959 |
|
name: Training Pearson Cosine |
|
- type: accuracy |
|
value: 0.9900051996071408 |
|
name: Test Accuracy |
|
- type: f1 |
|
value: 0.963323498754483 |
|
name: Test F1 Score |
|
license: apache-2.0 |
|
datasets: |
|
- google/code_x_glue_cc_clone_detection_big_clone_bench |
|
--- |
|
|
|
# SentenceTransformer based on `Shuu12121/CodeModernBERT-Owl🦉` |
|
|
|
This model is a SentenceTransformer fine-tuned from [`Shuu12121/CodeModernBERT-Owl🦉`](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) on the [BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench) dataset for **code clone detection**. It maps code snippets into a 768-dimensional dense vector space for semantic similarity tasks. |
|
|
|
|
|
|
|
## 🎯 Distinctive Performance and Stability |
|
|
|
This model achieves **very high accuracy and F1 scores** in code clone detection. |
|
One particularly noteworthy characteristic is that **changing the similarity threshold has minimal impact on classification performance**. |
|
This indicates that the model has learned to **clearly separate clones from non-clones**, resulting in a **stable and reliable similarity score distribution**. |
|
|
|
| Threshold | Accuracy | F1 Score | |
|
|-------------------|-------------------|--------------------| |
|
| 0.5 | 0.9900 | 0.9633 | |
|
| 0.85 | 0.9903 | 0.9641 | |
|
| 0.90 | 0.9902 | 0.9637 | |
|
| 0.95 | 0.9887 | 0.9579 | |
|
| 0.98 | 0.9879 | 0.9540 | |
|
|
|
- **High Stability**: Between thresholds of 0.85 and 0.98, accuracy and F1 scores remain nearly constant. |
|
_(This suggests that code pairs considered clones generally score between 0.9 and 1.0 in cosine similarity.)_ |
|
|
|
- **Reliable in Real-World Applications**: Even if the similarity threshold is slightly adjusted for different tasks or environments, the model maintains consistent performance without significant degradation. |
|
|
|
|
|
|
|
## 📌 Model Overview |
|
|
|
- **Architecture**: Sentence-BERT (SBERT) |
|
- **Base Model**: `Shuu12121/CodeModernBERT-Owl` |
|
- **Output Dimension**: 768 |
|
- **Max Sequence Length**: 2048 tokens |
|
- **Pooling Method**: CLS token pooling |
|
- **Similarity Function**: Cosine Similarity |
|
|
|
--- |
|
|
|
## 🏋️♂️ Training Configuration |
|
|
|
- **Loss Function**: `CosineSimilarityLoss` |
|
- **Epochs**: 1 |
|
- **Batch Size**: 32 |
|
- **Warmup Steps**: 3% of training steps |
|
- **Evaluator**: `EmbeddingSimilarityEvaluator` (on validation) |
|
|
|
--- |
|
|
|
## 📊 Evaluation Metrics |
|
|
|
| Metric | Score | |
|
|---------------------------|--------------------| |
|
| Pearson Cosine (Train) | `0.9481` | |
|
| Accuracy (Test) | `0.9902` | |
|
| F1 Score (Test) | `0.9637` | |
|
|
|
--- |
|
|
|
## 📚 Dataset |
|
|
|
- [Google BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench) |
|
|
|
--- |
|
|
|
## 🧪 How to Use |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
from torch.nn.functional import cosine_similarity |
|
import torch |
|
|
|
# Load the fine-tuned model |
|
model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl") |
|
|
|
# Two code snippets to compare |
|
code1 = "def add(a, b): return a + b" |
|
code2 = "def sum(x, y): return x + y" |
|
|
|
# Encode the code snippets |
|
embeddings = model.encode([code1, code2], convert_to_tensor=True) |
|
|
|
# Compute cosine similarity |
|
similarity_score = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)).item() |
|
|
|
# Print the result |
|
print(f"Cosine Similarity: {similarity_score:.4f}") |
|
if similarity_score >= 0.9: |
|
print("🟢 These code snippets are considered CLONES.") |
|
else: |
|
print("🔴 These code snippets are NOT considered clones.") |
|
``` |
|
## 🧪 How to Test |
|
|
|
```python |
|
!pip install -U sentence-transformers datasets |
|
|
|
from sentence_transformers import SentenceTransformer |
|
from datasets import load_dataset |
|
import torch |
|
from sklearn.metrics import accuracy_score, f1_score |
|
|
|
# --- データセットのロード --- |
|
ds_test = load_dataset("google/code_x_glue_cc_clone_detection_big_clone_bench", split="test") |
|
|
|
model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl") |
|
model.to("cuda") |
|
|
|
|
|
test_sentences1 = ds_test["func1"] |
|
test_sentences2 = ds_test["func2"] |
|
test_labels = ds_test["label"] |
|
|
|
batch_size = 256 # GPUメモリに合わせて調整 |
|
|
|
print("Encoding sentences1...") |
|
|
|
embeddings1 = model.encode( |
|
test_sentences1, |
|
convert_to_tensor=True, |
|
batch_size=batch_size, |
|
show_progress_bar=True |
|
) |
|
|
|
print("Encoding sentences2...") |
|
embeddings2 = model.encode( |
|
test_sentences2, |
|
convert_to_tensor=True, |
|
batch_size=batch_size, |
|
show_progress_bar=True |
|
) |
|
|
|
print("Calculating cosine scores...") |
|
cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2) |
|
|
|
# 閾値設定(ここでは0.9を採用) |
|
threshold = 0.9 |
|
print(f"Using threshold: {threshold}") |
|
predictions = (cosine_scores > threshold).long().cpu().numpy() |
|
|
|
accuracy = accuracy_score(test_labels, predictions) |
|
f1 = f1_score(test_labels, predictions) |
|
print("Test Accuracy:", accuracy) |
|
print("Test F1 Score:", f1) |
|
|
|
``` |
|
|
|
## 🛠️ Model Architecture |
|
|
|
```python |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 2048}) with model 'ModernBertModel' |
|
(1): Pooling({ |
|
'word_embedding_dimension': 768, |
|
'pooling_mode_cls_token': True, |
|
... |
|
}) |
|
) |
|
``` |
|
|
|
--- |
|
|
|
## 📦 Dependencies |
|
|
|
- Python: `3.11.11` |
|
- sentence-transformers: `4.0.1` |
|
- transformers: `4.50.3` |
|
- torch: `2.6.0+cu124` |
|
- datasets: `3.5.0` |
|
- tokenizers: `0.21.1` |
|
- flash-attn: ✅ Installed |
|
|
|
### Install Required Libraries |
|
|
|
```bash |
|
pip install -U sentence-transformers transformers>=4.48.0 flash-attn datasets |
|
``` |
|
|
|
--- |
|
|
|
## 🔐 Optional: Authentication |
|
|
|
```python |
|
from huggingface_hub import login |
|
login("your_huggingface_token") |
|
|
|
import wandb |
|
wandb.login(key="your_wandb_token") |
|
``` |
|
|
|
--- |
|
|
|
## 🧾 Citation |
|
|
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "EMNLP 2019", |
|
url = "https://arxiv.org/abs/1908.10084" |
|
} |
|
``` |
|
|
|
--- |
|
|
|
## 🔓 License |
|
|
|
Apache License 2.0 |
|
|
|
|