Shuu12121's picture
Update README.md
95276fd verified
---
tags:
- sentence-transformers
- sentence-similarity
- dataset_size:901028
- loss:CosineSimilarityLoss
base_model: Shuu12121/CodeModernBERT-Owl
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- pearson_cosine
- accuracy
- f1
model-index:
- name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: val
type: val
metrics:
- type: pearson_cosine
value: 0.9481467499740959
name: Training Pearson Cosine
- type: accuracy
value: 0.9900051996071408
name: Test Accuracy
- type: f1
value: 0.963323498754483
name: Test F1 Score
license: apache-2.0
datasets:
- google/code_x_glue_cc_clone_detection_big_clone_bench
---
# SentenceTransformer based on `Shuu12121/CodeModernBERT-Owl🦉`
This model is a SentenceTransformer fine-tuned from [`Shuu12121/CodeModernBERT-Owl🦉`](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) on the [BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench) dataset for **code clone detection**. It maps code snippets into a 768-dimensional dense vector space for semantic similarity tasks.
## 🎯 Distinctive Performance and Stability
This model achieves **very high accuracy and F1 scores** in code clone detection.
One particularly noteworthy characteristic is that **changing the similarity threshold has minimal impact on classification performance**.
This indicates that the model has learned to **clearly separate clones from non-clones**, resulting in a **stable and reliable similarity score distribution**.
| Threshold | Accuracy | F1 Score |
|-------------------|-------------------|--------------------|
| 0.5 | 0.9900 | 0.9633 |
| 0.85 | 0.9903 | 0.9641 |
| 0.90 | 0.9902 | 0.9637 |
| 0.95 | 0.9887 | 0.9579 |
| 0.98 | 0.9879 | 0.9540 |
- **High Stability**: Between thresholds of 0.85 and 0.98, accuracy and F1 scores remain nearly constant.
_(This suggests that code pairs considered clones generally score between 0.9 and 1.0 in cosine similarity.)_
- **Reliable in Real-World Applications**: Even if the similarity threshold is slightly adjusted for different tasks or environments, the model maintains consistent performance without significant degradation.
## 📌 Model Overview
- **Architecture**: Sentence-BERT (SBERT)
- **Base Model**: `Shuu12121/CodeModernBERT-Owl`
- **Output Dimension**: 768
- **Max Sequence Length**: 2048 tokens
- **Pooling Method**: CLS token pooling
- **Similarity Function**: Cosine Similarity
---
## 🏋️‍♂️ Training Configuration
- **Loss Function**: `CosineSimilarityLoss`
- **Epochs**: 1
- **Batch Size**: 32
- **Warmup Steps**: 3% of training steps
- **Evaluator**: `EmbeddingSimilarityEvaluator` (on validation)
---
## 📊 Evaluation Metrics
| Metric | Score |
|---------------------------|--------------------|
| Pearson Cosine (Train) | `0.9481` |
| Accuracy (Test) | `0.9902` |
| F1 Score (Test) | `0.9637` |
---
## 📚 Dataset
- [Google BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench)
---
## 🧪 How to Use
```python
from sentence_transformers import SentenceTransformer
from torch.nn.functional import cosine_similarity
import torch
# Load the fine-tuned model
model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")
# Two code snippets to compare
code1 = "def add(a, b): return a + b"
code2 = "def sum(x, y): return x + y"
# Encode the code snippets
embeddings = model.encode([code1, code2], convert_to_tensor=True)
# Compute cosine similarity
similarity_score = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)).item()
# Print the result
print(f"Cosine Similarity: {similarity_score:.4f}")
if similarity_score >= 0.9:
print("🟢 These code snippets are considered CLONES.")
else:
print("🔴 These code snippets are NOT considered clones.")
```
## 🧪 How to Test
```python
!pip install -U sentence-transformers datasets
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import torch
from sklearn.metrics import accuracy_score, f1_score
# --- データセットのロード ---
ds_test = load_dataset("google/code_x_glue_cc_clone_detection_big_clone_bench", split="test")
model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")
model.to("cuda")
test_sentences1 = ds_test["func1"]
test_sentences2 = ds_test["func2"]
test_labels = ds_test["label"]
batch_size = 256 # GPUメモリに合わせて調整
print("Encoding sentences1...")
embeddings1 = model.encode(
test_sentences1,
convert_to_tensor=True,
batch_size=batch_size,
show_progress_bar=True
)
print("Encoding sentences2...")
embeddings2 = model.encode(
test_sentences2,
convert_to_tensor=True,
batch_size=batch_size,
show_progress_bar=True
)
print("Calculating cosine scores...")
cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2)
# 閾値設定(ここでは0.9を採用)
threshold = 0.9
print(f"Using threshold: {threshold}")
predictions = (cosine_scores > threshold).long().cpu().numpy()
accuracy = accuracy_score(test_labels, predictions)
f1 = f1_score(test_labels, predictions)
print("Test Accuracy:", accuracy)
print("Test F1 Score:", f1)
```
## 🛠️ Model Architecture
```python
SentenceTransformer(
(0): Transformer({'max_seq_length': 2048}) with model 'ModernBertModel'
(1): Pooling({
'word_embedding_dimension': 768,
'pooling_mode_cls_token': True,
...
})
)
```
---
## 📦 Dependencies
- Python: `3.11.11`
- sentence-transformers: `4.0.1`
- transformers: `4.50.3`
- torch: `2.6.0+cu124`
- datasets: `3.5.0`
- tokenizers: `0.21.1`
- flash-attn: ✅ Installed
### Install Required Libraries
```bash
pip install -U sentence-transformers transformers>=4.48.0 flash-attn datasets
```
---
## 🔐 Optional: Authentication
```python
from huggingface_hub import login
login("your_huggingface_token")
import wandb
wandb.login(key="your_wandb_token")
```
---
## 🧾 Citation
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "EMNLP 2019",
url = "https://arxiv.org/abs/1908.10084"
}
```
---
## 🔓 License
Apache License 2.0