Shuu12121
/

CodeCloneDetection-ModernBERT-Owl

Sentence Similarity

sentence-transformers

dataset_size:901028

loss:CosineSimilarityLoss

text-embeddings-inference

Model card Files Files and versions

Shuu12121 commited on Apr 3

Commit

d90aaa2

·

verified ·

1 Parent(s): edf421e

Update README.md

Files changed (1) hide show

README.md +54 -4

README.md CHANGED Viewed

@@ -67,8 +67,8 @@ This model is a SentenceTransformer fine-tuned from [`Shuu12121/CodeModernBERT-O
 | Metric                    | Score              |
 |---------------------------|--------------------|
 | Pearson Cosine (Train)    | `0.9481`           |
-| Accuracy (Test)           | `0.9900`           |
-| F1 Score (Test)           | `0.9633`           |
 ---
@@ -100,13 +100,63 @@ similarity_score = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].u
 # Print the result
 print(f"Cosine Similarity: {similarity_score:.4f}")
-if similarity_score >= 0.5:
     print("🟢 These code snippets are considered CLONES.")
 else:
     print("🔴 These code snippets are NOT considered clones.")
 ```
----
 ## 🛠️ Model Architecture

 | Metric                    | Score              |
 |---------------------------|--------------------|
 | Pearson Cosine (Train)    | `0.9481`           |
+| Accuracy (Test)           | `0.9902`           |
+| F1 Score (Test)           | `0.9637`           |
 ---
 # Print the result
 print(f"Cosine Similarity: {similarity_score:.4f}")
+if similarity_score >= 0.9:
     print("🟢 These code snippets are considered CLONES.")
 else:
     print("🔴 These code snippets are NOT considered clones.")
 ```
+## 🧪 How to Test
+!pip install -U sentence-transformers datasets
+from sentence_transformers import SentenceTransformer
+from datasets import load_dataset
+import torch
+from sklearn.metrics import accuracy_score, f1_score
+# --- データセットのロード ---
+ds_test = load_dataset("google/code_x_glue_cc_clone_detection_big_clone_bench", split="test")
+model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")
+model.to("cuda")
+test_sentences1 = ds_test["func1"]
+test_sentences2 = ds_test["func2"]
+test_labels = ds_test["label"]
+batch_size = 256  # GPUメモリに合わせて調整
+print("Encoding sentences1...")
+embeddings1 = model.encode(
+    test_sentences1,
+    convert_to_tensor=True,
+    batch_size=batch_size,
+    show_progress_bar=True
+)
+print("Encoding sentences2...")
+embeddings2 = model.encode(
+    test_sentences2,
+    convert_to_tensor=True,
+    batch_size=batch_size,
+    show_progress_bar=True
+)
+print("Calculating cosine scores...")
+cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2)
+# 閾値設定（ここでは0.9を採用）
+threshold = 0.9
+print(f"Using threshold: {threshold}")
+predictions = (cosine_scores > threshold).long().cpu().numpy()
+accuracy = accuracy_score(test_labels, predictions)
+f1 = f1_score(test_labels, predictions)
+print("Test Accuracy:", accuracy)
+print("Test F1 Score:", f1)
+```
 ## 🛠️ Model Architecture