CodeCloneDetection-ModernBERT-Owl / README.md

Update README.md

95276fd verified 5 months ago

6.81 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- dataset_size:901028
	- loss:CosineSimilarityLoss
	base_model: Shuu12121/CodeModernBERT-Owl
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	metrics:
	- pearson_cosine
	- accuracy
	- f1
	model-index:
	- name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl
	results:
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: val
	type: val
	metrics:
	- type: pearson_cosine
	value: 0.9481467499740959
	name: Training Pearson Cosine
	- type: accuracy
	value: 0.9900051996071408
	name: Test Accuracy
	- type: f1
	value: 0.963323498754483
	name: Test F1 Score
	license: apache-2.0
	datasets:
	- google/code_x_glue_cc_clone_detection_big_clone_bench
	---

	#　SentenceTransformer based on `Shuu12121/CodeModernBERT-Owl🦉`

	This model is a SentenceTransformer fine-tuned from [`Shuu12121/CodeModernBERT-Owl🦉`](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) on the [BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench) dataset for code clone detection. It maps code snippets into a 768-dimensional dense vector space for semantic similarity tasks.



	## 🎯 Distinctive Performance and Stability

	This model achieves very high accuracy and F1 scores in code clone detection.
	One particularly noteworthy characteristic is that changing the similarity threshold has minimal impact on classification performance.
	This indicates that the model has learned to clearly separate clones from non-clones, resulting in a stable and reliable similarity score distribution.

	\| Threshold \| Accuracy \| F1 Score \|
	\|-------------------\|-------------------\|--------------------\|
	\| 0.5 \| 0.9900 \| 0.9633 \|
	\| 0.85 \| 0.9903 \| 0.9641 \|
	\| 0.90 \| 0.9902 \| 0.9637 \|
	\| 0.95 \| 0.9887 \| 0.9579 \|
	\| 0.98 \| 0.9879 \| 0.9540 \|

	- High Stability: Between thresholds of 0.85 and 0.98, accuracy and F1 scores remain nearly constant.
	_(This suggests that code pairs considered clones generally score between 0.9 and 1.0 in cosine similarity.)_

	- Reliable in Real-World Applications: Even if the similarity threshold is slightly adjusted for different tasks or environments, the model maintains consistent performance without significant degradation.



	## 📌 Model Overview

	- Architecture: Sentence-BERT (SBERT)
	- Base Model: `Shuu12121/CodeModernBERT-Owl`
	- Output Dimension: 768
	- Max Sequence Length: 2048 tokens
	- Pooling Method: CLS token pooling
	- Similarity Function: Cosine Similarity

	---

	## 🏋️‍♂️ Training Configuration

	- Loss Function: `CosineSimilarityLoss`
	- Epochs: 1
	- Batch Size: 32
	- Warmup Steps: 3% of training steps
	- Evaluator: `EmbeddingSimilarityEvaluator` (on validation)

	---

	## 📊 Evaluation Metrics

	\| Metric \| Score \|
	\|---------------------------\|--------------------\|
	\| Pearson Cosine (Train) \| `0.9481` \|
	\| Accuracy (Test) \| `0.9902` \|
	\| F1 Score (Test) \| `0.9637` \|

	---

	## 📚 Dataset

	- [Google BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench)

	---

	## 🧪 How to Use

	```python
	from sentence_transformers import SentenceTransformer
	from torch.nn.functional import cosine_similarity
	import torch

	# Load the fine-tuned model
	model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")

	# Two code snippets to compare
	code1 = "def add(a, b): return a + b"
	code2 = "def sum(x, y): return x + y"

	# Encode the code snippets
	embeddings = model.encode([code1, code2], convert_to_tensor=True)

	# Compute cosine similarity
	similarity_score = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)).item()

	# Print the result
	print(f"Cosine Similarity: {similarity_score:.4f}")
	if similarity_score >= 0.9:
	print("🟢 These code snippets are considered CLONES.")
	else:
	print("🔴 These code snippets are NOT considered clones.")
	```
	## 🧪 How to Test

	```python
	!pip install -U sentence-transformers datasets

	from sentence_transformers import SentenceTransformer
	from datasets import load_dataset
	import torch
	from sklearn.metrics import accuracy_score, f1_score

	# --- データセットのロード ---
	ds_test = load_dataset("google/code_x_glue_cc_clone_detection_big_clone_bench", split="test")

	model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")
	model.to("cuda")


	test_sentences1 = ds_test["func1"]
	test_sentences2 = ds_test["func2"]
	test_labels = ds_test["label"]

	batch_size = 256 # GPUメモリに合わせて調整

	print("Encoding sentences1...")

	embeddings1 = model.encode(
	test_sentences1,
	convert_to_tensor=True,
	batch_size=batch_size,
	show_progress_bar=True
	)

	print("Encoding sentences2...")
	embeddings2 = model.encode(
	test_sentences2,
	convert_to_tensor=True,
	batch_size=batch_size,
	show_progress_bar=True
	)

	print("Calculating cosine scores...")
	cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2)

	# 閾値設定（ここでは0.9を採用）
	threshold = 0.9
	print(f"Using threshold: {threshold}")
	predictions = (cosine_scores > threshold).long().cpu().numpy()

	accuracy = accuracy_score(test_labels, predictions)
	f1 = f1_score(test_labels, predictions)
	print("Test Accuracy:", accuracy)
	print("Test F1 Score:", f1)

	```

	## 🛠️ Model Architecture

	```python
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 2048}) with model 'ModernBertModel'
	(1): Pooling({
	'word_embedding_dimension': 768,
	'pooling_mode_cls_token': True,
	...
	})
	)
	```

	---

	## 📦 Dependencies

	- Python: `3.11.11`
	- sentence-transformers: `4.0.1`
	- transformers: `4.50.3`
	- torch: `2.6.0+cu124`
	- datasets: `3.5.0`
	- tokenizers: `0.21.1`
	- flash-attn: ✅ Installed

	### Install Required Libraries

	```bash
	pip install -U sentence-transformers transformers>=4.48.0 flash-attn datasets
	```

	---

	## 🔐 Optional: Authentication

	```python
	from huggingface_hub import login
	login("your_huggingface_token")

	import wandb
	wandb.login(key="your_wandb_token")
	```

	---

	## 🧾 Citation

	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "EMNLP 2019",
	url = "https://arxiv.org/abs/1908.10084"
	}
	```

	---

	## 🔓 License

	Apache License 2.0