Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
**π§ AI-Text-Similarity-Model**
|
3 |
+
|
4 |
+
A fine-tuned model on the STS Benchmark (Semantic Textual Similarity) dataset. This model computes the semantic closeness between sentence pairs using cosine similarity. It is ideal for tasks such as duplicate detection, semantic search, question-answer matching, and text clustering.
|
5 |
+
|
6 |
+
---
|
7 |
+
|
8 |
+
β¨ **Model Highlights**
|
9 |
+
|
10 |
+
- π Based on sentence-transformers/paraphrase-MiniLM-L6-v2
|
11 |
+
- π Fine-tuned on the STS Benchmark (English)
|
12 |
+
- π Outputs cosine similarity between 0 (not similar) and 1 (very similar)
|
13 |
+
- β‘ Fast, lightweight, and efficient on both CPU and GPU
|
14 |
+
- π Trained with contrastive loss using sentence embeddings
|
15 |
+
|
16 |
+
---
|
17 |
+
|
18 |
+
π§ Intended Uses
|
19 |
+
|
20 |
+
- β
Duplicate sentence detection
|
21 |
+
- β
Semantic search engines
|
22 |
+
- β
Question-Answer pair matching
|
23 |
+
- β
Plagiarism detection
|
24 |
+
- β
Conversational agent re-ranking
|
25 |
+
- β
Text clustering and grouping based on meaning
|
26 |
+
|
27 |
+
|
28 |
+
|
29 |
+
---
|
30 |
+
- π« Limitations
|
31 |
+
|
32 |
+
- β Trained on English sentences only
|
33 |
+
- β Not suitable for zero-shot multilingual similarity
|
34 |
+
- β Accuracy may degrade on domain-specific or technical content
|
35 |
+
- β Slight performance dip for long sequences (>128 tokens)
|
36 |
+
|
37 |
+
|
38 |
+
|
39 |
+
---
|
40 |
+
|
41 |
+
ποΈββοΈ Training Details
|
42 |
+
|
43 |
+
| Field | Value |
|
44 |
+
| -------------- | ------------------------------ |
|
45 |
+
| **Base Model** | paraphrase-MiniLM-L6-v2 |
|
46 |
+
| **Dataset** |stsb_multi_mt, English |
|
47 |
+
| **Framework** | PyTorch with π€ Transformers |
|
48 |
+
| **Epochs** | 3 |
|
49 |
+
| **Batch Size** | 16 |
|
50 |
+
| **Max Length** | 128 tokens |
|
51 |
+
| **Optimizer** | AdamW |
|
52 |
+
| **Loss** | CrossEntropyLoss (token-level) |
|
53 |
+
| **Device** | Trained on CUDA-enabled GPU |
|
54 |
+
|
55 |
+
---
|
56 |
+
|
57 |
+
π Evaluation Metrics
|
58 |
+
|
59 |
+
| Metric | Score |
|
60 |
+
| ----------------------------------------------- | ----- |
|
61 |
+
| Accuracy | 0.82 |
|
62 |
+
| F1-Score | 0.87 |
|
63 |
+
| Precision | 0.84 |
|
64 |
+
| Recall | 0.85 |
|
65 |
+
|
66 |
+
|
67 |
+
---
|
68 |
+
|
69 |
+
|
70 |
+
---
|
71 |
+
π Usage
|
72 |
+
```python
|
73 |
+
from transformers import AutoTokenizer
|
74 |
+
from transformers import pipeline
|
75 |
+
import torch
|
76 |
+
|
77 |
+
model_name = "AmanSengar/AI-Text-Similarity-Model"
|
78 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
79 |
+
model = BertForTokenClassification.from_pretrained(model_name)
|
80 |
+
model.eval()
|
81 |
+
|
82 |
+
|
83 |
+
# Inference
|
84 |
+
def get_similarity(text1, text2):
|
85 |
+
emb1 = model.encode(text1, convert_to_tensor=True)
|
86 |
+
emb2 = model.encode(text2, convert_to_tensor=True)
|
87 |
+
score = util.cos_sim(emb1, emb2).item()
|
88 |
+
return round(score, 4)
|
89 |
+
|
90 |
+
# Test Example
|
91 |
+
print(get_similarity("A man is eating food.", "A person is having a meal."))
|
92 |
+
|
93 |
+
```
|
94 |
+
---
|
95 |
+
|
96 |
+
- π§© Quantization
|
97 |
+
- Post-training static quantization applied using PyTorch to reduce model size and accelerate inference on edge devices.
|
98 |
+
|
99 |
+
----
|
100 |
+
|
101 |
+
π Repository Structure
|
102 |
+
```
|
103 |
+
.
|
104 |
+
βββ model/ # Quantized model files
|
105 |
+
βββ tokenizer_config/ # Tokenizer and vocab files
|
106 |
+
βββ model.safensors/ # Fine-tuned model in safetensors format
|
107 |
+
βββ README.md # Model card
|
108 |
+
|
109 |
+
```
|
110 |
+
---
|
111 |
+
π€ Contributing
|
112 |
+
|
113 |
+
Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.
|
114 |
+
|
115 |
+
|
116 |
+
|