AmanSengar commited on
Commit
e74d3b0
Β·
verified Β·
1 Parent(s): 40bf2e6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -0
README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ **🧠 AI-Text-Similarity-Model**
3
+
4
+ A fine-tuned model on the STS Benchmark (Semantic Textual Similarity) dataset. This model computes the semantic closeness between sentence pairs using cosine similarity. It is ideal for tasks such as duplicate detection, semantic search, question-answer matching, and text clustering.
5
+
6
+ ---
7
+
8
+ ✨ **Model Highlights**
9
+
10
+ - πŸ“Œ Based on sentence-transformers/paraphrase-MiniLM-L6-v2
11
+ - πŸ” Fine-tuned on the STS Benchmark (English)
12
+ - πŸ“ˆ Outputs cosine similarity between 0 (not similar) and 1 (very similar)
13
+ - ⚑ Fast, lightweight, and efficient on both CPU and GPU
14
+ - πŸ” Trained with contrastive loss using sentence embeddings
15
+
16
+ ---
17
+
18
+ 🧠 Intended Uses
19
+
20
+ - βœ… Duplicate sentence detection
21
+ - βœ… Semantic search engines
22
+ - βœ… Question-Answer pair matching
23
+ - βœ… Plagiarism detection
24
+ - βœ… Conversational agent re-ranking
25
+ - βœ… Text clustering and grouping based on meaning
26
+
27
+
28
+
29
+ ---
30
+ - 🚫 Limitations
31
+
32
+ - ❌ Trained on English sentences only
33
+ - ❌ Not suitable for zero-shot multilingual similarity
34
+ - ❌ Accuracy may degrade on domain-specific or technical content
35
+ - ❌ Slight performance dip for long sequences (>128 tokens)
36
+
37
+
38
+
39
+ ---
40
+
41
+ πŸ‹οΈβ€β™‚οΈ Training Details
42
+
43
+ | Field | Value |
44
+ | -------------- | ------------------------------ |
45
+ | **Base Model** | paraphrase-MiniLM-L6-v2 |
46
+ | **Dataset** |stsb_multi_mt, English |
47
+ | **Framework** | PyTorch with πŸ€— Transformers |
48
+ | **Epochs** | 3 |
49
+ | **Batch Size** | 16 |
50
+ | **Max Length** | 128 tokens |
51
+ | **Optimizer** | AdamW |
52
+ | **Loss** | CrossEntropyLoss (token-level) |
53
+ | **Device** | Trained on CUDA-enabled GPU |
54
+
55
+ ---
56
+
57
+ πŸ“Š Evaluation Metrics
58
+
59
+ | Metric | Score |
60
+ | ----------------------------------------------- | ----- |
61
+ | Accuracy | 0.82 |
62
+ | F1-Score | 0.87 |
63
+ | Precision | 0.84 |
64
+ | Recall | 0.85 |
65
+
66
+
67
+ ---
68
+
69
+
70
+ ---
71
+ πŸš€ Usage
72
+ ```python
73
+ from transformers import AutoTokenizer
74
+ from transformers import pipeline
75
+ import torch
76
+
77
+ model_name = "AmanSengar/AI-Text-Similarity-Model"
78
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
79
+ model = BertForTokenClassification.from_pretrained(model_name)
80
+ model.eval()
81
+
82
+
83
+ # Inference
84
+ def get_similarity(text1, text2):
85
+ emb1 = model.encode(text1, convert_to_tensor=True)
86
+ emb2 = model.encode(text2, convert_to_tensor=True)
87
+ score = util.cos_sim(emb1, emb2).item()
88
+ return round(score, 4)
89
+
90
+ # Test Example
91
+ print(get_similarity("A man is eating food.", "A person is having a meal."))
92
+
93
+ ```
94
+ ---
95
+
96
+ - 🧩 Quantization
97
+ - Post-training static quantization applied using PyTorch to reduce model size and accelerate inference on edge devices.
98
+
99
+ ----
100
+
101
+ πŸ—‚ Repository Structure
102
+ ```
103
+ .
104
+ β”œβ”€β”€ model/ # Quantized model files
105
+ β”œβ”€β”€ tokenizer_config/ # Tokenizer and vocab files
106
+ β”œβ”€β”€ model.safensors/ # Fine-tuned model in safetensors format
107
+ β”œβ”€β”€ README.md # Model card
108
+
109
+ ```
110
+ ---
111
+ 🀝 Contributing
112
+
113
+ Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.
114
+
115
+
116
+