radoslavralev commited on
Commit
b556388
·
verified ·
1 Parent(s): e17d3f5

Add new SentenceTransformer model

Browse files
Files changed (1) hide show
  1. README.md +90 -109
README.md CHANGED
@@ -12,42 +12,54 @@ tags:
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
- - dataset_size:478600
16
  - loss:MultipleNegativesSymmetricRankingLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
- - source_sentence: The brown dog is sniffing the back of a small black dog
 
20
  sentences:
21
- - Pickens died in Edgefield and was buried on the Willow Brook Cemetery in Edgefield
22
- , South Carolina .
23
- - It is notable as the oldest Chinatown in Australia , the oldest continuous Chinese
24
- settlement in Australia , and the longest continuously running Chinatown outside
25
- of Asia .
26
- - There is no large brown dog and small grey dog standing on a rocky surface
27
- - source_sentence: Is it harmful from security perspectives to use public Wi-Fi?
 
28
  sentences:
29
- - What is the best way to drive traffic to a website?
30
- - What startups have used GitHub?
31
- - Is there something wrong with using public Wi-Fi?
32
- - source_sentence: How can we make education better?
 
 
 
 
33
  sentences:
34
- - What are some things that would make education better today?
35
- - Mistery works full-time as a graffiti artist and is also Emcee / Rapper in the
36
- Brethren group .
37
- - Jammu Airport operates flights to many cities in India such as Delhi , Leh and
38
- Srinagar .
39
- - source_sentence: So are you.
 
40
  sentences:
41
- - 'Brown said afterwards that he was surprised they had not scored five , and Astall
42
- wrote in his newspaper column :'
43
- - Just like yourself.
44
- - How do I actually lose weight?
45
- - source_sentence: A group of boys are playing with a ball in front of a large door
46
- made of wood
 
 
47
  sentences:
48
- - The children are playing in front of a large door
49
- - What is the blind spot?
50
- - What are some good techniques for controlling your anger?
 
 
 
51
  datasets:
52
  - redis/langcache-sentencepairs-v1
53
  pipeline_tag: sentence-similarity
@@ -64,37 +76,6 @@ metrics:
64
  model-index:
65
  - name: Redis fine-tuned BiEncoder model for semantic caching on LangCache
66
  results:
67
- - task:
68
- type: binary-classification
69
- name: Binary Classification
70
- dataset:
71
- name: val
72
- type: val
73
- metrics:
74
- - type: cosine_accuracy
75
- value: 0.9996860282574568
76
- name: Cosine Accuracy
77
- - type: cosine_accuracy_threshold
78
- value: 0.4801735281944275
79
- name: Cosine Accuracy Threshold
80
- - type: cosine_f1
81
- value: 0.9998429894802952
82
- name: Cosine F1
83
- - type: cosine_f1_threshold
84
- value: 0.4801735281944275
85
- name: Cosine F1 Threshold
86
- - type: cosine_precision
87
- value: 1.0
88
- name: Cosine Precision
89
- - type: cosine_recall
90
- value: 0.9996860282574568
91
- name: Cosine Recall
92
- - type: cosine_ap
93
- value: 0.9999999999999999
94
- name: Cosine Ap
95
- - type: cosine_mcc
96
- value: 0.0
97
- name: Cosine Mcc
98
  - task:
99
  type: binary-classification
100
  name: Binary Classification
@@ -103,28 +84,28 @@ model-index:
103
  type: test
104
  metrics:
105
  - type: cosine_accuracy
106
- value: 0.9999627560521416
107
  name: Cosine Accuracy
108
  - type: cosine_accuracy_threshold
109
- value: 0.42059871554374695
110
  name: Cosine Accuracy Threshold
111
  - type: cosine_f1
112
- value: 0.9999813776792864
113
  name: Cosine F1
114
  - type: cosine_f1_threshold
115
- value: 0.42059871554374695
116
  name: Cosine F1 Threshold
117
  - type: cosine_precision
118
- value: 1.0
119
  name: Cosine Precision
120
  - type: cosine_recall
121
- value: 0.9999627560521416
122
  name: Cosine Recall
123
  - type: cosine_ap
124
- value: 1.0
125
  name: Cosine Ap
126
  - type: cosine_mcc
127
- value: 0.0
128
  name: Cosine Mcc
129
  ---
130
 
@@ -178,9 +159,9 @@ from sentence_transformers import SentenceTransformer
178
  model = SentenceTransformer("redis/langcache-embed-v3")
179
  # Run inference
180
  sentences = [
181
- 'A group of boys are playing with a ball in front of a large door made of wood',
182
- 'The children are playing in front of a large door',
183
- 'What are some good techniques for controlling your anger?',
184
  ]
185
  embeddings = model.encode(sentences)
186
  print(embeddings.shape)
@@ -189,9 +170,9 @@ print(embeddings.shape)
189
  # Get the similarity scores for the embeddings
190
  similarities = model.similarity(embeddings, embeddings)
191
  print(similarities)
192
- # tensor([[1.0000, 0.8672, 0.4121],
193
- # [0.8672, 1.0000, 0.4219],
194
- # [0.4121, 0.4219, 1.0000]], dtype=torch.bfloat16)
195
  ```
196
 
197
  <!--
@@ -224,19 +205,19 @@ You can finetune this model on your own dataset.
224
 
225
  #### Binary Classification
226
 
227
- * Datasets: `val` and `test`
228
  * Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
229
 
230
- | Metric | val | test |
231
- |:--------------------------|:--------|:--------|
232
- | cosine_accuracy | 0.9997 | 1.0 |
233
- | cosine_accuracy_threshold | 0.4802 | 0.4206 |
234
- | cosine_f1 | 0.9998 | 1.0 |
235
- | cosine_f1_threshold | 0.4802 | 0.4206 |
236
- | cosine_precision | 1.0 | 1.0 |
237
- | cosine_recall | 0.9997 | 1.0 |
238
- | **cosine_ap** | **1.0** | **1.0** |
239
- | cosine_mcc | 0.0 | 0.0 |
240
 
241
  <!--
242
  ## Bias, Risks and Limitations
@@ -257,19 +238,19 @@ You can finetune this model on your own dataset.
257
  #### LangCache Sentence Pairs (all)
258
 
259
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
260
- * Size: 26,850 training samples
261
- * Columns: <code>sentence1</code> and <code>sentence2</code>
262
  * Approximate statistics based on the first 1000 samples:
263
- | | sentence1 | sentence2 |
264
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
265
- | type | string | string |
266
- | details | <ul><li>min: 4 tokens</li><li>mean: 16.76 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 16.58 tokens</li><li>max: 44 tokens</li></ul> |
267
  * Samples:
268
- | sentence1 | sentence2 |
269
- |:---------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------|
270
- | <code>A chef is preparing a meal</code> | <code>Some food is being prepared by a chef</code> |
271
- | <code>The presentation is being watched by a classroom of students</code> | <code>A classroom is full of students</code> |
272
- | <code>Garden River , located north of Garden River Airport , Alberta , Canada .</code> | <code>Garden River , , is located north of Garden River Airport , Alberta , Canada .</code> |
273
  * Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
274
  ```json
275
  {
@@ -284,19 +265,19 @@ You can finetune this model on your own dataset.
284
  #### LangCache Sentence Pairs (all)
285
 
286
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
287
- * Size: 26,850 evaluation samples
288
- * Columns: <code>sentence1</code> and <code>sentence2</code>
289
  * Approximate statistics based on the first 1000 samples:
290
- | | sentence1 | sentence2 |
291
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
292
- | type | string | string |
293
- | details | <ul><li>min: 4 tokens</li><li>mean: 16.76 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 16.58 tokens</li><li>max: 44 tokens</li></ul> |
294
  * Samples:
295
- | sentence1 | sentence2 |
296
- |:---------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------|
297
- | <code>A chef is preparing a meal</code> | <code>Some food is being prepared by a chef</code> |
298
- | <code>The presentation is being watched by a classroom of students</code> | <code>A classroom is full of students</code> |
299
- | <code>Garden River , located north of Garden River Airport , Alberta , Canada .</code> | <code>Garden River , , is located north of Garden River Airport , Alberta , Canada .</code> |
300
  * Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
301
  ```json
302
  {
@@ -307,9 +288,9 @@ You can finetune this model on your own dataset.
307
  ```
308
 
309
  ### Training Logs
310
- | Epoch | Step | val_cosine_ap | test_cosine_ap |
311
- |:-----:|:----:|:-------------:|:--------------:|
312
- | -1 | -1 | 1.0000 | 1.0 |
313
 
314
 
315
  ### Framework Versions
 
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
+ - dataset_size:483820
16
  - loss:MultipleNegativesSymmetricRankingLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
+ - source_sentence: In 2015 Adolf Hitler appeared in the kickstarter short movie ``
20
+ Kung Fury `` as Taccone ( A.K.A .
21
  sentences:
22
+ - In 2015 , Adolf Hitler appeared in the Kickstarter - short film `` Kung Fury ``
23
+ as Taccone ( A.K.A .
24
+ - In 1795 , the only white residents were Dr. John Laidley and two brothers with
25
+ the surname Ainslie .
26
+ - The 125th University Match was played in March 2014 at the Rye Golf Club , Oxford
27
+ , East Sussex won the game 8.5 - 6.5 .
28
+ - source_sentence: From 1973 to 1974 , Aubrey toured with the Cambridge Theatre Company
29
+ as Diggory in `` She Stoops to Conquer `` and again as Aguecheek .
30
  sentences:
31
+ - Oxide can be reduced to metallic samarium at higher temperatures by heating with
32
+ a reducing agent such as hydrogen or carbon monoxide .
33
+ - From 1973 to 1974 Aguecheek toured with the Cambridge Theatre Company as Diggory
34
+ in `` You Stoops to Conquer `` and again as Aubrey .
35
+ - The medals were presented by Barry Maister , IOC member , New Zealand and Sarah
36
+ Webb Gosling , Vice President of World Sailing .
37
+ - source_sentence: There is no official wall on the border , although there are sections
38
+ of fence near populated areas and continuous border crossings .
39
  sentences:
40
+ - The 2014 -- 15 Boston Bruins season was the 91st season for the National Hockey
41
+ League franchise that was established on November 1 , 1924 .
42
+ - He was trained by the Inghams and owned by John Hawkes .
43
+ - There is no continuous wall on the border , although there are fence sections
44
+ near populated areas and official border crossings .
45
+ - source_sentence: Capital . `` The French established similar hill stations in Indochina
46
+ , such as Dalat built in 1921 .
47
  sentences:
48
+ - Lubuk China is a small town in Alor Gajah District , Melaka , Malaysia . It is
49
+ situated near the border with Negeri Sembilan .
50
+ - The French established similar hill stations in Indochina , such as Dalat , built
51
+ in 1921 .
52
+ - John Potts ( or Pott ) was a doctor and colonial governor of Virginia in the Jamestown
53
+ settlement at Virginia Colony in the early 17th century .
54
+ - source_sentence: The band pursued `` signals `` in January 2012 in three weeks ,
55
+ and drums were recorded in a day and a half .
56
  sentences:
57
+ - It was repaired at the beginning of the 20th century and is listed as closed in
58
+ our records .
59
+ - The band tracked `` Signals `` in three weeks in January 2012 . Drums were recorded
60
+ in a day and a half .
61
+ - Contributors include actor Anton LaVey , Satanist Christopher Lee , serial killer
62
+ expert Clive Barker , author Karen Greenlee , and necrophile Robert Ressler .
63
  datasets:
64
  - redis/langcache-sentencepairs-v1
65
  pipeline_tag: sentence-similarity
 
76
  model-index:
77
  - name: Redis fine-tuned BiEncoder model for semantic caching on LangCache
78
  results:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  - task:
80
  type: binary-classification
81
  name: Binary Classification
 
84
  type: test
85
  metrics:
86
  - type: cosine_accuracy
87
+ value: 0.7037777526966672
88
  name: Cosine Accuracy
89
  - type: cosine_accuracy_threshold
90
+ value: 0.8524033427238464
91
  name: Cosine Accuracy Threshold
92
  - type: cosine_f1
93
+ value: 0.7122170715871171
94
  name: Cosine F1
95
  - type: cosine_f1_threshold
96
+ value: 0.8118724822998047
97
  name: Cosine F1 Threshold
98
  - type: cosine_precision
99
+ value: 0.5989283084033827
100
  name: Cosine Precision
101
  - type: cosine_recall
102
+ value: 0.8783612662942272
103
  name: Cosine Recall
104
  - type: cosine_ap
105
+ value: 0.6476665223951498
106
  name: Cosine Ap
107
  - type: cosine_mcc
108
+ value: 0.44182914870985407
109
  name: Cosine Mcc
110
  ---
111
 
 
159
  model = SentenceTransformer("redis/langcache-embed-v3")
160
  # Run inference
161
  sentences = [
162
+ 'The band pursued `` signals `` in January 2012 in three weeks , and drums were recorded in a day and a half .',
163
+ 'The band tracked `` Signals `` in three weeks in January 2012 . Drums were recorded in a day and a half .',
164
+ 'Contributors include actor Anton LaVey , Satanist Christopher Lee , serial killer expert Clive Barker , author Karen Greenlee , and necrophile Robert Ressler .',
165
  ]
166
  embeddings = model.encode(sentences)
167
  print(embeddings.shape)
 
170
  # Get the similarity scores for the embeddings
171
  similarities = model.similarity(embeddings, embeddings)
172
  print(similarities)
173
+ # tensor([[0.9961, 0.9570, 0.4941],
174
+ # [0.9570, 0.9961, 0.5078],
175
+ # [0.4941, 0.5078, 1.0000]], dtype=torch.bfloat16)
176
  ```
177
 
178
  <!--
 
205
 
206
  #### Binary Classification
207
 
208
+ * Dataset: `test`
209
  * Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
210
 
211
+ | Metric | Value |
212
+ |:--------------------------|:-----------|
213
+ | cosine_accuracy | 0.7038 |
214
+ | cosine_accuracy_threshold | 0.8524 |
215
+ | cosine_f1 | 0.7122 |
216
+ | cosine_f1_threshold | 0.8119 |
217
+ | cosine_precision | 0.5989 |
218
+ | cosine_recall | 0.8784 |
219
+ | **cosine_ap** | **0.6477** |
220
+ | cosine_mcc | 0.4418 |
221
 
222
  <!--
223
  ## Bias, Risks and Limitations
 
238
  #### LangCache Sentence Pairs (all)
239
 
240
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
241
+ * Size: 62,021 training samples
242
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
243
  * Approximate statistics based on the first 1000 samples:
244
+ | | sentence1 | sentence2 | label |
245
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:------------------------------------------------|
246
+ | type | string | string | int |
247
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.46 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.36 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>0: ~50.30%</li><li>1: ~49.70%</li></ul> |
248
  * Samples:
249
+ | sentence1 | sentence2 | label |
250
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
251
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
252
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> | <code>0</code> |
253
+ | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
254
  * Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
255
  ```json
256
  {
 
265
  #### LangCache Sentence Pairs (all)
266
 
267
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
268
+ * Size: 62,021 evaluation samples
269
+ * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
270
  * Approximate statistics based on the first 1000 samples:
271
+ | | sentence1 | sentence2 | label |
272
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:------------------------------------------------|
273
+ | type | string | string | int |
274
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.46 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.36 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>0: ~50.30%</li><li>1: ~49.70%</li></ul> |
275
  * Samples:
276
+ | sentence1 | sentence2 | label |
277
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
278
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>1</code> |
279
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> | <code>0</code> |
280
+ | <code>After losing his second election , he resigned as opposition leader and was replaced by Geoff Pearsall .</code> | <code>Max Bingham resigned as opposition leader after losing his second election , and was replaced by Geoff Pearsall .</code> | <code>1</code> |
281
  * Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
282
  ```json
283
  {
 
288
  ```
289
 
290
  ### Training Logs
291
+ | Epoch | Step | test_cosine_ap |
292
+ |:-----:|:----:|:--------------:|
293
+ | -1 | -1 | 0.6477 |
294
 
295
 
296
  ### Framework Versions