radoslavralev commited on
Commit
e17d3f5
·
verified ·
1 Parent(s): 4dcd005

Add new SentenceTransformer model

Browse files
Files changed (2) hide show
  1. README.md +90 -115
  2. model.safetensors +1 -1
README.md CHANGED
@@ -12,58 +12,42 @@ tags:
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
- - dataset_size:1047690
16
- - loss:CoSENTLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
- - source_sentence: That is evident from their failure , three times in a row , to
20
- get a big enough turnout to elect a president .
21
  sentences:
22
- - 'given a text, decide to which of a predefined set of classes it belongs. examples:
23
- language identification, genre classification, sentiment analysis, and spam detection'
24
- - Three times in a row , they failed to get a big _ enough turnout to elect a president
25
- .
26
- - He said the Government still did not know the real reason the original Saudi buyer
27
- pulled out on August 21 .
28
- - source_sentence: these use built-in and learned knowledge to make decisions and
29
- accomplish tasks that fulfill the intentions of the user.
30
  sentences:
31
- - It also features a 4.5 in back-lit LCD screen and memory expansion facilities
32
- .
33
- - '- set of interrelated components - collect, process, store and distribute info.
34
- - support decision-making, coordination, and control'
35
- - software programs that work without direct human intervention to carry out specific
36
- tasks for an individual user, business process, or software application -siri
37
- adapts to your preferences over time
38
- - source_sentence: any location in storage can be accessed at any moment in approximately
39
- the same amount of time.
40
  sentences:
41
- - your study can adopt the original model used by the cited theorist but you can
42
- modify different variables depending on your study of the whole theory
43
- - an access method that can access any storage location directly and in any order;
44
- primary storage devices and disk storage devices use random access...
45
- - Branson said that his preference would be to operate a fully commercial service
46
- on routes to New York , Barbados and Dubai .
47
- - source_sentence: United issued a statement saying it will " work professionally
48
- and cooperatively with all its unions . "
49
  sentences:
50
- - network that acts like the human brain; type of ai
51
- - a database system consists of one or more databases and a database management
52
- system (dbms).
53
- - Senior vice president Sara Fields said the airline " will work professionally
54
- and cooperatively with all our unions . "
55
- - source_sentence: A European Union spokesman said the Commission was consulting EU
56
- member states " with a view to taking appropriate action if necessary " on the
57
- matter .
58
  sentences:
59
- - Justice Minister Martin Cauchon and Prime Minister Jean Chretien both have said
60
- the government will introduce legislation to decriminalize possession of small
61
- amounts of pot .
62
- - Laos 's second most important export destination - said it was consulting EU member
63
- states ' ' with a view to taking appropriate action if necessary ' ' on the matter
64
- .
65
- - the form data assumes and the possible range of values that the attribute defined
66
- as that type of data may express 1. text 2. numerical
67
  datasets:
68
  - redis/langcache-sentencepairs-v1
69
  pipeline_tag: sentence-similarity
@@ -88,28 +72,28 @@ model-index:
88
  type: val
89
  metrics:
90
  - type: cosine_accuracy
91
- value: 0.7638310529446758
92
  name: Cosine Accuracy
93
  - type: cosine_accuracy_threshold
94
- value: 0.8640533685684204
95
  name: Cosine Accuracy Threshold
96
  - type: cosine_f1
97
- value: 0.6912742186395134
98
  name: Cosine F1
99
  - type: cosine_f1_threshold
100
- value: 0.825770378112793
101
  name: Cosine F1 Threshold
102
  - type: cosine_precision
103
- value: 0.6289243437982501
104
  name: Cosine Precision
105
  - type: cosine_recall
106
- value: 0.7673469387755102
107
  name: Cosine Recall
108
  - type: cosine_ap
109
- value: 0.7353968345121902
110
  name: Cosine Ap
111
  - type: cosine_mcc
112
- value: 0.4778469995044085
113
  name: Cosine Mcc
114
  - task:
115
  type: binary-classification
@@ -119,28 +103,28 @@ model-index:
119
  type: test
120
  metrics:
121
  - type: cosine_accuracy
122
- value: 0.7037777526966672
123
  name: Cosine Accuracy
124
  - type: cosine_accuracy_threshold
125
- value: 0.8524033427238464
126
  name: Cosine Accuracy Threshold
127
  - type: cosine_f1
128
- value: 0.7122170715871171
129
  name: Cosine F1
130
  - type: cosine_f1_threshold
131
- value: 0.8118724822998047
132
  name: Cosine F1 Threshold
133
  - type: cosine_precision
134
- value: 0.5989283084033827
135
  name: Cosine Precision
136
  - type: cosine_recall
137
- value: 0.8783612662942272
138
  name: Cosine Recall
139
  - type: cosine_ap
140
- value: 0.6476665223951498
141
  name: Cosine Ap
142
  - type: cosine_mcc
143
- value: 0.44182914870985407
144
  name: Cosine Mcc
145
  ---
146
 
@@ -194,9 +178,9 @@ from sentence_transformers import SentenceTransformer
194
  model = SentenceTransformer("redis/langcache-embed-v3")
195
  # Run inference
196
  sentences = [
197
- 'A European Union spokesman said the Commission was consulting EU member states " with a view to taking appropriate action if necessary " on the matter .',
198
- "Laos 's second most important export destination - said it was consulting EU member states ' ' with a view to taking appropriate action if necessary ' ' on the matter .",
199
- 'the form data assumes and the possible range of values that the attribute defined as that type of data may express 1. text 2. numerical',
200
  ]
201
  embeddings = model.encode(sentences)
202
  print(embeddings.shape)
@@ -205,9 +189,9 @@ print(embeddings.shape)
205
  # Get the similarity scores for the embeddings
206
  similarities = model.similarity(embeddings, embeddings)
207
  print(similarities)
208
- # tensor([[1.0078, 0.8789, 0.4961],
209
- # [0.8789, 1.0000, 0.4648],
210
- # [0.4961, 0.4648, 1.0078]], dtype=torch.bfloat16)
211
  ```
212
 
213
  <!--
@@ -243,16 +227,16 @@ You can finetune this model on your own dataset.
243
  * Datasets: `val` and `test`
244
  * Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
245
 
246
- | Metric | val | test |
247
- |:--------------------------|:-----------|:-----------|
248
- | cosine_accuracy | 0.7638 | 0.7038 |
249
- | cosine_accuracy_threshold | 0.8641 | 0.8524 |
250
- | cosine_f1 | 0.6913 | 0.7122 |
251
- | cosine_f1_threshold | 0.8258 | 0.8119 |
252
- | cosine_precision | 0.6289 | 0.5989 |
253
- | cosine_recall | 0.7673 | 0.8784 |
254
- | **cosine_ap** | **0.7354** | **0.6477** |
255
- | cosine_mcc | 0.4778 | 0.4418 |
256
 
257
  <!--
258
  ## Bias, Risks and Limitations
@@ -273,24 +257,25 @@ You can finetune this model on your own dataset.
273
  #### LangCache Sentence Pairs (all)
274
 
275
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
276
- * Size: 8,405 training samples
277
- * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
278
  * Approximate statistics based on the first 1000 samples:
279
- | | sentence1 | sentence2 | label |
280
- |:--------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:------------------------------------------------|
281
- | type | string | string | int |
282
- | details | <ul><li>min: 6 tokens</li><li>mean: 24.89 tokens</li><li>max: 50 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 24.3 tokens</li><li>max: 43 tokens</li></ul> | <ul><li>0: ~45.80%</li><li>1: ~54.20%</li></ul> |
283
  * Samples:
284
- | sentence1 | sentence2 | label |
285
- |:--------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
286
- | <code>He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .</code> | <code>" The foodservice pie business does not fit our long-term growth strategy .</code> | <code>1</code> |
287
- | <code>Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .</code> | <code>His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .</code> | <code>0</code> |
288
- | <code>The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .</code> | <code>The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .</code> | <code>0</code> |
289
- * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
290
  ```json
291
  {
292
  "scale": 20.0,
293
- "similarity_fct": "pairwise_cos_sim"
 
294
  }
295
  ```
296
 
@@ -299,31 +284,32 @@ You can finetune this model on your own dataset.
299
  #### LangCache Sentence Pairs (all)
300
 
301
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
302
- * Size: 8,405 evaluation samples
303
- * Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
304
  * Approximate statistics based on the first 1000 samples:
305
- | | sentence1 | sentence2 | label |
306
- |:--------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:------------------------------------------------|
307
- | type | string | string | int |
308
- | details | <ul><li>min: 6 tokens</li><li>mean: 24.89 tokens</li><li>max: 50 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 24.3 tokens</li><li>max: 43 tokens</li></ul> | <ul><li>0: ~45.80%</li><li>1: ~54.20%</li></ul> |
309
  * Samples:
310
- | sentence1 | sentence2 | label |
311
- |:--------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
312
- | <code>He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .</code> | <code>" The foodservice pie business does not fit our long-term growth strategy .</code> | <code>1</code> |
313
- | <code>Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .</code> | <code>His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .</code> | <code>0</code> |
314
- | <code>The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .</code> | <code>The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .</code> | <code>0</code> |
315
- * Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
316
  ```json
317
  {
318
  "scale": 20.0,
319
- "similarity_fct": "pairwise_cos_sim"
 
320
  }
321
  ```
322
 
323
  ### Training Logs
324
  | Epoch | Step | val_cosine_ap | test_cosine_ap |
325
  |:-----:|:----:|:-------------:|:--------------:|
326
- | -1 | -1 | 0.7354 | 0.6477 |
327
 
328
 
329
  ### Framework Versions
@@ -352,17 +338,6 @@ You can finetune this model on your own dataset.
352
  }
353
  ```
354
 
355
- #### CoSENTLoss
356
- ```bibtex
357
- @online{kexuefm-8847,
358
- title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
359
- author={Su Jianlin},
360
- year={2022},
361
- month={Jan},
362
- url={https://kexue.fm/archives/8847},
363
- }
364
- ```
365
-
366
  <!--
367
  ## Glossary
368
 
 
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
+ - dataset_size:478600
16
+ - loss:MultipleNegativesSymmetricRankingLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
+ - source_sentence: The brown dog is sniffing the back of a small black dog
 
20
  sentences:
21
+ - Pickens died in Edgefield and was buried on the Willow Brook Cemetery in Edgefield
22
+ , South Carolina .
23
+ - It is notable as the oldest Chinatown in Australia , the oldest continuous Chinese
24
+ settlement in Australia , and the longest continuously running Chinatown outside
25
+ of Asia .
26
+ - There is no large brown dog and small grey dog standing on a rocky surface
27
+ - source_sentence: Is it harmful from security perspectives to use public Wi-Fi?
 
28
  sentences:
29
+ - What is the best way to drive traffic to a website?
30
+ - What startups have used GitHub?
31
+ - Is there something wrong with using public Wi-Fi?
32
+ - source_sentence: How can we make education better?
 
 
 
 
 
33
  sentences:
34
+ - What are some things that would make education better today?
35
+ - Mistery works full-time as a graffiti artist and is also Emcee / Rapper in the
36
+ Brethren group .
37
+ - Jammu Airport operates flights to many cities in India such as Delhi , Leh and
38
+ Srinagar .
39
+ - source_sentence: So are you.
 
 
40
  sentences:
41
+ - 'Brown said afterwards that he was surprised they had not scored five , and Astall
42
+ wrote in his newspaper column :'
43
+ - Just like yourself.
44
+ - How do I actually lose weight?
45
+ - source_sentence: A group of boys are playing with a ball in front of a large door
46
+ made of wood
 
 
47
  sentences:
48
+ - The children are playing in front of a large door
49
+ - What is the blind spot?
50
+ - What are some good techniques for controlling your anger?
 
 
 
 
 
51
  datasets:
52
  - redis/langcache-sentencepairs-v1
53
  pipeline_tag: sentence-similarity
 
72
  type: val
73
  metrics:
74
  - type: cosine_accuracy
75
+ value: 0.9996860282574568
76
  name: Cosine Accuracy
77
  - type: cosine_accuracy_threshold
78
+ value: 0.4801735281944275
79
  name: Cosine Accuracy Threshold
80
  - type: cosine_f1
81
+ value: 0.9998429894802952
82
  name: Cosine F1
83
  - type: cosine_f1_threshold
84
+ value: 0.4801735281944275
85
  name: Cosine F1 Threshold
86
  - type: cosine_precision
87
+ value: 1.0
88
  name: Cosine Precision
89
  - type: cosine_recall
90
+ value: 0.9996860282574568
91
  name: Cosine Recall
92
  - type: cosine_ap
93
+ value: 0.9999999999999999
94
  name: Cosine Ap
95
  - type: cosine_mcc
96
+ value: 0.0
97
  name: Cosine Mcc
98
  - task:
99
  type: binary-classification
 
103
  type: test
104
  metrics:
105
  - type: cosine_accuracy
106
+ value: 0.9999627560521416
107
  name: Cosine Accuracy
108
  - type: cosine_accuracy_threshold
109
+ value: 0.42059871554374695
110
  name: Cosine Accuracy Threshold
111
  - type: cosine_f1
112
+ value: 0.9999813776792864
113
  name: Cosine F1
114
  - type: cosine_f1_threshold
115
+ value: 0.42059871554374695
116
  name: Cosine F1 Threshold
117
  - type: cosine_precision
118
+ value: 1.0
119
  name: Cosine Precision
120
  - type: cosine_recall
121
+ value: 0.9999627560521416
122
  name: Cosine Recall
123
  - type: cosine_ap
124
+ value: 1.0
125
  name: Cosine Ap
126
  - type: cosine_mcc
127
+ value: 0.0
128
  name: Cosine Mcc
129
  ---
130
 
 
178
  model = SentenceTransformer("redis/langcache-embed-v3")
179
  # Run inference
180
  sentences = [
181
+ 'A group of boys are playing with a ball in front of a large door made of wood',
182
+ 'The children are playing in front of a large door',
183
+ 'What are some good techniques for controlling your anger?',
184
  ]
185
  embeddings = model.encode(sentences)
186
  print(embeddings.shape)
 
189
  # Get the similarity scores for the embeddings
190
  similarities = model.similarity(embeddings, embeddings)
191
  print(similarities)
192
+ # tensor([[1.0000, 0.8672, 0.4121],
193
+ # [0.8672, 1.0000, 0.4219],
194
+ # [0.4121, 0.4219, 1.0000]], dtype=torch.bfloat16)
195
  ```
196
 
197
  <!--
 
227
  * Datasets: `val` and `test`
228
  * Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
229
 
230
+ | Metric | val | test |
231
+ |:--------------------------|:--------|:--------|
232
+ | cosine_accuracy | 0.9997 | 1.0 |
233
+ | cosine_accuracy_threshold | 0.4802 | 0.4206 |
234
+ | cosine_f1 | 0.9998 | 1.0 |
235
+ | cosine_f1_threshold | 0.4802 | 0.4206 |
236
+ | cosine_precision | 1.0 | 1.0 |
237
+ | cosine_recall | 0.9997 | 1.0 |
238
+ | **cosine_ap** | **1.0** | **1.0** |
239
+ | cosine_mcc | 0.0 | 0.0 |
240
 
241
  <!--
242
  ## Bias, Risks and Limitations
 
257
  #### LangCache Sentence Pairs (all)
258
 
259
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
260
+ * Size: 26,850 training samples
261
+ * Columns: <code>sentence1</code> and <code>sentence2</code>
262
  * Approximate statistics based on the first 1000 samples:
263
+ | | sentence1 | sentence2 |
264
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
265
+ | type | string | string |
266
+ | details | <ul><li>min: 4 tokens</li><li>mean: 16.76 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 16.58 tokens</li><li>max: 44 tokens</li></ul> |
267
  * Samples:
268
+ | sentence1 | sentence2 |
269
+ |:---------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------|
270
+ | <code>A chef is preparing a meal</code> | <code>Some food is being prepared by a chef</code> |
271
+ | <code>The presentation is being watched by a classroom of students</code> | <code>A classroom is full of students</code> |
272
+ | <code>Garden River , located north of Garden River Airport , Alberta , Canada .</code> | <code>Garden River , , is located north of Garden River Airport , Alberta , Canada .</code> |
273
+ * Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
274
  ```json
275
  {
276
  "scale": 20.0,
277
+ "similarity_fct": "cos_sim",
278
+ "gather_across_devices": false
279
  }
280
  ```
281
 
 
284
  #### LangCache Sentence Pairs (all)
285
 
286
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
287
+ * Size: 26,850 evaluation samples
288
+ * Columns: <code>sentence1</code> and <code>sentence2</code>
289
  * Approximate statistics based on the first 1000 samples:
290
+ | | sentence1 | sentence2 |
291
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
292
+ | type | string | string |
293
+ | details | <ul><li>min: 4 tokens</li><li>mean: 16.76 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 16.58 tokens</li><li>max: 44 tokens</li></ul> |
294
  * Samples:
295
+ | sentence1 | sentence2 |
296
+ |:---------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------|
297
+ | <code>A chef is preparing a meal</code> | <code>Some food is being prepared by a chef</code> |
298
+ | <code>The presentation is being watched by a classroom of students</code> | <code>A classroom is full of students</code> |
299
+ | <code>Garden River , located north of Garden River Airport , Alberta , Canada .</code> | <code>Garden River , , is located north of Garden River Airport , Alberta , Canada .</code> |
300
+ * Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
301
  ```json
302
  {
303
  "scale": 20.0,
304
+ "similarity_fct": "cos_sim",
305
+ "gather_across_devices": false
306
  }
307
  ```
308
 
309
  ### Training Logs
310
  | Epoch | Step | val_cosine_ap | test_cosine_ap |
311
  |:-----:|:----:|:-------------:|:--------------:|
312
+ | -1 | -1 | 1.0000 | 1.0 |
313
 
314
 
315
  ### Framework Versions
 
338
  }
339
  ```
340
 
 
 
 
 
 
 
 
 
 
 
 
341
  <!--
342
  ## Glossary
343
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:db6feb45cb2c9f0ca7da393bb058316538cac11e31b6ef6fc4cb21e299b1e346
3
  size 298041696
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95d02211c4cca89113f9f3e93ed91f5176bf50170faa2cb835f7bfea15bb9dd2
3
  size 298041696