radoslavralev commited on
Commit
cfc7a2d
·
verified ·
1 Parent(s): 07bd930

Add new SentenceTransformer model

Browse files
Files changed (2) hide show
  1. README.md +68 -67
  2. model.safetensors +1 -1
README.md CHANGED
@@ -12,50 +12,51 @@ tags:
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
- - dataset_size:3587
16
  - loss:CustomBCELoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
- - source_sentence: Hunter College was originally Lehman College 's uptown campus .
 
20
  sentences:
21
- - Acquired programming includes the Irish soap `` Fair City `` and Finnish drama
22
- `` Black Widows `` .
23
- - According to the United States Census Bureau , the town has a total area of ;
24
- of the area is land and 0.66 % is water .
25
- - Hunter College originally was Lehman College Uptown Campus .
26
- - source_sentence: He hoped to defeat them and then marry Ravonna .
 
 
27
  sentences:
28
- - Stillwater Creek received its official name in 1884 when William L. Couch established
29
- his `` boomer colony `` on its banks .
30
- - Note that the invertible of a matrix is always an exponential matrix .
31
- - He hoped to defeat them and marry Ravonna .
32
- - source_sentence: Born on February 2 , 1984 , Abrar Khan is a professional Pakistani
33
- international Kabaddi player .
 
 
34
  sentences:
35
- - Born on February 2 , 1984 , Abrar Khan is a professional Pakistani international
36
- Kabaddi player .
37
- - Together , the paired mylohyoid muscles form a muscular floor for the oral cavity
38
- of the mouth .
39
- - Abrar Khan born 2 February 1984 is a Pakistani professional international Kabaddi
40
- player .
41
- - source_sentence: Certainly , `` Lucy was nothing like flat `` in physical form ,
42
- social condition , and personality .
43
  sentences:
44
- - The real number is called the `` imaginary part `` of the real number ; the real
45
- number is called the `` complex part `` of .
46
- - From the Celebes lake , the captain Bullock observed the appearance of the corona
47
- , while Gustav Fritsch accompanied an expedition to Aden .
48
- - Certainly `` Lucy was , in physical form , social condition and personality ,
49
- nothing like Shallow `` .
50
- - source_sentence: The trio has performed besides Gesaffelstein , Justice , Bob Moses
51
- and Lee Foss .
52
  sentences:
53
- - The trio has performed besides Gesaffelstein , Justice , Bob Moses and Lee Foss
54
- .
55
- - The suttas generally contain educational content , while other early Buddhist
56
- texts deal with monastic discipline or vinaya .
57
- - The trio has performed alongside Bob Moses , Justice , Gesaffelstein and Lee Foss
58
- .
59
  datasets:
60
  - redis/langcache-sentencepairs-v2
61
  pipeline_tag: sentence-similarity
@@ -87,13 +88,13 @@ model-index:
87
  value: 0.5679885764966713
88
  name: Cosine Recall@1
89
  - type: cosine_ndcg@10
90
- value: 0.773078207125666
91
  name: Cosine Ndcg@10
92
  - type: cosine_mrr@1
93
  value: 0.5861241448475948
94
  name: Cosine Mrr@1
95
  - type: cosine_map@100
96
- value: 0.7217228927629071
97
  name: Cosine Map@100
98
  ---
99
 
@@ -147,9 +148,9 @@ from sentence_transformers import SentenceTransformer
147
  model = SentenceTransformer("redis/langcache-embed-v3")
148
  # Run inference
149
  sentences = [
150
- 'The trio has performed besides Gesaffelstein , Justice , Bob Moses and Lee Foss .',
151
- 'The trio has performed besides Gesaffelstein , Justice , Bob Moses and Lee Foss .',
152
- 'The trio has performed alongside Bob Moses , Justice , Gesaffelstein and Lee Foss .',
153
  ]
154
  embeddings = model.encode(sentences)
155
  print(embeddings.shape)
@@ -158,9 +159,9 @@ print(embeddings.shape)
158
  # Get the similarity scores for the embeddings
159
  similarities = model.similarity(embeddings, embeddings)
160
  print(similarities)
161
- # tensor([[0.9961, 0.9961, 0.9844],
162
- # [0.9961, 0.9961, 0.9844],
163
- # [0.9844, 0.9844, 0.9961]], dtype=torch.bfloat16)
164
  ```
165
 
166
  <!--
@@ -196,14 +197,14 @@ You can finetune this model on your own dataset.
196
  * Dataset: `test`
197
  * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
198
 
199
- | Metric | Value |
200
- |:-------------------|:-----------|
201
- | cosine_accuracy@1 | 0.5861 |
202
- | cosine_precision@1 | 0.5861 |
203
- | cosine_recall@1 | 0.568 |
204
- | **cosine_ndcg@10** | **0.7731** |
205
- | cosine_mrr@1 | 0.5861 |
206
- | cosine_map@100 | 0.7217 |
207
 
208
  <!--
209
  ## Bias, Risks and Limitations
@@ -224,19 +225,19 @@ You can finetune this model on your own dataset.
224
  #### LangCache Sentence Pairs (all)
225
 
226
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
227
- * Size: 1,922 training samples
228
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
229
  * Approximate statistics based on the first 1000 samples:
230
  | | anchor | positive | negative |
231
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
232
  | type | string | string | string |
233
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.26 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.24 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.09 tokens</li><li>max: 49 tokens</li></ul> |
234
  * Samples:
235
- | anchor | positive | negative |
236
- |:--------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------|
237
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>At that time , on June 22 , 1754 , Edward Bentham married Bentham Elizabeth Bates ( d . 1790 ) from Hampshire in the nearby county of Alton .</code> |
238
- | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>In 2012 , Cornell 5th and Lehigh 8th , Cornell was also 4th in 2013 and 7th in 2014 .</code> |
239
- | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
240
  * Loss: <code>losses.CustomBCELoss</code>
241
 
242
  ### Evaluation Dataset
@@ -244,25 +245,25 @@ You can finetune this model on your own dataset.
244
  #### LangCache Sentence Pairs (all)
245
 
246
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
247
- * Size: 1,922 evaluation samples
248
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
249
  * Approximate statistics based on the first 1000 samples:
250
  | | anchor | positive | negative |
251
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
252
  | type | string | string | string |
253
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.26 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.24 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.09 tokens</li><li>max: 49 tokens</li></ul> |
254
  * Samples:
255
- | anchor | positive | negative |
256
- |:--------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------|
257
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>At that time , on June 22 , 1754 , Edward Bentham married Bentham Elizabeth Bates ( d . 1790 ) from Hampshire in the nearby county of Alton .</code> |
258
- | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>In 2012 , Cornell 5th and Lehigh 8th , Cornell was also 4th in 2013 and 7th in 2014 .</code> |
259
- | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
260
  * Loss: <code>losses.CustomBCELoss</code>
261
 
262
  ### Training Logs
263
  | Epoch | Step | test_cosine_ndcg@10 |
264
  |:-----:|:----:|:-------------------:|
265
- | -1 | -1 | 0.7731 |
266
 
267
 
268
  ### Framework Versions
 
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
+ - dataset_size:3119809
16
  - loss:CustomBCELoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
+ - source_sentence: Hayley Vaughan portrayed Ripa on the ABC daytime soap opera , ``
20
+ All My Children `` , between 1990 and 2002 .
21
  sentences:
22
+ - Traxxpad is a music application for Sony 's PlayStation Portable published by
23
+ Definitive Studios and developed by Eidos Interactive .
24
+ - Between 1990 and 2002 , Hayley Vaughan Ripa portrayed in the ABC soap opera ``
25
+ All My Children `` .
26
+ - Between 1990 and 2002 , Ripa Hayley portrayed Vaughan in the ABC soap opera ``
27
+ All My Children `` .
28
+ - source_sentence: Olivella monilifera is a species of dwarf sea snail , small gastropod
29
+ mollusk in the family Olivellidae , the marine olives .
30
  sentences:
31
+ - Olivella monilifera is a species of the dwarf - sea snail , small gastropod mollusk
32
+ in the Olivellidae family , the marine olives .
33
+ - He was cut by the Browns after being signed by the Bills in 2013 . He was later
34
+ released .
35
+ - Olivella monilifera is a kind of sea snail , marine gastropod mollusk in the Olivellidae
36
+ family , the dwarf olives .
37
+ - source_sentence: Hayashi said that Mackey `` is a sort of `` of the original model
38
+ for Tenchi .
39
  sentences:
40
+ - In the summer of 2009 , Ellick shot a documentary about Malala Yousafzai .
41
+ - Hayashi said that Mackey is `` sort of `` the original model for Tenchi .
42
+ - Mackey said that Hayashi is `` sort of `` the original model for Tenchi .
43
+ - source_sentence: Much of the film was shot on location in Los Angeles and in nearby
44
+ Burbank and Glendale .
 
 
 
45
  sentences:
46
+ - Much of the film was shot on location in Los Angeles and in nearby Burbank and
47
+ Glendale .
48
+ - Much of the film was shot on site in Burbank and Glendale and in the nearby Los
49
+ Angeles .
50
+ - Traxxpad is a music application for the Sony PlayStation Portable developed by
51
+ the Definitive Studios and published by Eidos Interactive .
52
+ - source_sentence: According to him , the earth is the carrier of his artistic work
53
+ , which is only integrated into the creative process by minimal changes .
54
  sentences:
55
+ - National players are Bold players .
56
+ - According to him , earth is the carrier of his artistic work being integrated
57
+ into the creative process only by minimal changes .
58
+ - According to him , earth is the carrier of his creative work being integrated
59
+ into the artistic process only by minimal changes .
 
60
  datasets:
61
  - redis/langcache-sentencepairs-v2
62
  pipeline_tag: sentence-similarity
 
88
  value: 0.5679885764966713
89
  name: Cosine Recall@1
90
  - type: cosine_ndcg@10
91
+ value: 0.7729838064849864
92
  name: Cosine Ndcg@10
93
  - type: cosine_mrr@1
94
  value: 0.5861241448475948
95
  name: Cosine Mrr@1
96
  - type: cosine_map@100
97
+ value: 0.7216697804426214
98
  name: Cosine Map@100
99
  ---
100
 
 
148
  model = SentenceTransformer("redis/langcache-embed-v3")
149
  # Run inference
150
  sentences = [
151
+ 'According to him , the earth is the carrier of his artistic work , which is only integrated into the creative process by minimal changes .',
152
+ 'According to him , earth is the carrier of his artistic work being integrated into the creative process only by minimal changes .',
153
+ 'According to him , earth is the carrier of his creative work being integrated into the artistic process only by minimal changes .',
154
  ]
155
  embeddings = model.encode(sentences)
156
  print(embeddings.shape)
 
159
  # Get the similarity scores for the embeddings
160
  similarities = model.similarity(embeddings, embeddings)
161
  print(similarities)
162
+ # tensor([[1.0000, 0.9961, 0.9922],
163
+ # [0.9961, 1.0000, 0.9961],
164
+ # [0.9922, 0.9961, 0.9961]], dtype=torch.bfloat16)
165
  ```
166
 
167
  <!--
 
197
  * Dataset: `test`
198
  * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
199
 
200
+ | Metric | Value |
201
+ |:-------------------|:----------|
202
+ | cosine_accuracy@1 | 0.5861 |
203
+ | cosine_precision@1 | 0.5861 |
204
+ | cosine_recall@1 | 0.568 |
205
+ | **cosine_ndcg@10** | **0.773** |
206
+ | cosine_mrr@1 | 0.5861 |
207
+ | cosine_map@100 | 0.7217 |
208
 
209
  <!--
210
  ## Bias, Risks and Limitations
 
225
  #### LangCache Sentence Pairs (all)
226
 
227
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
228
+ * Size: 126,938 training samples
229
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
230
  * Approximate statistics based on the first 1000 samples:
231
  | | anchor | positive | negative |
232
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
233
  | type | string | string | string |
234
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 48 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 26.54 tokens</li><li>max: 61 tokens</li></ul> |
235
  * Samples:
236
+ | anchor | positive | negative |
237
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|
238
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>how can I get financial freedom as soon as possible?</code> |
239
+ | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The older Punts are still very much in existence today and race in the same fleets as the newer boats .</code> |
240
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
241
  * Loss: <code>losses.CustomBCELoss</code>
242
 
243
  ### Evaluation Dataset
 
245
  #### LangCache Sentence Pairs (all)
246
 
247
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
248
+ * Size: 126,938 evaluation samples
249
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
250
  * Approximate statistics based on the first 1000 samples:
251
  | | anchor | positive | negative |
252
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
253
  | type | string | string | string |
254
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 48 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 26.54 tokens</li><li>max: 61 tokens</li></ul> |
255
  * Samples:
256
+ | anchor | positive | negative |
257
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|
258
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>how can I get financial freedom as soon as possible?</code> |
259
+ | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The older Punts are still very much in existence today and race in the same fleets as the newer boats .</code> |
260
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
261
  * Loss: <code>losses.CustomBCELoss</code>
262
 
263
  ### Training Logs
264
  | Epoch | Step | test_cosine_ndcg@10 |
265
  |:-----:|:----:|:-------------------:|
266
+ | -1 | -1 | 0.7730 |
267
 
268
 
269
  ### Framework Versions
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:101e6d51a3bcf3b334098c04557b82cbb0cc0e364fd85abb022b2062eacab4e7
3
  size 298041696
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95d02211c4cca89113f9f3e93ed91f5176bf50170faa2cb835f7bfea15bb9dd2
3
  size 298041696