redis
/

langcache-embed-v3

@@ -12,58 +12,42 @@ tags:
 - retrieval
 - reranking
 - generated_from_trainer
-- dataset_size:1047690
-- loss:CoSENTLoss
 base_model: Alibaba-NLP/gte-modernbert-base
 widget:
-- source_sentence: That is evident from their failure , three times in a row , to
-    get a big enough turnout to elect a president .
   sentences:
-  - 'given a text, decide to which of a predefined set of classes it belongs.  examples:
-    language identification, genre classification, sentiment analysis, and spam detection'
-  - Three times in a row , they failed to get a big _ enough turnout to elect a president
-    .
-  - He said the Government still did not know the real reason the original Saudi buyer
-    pulled out on August 21 .
-- source_sentence: these use built-in and learned knowledge to make decisions and
-    accomplish tasks that fulfill the intentions of the user.
   sentences:
-  - It also features a 4.5 in back-lit LCD screen and memory expansion facilities
-    .
-  - '- set of interrelated components - collect, process, store and distribute info.
-    - support decision-making, coordination, and control'
-  - software programs that work without direct human intervention to carry out specific
-    tasks for an individual user, business process, or software application -siri
-    adapts to your preferences over time
-- source_sentence: any location in storage can be accessed at any moment in approximately
-    the same amount of time.
   sentences:
-  - your study can adopt the original model used by the cited theorist but you can
-    modify different variables depending on your study of the whole theory
-  - an access method that can access any storage location directly and in any order;
-    primary storage devices and disk storage devices use random access...
-  - Branson said that his preference would be to operate a fully commercial service
-    on routes to New York , Barbados and Dubai .
-- source_sentence: United issued a statement saying it will " work professionally
-    and cooperatively with all its unions . "
   sentences:
-  - network that acts like the human brain; type of ai
-  - a database system consists of one or more databases and a database management
-    system (dbms).
-  - Senior vice president Sara Fields said the airline " will work professionally
-    and cooperatively with all our unions . "
-- source_sentence: A European Union spokesman said the Commission was consulting EU
-    member states " with a view to taking appropriate action if necessary " on the
-    matter .
   sentences:
-  - Justice Minister Martin Cauchon and Prime Minister Jean Chretien both have said
-    the government will introduce legislation to decriminalize possession of small
-    amounts of pot .
-  - Laos 's second most important export destination - said it was consulting EU member
-    states ' ' with a view to taking appropriate action if necessary ' ' on the matter
-    .
-  - the form data assumes and the possible range of values that the attribute defined
-    as that type of data may express  1. text 2. numerical
 datasets:
 - redis/langcache-sentencepairs-v1
 pipeline_tag: sentence-similarity
@@ -88,28 +72,28 @@ model-index:
       type: val
     metrics:
     - type: cosine_accuracy
-      value: 0.7638310529446758
       name: Cosine Accuracy
     - type: cosine_accuracy_threshold
-      value: 0.8640533685684204
       name: Cosine Accuracy Threshold
     - type: cosine_f1
-      value: 0.6912742186395134
       name: Cosine F1
     - type: cosine_f1_threshold
-      value: 0.825770378112793
       name: Cosine F1 Threshold
     - type: cosine_precision
-      value: 0.6289243437982501
       name: Cosine Precision
     - type: cosine_recall
-      value: 0.7673469387755102
       name: Cosine Recall
     - type: cosine_ap
-      value: 0.7353968345121902
       name: Cosine Ap
     - type: cosine_mcc
-      value: 0.4778469995044085
       name: Cosine Mcc
   - task:
       type: binary-classification
@@ -119,28 +103,28 @@ model-index:
       type: test
     metrics:
     - type: cosine_accuracy
-      value: 0.7037777526966672
       name: Cosine Accuracy
     - type: cosine_accuracy_threshold
-      value: 0.8524033427238464
       name: Cosine Accuracy Threshold
     - type: cosine_f1
-      value: 0.7122170715871171
       name: Cosine F1
     - type: cosine_f1_threshold
-      value: 0.8118724822998047
       name: Cosine F1 Threshold
     - type: cosine_precision
-      value: 0.5989283084033827
       name: Cosine Precision
     - type: cosine_recall
-      value: 0.8783612662942272
       name: Cosine Recall
     - type: cosine_ap
-      value: 0.6476665223951498
       name: Cosine Ap
     - type: cosine_mcc
-      value: 0.44182914870985407
       name: Cosine Mcc
 ---
@@ -194,9 +178,9 @@ from sentence_transformers import SentenceTransformer
 model = SentenceTransformer("redis/langcache-embed-v3")
 # Run inference
 sentences = [
-    'A European Union spokesman said the Commission was consulting EU member states " with a view to taking appropriate action if necessary " on the matter .',
-    "Laos 's second most important export destination - said it was consulting EU member states ' ' with a view to taking appropriate action if necessary ' ' on the matter .",
-    'the form data assumes and the possible range of values that the attribute defined as that type of data may express  1. text 2. numerical',
 ]
 embeddings = model.encode(sentences)
 print(embeddings.shape)
@@ -205,9 +189,9 @@ print(embeddings.shape)
 # Get the similarity scores for the embeddings
 similarities = model.similarity(embeddings, embeddings)
 print(similarities)
-# tensor([[1.0078, 0.8789, 0.4961],
-#         [0.8789, 1.0000, 0.4648],
-#         [0.4961, 0.4648, 1.0078]], dtype=torch.bfloat16)
 ```
 <!--
@@ -243,16 +227,16 @@ You can finetune this model on your own dataset.
 * Datasets: `val` and `test`
 * Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
-| Metric                    | val        | test       |
-|:--------------------------|:-----------|:-----------|
-| cosine_accuracy           | 0.7638     | 0.7038     |
-| cosine_accuracy_threshold | 0.8641     | 0.8524     |
-| cosine_f1                 | 0.6913     | 0.7122     |
-| cosine_f1_threshold       | 0.8258     | 0.8119     |
-| cosine_precision          | 0.6289     | 0.5989     |
-| cosine_recall             | 0.7673     | 0.8784     |
-| **cosine_ap**             | **0.7354** | **0.6477** |
-| cosine_mcc                | 0.4778     | 0.4418     |
 <!--
 ## Bias, Risks and Limitations
@@ -273,24 +257,25 @@ You can finetune this model on your own dataset.
 #### LangCache Sentence Pairs (all)
 * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
-* Size: 8,405 training samples
-* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
 * Approximate statistics based on the first 1000 samples:
-  |         | sentence1                                                                         | sentence2                                                                        | label                                           |
-  |:--------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:------------------------------------------------|
-  | type    | string                                                                            | string                                                                           | int                                             |
-  | details | <ul><li>min: 6 tokens</li><li>mean: 24.89 tokens</li><li>max: 50 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 24.3 tokens</li><li>max: 43 tokens</li></ul> | <ul><li>0: ~45.80%</li><li>1: ~54.20%</li></ul> |
 * Samples:
-  | sentence1                                                                                                                             | sentence2                                                                                                                                          | label          |
-  |:--------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
-  | <code>He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .</code>                             | <code>" The foodservice pie business does not fit our long-term growth strategy .</code>                                                           | <code>1</code> |
-  | <code>Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .</code>       | <code>His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .</code>                | <code>0</code> |
-  | <code>The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .</code> | <code>The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .</code> | <code>0</code> |
-* Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
   ```json
   {
       "scale": 20.0,
-      "similarity_fct": "pairwise_cos_sim"
   }
   ```
@@ -299,31 +284,32 @@ You can finetune this model on your own dataset.
 #### LangCache Sentence Pairs (all)
 * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
-* Size: 8,405 evaluation samples
-* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
 * Approximate statistics based on the first 1000 samples:
-  |         | sentence1                                                                         | sentence2                                                                        | label                                           |
-  |:--------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:------------------------------------------------|
-  | type    | string                                                                            | string                                                                           | int                                             |
-  | details | <ul><li>min: 6 tokens</li><li>mean: 24.89 tokens</li><li>max: 50 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 24.3 tokens</li><li>max: 43 tokens</li></ul> | <ul><li>0: ~45.80%</li><li>1: ~54.20%</li></ul> |
 * Samples:
-  | sentence1                                                                                                                             | sentence2                                                                                                                                          | label          |
-  |:--------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
-  | <code>He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .</code>                             | <code>" The foodservice pie business does not fit our long-term growth strategy .</code>                                                           | <code>1</code> |
-  | <code>Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .</code>       | <code>His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .</code>                | <code>0</code> |
-  | <code>The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .</code> | <code>The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .</code> | <code>0</code> |
-* Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
   ```json
   {
       "scale": 20.0,
-      "similarity_fct": "pairwise_cos_sim"
   }
   ```
 ### Training Logs
 | Epoch | Step | val_cosine_ap | test_cosine_ap |
 |:-----:|:----:|:-------------:|:--------------:|
-| -1    | -1   | 0.7354        | 0.6477         |
 ### Framework Versions
@@ -352,17 +338,6 @@ You can finetune this model on your own dataset.
 }
 ```
-#### CoSENTLoss
-```bibtex
-@online{kexuefm-8847,
-    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
-    author={Su Jianlin},
-    year={2022},
-    month={Jan},
-    url={https://kexue.fm/archives/8847},
-}
-```
 <!--
 ## Glossary

 - retrieval
 - reranking
 - generated_from_trainer
+- dataset_size:478600
+- loss:MultipleNegativesSymmetricRankingLoss
 base_model: Alibaba-NLP/gte-modernbert-base
 widget:
+- source_sentence: The brown dog is sniffing the back of a small black dog
   sentences:
+  - Pickens died in Edgefield and was buried on the Willow Brook Cemetery in Edgefield
+    , South Carolina .
+  - It is notable as the oldest Chinatown in Australia , the oldest continuous Chinese
+    settlement in Australia , and the longest continuously running Chinatown outside
+    of Asia .
+  - There is no large brown dog and small grey dog standing on a rocky surface
+- source_sentence: Is it harmful from security perspectives to use public Wi-Fi?
   sentences:
+  - What is the best way to drive traffic to a website?
+  - What startups have used GitHub?
+  - Is there something wrong with using public Wi-Fi?
+- source_sentence: How can we make education better?
   sentences:
+  - What are some things that would make education better today?
+  - Mistery works full-time as a graffiti artist and is also Emcee / Rapper in the
+    Brethren group .
+  - Jammu Airport operates flights to many cities in India such as Delhi , Leh and
+    Srinagar .
+- source_sentence: So are you.
   sentences:
+  - 'Brown said afterwards that he was surprised they had not scored five , and Astall
+    wrote in his newspaper column :'
+  - Just like yourself.
+  - How do I actually lose weight?
+- source_sentence: A group of boys are playing with a ball in front of a large door
+    made of wood
   sentences:
+  - The children are playing in front of a large door
+  - What is the blind spot?
+  - What are some good techniques for controlling your anger?
 datasets:
 - redis/langcache-sentencepairs-v1
 pipeline_tag: sentence-similarity
       type: val
     metrics:
     - type: cosine_accuracy
+      value: 0.9996860282574568
       name: Cosine Accuracy
     - type: cosine_accuracy_threshold
+      value: 0.4801735281944275
       name: Cosine Accuracy Threshold
     - type: cosine_f1
+      value: 0.9998429894802952
       name: Cosine F1
     - type: cosine_f1_threshold
+      value: 0.4801735281944275
       name: Cosine F1 Threshold
     - type: cosine_precision
+      value: 1.0
       name: Cosine Precision
     - type: cosine_recall
+      value: 0.9996860282574568
       name: Cosine Recall
     - type: cosine_ap
+      value: 0.9999999999999999
       name: Cosine Ap
     - type: cosine_mcc
+      value: 0.0
       name: Cosine Mcc
   - task:
       type: binary-classification
       type: test
     metrics:
     - type: cosine_accuracy
+      value: 0.9999627560521416
       name: Cosine Accuracy
     - type: cosine_accuracy_threshold
+      value: 0.42059871554374695
       name: Cosine Accuracy Threshold
     - type: cosine_f1
+      value: 0.9999813776792864
       name: Cosine F1
     - type: cosine_f1_threshold
+      value: 0.42059871554374695
       name: Cosine F1 Threshold
     - type: cosine_precision
+      value: 1.0
       name: Cosine Precision
     - type: cosine_recall
+      value: 0.9999627560521416
       name: Cosine Recall
     - type: cosine_ap
+      value: 1.0
       name: Cosine Ap
     - type: cosine_mcc
+      value: 0.0
       name: Cosine Mcc
 ---
 model = SentenceTransformer("redis/langcache-embed-v3")
 # Run inference
 sentences = [
+    'A group of boys are playing with a ball in front of a large door made of wood',
+    'The children are playing in front of a large door',
+    'What are some good techniques for controlling your anger?',
 ]
 embeddings = model.encode(sentences)
 print(embeddings.shape)
 # Get the similarity scores for the embeddings
 similarities = model.similarity(embeddings, embeddings)
 print(similarities)
+# tensor([[1.0000, 0.8672, 0.4121],
+#         [0.8672, 1.0000, 0.4219],
+#         [0.4121, 0.4219, 1.0000]], dtype=torch.bfloat16)
 ```
 <!--
 * Datasets: `val` and `test`
 * Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
+| Metric                    | val     | test    |
+|:--------------------------|:--------|:--------|
+| cosine_accuracy           | 0.9997  | 1.0     |
+| cosine_accuracy_threshold | 0.4802  | 0.4206  |
+| cosine_f1                 | 0.9998  | 1.0     |
+| cosine_f1_threshold       | 0.4802  | 0.4206  |
+| cosine_precision          | 1.0     | 1.0     |
+| cosine_recall             | 0.9997  | 1.0     |
+| **cosine_ap**             | **1.0** | **1.0** |
+| cosine_mcc                | 0.0     | 0.0     |
 <!--
 ## Bias, Risks and Limitations
 #### LangCache Sentence Pairs (all)
 * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
+* Size: 26,850 training samples
+* Columns: <code>sentence1</code> and <code>sentence2</code>
 * Approximate statistics based on the first 1000 samples:
+  |         | sentence1                                                                         | sentence2                                                                         |
+  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
+  | type    | string                                                                            | string                                                                            |
+  | details | <ul><li>min: 4 tokens</li><li>mean: 16.76 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 16.58 tokens</li><li>max: 44 tokens</li></ul> |
 * Samples:
+  | sentence1                                                                              | sentence2                                                                                   |
+  |:---------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------|
+  | <code>A chef is preparing a meal</code>                                                | <code>Some food is being prepared by a chef</code>                                          |
+  | <code>The presentation is being watched by a classroom of students</code>              | <code>A classroom is full of students</code>                                                |
+  | <code>Garden River , located north of Garden River Airport , Alberta , Canada .</code> | <code>Garden River , , is located north of Garden River Airport , Alberta , Canada .</code> |
+* Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
   ```json
   {
       "scale": 20.0,
+      "similarity_fct": "cos_sim",
+      "gather_across_devices": false
   }
   ```
 #### LangCache Sentence Pairs (all)
 * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
+* Size: 26,850 evaluation samples
+* Columns: <code>sentence1</code> and <code>sentence2</code>
 * Approximate statistics based on the first 1000 samples:
+  |         | sentence1                                                                         | sentence2                                                                         |
+  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
+  | type    | string                                                                            | string                                                                            |
+  | details | <ul><li>min: 4 tokens</li><li>mean: 16.76 tokens</li><li>max: 44 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 16.58 tokens</li><li>max: 44 tokens</li></ul> |
 * Samples:
+  | sentence1                                                                              | sentence2                                                                                   |
+  |:---------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------|
+  | <code>A chef is preparing a meal</code>                                                | <code>Some food is being prepared by a chef</code>                                          |
+  | <code>The presentation is being watched by a classroom of students</code>              | <code>A classroom is full of students</code>                                                |
+  | <code>Garden River , located north of Garden River Airport , Alberta , Canada .</code> | <code>Garden River , , is located north of Garden River Airport , Alberta , Canada .</code> |
+* Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
   ```json
   {
       "scale": 20.0,
+      "similarity_fct": "cos_sim",
+      "gather_across_devices": false
   }
   ```
 ### Training Logs
 | Epoch | Step | val_cosine_ap | test_cosine_ap |
 |:-----:|:----:|:-------------:|:--------------:|
+| -1    | -1   | 1.0000        | 1.0            |
 ### Framework Versions
 }
 ```
 <!--
 ## Glossary

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:db6feb45cb2c9f0ca7da393bb058316538cac11e31b6ef6fc4cb21e299b1e346
 size 298041696

 version https://git-lfs.github.com/spec/v1
+oid sha256:95d02211c4cca89113f9f3e93ed91f5176bf50170faa2cb835f7bfea15bb9dd2
 size 298041696