opensearch-project
/

opensearch-neural-sparse-encoding-doc-v3-gte

@@ -11,7 +11,7 @@ tags:
 - bag-of-words
 ---
-# opensearch-neural-sparse-encoding-doc-v3-distill
 ## Select the model
 The model should be selected considering search relevance, model inference and retrieval efficiency(FLOPS). We benchmark models' performance on a subset of BEIR benchmark: TrecCovid,NFCorpus,NQ,HotpotQA,FiQA,ArguAna,Touche,DBPedia,SCIDOCS,FEVER,Climate FEVER,SciFact,Quora.
@@ -26,9 +26,12 @@ Overall, the v3 series of models have better search relevance, efficiency and in
 | [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | ✔️ | 67M | 0.504 | 1.8 |
 | [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | ✔️ | 23M | 0.497 | 1.7 |
 | [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | ✔️ | 67M | 0.517 | 1.8 |
 ## Overview
-- **Paper**: [Exploring $\ell_0$ Sparsification for Inference-free Sparse Retrievers ](https://arxiv.org/abs/2504.14839)
 - **Codes**: [opensearch-sparse-model-tuning-sample](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample/tree/l0_enhance)
 This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors.
@@ -75,7 +78,7 @@ def transform_sparse_vector_to_dict(sparse_vector):
 # download the idf file from model hub. idf is used to give weights for query tokens
 def get_tokenizer_idf(tokenizer):
     from huggingface_hub import hf_hub_download
-    local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill", filename="idf.json")
     with open(local_cached_path) as f:
         idf = json.load(f)
     idf_vector = [0]*tokenizer.vocab_size
@@ -85,8 +88,8 @@ def get_tokenizer_idf(tokenizer):
     return torch.tensor(idf_vector)
 # load the model
-model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill")
-tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill")
 idf = get_tokenizer_idf(tokenizer)
 # set the special tokens and id_to_token transform for post-process
@@ -118,7 +121,7 @@ document_sparse_vector = get_sparse_vector(feature_document, output)
 # get similarity score
 sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
-print(sim_score)   # tensor(11.1105, grad_fn=<DotBackward0>)
 query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
@@ -127,15 +130,12 @@ for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reve
     if token in document_query_token_weight:
         print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
 # result:
-# score in query: 5.7729, score in document: 0.8049, token: ny
-# score in query: 4.5684, score in document: 0.9710, token: weather
-# score in query: 3.5895, score in document: 0.4720, token: now
-# score in query: 3.3313, score in document: 0.0286, token: ?
-# score in query: 2.7699, score in document: 0.0787, token: what
-# score in query: 0.4989, score in document: 0.0417, token: in
 ```
 The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.
@@ -152,6 +152,7 @@ The above code sample shows an example of neural sparse search. Although there i
 | [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | 0.504 | 0.690 | 0.343 | 0.528 | 0.675 | 0.357 | 0.496 | 0.287 | 0.418 | 0.166 | 0.818 | 0.224 | 0.715 | 0.841 |
 | [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | 0.497 | 0.709 | 0.336 | 0.510 | 0.666 | 0.338 | 0.480 | 0.285 | 0.407 | 0.164 | 0.812 | 0.216 | 0.699 | 0.837 |
 | [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | 0.517 | 0.724 | 0.345 | 0.544 | 0.694 | 0.356 | 0.520 | 0.294 | 0.424 | 0.163 | 0.845 | 0.239 | 0.708 | 0.863 |
 </div>
 ## License

 - bag-of-words
 ---
+# opensearch-neural-sparse-encoding-doc-v3-gte
 ## Select the model
 The model should be selected considering search relevance, model inference and retrieval efficiency(FLOPS). We benchmark models' performance on a subset of BEIR benchmark: TrecCovid,NFCorpus,NQ,HotpotQA,FiQA,ArguAna,Touche,DBPedia,SCIDOCS,FEVER,Climate FEVER,SciFact,Quora.
 | [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | ✔️ | 67M | 0.504 | 1.8 |
 | [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | ✔️ | 23M | 0.497 | 1.7 |
 | [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | ✔️ | 67M | 0.517 | 1.8 |
+| [opensearch-neural-sparse-encoding-doc-v3-gte](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte) | ✔️ | 133M | 0.546 | 1.7 |
 ## Overview
+- **Paper**:
+    - [Exploring $\ell_0$ Sparsification for Inference-free Sparse Retrievers ](https://arxiv.org/abs/2504.14839)
+    - [Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers](https://arxiv.org/abs/2411.04403)
 - **Codes**: [opensearch-sparse-model-tuning-sample](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample/tree/l0_enhance)
 This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors.
 # download the idf file from model hub. idf is used to give weights for query tokens
 def get_tokenizer_idf(tokenizer):
     from huggingface_hub import hf_hub_download
+    local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte", filename="idf.json")
     with open(local_cached_path) as f:
         idf = json.load(f)
     idf_vector = [0]*tokenizer.vocab_size
     return torch.tensor(idf_vector)
 # load the model
+model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte")
 idf = get_tokenizer_idf(tokenizer)
 # set the special tokens and id_to_token transform for post-process
 # get similarity score
 sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
+print(sim_score)   # tensor(12.5747, grad_fn=<DotBackward0>)
 query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
     if token in document_query_token_weight:
         print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
 # result:
+# score in query: 5.7729, score in document: 0.9703, token: ny
+# score in query: 4.5684, score in document: 1.0387, token: weather
+# score in query: 3.5895, score in document: 0.5861, token: now
+# score in query: 0.4989, score in document: 0.2494, token: in
 ```
 The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.
 | [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | 0.504 | 0.690 | 0.343 | 0.528 | 0.675 | 0.357 | 0.496 | 0.287 | 0.418 | 0.166 | 0.818 | 0.224 | 0.715 | 0.841 |
 | [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | 0.497 | 0.709 | 0.336 | 0.510 | 0.666 | 0.338 | 0.480 | 0.285 | 0.407 | 0.164 | 0.812 | 0.216 | 0.699 | 0.837 |
 | [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | 0.517 | 0.724 | 0.345 | 0.544 | 0.694 | 0.356 | 0.520 | 0.294 | 0.424 | 0.163 | 0.845 | 0.239 | 0.708 | 0.863 |
+| [opensearch-neural-sparse-encoding-doc-v3-gte](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte) | 0.546 | 0.734 | 0.360 | 0.582 | 0.716 | 0.407 | 0.520 | 0.389 | 0.455 | 0.167 | 0.860 | 0.312 | 0.725 | 0.873 |
 </div>
 ## License