Update README.md
Browse files
README.md
CHANGED
|
@@ -11,7 +11,7 @@ tags:
|
|
| 11 |
- bag-of-words
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# opensearch-neural-sparse-encoding-doc-v3-
|
| 15 |
|
| 16 |
## Select the model
|
| 17 |
The model should be selected considering search relevance, model inference and retrieval efficiency(FLOPS). We benchmark models' performance on a subset of BEIR benchmark: TrecCovid,NFCorpus,NQ,HotpotQA,FiQA,ArguAna,Touche,DBPedia,SCIDOCS,FEVER,Climate FEVER,SciFact,Quora.
|
|
@@ -26,9 +26,12 @@ Overall, the v3 series of models have better search relevance, efficiency and in
|
|
| 26 |
| [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | ✔️ | 67M | 0.504 | 1.8 |
|
| 27 |
| [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | ✔️ | 23M | 0.497 | 1.7 |
|
| 28 |
| [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | ✔️ | 67M | 0.517 | 1.8 |
|
|
|
|
| 29 |
|
| 30 |
## Overview
|
| 31 |
-
- **Paper**:
|
|
|
|
|
|
|
| 32 |
- **Codes**: [opensearch-sparse-model-tuning-sample](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample/tree/l0_enhance)
|
| 33 |
|
| 34 |
This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors.
|
|
@@ -75,7 +78,7 @@ def transform_sparse_vector_to_dict(sparse_vector):
|
|
| 75 |
# download the idf file from model hub. idf is used to give weights for query tokens
|
| 76 |
def get_tokenizer_idf(tokenizer):
|
| 77 |
from huggingface_hub import hf_hub_download
|
| 78 |
-
local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-
|
| 79 |
with open(local_cached_path) as f:
|
| 80 |
idf = json.load(f)
|
| 81 |
idf_vector = [0]*tokenizer.vocab_size
|
|
@@ -85,8 +88,8 @@ def get_tokenizer_idf(tokenizer):
|
|
| 85 |
return torch.tensor(idf_vector)
|
| 86 |
|
| 87 |
# load the model
|
| 88 |
-
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-
|
| 89 |
-
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-
|
| 90 |
idf = get_tokenizer_idf(tokenizer)
|
| 91 |
|
| 92 |
# set the special tokens and id_to_token transform for post-process
|
|
@@ -118,7 +121,7 @@ document_sparse_vector = get_sparse_vector(feature_document, output)
|
|
| 118 |
|
| 119 |
# get similarity score
|
| 120 |
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
|
| 121 |
-
print(sim_score) # tensor(
|
| 122 |
|
| 123 |
|
| 124 |
query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
|
|
@@ -127,15 +130,12 @@ for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reve
|
|
| 127 |
if token in document_query_token_weight:
|
| 128 |
print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
|
| 129 |
|
| 130 |
-
|
| 131 |
-
|
| 132 |
# result:
|
| 133 |
-
# score in query: 5.7729, score in document: 0.
|
| 134 |
-
# score in query: 4.5684, score in document:
|
| 135 |
-
# score in query: 3.5895, score in document: 0.
|
| 136 |
-
# score in query:
|
| 137 |
-
# score in query: 2.7699, score in document: 0.0787, token: what
|
| 138 |
-
# score in query: 0.4989, score in document: 0.0417, token: in
|
| 139 |
```
|
| 140 |
|
| 141 |
The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.
|
|
@@ -152,6 +152,7 @@ The above code sample shows an example of neural sparse search. Although there i
|
|
| 152 |
| [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | 0.504 | 0.690 | 0.343 | 0.528 | 0.675 | 0.357 | 0.496 | 0.287 | 0.418 | 0.166 | 0.818 | 0.224 | 0.715 | 0.841 |
|
| 153 |
| [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | 0.497 | 0.709 | 0.336 | 0.510 | 0.666 | 0.338 | 0.480 | 0.285 | 0.407 | 0.164 | 0.812 | 0.216 | 0.699 | 0.837 |
|
| 154 |
| [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | 0.517 | 0.724 | 0.345 | 0.544 | 0.694 | 0.356 | 0.520 | 0.294 | 0.424 | 0.163 | 0.845 | 0.239 | 0.708 | 0.863 |
|
|
|
|
| 155 |
</div>
|
| 156 |
|
| 157 |
## License
|
|
|
|
| 11 |
- bag-of-words
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# opensearch-neural-sparse-encoding-doc-v3-gte
|
| 15 |
|
| 16 |
## Select the model
|
| 17 |
The model should be selected considering search relevance, model inference and retrieval efficiency(FLOPS). We benchmark models' performance on a subset of BEIR benchmark: TrecCovid,NFCorpus,NQ,HotpotQA,FiQA,ArguAna,Touche,DBPedia,SCIDOCS,FEVER,Climate FEVER,SciFact,Quora.
|
|
|
|
| 26 |
| [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | ✔️ | 67M | 0.504 | 1.8 |
|
| 27 |
| [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | ✔️ | 23M | 0.497 | 1.7 |
|
| 28 |
| [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | ✔️ | 67M | 0.517 | 1.8 |
|
| 29 |
+
| [opensearch-neural-sparse-encoding-doc-v3-gte](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte) | ✔️ | 133M | 0.546 | 1.7 |
|
| 30 |
|
| 31 |
## Overview
|
| 32 |
+
- **Paper**:
|
| 33 |
+
- [Exploring $\ell_0$ Sparsification for Inference-free Sparse Retrievers ](https://arxiv.org/abs/2504.14839)
|
| 34 |
+
- [Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers](https://arxiv.org/abs/2411.04403)
|
| 35 |
- **Codes**: [opensearch-sparse-model-tuning-sample](https://github.com/zhichao-aws/opensearch-sparse-model-tuning-sample/tree/l0_enhance)
|
| 36 |
|
| 37 |
This is a learned sparse retrieval model. It encodes the documents to 30522 dimensional **sparse vectors**. For queries, it just use a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index means the corresponding token in the vocabulary, and the weight means the importance of the token. And the similarity score is the inner product of query/document sparse vectors.
|
|
|
|
| 78 |
# download the idf file from model hub. idf is used to give weights for query tokens
|
| 79 |
def get_tokenizer_idf(tokenizer):
|
| 80 |
from huggingface_hub import hf_hub_download
|
| 81 |
+
local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte", filename="idf.json")
|
| 82 |
with open(local_cached_path) as f:
|
| 83 |
idf = json.load(f)
|
| 84 |
idf_vector = [0]*tokenizer.vocab_size
|
|
|
|
| 88 |
return torch.tensor(idf_vector)
|
| 89 |
|
| 90 |
# load the model
|
| 91 |
+
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte", trust_remote_code=True)
|
| 92 |
+
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte")
|
| 93 |
idf = get_tokenizer_idf(tokenizer)
|
| 94 |
|
| 95 |
# set the special tokens and id_to_token transform for post-process
|
|
|
|
| 121 |
|
| 122 |
# get similarity score
|
| 123 |
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
|
| 124 |
+
print(sim_score) # tensor(12.5747, grad_fn=<DotBackward0>)
|
| 125 |
|
| 126 |
|
| 127 |
query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
|
|
|
|
| 130 |
if token in document_query_token_weight:
|
| 131 |
print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
|
| 132 |
|
| 133 |
+
|
|
|
|
| 134 |
# result:
|
| 135 |
+
# score in query: 5.7729, score in document: 0.9703, token: ny
|
| 136 |
+
# score in query: 4.5684, score in document: 1.0387, token: weather
|
| 137 |
+
# score in query: 3.5895, score in document: 0.5861, token: now
|
| 138 |
+
# score in query: 0.4989, score in document: 0.2494, token: in
|
|
|
|
|
|
|
| 139 |
```
|
| 140 |
|
| 141 |
The above code sample shows an example of neural sparse search. Although there is no overlap token in original query and document, but this model performs a good match.
|
|
|
|
| 152 |
| [opensearch-neural-sparse-encoding-doc-v2-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill) | 0.504 | 0.690 | 0.343 | 0.528 | 0.675 | 0.357 | 0.496 | 0.287 | 0.418 | 0.166 | 0.818 | 0.224 | 0.715 | 0.841 |
|
| 153 |
| [opensearch-neural-sparse-encoding-doc-v2-mini](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini) | 0.497 | 0.709 | 0.336 | 0.510 | 0.666 | 0.338 | 0.480 | 0.285 | 0.407 | 0.164 | 0.812 | 0.216 | 0.699 | 0.837 |
|
| 154 |
| [opensearch-neural-sparse-encoding-doc-v3-distill](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill) | 0.517 | 0.724 | 0.345 | 0.544 | 0.694 | 0.356 | 0.520 | 0.294 | 0.424 | 0.163 | 0.845 | 0.239 | 0.708 | 0.863 |
|
| 155 |
+
| [opensearch-neural-sparse-encoding-doc-v3-gte](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte) | 0.546 | 0.734 | 0.360 | 0.582 | 0.716 | 0.407 | 0.520 | 0.389 | 0.455 | 0.167 | 0.860 | 0.312 | 0.725 | 0.873 |
|
| 156 |
</div>
|
| 157 |
|
| 158 |
## License
|