Update README.md
Browse files
README.md
CHANGED
@@ -116,6 +116,40 @@ for doc, score in doc_score_pairs:
|
|
116 |
print(score, doc)
|
117 |
```
|
118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
## Technical Details
|
120 |
|
121 |
In the following some technical details how this model must be used:
|
@@ -129,7 +163,6 @@ In the following some technical details how this model must be used:
|
|
129 |
|
130 |
----
|
131 |
|
132 |
-
|
133 |
## Background
|
134 |
|
135 |
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
|
@@ -146,8 +179,6 @@ Our model is intended to be used for semantic search: It encodes queries / quest
|
|
146 |
|
147 |
Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
|
148 |
|
149 |
-
|
150 |
-
|
151 |
## Training procedure
|
152 |
|
153 |
The full training script is accessible in this current repository: `train_script.py`.
|
@@ -163,9 +194,6 @@ We sampled each dataset given a weighted probability which configuration is deta
|
|
163 |
|
164 |
The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) using CLS-pooling, dot-product as similarity function, and a scale of 1.
|
165 |
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
| Dataset | Number of training tuples |
|
170 |
|--------------------------------------------------------|:--------------------------:|
|
171 |
| [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs from WikiAnswers | 77,427,422 |
|
@@ -185,36 +213,4 @@ The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/
|
|
185 |
| [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
|
186 |
| [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
|
187 |
| [TriviaQA](https://huggingface.co/datasets/trivia_qa) (Question, Evidence) pairs | 73,346 |
|
188 |
-
| **Total** | **214,988,242** |
|
189 |
-
|
190 |
-
## Usage (Text Embeddings Inference (TEI))
|
191 |
-
|
192 |
-
[Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) is a blazing fast inference solution for text embeddings models.
|
193 |
-
|
194 |
-
- CPU:
|
195 |
-
```bash
|
196 |
-
docker run -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
|
197 |
-
--model-id sentence-transformers/multi-qa-mpnet-base-dot-v1 \
|
198 |
-
--pooling cls \
|
199 |
-
--dtype float16
|
200 |
-
```
|
201 |
-
|
202 |
-
- NVIDIA GPU:
|
203 |
-
```bash
|
204 |
-
docker run --gpus all -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-latest \
|
205 |
-
--model-id sentence-transformers/multi-qa-mpnet-base-dot-v1 \
|
206 |
-
--pooling cls \
|
207 |
-
--dtype float16
|
208 |
-
```
|
209 |
-
|
210 |
-
Send a request to `/v1/embeddings` to generate embeddings via the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create):
|
211 |
-
```bash
|
212 |
-
curl http://localhost:8080/v1/embeddings \
|
213 |
-
-H "Content-Type: application/json" \
|
214 |
-
-d '{
|
215 |
-
"model": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
|
216 |
-
"input": "How many people live in London?"
|
217 |
-
}'
|
218 |
-
```
|
219 |
-
|
220 |
-
Or check the [Text Embeddings Inference API specification](https://huggingface.github.io/text-embeddings-inference/) instead.
|
|
|
116 |
print(score, doc)
|
117 |
```
|
118 |
|
119 |
+
## Usage (Text Embeddings Inference (TEI))
|
120 |
+
|
121 |
+
[Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) is a blazing fast inference solution for text embeddings models.
|
122 |
+
|
123 |
+
- CPU:
|
124 |
+
```bash
|
125 |
+
docker run -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
|
126 |
+
--model-id sentence-transformers/multi-qa-mpnet-base-dot-v1 \
|
127 |
+
--pooling cls \
|
128 |
+
--dtype float16
|
129 |
+
```
|
130 |
+
|
131 |
+
- NVIDIA GPU:
|
132 |
+
```bash
|
133 |
+
docker run --gpus all -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-latest \
|
134 |
+
--model-id sentence-transformers/multi-qa-mpnet-base-dot-v1 \
|
135 |
+
--pooling cls \
|
136 |
+
--dtype float16
|
137 |
+
```
|
138 |
+
|
139 |
+
Send a request to `/v1/embeddings` to generate embeddings via the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create):
|
140 |
+
```bash
|
141 |
+
curl http://localhost:8080/v1/embeddings \
|
142 |
+
-H "Content-Type: application/json" \
|
143 |
+
-d '{
|
144 |
+
"model": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
|
145 |
+
"input": "How many people live in London?"
|
146 |
+
}'
|
147 |
+
```
|
148 |
+
|
149 |
+
Or check the [Text Embeddings Inference API specification](https://huggingface.github.io/text-embeddings-inference/) instead.
|
150 |
+
|
151 |
+
----
|
152 |
+
|
153 |
## Technical Details
|
154 |
|
155 |
In the following some technical details how this model must be used:
|
|
|
163 |
|
164 |
----
|
165 |
|
|
|
166 |
## Background
|
167 |
|
168 |
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
|
|
|
179 |
|
180 |
Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
|
181 |
|
|
|
|
|
182 |
## Training procedure
|
183 |
|
184 |
The full training script is accessible in this current repository: `train_script.py`.
|
|
|
194 |
|
195 |
The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) using CLS-pooling, dot-product as similarity function, and a scale of 1.
|
196 |
|
|
|
|
|
|
|
197 |
| Dataset | Number of training tuples |
|
198 |
|--------------------------------------------------------|:--------------------------:|
|
199 |
| [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs from WikiAnswers | 77,427,422 |
|
|
|
213 |
| [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
|
214 |
| [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
|
215 |
| [TriviaQA](https://huggingface.co/datasets/trivia_qa) (Question, Evidence) pairs | 73,346 |
|
216 |
+
| **Total** | **214,988,242** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|