Tom Aarsen
commited on
Commit
·
1828279
1
Parent(s):
0c579fa
Fix typo; update README script + specific MRL snippets; bold in table
Browse files
README.md
CHANGED
|
@@ -2902,23 +2902,25 @@ base_model:
|
|
| 2902 |
|
| 2903 |
# ModernBERT Embed
|
| 2904 |
|
| 2905 |
-
ModernBERT Embed is an embedding model trained from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base),
|
| 2906 |
|
| 2907 |
Trained on the [Nomic Embed](https://arxiv.org/abs/2402.01613) weakly-supervised and supervised datasets, `modernbert-embed` also supports Matryoshka Representation Learning dimensions of 256, reducing memory by 3x with minimal performance loss.
|
| 2908 |
|
| 2909 |
## Performance
|
| 2910 |
|
| 2911 |
-
| Model
|
| 2912 |
-
|
| 2913 |
-
| nomic-embed-text-v1
|
| 2914 |
-
| nomic-embed-text-v1.5 | 768
|
| 2915 |
-
|
|
| 2916 |
-
| nomic-embed-text-v1.5 | 256
|
| 2917 |
-
|
|
|
|
|
|
|
|
| 2918 |
|
| 2919 |
## Usage
|
| 2920 |
|
| 2921 |
-
You can use these models directly with the transformers library. Until the next transformers release, doing so requires installing transformers from main
|
| 2922 |
|
| 2923 |
```bash
|
| 2924 |
pip install git+https://github.com/huggingface/transformers.git
|
|
@@ -2926,7 +2928,59 @@ pip install git+https://github.com/huggingface/transformers.git
|
|
| 2926 |
|
| 2927 |
Reminder, this model is trained similarly to Nomic Embed and **REQUIRES** prefixes to be added to the input. For more information, see the instructions in [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes).
|
| 2928 |
|
| 2929 |
-
Most use cases, adding `search_query` to the query and `search_document` to the documents will be sufficient.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2930 |
|
| 2931 |
### Transformers
|
| 2932 |
|
|
@@ -2935,48 +2989,95 @@ import torch
|
|
| 2935 |
import torch.nn.functional as F
|
| 2936 |
from transformers import AutoTokenizer, AutoModel
|
| 2937 |
|
|
|
|
| 2938 |
def mean_pooling(model_output, attention_mask):
|
| 2939 |
token_embeddings = model_output[0]
|
| 2940 |
-
input_mask_expanded =
|
| 2941 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2942 |
|
| 2943 |
-
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
|
| 2944 |
|
| 2945 |
-
|
| 2946 |
-
|
| 2947 |
-
model.eval()
|
| 2948 |
|
| 2949 |
-
|
|
|
|
| 2950 |
|
| 2951 |
-
|
|
|
|
| 2952 |
|
| 2953 |
with torch.no_grad():
|
| 2954 |
-
|
|
|
|
| 2955 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2956 |
|
| 2957 |
-
|
| 2958 |
-
|
| 2959 |
-
|
| 2960 |
-
|
| 2961 |
```
|
| 2962 |
|
| 2963 |
-
|
|
|
|
|
|
|
| 2964 |
|
| 2965 |
```python
|
| 2966 |
-
|
|
|
|
|
|
|
| 2967 |
|
| 2968 |
-
model = SentenceTransformer(
|
| 2969 |
-
"nomic-ai/modernbert-embed",
|
| 2970 |
-
)
|
| 2971 |
|
| 2972 |
-
|
| 2973 |
-
|
| 2974 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2975 |
|
| 2976 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2977 |
print(similarities)
|
|
|
|
|
|
|
| 2978 |
```
|
| 2979 |
|
|
|
|
|
|
|
|
|
|
| 2980 |
|
| 2981 |
## Training
|
| 2982 |
|
|
|
|
| 2902 |
|
| 2903 |
# ModernBERT Embed
|
| 2904 |
|
| 2905 |
+
ModernBERT Embed is an embedding model trained from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), bringing the new advances of ModernBERT to embeddings!
|
| 2906 |
|
| 2907 |
Trained on the [Nomic Embed](https://arxiv.org/abs/2402.01613) weakly-supervised and supervised datasets, `modernbert-embed` also supports Matryoshka Representation Learning dimensions of 256, reducing memory by 3x with minimal performance loss.
|
| 2908 |
|
| 2909 |
## Performance
|
| 2910 |
|
| 2911 |
+
| Model | Dimensions | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Overall/Summ (1) |
|
| 2912 |
+
|-----------------------|------------|--------------|---------------------|-----------------|-------------------------|---------------|----------------|-----------|------------------|
|
| 2913 |
+
| nomic-embed-text-v1 | 768 | 62.4 | 74.1 | 43.9 | **85.2** | 55.7 | 52.8 | 82.1 | 30.1 |
|
| 2914 |
+
| nomic-embed-text-v1.5 | 768 | 62.28 | 73.55 | 43.93 | 84.61 | **55.78** | **53.01** | **81.94** | 30.4 |
|
| 2915 |
+
| modernbert-embed | 768 | **62.62** | **74.31** | **44.98** | 83.96 | 56.42 | 52.89 | 81.78 | **31.39** |
|
| 2916 |
+
| nomic-embed-text-v1.5 | 256 | 61.04 | 72.1 | 43.16 | 84.09 | 55.18 | 50.81 | 81.34 | |
|
| 2917 |
+
| modernbert-embed | 256 | 61.17 | 72.40 | 43.82 | 83.45 | 55.69 | 50.62 | 81.12 | 31.27 |
|
| 2918 |
+
|
| 2919 |
+
|
| 2920 |
|
| 2921 |
## Usage
|
| 2922 |
|
| 2923 |
+
You can use these models directly with the transformers library. Until the next transformers release, doing so requires installing `transformers` from `main`:
|
| 2924 |
|
| 2925 |
```bash
|
| 2926 |
pip install git+https://github.com/huggingface/transformers.git
|
|
|
|
| 2928 |
|
| 2929 |
Reminder, this model is trained similarly to Nomic Embed and **REQUIRES** prefixes to be added to the input. For more information, see the instructions in [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes).
|
| 2930 |
|
| 2931 |
+
Most use cases, adding `search_query: ` to the query and `search_document: ` to the documents will be sufficient.
|
| 2932 |
+
|
| 2933 |
+
### Sentence Transformers
|
| 2934 |
+
|
| 2935 |
+
```python
|
| 2936 |
+
from sentence_transformers import SentenceTransformer
|
| 2937 |
+
|
| 2938 |
+
model = SentenceTransformer("nomic-ai/modernbert-embed")
|
| 2939 |
+
|
| 2940 |
+
query_embeddings = model.encode([
|
| 2941 |
+
"search_query: What is TSNE?",
|
| 2942 |
+
"search_query: Who is Laurens van der Maaten?",
|
| 2943 |
+
])
|
| 2944 |
+
doc_embeddings = model.encode([
|
| 2945 |
+
"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
|
| 2946 |
+
])
|
| 2947 |
+
print(query_embeddings.shape, doc_embeddings.shape)
|
| 2948 |
+
# (2, 768) (1, 768)
|
| 2949 |
+
|
| 2950 |
+
similarities = model.similarity(query_embeddings, doc_embeddings)
|
| 2951 |
+
print(similarities)
|
| 2952 |
+
# tensor([[0.7214],
|
| 2953 |
+
# [0.3260]])
|
| 2954 |
+
```
|
| 2955 |
+
|
| 2956 |
+
<details><summary>Click to see Sentence Transformers usage with Matryoshka Truncation</summary>
|
| 2957 |
+
|
| 2958 |
+
In Sentence Transformers, you can truncate embeddings to a smaller dimension by using the `truncate_dim` parameter when loading the `SentenceTransformer` model.
|
| 2959 |
+
|
| 2960 |
+
```python
|
| 2961 |
+
from sentence_transformers import SentenceTransformer
|
| 2962 |
+
|
| 2963 |
+
model = SentenceTransformer("nomic-ai/modernbert-embed", truncate_dim=256)
|
| 2964 |
+
|
| 2965 |
+
query_embeddings = model.encode([
|
| 2966 |
+
"search_query: What is TSNE?",
|
| 2967 |
+
"search_query: Who is Laurens van der Maaten?",
|
| 2968 |
+
])
|
| 2969 |
+
doc_embeddings = model.encode([
|
| 2970 |
+
"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
|
| 2971 |
+
])
|
| 2972 |
+
print(query_embeddings.shape, doc_embeddings.shape)
|
| 2973 |
+
# (2, 256) (1, 256)
|
| 2974 |
+
|
| 2975 |
+
similarities = model.similarity(query_embeddings, doc_embeddings)
|
| 2976 |
+
print(similarities)
|
| 2977 |
+
# tensor([[0.7759],
|
| 2978 |
+
# [0.3419]])
|
| 2979 |
+
```
|
| 2980 |
+
|
| 2981 |
+
Note the small differences compared to the full 768-dimensional similarities.
|
| 2982 |
+
|
| 2983 |
+
</details>
|
| 2984 |
|
| 2985 |
### Transformers
|
| 2986 |
|
|
|
|
| 2989 |
import torch.nn.functional as F
|
| 2990 |
from transformers import AutoTokenizer, AutoModel
|
| 2991 |
|
| 2992 |
+
|
| 2993 |
def mean_pooling(model_output, attention_mask):
|
| 2994 |
token_embeddings = model_output[0]
|
| 2995 |
+
input_mask_expanded = (
|
| 2996 |
+
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
| 2997 |
+
)
|
| 2998 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
|
| 2999 |
+
input_mask_expanded.sum(1), min=1e-9
|
| 3000 |
+
)
|
| 3001 |
|
|
|
|
| 3002 |
|
| 3003 |
+
queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
|
| 3004 |
+
documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]
|
|
|
|
| 3005 |
|
| 3006 |
+
tokenizer = AutoTokenizer.from_pretrained(".")
|
| 3007 |
+
model = AutoModel.from_pretrained(".")
|
| 3008 |
|
| 3009 |
+
encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
|
| 3010 |
+
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
|
| 3011 |
|
| 3012 |
with torch.no_grad():
|
| 3013 |
+
queries_outputs = model(**encoded_queries)
|
| 3014 |
+
documents_outputs = model(**encoded_documents)
|
| 3015 |
|
| 3016 |
+
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
|
| 3017 |
+
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
|
| 3018 |
+
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
|
| 3019 |
+
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
|
| 3020 |
+
print(query_embeddings.shape, doc_embeddings.shape)
|
| 3021 |
+
# torch.Size([2, 768]) torch.Size([1, 768])
|
| 3022 |
|
| 3023 |
+
similarities = query_embeddings @ doc_embeddings.T
|
| 3024 |
+
print(similarities)
|
| 3025 |
+
# tensor([[0.7214],
|
| 3026 |
+
# [0.3260]])
|
| 3027 |
```
|
| 3028 |
|
| 3029 |
+
<details><summary>Click to see Transformers usage with Matryoshka Truncation</summary>
|
| 3030 |
+
|
| 3031 |
+
In `transformers`, you can truncate embeddings to a smaller dimension by slicing the mean pooled embeddings, prior to normalization.
|
| 3032 |
|
| 3033 |
```python
|
| 3034 |
+
import torch
|
| 3035 |
+
import torch.nn.functional as F
|
| 3036 |
+
from transformers import AutoTokenizer, AutoModel
|
| 3037 |
|
|
|
|
|
|
|
|
|
|
| 3038 |
|
| 3039 |
+
def mean_pooling(model_output, attention_mask):
|
| 3040 |
+
token_embeddings = model_output[0]
|
| 3041 |
+
input_mask_expanded = (
|
| 3042 |
+
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
| 3043 |
+
)
|
| 3044 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
|
| 3045 |
+
input_mask_expanded.sum(1), min=1e-9
|
| 3046 |
+
)
|
| 3047 |
+
|
| 3048 |
+
|
| 3049 |
+
queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
|
| 3050 |
+
documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]
|
| 3051 |
+
|
| 3052 |
+
tokenizer = AutoTokenizer.from_pretrained(".")
|
| 3053 |
+
model = AutoModel.from_pretrained(".")
|
| 3054 |
+
truncate_dim = 256
|
| 3055 |
|
| 3056 |
+
encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
|
| 3057 |
+
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
|
| 3058 |
+
|
| 3059 |
+
with torch.no_grad():
|
| 3060 |
+
queries_outputs = model(**encoded_queries)
|
| 3061 |
+
documents_outputs = model(**encoded_documents)
|
| 3062 |
+
|
| 3063 |
+
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
|
| 3064 |
+
query_embeddings = query_embeddings[:, :truncate_dim]
|
| 3065 |
+
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
|
| 3066 |
+
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
|
| 3067 |
+
doc_embeddings = doc_embeddings[:, :truncate_dim]
|
| 3068 |
+
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
|
| 3069 |
+
print(query_embeddings.shape, doc_embeddings.shape)
|
| 3070 |
+
# torch.Size([2, 256]) torch.Size([1, 256])
|
| 3071 |
+
|
| 3072 |
+
similarities = query_embeddings @ doc_embeddings.T
|
| 3073 |
print(similarities)
|
| 3074 |
+
# tensor([[0.7759],
|
| 3075 |
+
# [0.3419]])
|
| 3076 |
```
|
| 3077 |
|
| 3078 |
+
Note the small differences compared to the full 768-dimensional similarities.
|
| 3079 |
+
|
| 3080 |
+
</details>
|
| 3081 |
|
| 3082 |
## Training
|
| 3083 |
|