pfnet
/

plamo-embedding-1b

@@ -1,6 +1,15 @@
 ---
 license: apache-2.0
 ---
 # PLaMo-Embedding-1B
 ## モデルの概要
@@ -8,7 +17,7 @@ PLaMo-Embedding-1Bは、Preferred Networks, Inc. によって開発された日
 日本語の文章を入力することで数値ベクトルに変換することができ、情報検索、テキスト分類、クラスタリングなどをはじめとした幅広い用途でお使い頂けます。
-日本語テキスト埋め込みのためのベンチマークである[JMTEB](https://github.com/sbintuitions/JMTEB)において、2025/4/*時点で最高水準のスコアを達成しました。
 特に検索タスクにおいて一際優れた性能を示しています。
 PLaMo-Embedding-1Bは [Apache v2.0](https://www.apache.org/licenses/LICENSE-2.0) ライセンスで公開されており、商用利用を含めて自由にお使い頂けます。
@@ -16,6 +25,15 @@ PLaMo-Embedding-1Bは [Apache v2.0](https://www.apache.org/licenses/LICENSE-2.0)
 技術的詳細については次のTech Blogをご参照ください: [link]
 ## 使用方法
 ```python
 import torch
 import torch.nn.functional as F
@@ -25,6 +43,9 @@ from transformers import AutoModel, AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("pfnet/plamo-embedding-1b", trust_remote_code=True)
 model = AutoModel.from_pretrained("pfnet/plamo-embedding-1b", trust_remote_code=True)
 query = "PLaMo-Embedding-1Bとは何ですか？"
 documents = [
     "PLaMo-Embedding-1Bは、Preferred Networks, Inc. によって開発された日本語テキスト埋め込みモデルです。",
@@ -46,23 +67,22 @@ print(similarities)
 # tensor([0.8812, 0.5533])
 ```
 ## ベンチマーク結果
 日本語テキスト埋め込みのためのベンチマークである[JMTEB](https://github.com/sbintuitions/JMTEB)を用いて性能評価を行いました。
  Model                                         |Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
 |:----------------------------------------------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
 | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)       |70.90     | 70.98       | 79.70     | 72.89            | 92.96       | 51.24        | 62.15                |
-| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja)   |72.04     | 73.21       | 81.39     | 72.41            | 92.69       | 53.23        | 61.74                |
-| [retrieva-jp/amber-large](https://huggingface.co/retrieva-jp/amber-large)   |72.06     | 71.71	       | 80.87   | 72.45	        | 93.29      | 51.59	        | **62.42**                |
 | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2)   |72.23     | 73.36       | 82.96     | 74.21            | 93.01       | 48.65        | 62.37                |
-| [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3)     |73.44     | 75.22       | 80.05     | 76.39            | 92.71       | 52.46        | 62.37                |
 | [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)      |74.05 | 74.48   | 82.52     | 77.58        | 93.58   | 53.32        | 62.35                |
 | [cl-nagoya/ruri-large-v2](https://huggingface.co/cl-nagoya/ruri-large-v2)     |74.55     | 76.34       | 83.17     | 77.18            | 93.21       | 52.14        | 62.27                |
 |[Sarashina-Embedding-v1-1B](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)|75.50|77.61|82.71|**78.37**|**93.74**|**53.86**|62.00|
 |||
-|[**PLaMo-Embedding-1B**](https://huggingface.co/pfnet/plamo-embedding-1b) (This model) [^1]|**76.10**|**79.94**|**83.14**|77.20|93.57|53.47|62.37|
-[^1]: コンテキスト長1024で計測。モデルとしてはコンテキスト長4096まで対応していますが、学習時に入れているコンテキスト長が1024までのため、1024で計測しています。ただし、4096で評価してもそこまでスコア平均に影響がないことがわかっています (Tech Blog参照)
 ## モデル詳細

 ---
+language:
+  - ja
+library_name: transformers
 license: apache-2.0
+pipeline_tag: sentence-similarity
+tags:
+  - feature-extraction
+  - sentence-similarity
+  - transformers
 ---
 # PLaMo-Embedding-1B
 ## モデルの概要
 日本語の文章を入力することで数値ベクトルに変換することができ、情報検索、テキスト分類、クラスタリングなどをはじめとした幅広い用途でお使い頂けます。
+日本語テキスト埋め込みのためのベンチマークである[JMTEB](https://github.com/sbintuitions/JMTEB)において、2025年4月初頭時点で最高水準のスコアを達成しました。
 特に検索タスクにおいて一際優れた性能を示しています。
 PLaMo-Embedding-1Bは [Apache v2.0](https://www.apache.org/licenses/LICENSE-2.0) ライセンスで公開されており、商用利用を含めて自由にお使い頂けます。
 技術的詳細については次のTech Blogをご参照ください: [link]
 ## 使用方法
+### Requirements
+```
+sentencepiece
+torch
+transformers
+```
+### サンプルコード
 ```python
 import torch
 import torch.nn.functional as F
 tokenizer = AutoTokenizer.from_pretrained("pfnet/plamo-embedding-1b", trust_remote_code=True)
 model = AutoModel.from_pretrained("pfnet/plamo-embedding-1b", trust_remote_code=True)
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = model.to(device)
 query = "PLaMo-Embedding-1Bとは何ですか？"
 documents = [
     "PLaMo-Embedding-1Bは、Preferred Networks, Inc. によって開発された日本語テキスト埋め込みモデルです。",
 # tensor([0.8812, 0.5533])
 ```
+※ `encode_document`, `encode_query` では、モデルの最大コンテキスト長である4096を超えた文章は切り捨てられるのでご注意ください。特に `encode_query` では、内部でprefixが付与されており、この分僅かに最大コンテキスト長が短くなっています。
 ## ベンチマーク結果
 日本語テキスト埋め込みのためのベンチマークである[JMTEB](https://github.com/sbintuitions/JMTEB)を用いて性能評価を行いました。
  Model                                         |Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
 |:----------------------------------------------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
 | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)       |70.90     | 70.98       | 79.70     | 72.89            | 92.96       | 51.24        | 62.15                |
 | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2)   |72.23     | 73.36       | 82.96     | 74.21            | 93.01       | 48.65        | 62.37                |
 | [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)      |74.05 | 74.48   | 82.52     | 77.58        | 93.58   | 53.32        | 62.35                |
 | [cl-nagoya/ruri-large-v2](https://huggingface.co/cl-nagoya/ruri-large-v2)     |74.55     | 76.34       | 83.17     | 77.18            | 93.21       | 52.14        | 62.27                |
 |[Sarashina-Embedding-v1-1B](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)|75.50|77.61|82.71|**78.37**|**93.74**|**53.86**|62.00|
 |||
+|[**PLaMo-Embedding-1B**](https://huggingface.co/pfnet/plamo-embedding-1b) (This model) (*)|**76.10**|**79.94**|**83.14**|77.20|93.57|53.47|62.37|
+(*): コンテキスト長1024で計測。モデルとしてはコンテキスト長4096まで対応していますが、学習時に入れているコンテキスト長が1024までのため、1024で計測しています。ただし、4096で評価してもそこまでスコア平均に影響がないことがわかっています(Tech Blog参照)
 ## モデル詳細