|
--- |
|
language: |
|
- ko |
|
license: apache-2.0 |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- transformers |
|
--- |
|
|
|
## PwC-Embedding-expr |
|
|
|
We trained the **PwC-Embedding-expr** model on top of the [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) embedding model. |
|
To enhance performance in Korean, we applied our curated augmentation to STS datasets and fine-tuned the E5 model using a carefully balanced ratio across datasets. |
|
|
|
> ⚠️ This is an experimental model and is under continuous development. |
|
|
|
### To-do |
|
- [x] MTEB Leaderboard |
|
- [ ] Technical Report |
|
|
|
|
|
## MTEB |
|
PwC-Embedding_expr was evaluated on the Korean subset of MTEB. |
|
A leaderboard link will be added once it is published. |
|
|
|
| Task | PwC-Embedding_expr | |
|
|------------------|--------------------| |
|
| KLUE-STS | 0.88 | |
|
| KLUE-TC | 0.73 | |
|
| Ko-StrategyQA | 0.80 | |
|
| KorSTS | 0.84 | |
|
| MIRACL-Reranking | 0.72 | |
|
| MIRACL-Retrieval | 0.65 | |
|
| **Average** | **0.77** | |
|
|
|
|
|
## Model |
|
- Base Model: [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) |
|
- Model Size: 0.56B |
|
- Embedding Dimension: 1024 |
|
- Max Input Tokens: 514 |
|
|
|
|
|
## Requirements |
|
It works with the dependencies included in the latest version of MTEB. |
|
|
|
|
|
## Citation |
|
|
|
TBD (technical report expected September 2025) |