File size: 7,407 Bytes
fde2cd5
 
 
002aee0
fde2cd5
 
 
 
 
 
 
 
 
 
 
 
 
 
8267c46
 
 
 
fde2cd5
 
 
 
 
 
8267c46
 
 
 
 
 
 
 
 
 
fde2cd5
8267c46
 
 
 
 
fde2cd5
8267c46
fde2cd5
8267c46
fde2cd5
8267c46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12d8a9e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8267c46
 
fde2cd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8267c46
fde2cd5
 
 
 
002aee0
fde2cd5
 
 
 
 
 
 
8267c46
002aee0
 
fde2cd5
 
 
 
 
 
 
 
 
 
 
002aee0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
license: apache-2.0
datasets:
- deepvk/GeRaCl_synthethic_dataset
language:
- ru
base_model:
- deepvk/USER2-base
pipeline_tag: zero-shot-classification
---

# GeRaCl-USER2-base

**GeRaCl** is a **Ge**neral **Ra**pid **Cl**assifer designed to perform zero-shot classification tasks primarily on Russian texts.
 

This is a model with 155M parameters that is build on top of the [USER2-base](https://huggingface.co/deepvk/USER2-base) sentence encoder (149M) and is fine-tuned for zero-shot classification task.

What is Zero‑Shot Classification?

Zero‑shot text classification lets a model assign user‑supplied labels to a text without seeing any training examples for those labels. At inference you simply provide the candidate labels as strings, and the model chooses the most appropriate one.

## Performance

To evaluate the model, we measure quality on multiclass classification tasks from the `MTEB-rus` benchmark.

**MTEB-rus**

| Model                            | Size  | Type        | Mean(task) | Kinopoisk <nobr>(3&nbsp;classes)</nobr> | Headliness (6 classes) | GRNTI <nobr>(28&nbsp;classes)</nobr>  | OECD <nobr>(29&nbsp;classes)</nobr>   | Inappropriateness <nobr>(3&nbsp;classes)</nobr> |
| -------------------------------- | ----- | ----------- | ---------- | --------- | --------- | -------- | -------- | ----------------- |
| `GeRaCl-USER2-base`              | 155 M | GeRaCl      | **0.65**   | 0.61      | 0.80      | **0.63** | **0.48** | 0.71              |
| `USER2-base`                     | 149 M | Encoder     | 0.52       | 0.50      | 0.65      | 0.56     | 0.39     | 0.51              |
| `USER-bge-m3`                    | 359 M | Encoder     | 0.53       | 0.60      | 0.73      | 0.43     | 0.28     | 0.62              |
| `multilingual-e5-large-instruct` | 560 M | Encoder     | 0.63       | 0.56      | **0.83**  | 0.62     | 0.46     | 0.67              |
| `mDeBERTa-v3-base-mnli-xnli`     | 279 M | NLI-encoder | 0.45       | 0.54      | 0.53      | 0.34     | 0.23     | 0.62              |
| `bge-m3-zeroshot-v2.0`           | 568 M | NLI-encoder | 0.60       | **0.65**  | 0.72      | 0.53     | 0.41     | 0.67              |
| `Qwen2.5-1.5B-Instruct`          | 1.5 B | LLM         | 0.56       | 0.62      | 0.55      | 0.51     | 0.41     | 0.71              |
| `Qwen2.5-3B-Instruct`            | 3 B   | LLM         | 0.63       | 0.63      | 0.74      | 0.60     | 0.43     | **0.75**          |

**How comparison was performed**

1. NLI‑Encoders were used via 🤗 ```pipeline("zero-shot-classification")```

Models such as mDeBERTa-v3-base-mnli-xnli and bge-m3-zeroshot-v2.0 are pre‑trained on Natural Language Inference corpora.The Hugging Face pipeline converts classification into NLI hypotheses like:

Premise: text

Hypothesis: "This text is about {label}."

The model scores each (premise, hypothesis) pair independently; the label with the highest entailment probability wins.

2. LLMs prompted for classification

Large‑language models such as Qwen2.5‑1.5B and Qwen2.5‑3B are queried with a simple classification prompt:

```
PROMPT = """Ниже указан текст. Ты должен присвоить ему один из перечисленных ниже классов.

Текст:
{}

Классы:
{}.

Твой ответ должен состоять только из выбранного класса, ничего другого.
"""
```

3. GeRaCl architecture. Detailed information about this architecture is located in **Training Detais** section.

## Installation

Clone and install directly from GitHub:

```bash
git clone https://github.com/deepvk/geracl
cd geracl

pip install -r requirements.txt
```

Verify your installation:

```python
import geracl
print(geracl.__version__)
```

## Usage

#### Single classification scenario

```python
from transformers import AutoTokenizer
from geracl import GeraclHF, ZeroShotClassificationPipeline

model = GeraclHF.from_pretrained('deepvk/GeRaCl-USER2-base').to('cuda').eval()
tokenizer  = AutoTokenizer.from_pretrained('deepvk/GeRaCl-USER2-base')

pipe = ZeroShotClassificationPipeline(model, tokenizer, device="cuda")

text = "Утилизация катализаторов: как неплохо заработать"
labels = ["экономика", "происшествия", "политика", "культура", "наука", "спорт"]
result = pipe(text, labels, batch_size=1)[0]

print(labels[result])
```

#### Multiple classification scenarios

```python
from transformers import AutoTokenizer
from geracl import GeraclHF, ZeroShotClassificationPipeline

model = GeraclHF.from_pretrained('deepvk/GeRaCl-USER2-base').to('cuda').eval()
tokenizer  = AutoTokenizer.from_pretrained('deepvk/GeRaCl-USER2-base')

pipe = ZeroShotClassificationPipeline(model, tokenizer, device="cuda")

texts = [
  "Утилизация катализаторов: как неплохо заработать",
  "Мне не понравился этот фильм"
]
labels = [
  ["экономика", "происшествия", "политика", "культура", "наука", "спорт"],
  ["нейтральный", "позитивный", "негативный"]
]
results = pipe(texts, labels, batch_size=2)

for i in range(len(labels)):
    print(labels[i][results[i]])
```

## Training details

This is the base version with 155 million parameters, based on [`USER2-base`](https://huggingface.co/deepvk/USER2-base) sentence encoder. This model uses similar to GLiNER idea, but it has only one vector of similarity scores instead of a full matrix of similarities.
Compared to the USER2-base model, there are two additional MLP layers. One is for the text embeddings and another is for the classes embeddings. You can see the detailed model's architecture on the picture below.

<img src="assets/architecture.png" alt="GeRaCl architecture" width="600"/>

The training set is built entirely from splits of the  [`deepvk/GeRaCl_synthethic_dataset`](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset) dataset. It is a concatenation of three sub-datasets:
 - **Synthetic classes part**. For every training example we randomly chose one of the five class lists (`classes_0``classes_4`) and paired it with the sample’s text. The validation and test splits were added unchanged.
 - **RU-MTEB part**. The entire `ru_mteb_classes` dataset was added to the mix.
 - **RU-MTEB extended part**. The entire `ru_mteb_extended_classes` dataset was added to the mix.


| Dataset                     | # Samples |
|----------------------------:|:----:|
| [GeRaCl_synthethic_dataset/synthetic_classes_train](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/synthetic_classes_train)                    | 99K |
| [GeRaCl_synthethic_dataset/ru_mteb_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_classes/)                    | 52K  |
| [GeRaCl_synthethic_dataset/ru_mteb_extended_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_extended_classes)             | 93K  |
| **Total**                   | 244K |

## Citations
```
@misc{deepvk2025geracl,
    title={GeRaCl},
    author={Vyrodov, Mikhail and Spirin, Egor and Sokolov Andrey},
    url={https://huggingface.co/deepvk/GeRaCl-USER2-base},
    publisher={Hugging Face}
    year={2025},
}
```