MikhailVyrodov commited on
Commit
8267c46
·
1 Parent(s): 002aee0

Add more info about zero-shot text classification and validation process

Browse files
Files changed (1) hide show
  1. README.md +46 -17
README.md CHANGED
@@ -16,30 +16,60 @@ pipeline_tag: zero-shot-classification
16
 
17
  This is a model with 155M parameters that is build on top of the [USER2-base](https://huggingface.co/deepvk/USER2-base) sentence encoder (149M) and is fine-tuned for zero-shot classification task.
18
 
 
 
 
 
19
  ## Performance
20
 
21
  To evaluate the model, we measure quality on multiclass classification tasks from the `MTEB-rus` benchmark.
22
 
23
  **MTEB-rus**
24
 
25
- | Model | Size | Hidden Dim | Context Length | Mean(task) | Kinopoisk | Headlines | GRNTI | OECD | Inappropriateness |
26
- |----------------------------------------------------------------------------------------------:|:-----:|:----------:|:--------------:|:----------:|:--------------:|:-------------:|:----------:|:------------------------:|:-----------------:|
27
- | `GeRaCl-USER2-base` | 155M | 768 | 8192 | 0.65 | 0.61 | 0.80 | 0.63 | 0.48 | 0.71 |
28
- | `USER2-base` | 149M | 768 | 8192 | 0.52 | 0.50 | 0.65 | 0.56 | 0.39 | 0.51 |
29
- | `USER-bge-m3` | 359M | 1024 | 8192 | 0.53 | 0.60 | 0.73 | 0.43 | 0.28 | 0.62 |
30
- | `multilingual-e5-large-instruct` | 560M | 1024 | 512 | 0.63 | 0.56 | 0.83 | 0.62 | 0.46 | 0.67 |
31
- | `mDeBERTa-v3-base-mnli-xnli` | 279M | 768 | 512 | 0.45 | 0.54 | 0.53 | 0.34 | 0.23 | 0.62 |
32
- | `bge-m3-zeroshot-v2.0` | 568M | 1024 | 8192 | 0.60 | 0.65 | 0.72 | 0.53 | 0.41 | 0.67 |
33
- | `Qwen2.5-1.5B-Instruct` | 1,5B | 1536 | 128K | 0.56 | 0.62 | 0.55 | 0.51 | 0.41 | 0.71 |
34
- | `Qwen2.5-3B-Instruct` | 3B | 2048 | 128K | 0.63 | 0.63 | 0.74 | 0.60 | 0.43 | 0.75 |
35
 
36
- ## Usage
 
 
 
 
37
 
38
- ### Prefixes
39
 
40
- This model is based on the USER2-base sentence encoder. This model uses the "classification: " prefix to work on classification tasks.
41
 
42
- ### Code
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  #### Single classification scenario
45
 
@@ -86,7 +116,7 @@ for i in range(len(labels)):
86
 
87
  ## Training details
88
 
89
- This is the base version with 155 million parameters, based on [`USER2-base`](https://huggingface.co/deepvk/USER2-base) sentence encoder. This model uses the GLiNER architecture, but it has only one vector of similarity scores instead of a full matrix of similarities.
90
  Compared to the USER2-base model, there are two additional MLP layers. One is for the text embeddings and another is for the classes embeddings. You can see the detailed model's architecture on the picture below.
91
 
92
  <img src="assets/architecture.png" alt="GeRaCl architecture" width="600"/>
@@ -99,8 +129,7 @@ The training set is built entirely from splits of the [`deepvk/GeRaCl_synthethi
99
 
100
  | Dataset | # Samples |
101
  |----------------------------:|:----:|
102
- | [GeRaCl_synthethic_dataset/synthetic_classes_train](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/synthetic_classes_train) | 93K |
103
- | [GeRaCl_synthethic_dataset/synthetic_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/synthetic_classes) (val and test) | 6K |
104
  | [GeRaCl_synthethic_dataset/ru_mteb_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_classes/) | 52K |
105
  | [GeRaCl_synthethic_dataset/ru_mteb_extended_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_extended_classes) | 93K |
106
  | **Total** | 244K |
 
16
 
17
  This is a model with 155M parameters that is build on top of the [USER2-base](https://huggingface.co/deepvk/USER2-base) sentence encoder (149M) and is fine-tuned for zero-shot classification task.
18
 
19
+ What is Zero‑Shot Classification?
20
+
21
+ Zero‑shot text classification lets a model assign user‑supplied labels to a text without seeing any training examples for those labels. At inference you simply provide the candidate labels as strings, and the model chooses the most appropriate one.
22
+
23
  ## Performance
24
 
25
  To evaluate the model, we measure quality on multiclass classification tasks from the `MTEB-rus` benchmark.
26
 
27
  **MTEB-rus**
28
 
29
+ | Model | Size | Type | Mean(task) | Kinopoisk <nobr>(3&nbsp;classes)</nobr> | Headliness (6 classes) | GRNTI <nobr>(28&nbsp;classes)</nobr> | OECD <nobr>(29&nbsp;classes)</nobr> | Inappropriateness <nobr>(3&nbsp;classes)</nobr> |
30
+ | -------------------------------- | ----- | ----------- | ---------- | --------- | --------- | -------- | -------- | ----------------- |
31
+ | `GeRaCl-USER2-base` | 155 M | GeRaCl | **0.65** | 0.61 | 0.80 | **0.63** | **0.48** | 0.71 |
32
+ | `USER2-base` | 149 M | Encoder | 0.52 | 0.50 | 0.65 | 0.56 | 0.39 | 0.51 |
33
+ | `USER-bge-m3` | 359 M | Encoder | 0.53 | 0.60 | 0.73 | 0.43 | 0.28 | 0.62 |
34
+ | `multilingual-e5-large-instruct` | 560 M | Encoder | 0.63 | 0.56 | **0.83** | 0.62 | 0.46 | 0.67 |
35
+ | `mDeBERTa-v3-base-mnli-xnli` | 279 M | NLI-encoder | 0.45 | 0.54 | 0.53 | 0.34 | 0.23 | 0.62 |
36
+ | `bge-m3-zeroshot-v2.0` | 568 M | NLI-encoder | 0.60 | **0.65** | 0.72 | 0.53 | 0.41 | 0.67 |
37
+ | `Qwen2.5-1.5B-Instruct` | 1.5 B | LLM | 0.56 | 0.62 | 0.55 | 0.51 | 0.41 | 0.71 |
38
+ | `Qwen2.5-3B-Instruct` | 3 B | LLM | 0.63 | 0.63 | 0.74 | 0.60 | 0.43 | **0.75** |
39
 
40
+ **How comparison was performed**
41
+
42
+ 1. NLI‑Encoders were used via 🤗 ```pipeline("zero-shot-classification")```
43
+
44
+ Models such as mDeBERTa-v3-base-mnli-xnli and bge-m3-zeroshot-v2.0 are pre‑trained on Natural Language Inference corpora.The Hugging Face pipeline converts classification into NLI hypotheses like:
45
 
46
+ Premise: text
47
 
48
+ Hypothesis: "This text is about {label}."
49
 
50
+ The model scores each (premise, hypothesis) pair independently; the label with the highest entailment probability wins.
51
+
52
+ 2. LLMs prompted for classification
53
+
54
+ Large‑language models such as Qwen2.5‑1.5B and Qwen2.5‑3B are queried with a simple classification prompt:
55
+
56
+ ```
57
+ PROMPT = """Ниже указан текст. Ты должен присвоить ему один из перечисленных ниже классов.
58
+
59
+ Текст:
60
+ {}
61
+
62
+ Классы:
63
+ {}.
64
+
65
+ Твой ответ должен состоять только из выбранного класса, ничего другого.
66
+ """
67
+ ```
68
+
69
+ 3. GeRaCl architecture. Detailed information about this architecture is located in **Training Detais** section.
70
+
71
+
72
+ ## Usage
73
 
74
  #### Single classification scenario
75
 
 
116
 
117
  ## Training details
118
 
119
+ This is the base version with 155 million parameters, based on [`USER2-base`](https://huggingface.co/deepvk/USER2-base) sentence encoder. This model uses similar to GLiNER idea, but it has only one vector of similarity scores instead of a full matrix of similarities.
120
  Compared to the USER2-base model, there are two additional MLP layers. One is for the text embeddings and another is for the classes embeddings. You can see the detailed model's architecture on the picture below.
121
 
122
  <img src="assets/architecture.png" alt="GeRaCl architecture" width="600"/>
 
129
 
130
  | Dataset | # Samples |
131
  |----------------------------:|:----:|
132
+ | [GeRaCl_synthethic_dataset/synthetic_classes_train](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/synthetic_classes_train) | 99K |
 
133
  | [GeRaCl_synthethic_dataset/ru_mteb_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_classes/) | 52K |
134
  | [GeRaCl_synthethic_dataset/ru_mteb_extended_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_extended_classes) | 93K |
135
  | **Total** | 244K |