Update README.md
Browse files
README.md
CHANGED
@@ -9,70 +9,13 @@ thumbnail: >-
|
|
9 |
https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/AZ7NrNQ2RuRFcIQ63L5Jv.png
|
10 |
---
|
11 |
|
12 |
-
|
13 |
|
14 |
-
|
15 |
-
**JQL (Judging Quality across Languages)** is a scalable and lightweight data filtering approach
|
16 |
-
that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators.
|
17 |
-
These annotators enable robust filtering of web-scale data.
|
18 |
|
19 |
-
|
20 |
-
|
21 |
-
|
|
|
22 |
|
23 |
-
---
|
24 |
-
|
25 |
-
## 🧩 Main Pipeline Steps
|
26 |
-
|
27 |
-
|
28 |
-

|
29 |
-
|
30 |
-
1. **📋 Ground Truth Creation**
|
31 |
-
Human annotators label monolingual documents based on a structured instruction prompt.
|
32 |
-
These documents are then translated into all target languages to form a multilingual gold-standard dataset.
|
33 |
-
*(See Figure 1)*
|
34 |
-
|
35 |
-
3. **🤖 LLM-as-a-Judge Selection & Data Annotation**
|
36 |
-
Several strong multilingual LLMs (e.g., Gemma-3-27B-it, Mistral-3.1-24B-it, LLaMA-3.3-70B-it) are evaluated against the GT dataset.
|
37 |
-
The top-$n$ models are selected to generate high-quality synthetic annotations on a large multilingual corpus.
|
38 |
-
*(See Figure 1)*
|
39 |
-
|
40 |
-
5. **🪶 Lightweight Annotator Training**
|
41 |
-
Regression heads are trained on top of frozen multilingual embeddings (e.g., Snowflake Arctic Embed v2) using the synthetic data.
|
42 |
-
This results in lightweight, efficient annotators capable of high-throughput filtering.
|
43 |
-
*(See Figure 1)*
|
44 |
-
|
45 |
-
7. **🚀 Scalable Data Filtering**
|
46 |
-
The trained annotators are used to label large-scale pretraining corpora.
|
47 |
-
Quality-based thresholds (e.g., 0.6 or 0.7 quantile) are applied to retain high-quality subsets for downstream LLM training.
|
48 |
-
*(See Figure 1)*
|
49 |
-
|
50 |
-
---
|
51 |
-
|
52 |
-
## 📊 Results
|
53 |
-
|
54 |
-
- **Accuracy**: Spearman’s ρ > 0.87 with human ground truth
|
55 |
-
- **Downstream LLM Training**:
|
56 |
-
- Up to **+7.2% benchmark performance improvement**
|
57 |
-
- **+4.8% token retention** over baseline FineWeb2 heuristic filter
|
58 |
-
- Reliable trade-offs with 0.6 and 0.7 quantile filtering strategies
|
59 |
-
- **Annotation Speed**: ~11,000 docs/min (A100 GPU, 690 tokens avg. length)
|
60 |
-
|
61 |
-
---
|
62 |
-
|
63 |
-
|
64 |
-
## 📁 Available Artifacts
|
65 |
-
|
66 |
-
- ✅ Ground truth annotations in 35 languages
|
67 |
-
- ✅ Synthetic LLM-annotated dataset (14M+ documents)
|
68 |
-
- ✅ Lightweight annotation models:
|
69 |
-
- `JQL-Gemma`
|
70 |
-
- `JQL-Mistral`
|
71 |
-
- `JQL-Llama`
|
72 |
-
- ✅ Training & inference scripts *(coming soon)*
|
73 |
-
|
74 |
-
---
|
75 |
-
|
76 |
-
## 📜 Citation
|
77 |
-
|
78 |
-
If you use JQL, the annotations, or the pretrained annotators, please cite the paper:
|
|
|
9 |
https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/AZ7NrNQ2RuRFcIQ63L5Jv.png
|
10 |
---
|
11 |
|
12 |
+
**Jackal-AI** is a community of researchers committed to advancing the development of **multilingual foundation models**. We focus on open methods, reproducible experiments, and practical tools to improve multilingual training, alignment, and reasoning at scale.
|
13 |
|
14 |
+
## Latest Research
|
|
|
|
|
|
|
15 |
|
16 |
+
- [Judging Across Languages]()
|
17 |
+
- [Tokenizer Choice For LLM Training: Negligible or Crucial?](https://aclanthology.org/2024.findings-naacl.247/)
|
18 |
+
- [Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?](https://aclanthology.org/2024.emnlp-main.1159/)
|
19 |
+
- [Do Multilingual Large Language Models Mitigate Stereotype Bias?](https://aclanthology.org/2024.c3nlp-1.6.pdf)
|
20 |
|
21 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|