mali90 commited on
Commit
49a1e87
·
verified ·
1 Parent(s): 3c36980

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -64
README.md CHANGED
@@ -9,70 +9,13 @@ thumbnail: >-
9
  https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/AZ7NrNQ2RuRFcIQ63L5Jv.png
10
  ---
11
 
12
- # 🦊 JQL: Judging Quality across Languages
13
 
14
- High-quality multilingual data is crucial for training effective large language models (LLMs).
15
- **JQL (Judging Quality across Languages)** is a scalable and lightweight data filtering approach
16
- that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators.
17
- These annotators enable robust filtering of web-scale data.
18
 
19
- JQL improves data quality, retains more tokens, and generalizes beyond high-resource European languages—achieving strong performance
20
- on Arabic, Thai, and Mandarin.
21
- It outperforms heuristic baselines and enables efficient multilingual pretraining data curation at scale.
 
22
 
23
- ---
24
-
25
- ## 🧩 Main Pipeline Steps
26
-
27
-
28
- ![JQL Pipeline Overview](https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/AZ7NrNQ2RuRFcIQ63L5Jv.png)
29
-
30
- 1. **📋 Ground Truth Creation**
31
- Human annotators label monolingual documents based on a structured instruction prompt.
32
- These documents are then translated into all target languages to form a multilingual gold-standard dataset.
33
- *(See Figure 1)*
34
-
35
- 3. **🤖 LLM-as-a-Judge Selection & Data Annotation**
36
- Several strong multilingual LLMs (e.g., Gemma-3-27B-it, Mistral-3.1-24B-it, LLaMA-3.3-70B-it) are evaluated against the GT dataset.
37
- The top-$n$ models are selected to generate high-quality synthetic annotations on a large multilingual corpus.
38
- *(See Figure 1)*
39
-
40
- 5. **🪶 Lightweight Annotator Training**
41
- Regression heads are trained on top of frozen multilingual embeddings (e.g., Snowflake Arctic Embed v2) using the synthetic data.
42
- This results in lightweight, efficient annotators capable of high-throughput filtering.
43
- *(See Figure 1)*
44
-
45
- 7. **🚀 Scalable Data Filtering**
46
- The trained annotators are used to label large-scale pretraining corpora.
47
- Quality-based thresholds (e.g., 0.6 or 0.7 quantile) are applied to retain high-quality subsets for downstream LLM training.
48
- *(See Figure 1)*
49
-
50
- ---
51
-
52
- ## 📊 Results
53
-
54
- - **Accuracy**: Spearman’s ρ > 0.87 with human ground truth
55
- - **Downstream LLM Training**:
56
- - Up to **+7.2% benchmark performance improvement**
57
- - **+4.8% token retention** over baseline FineWeb2 heuristic filter
58
- - Reliable trade-offs with 0.6 and 0.7 quantile filtering strategies
59
- - **Annotation Speed**: ~11,000 docs/min (A100 GPU, 690 tokens avg. length)
60
-
61
- ---
62
-
63
-
64
- ## 📁 Available Artifacts
65
-
66
- - ✅ Ground truth annotations in 35 languages
67
- - ✅ Synthetic LLM-annotated dataset (14M+ documents)
68
- - ✅ Lightweight annotation models:
69
- - `JQL-Gemma`
70
- - `JQL-Mistral`
71
- - `JQL-Llama`
72
- - ✅ Training & inference scripts *(coming soon)*
73
-
74
- ---
75
-
76
- ## 📜 Citation
77
-
78
- If you use JQL, the annotations, or the pretrained annotators, please cite the paper:
 
9
  https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/AZ7NrNQ2RuRFcIQ63L5Jv.png
10
  ---
11
 
12
+ **Jackal-AI** is a community of researchers committed to advancing the development of **multilingual foundation models**. We focus on open methods, reproducible experiments, and practical tools to improve multilingual training, alignment, and reasoning at scale.
13
 
14
+ ## Latest Research
 
 
 
15
 
16
+ - [Judging Across Languages]()
17
+ - [Tokenizer Choice For LLM Training: Negligible or Crucial?](https://aclanthology.org/2024.findings-naacl.247/)
18
+ - [Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?](https://aclanthology.org/2024.emnlp-main.1159/)
19
+ - [Do Multilingual Large Language Models Mitigate Stereotype Bias?](https://aclanthology.org/2024.c3nlp-1.6.pdf)
20
 
21
+ ---