mali90 commited on
Commit
f6dbd4b
·
verified ·
1 Parent(s): b5134eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -1
README.md CHANGED
@@ -7,4 +7,71 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # 🦊 JQL: Judging Quality across Languages
11
+
12
+ High-quality multilingual data is crucial for training effective large language models (LLMs).
13
+ **JQL (Judging Quality across Languages)** is a scalable and lightweight data filtering approach
14
+ that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators.
15
+ These annotators enable robust filtering of web-scale data.
16
+
17
+ JQL improves data quality, retains more tokens, and generalizes beyond high-resource European languages—achieving strong performance
18
+ on Arabic, Thai, and Mandarin.
19
+ It outperforms heuristic baselines and enables efficient multilingual pretraining data curation at scale.
20
+
21
+ ---
22
+
23
+ ## 🧩 Main Pipeline Steps
24
+
25
+
26
+ ![JQL Pipeline Overview](assets/jql.pdf)
27
+
28
+ 1. **📋 Ground Truth Creation**
29
+ Human annotators label monolingual documents based on a structured instruction prompt.
30
+ These documents are then translated into all target languages to form a multilingual gold-standard dataset.
31
+ *(See Figure 1)*
32
+
33
+ 3. **🤖 LLM-as-a-Judge Selection & Data Annotation**
34
+ Several strong multilingual LLMs (e.g., Gemma-3-27B-it, Mistral-3.1-24B-it, LLaMA-3.3-70B-it) are evaluated against the GT dataset.
35
+ The top-$n$ models are selected to generate high-quality synthetic annotations on a large multilingual corpus.
36
+ *(See Figure 1)*
37
+
38
+ 5. **🪶 Lightweight Annotator Training**
39
+ Regression heads are trained on top of frozen multilingual embeddings (e.g., Snowflake Arctic Embed v2) using the synthetic data.
40
+ This results in lightweight, efficient annotators capable of high-throughput filtering.
41
+ *(See Figure 1)*
42
+
43
+ 7. **🚀 Scalable Data Filtering**
44
+ The trained annotators are used to label large-scale pretraining corpora.
45
+ Quality-based thresholds (e.g., 0.6 or 0.7 quantile) are applied to retain high-quality subsets for downstream LLM training.
46
+ *(See Figure 1)*
47
+
48
+ ---
49
+
50
+ ## 📊 Results
51
+
52
+ - **Accuracy**: Spearman’s ρ > 0.87 with human ground truth
53
+ - **Downstream LLM Training**:
54
+ - Up to **+7.2% benchmark performance improvement**
55
+ - **+4.8% token retention** over baseline FineWeb2 heuristic filter
56
+ - Reliable trade-offs with 0.6 and 0.7 quantile filtering strategies
57
+ - **Annotation Speed**: ~11,000 docs/min (A100 GPU, 690 tokens avg. length)
58
+
59
+ ---
60
+
61
+
62
+ ## 📁 Available Artifacts
63
+
64
+ - ✅ Ground truth annotations in 35 languages
65
+ - ✅ Synthetic LLM-annotated dataset (14M+ documents)
66
+ - ✅ Lightweight annotation models:
67
+ - `JQL-Gemma`
68
+ - `JQL-Mistral`
69
+ - `JQL-Llama`
70
+ - ✅ Training & inference scripts *(coming soon)*
71
+
72
+ ---
73
+
74
+ ## 📜 Citation
75
+
76
+ If you use JQL, the annotations, or the pretrained annotators, please cite the paper:
77
+