Spaces:

JQL-AI
/

README

Running

App Files Files Community

mali90 commited on May 26

Commit

49a1e87

verified ·

1 Parent(s): 3c36980

Update README.md

Browse files

Files changed (1) hide show

README.md +7 -64

README.md CHANGED Viewed

@@ -9,70 +9,13 @@ thumbnail: >-
   https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/AZ7NrNQ2RuRFcIQ63L5Jv.png
 ---
-# 🦊 JQL: Judging Quality across Languages
-High-quality multilingual data is crucial for training effective large language models (LLMs).
-**JQL (Judging Quality across Languages)** is a scalable and lightweight data filtering approach
-that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators.
-These annotators enable robust filtering of web-scale data.
-JQL improves data quality, retains more tokens, and generalizes beyond high-resource European languages—achieving strong performance
-on Arabic, Thai, and Mandarin.
-It outperforms heuristic baselines and enables efficient multilingual pretraining data curation at scale.
----
-## 🧩 Main Pipeline Steps
-![JQL Pipeline Overview](https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/AZ7NrNQ2RuRFcIQ63L5Jv.png)
-1. **📋 Ground Truth Creation**
-   Human annotators label monolingual documents based on a structured instruction prompt.
-   These documents are then translated into all target languages to form a multilingual gold-standard dataset.
-   *(See Figure 1)*
-3. **🤖 LLM-as-a-Judge Selection & Data Annotation**
-   Several strong multilingual LLMs (e.g., Gemma-3-27B-it, Mistral-3.1-24B-it, LLaMA-3.3-70B-it) are evaluated against the GT dataset.
-   The top-$n$ models are selected to generate high-quality synthetic annotations on a large multilingual corpus.
-   *(See Figure 1)*
-5. **🪶 Lightweight Annotator Training**
-   Regression heads are trained on top of frozen multilingual embeddings (e.g., Snowflake Arctic Embed v2) using the synthetic data.
-   This results in lightweight, efficient annotators capable of high-throughput filtering.
-   *(See Figure 1)*
-7. **🚀 Scalable Data Filtering**
-   The trained annotators are used to label large-scale pretraining corpora.
-   Quality-based thresholds (e.g., 0.6 or 0.7 quantile) are applied to retain high-quality subsets for downstream LLM training.
-   *(See Figure 1)*
----
-## 📊 Results
-- **Accuracy**: Spearman’s ρ > 0.87 with human ground truth
-- **Downstream LLM Training**:
-  - Up to **+7.2% benchmark performance improvement**
-  - **+4.8% token retention** over baseline FineWeb2 heuristic filter
-  - Reliable trade-offs with 0.6 and 0.7 quantile filtering strategies
-- **Annotation Speed**: ~11,000 docs/min (A100 GPU, 690 tokens avg. length)
----
-## 📁 Available Artifacts
-- ✅ Ground truth annotations in 35 languages
-- ✅ Synthetic LLM-annotated dataset (14M+ documents)
-- ✅ Lightweight annotation models:
-  - `JQL-Gemma`
-  - `JQL-Mistral`
-  - `JQL-Llama`
-- ✅ Training & inference scripts *(coming soon)*
----
-## 📜 Citation
-If you use JQL, the annotations, or the pretrained annotators, please cite the paper:

   https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/AZ7NrNQ2RuRFcIQ63L5Jv.png
 ---
+**Jackal-AI** is a community of researchers committed to advancing the development of **multilingual foundation models**. We focus on open methods, reproducible experiments, and practical tools to improve multilingual training, alignment, and reasoning at scale.
+## Latest Research
+- [Judging Across Languages]()
+- [Tokenizer Choice For LLM Training: Negligible or Crucial?](https://aclanthology.org/2024.findings-naacl.247/)
+- [Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?](https://aclanthology.org/2024.emnlp-main.1159/)
+- [Do Multilingual Large Language Models Mitigate Stereotype Bias?](https://aclanthology.org/2024.c3nlp-1.6.pdf)
+---