Meta-rater Language Model (7.2B Parameters, 150B Tokens)
This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.
Code: https://github.com/opendatalab/Meta-rater
Model Description
This is a 7.2B parameter transformer-based decoder-only language model trained from scratch on 150B tokens selected from SlimPajama dataset using the Meta-rater framework with all 25 quality scores. This represents the largest and most capable model in the Meta-rater research, demonstrating maximal benefits of quality-driven data selection at scale.
Model Details
- Architecture: Transformer decoder-only
- Parameters: 7.2B (7,241,732,096 parameters)
- Training Tokens: 150B tokens
- Context Window: 1,024 tokens
- Vocabulary Size: 32,000 (LLaMA tokenizer)
- Data Selection Method: Meta-rater with all 25 quality scores
- Optimization: Learned optimal weightings from 1.3B experiments
Architecture Specifications
- Hidden Dimension: 4,096
- Number of Layers: 32
- Attention Heads: 32
- Key-Value Heads: 8 (Grouped Query Attention)
- MLP Ratio: 8/3
- Position Encoding: RoPE (base=10,000)
Data Selection Framework
The training data was selected using the complete Meta-rater framework, demonstrating its scalability:
Comprehensive Quality Assessment (25 metrics)
- Natural Language Quality Signals (11): RedPajama rule-based measures
- Data Importance Scores (3): DSIR similarity to Books, Wikipedia, AutoMathText
- Model-based Ratings (11): PRRC + QuRating + FineWeb-Edu + WanjuanCC
Optimal Integration Strategy
The same learned weight distribution from 1.3B proxy experiments was applied, proving the transferability of the Meta-rater framework across scales.
Training Details
- Hardware: 32x NVIDIA A800 GPUs
- Global Batch Size: 4,194,304 tokens
- Learning Rate: 5e-5
- Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
- Training Time: ~284 hours
Performance Results
Downstream Task Performance (Average Accuracy)
General Knowledge: 67.97% (+2.87% vs Random 7.2B)
- ARC-Easy: 71.34%
- ARC-Challenge: 39.76%
- SciQ: 92.80%
Commonsense Reasoning: 54.58% (+2.57% vs Random 7.2B)
- HellaSwag: 58.97%
- SIQA: 44.32%
- WinoGrande: 60.45%
Reading Comprehension: 37.14% (+1.27% vs Random 7.2B)
- RACE: 36.08%
- OpenbookQA: 38.20%
Overall Average: 55.24% (+3.12% vs Random 7.2B)
Knowledge-Intensive Tasks
- MMLU: 26.24% (+0.03% vs Random 7.2B)
- NaturalQuestions: 10.42% (-0.47% vs Random 7.2B)
Scaling Excellence
Meta-rater Scaling Progression
- 1.3B Meta-rater: 47.01% overall
- 3.3B Meta-rater: 54.71% overall (+7.70%)
- 7.2B Meta-rater: 55.24% overall (+0.53%)
Scaling Efficiency Comparison
Meta-rater vs Random across scales:
- 1.3B: +3.23% improvement (Meta-rater advantage)
- 3.3B: +1.73% improvement (maintained advantage)
- 7.2B: +3.12% improvement (increased advantage)
Key Research Findings
Data Quality Becomes More Valuable at Scale
- Larger Improvement: 7.2B shows biggest absolute improvement (+3.12%)
- Efficiency Recovery: Meta-rater overcomes random selection plateau
- Continued Benefits: Quality selection prevents performance stagnation
- Scale Synergy: Larger models better utilize high-quality data
Breakthrough Performance
This model demonstrates:
- Highest Absolute Performance: 55.24% overall accuracy
- Best Scaling Efficiency: Largest improvement over random baseline
- Consistent Quality: Strong performance across all task categories
- Framework Validation: Meta-rater scales effectively to 7.2B parameters
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_name = "opendatalab/meta-rater-7b-25raters"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Generate text (highest quality model)
prompt = "The implications of quantum computing for cryptography"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_length=200,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Applications
This model is exceptionally well-suited for:
- Production deployment requiring highest quality standards
- Research applications needing state-of-the-art baselines
- Content generation at scale with quality assurance
- Educational platforms across diverse domains
- Knowledge-intensive applications and question answering
- Professional writing assistance and content creation
- Multi-domain tasks requiring robust capabilities
Revolutionary Advantages
Quality Data Selection at Scale
- Overcomes Plateau: While random selection stagnates, Meta-rater continues improving
- Efficiency Multiplication: 3.12% improvement represents significant capability gain
- Resource Optimization: Same training cost, substantially better results
- Scalability Proof: Validates data curation importance at any scale
State-of-the-Art Achievement
- Best Performance: Highest accuracy among all tested models and methods
- Largest Improvement: Biggest relative gain over random baseline
- Comprehensive Excellence: Strong across all evaluation categories
- Framework Validation: Proves Meta-rater methodology at scale
Research Significance
This model provides definitive evidence that:
- Data quality matters more at scale: Larger models need better data
- Meta-rater scaling: Framework benefits increase rather than diminish
- Efficiency paradigm: Quality beats quantity in data selection
- Practical impact: Substantial performance gains with existing computational budgets
Strengths
- Highest Performance: Best accuracy across all model scales
- Scaling Success: Demonstrates continued Meta-rater benefits at scale
- Quality Consistency: Reliable high-quality output generation
- Resource Efficiency: Maximum return on computational investment
- Robust Capabilities: Strong performance across diverse tasks
- Research Validated: Empirically proven methodology
Limitations
- Computational Requirements: Large model requires significant resources
- Context Window: Limited to 1,024 tokens
- No Instruction Tuning: Base model without safety alignment
- Data Selection Overhead: Requires quality score preprocessing
- Specialized Infrastructure: Needs appropriate hardware for deployment
Critical Insights
Data Curation Imperative
This model definitively proves:
- Scale Amplifies Quality: Better data becomes more important with larger models
- Random Selection Failure: Quality-agnostic approaches hit performance walls
- Meta-rater Success: Systematic quality integration scales effectively
- Future Direction: Data curation essential for continued progress
Performance Breakthrough
- vs Random 7.2B: +3.12% improvement (largest in study)
- vs Best Single Method: Outperforms all baseline approaches
- vs Simple Combinations: Superior to naive quality score averaging
- vs Previous SOTA: Establishes new state-of-the-art for data selection
Citation
If you use this model in your research, please cite:
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
Related Resources
- Complete Model Series: 1.3B, 3.3B, and 7.2B variants
- PRRC Rating Models: Quality assessment models for data curation
- Annotated SlimPajama: Fully labeled dataset with all quality scores
- Meta-rater Framework: Implementation and methodology details
License
Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.
Contact
For questions or issues, please contact the authors or open an issue in the repository.
- Downloads last month
- 16
Model tree for opendatalab/meta-rater-7b-25raters
Base model
internlm/internlm2-7b