Organization Card

AstroMLab

AstroMLab is a diverse group of researchers dedicated to advancing the application of Large Language Models (LLMs) in astronomy. Our team includes:

Leading astronomers, astrophysicists, and cosmologists.
Natural language processing experts.
Frontier arXivists from the NASA Astrophysics Data System

Objectives

Develop specialized LLMs for astronomy
Create open-source models for advanced research
Facilitate LLM-driven end-to-end agentic research in astronomy

Current Work

Our ongoing projects include:

Curation of an astronomy-based benchmarking dataset
Development of specialized astronomy LLMs
Performance evaluation of models on astronomical tasks

Models and Performance

We have developed several models, including AstroSage-LLaMA-3.1-70B (de Haan et al. 2025b) AstroSage-LLaMA-3.1-8B (de Haan et al. 2025a), AstroLLaMA-2-70B (Pan et al. 2024), and AstroLLaMA-3-8B (Pan et al. 2024). Our AstroSage-LLaMA-3.1-8B model has demonstrated strong performance in astronomy Q&A tasks (Ting et al. 2024):

Model	Score (%)
AstroSage-LLaMA-3.1-70B (AstroMLab)	86.2
Claude-4-Opus	86.3
o3	85.4
Claude-4-Sonnet	85.0
GPT-4.1	84.7
o4-Mini	84.7
Gemini-2.5-Pro	84.8
Deepseek-R1	84.4
Qwen-3-235B	84.0
LLaMA-4-Maverick	83.4
Deepseek-v3-2503	82.9
Gemini-2.5-Flash-0520	82.3
LLaMA-4-Scout	82.2
Grok-3	81.7
Mistral-Medium-v3	81.8
AstroSage-LLaMA-3.1-8B (AstroMLab)	80.9
Mistral-Large-v2	80.8
Qwen-3-32B	79.7
Mistral-Small-v3.1	78.6
GPT-4.1-Nano	78.0
Gemini-2-Flash-Lite	78.4
Gemma-3-27B	76.9
Qwen-3-14B	76.4
AstroLLaMA-2-7B	44.3

As of this writing in May 2025, AstroSage-LLaMA-3.1-70B (de Haan et al. 2025b) achieves among the highest scores on AstroBench (Ting et al. 2024), tying with Claude-4-Opus and outperforming other leading models including GPT-4.1, o3, Gemini-2.5-Pro, and Claude-4-Sonnet.