File size: 2,747 Bytes
ca740fe
51616ef
d92d70b
ca740fe
 
d92d70b
51616ef
d92d70b
 
 
e45d635
51616ef
d92d70b
51616ef
d92d70b
51616ef
d92d70b
 
 
51616ef
d92d70b
51616ef
04c7a6a
51616ef
 
 
04c7a6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb1c543
04c7a6a
 
 
 
 
 
 
 
 
 
e45d635
48c33d6
 
51616ef
d92d70b
51616ef
d92d70b
51616ef
d92d70b
51616ef
d92d70b
51616ef
d92d70b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# AstroMLab

AstroMLab is a diverse group of researchers dedicated to advancing the application of Large Language Models (LLMs) in astronomy. Our team includes:
- Leading astronomers, astrophysicists, and cosmologists.
- Natural language processing experts.
- Frontier arXivists from the NASA Astrophysics Data System

## Objectives
- Develop specialized LLMs for astronomy
- Create open-source models for advanced research
- Facilitate LLM-driven end-to-end agentic research in astronomy

## Current Work

Our ongoing projects include:

- Curation of an astronomy-based benchmarking dataset
- Development of specialized astronomy LLMs
- Performance evaluation of models on astronomical tasks

## Models and Performance

We have developed several models, including AstroSage-LLaMA-3.1-70B ([de Haan et al. 2025b](https://arxiv.org/abs/2505.17592)) AstroSage-LLaMA-3.1-8B ([de Haan et al. 2025a](https://arxiv.org/abs/2411.09012)), AstroLLaMA-2-70B ([Pan et al. 2024](https://arxiv.org/abs/2409.19750)), and AstroLLaMA-3-8B ([Pan et al. 2024](https://arxiv.org/abs/2409.19750)). Our AstroSage-LLaMA-3.1-8B model has demonstrated strong performance in astronomy Q&A tasks ([Ting et al. 2024](https://arxiv.org/abs/2407.11194)):

| Model | Score (%) |
|-------|-----------|
| **AstroSage-LLaMA-3.1-70B (AstroMLab)** | **86.2** |
| Claude-4-Opus | **86.3** |
| o3 | 85.4 |
| Claude-4-Sonnet | 85.0 |
| GPT-4.1 | 84.7 |
| o4-Mini | 84.7 |
| Gemini-2.5-Pro | 84.8 |
| Deepseek-R1 | 84.4 |
| Qwen-3-235B | 84.0 |
| LLaMA-4-Maverick | 83.4 |
| Deepseek-v3-2503 | 82.9 |
| Gemini-2.5-Flash-0520 | 82.3 |
| LLaMA-4-Scout | 82.2 |
| Grok-3 | 81.7 |
| Mistral-Medium-v3 | 81.8 |
| **AstroSage-LLaMA-3.1-8B (AstroMLab)** | **80.9** |
| Mistral-Large-v2 | 80.8 |
| Qwen-3-32B | 79.7 |
| Mistral-Small-v3.1 | 78.6 |
| GPT-4.1-Nano | 78.0 |
| Gemini-2-Flash-Lite | 78.4 |
| Gemma-3-27B | 76.9 |
| Qwen-3-14B | 76.4 |
| AstroLLaMA-2-7B | 44.3 |

As of this writing in May 2025, AstroSage-LLaMA-3.1-70B ([de Haan et al. 2025b](https://arxiv.org/abs/2505.17592)) achieves among the highest scores on AstroBench ([Ting et al. 2024](https://arxiv.org/abs/2407.11194)), tying with Claude-4-Opus and outperforming other leading models including GPT-4.1, o3, Gemini-2.5-Pro, and Claude-4-Sonnet.

![Cost and performance trade-off in AstroBench](https://cdn-uploads.huggingface.co/production/uploads/64f12d6e057f7e90416ce3c4/EW5taqz-hYtKSsFVK6xeF.png)


## Support and Resources

Our research benefits from:
- Access to the Frontier nodes at Oak Ridge Leadership Computing Facility
- Support from Microsoft's Accelerating Foundation Models Research (AFMR) program

## Contact

For inquiries or collaboration opportunities, please contact: [email protected]