arxiv:2311.12983

GAIA: a benchmark for General AI Assistants

Published on Nov 21, 2023

· Submitted by

akhaliq on Nov 23, 2023

#1 Paper of the day

Upvote

207

Authors:

Grégoire Mialon ,

Clémentine Fourrier ,

Craig Swift ,

Thomas Wolf ,

Yann LeCun ,

Thomas Scialom

Abstract

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

View arXiv page View PDF Add to collection

Community

pejas

Nov 23, 2023

This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:

In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.

vincentmin

Nov 23, 2023

In the level 3 example you say explicitly "Use commas as thousands separators in the number of minutes.". The provided answer is "Ground truth: White; 5876". Should it not be "Ground truth: White; 5,876"?

gregmialz

Paper author Nov 23, 2023

You're absolutely right: "Use commas as thousands separators in the number of minutes." comes from an older version of the dataset, we will remove it in the next version of the paper

gregmialz

Paper author Nov 23, 2023

•

edited Nov 23, 2023

This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:

In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.

There is indeed likely a discrepancy that is impossible to solve between the distribution of the annotators and the general population. That being said the questions rather require fundamental abilities (planning, tool use, multi-modal understanding, etc.) than expert knowledge

lewtun

Nov 24, 2023

Very cool benchmark, congrats!

Can you share any examples from levels 1 & 2 where GPT-4 got the right answer, but the human annotators didn't? I think this would be quite interesting to learn whether there's a type of multi-step question that LLMs are intrinsically better at than humans

clefourrier

Paper author Nov 24, 2023

Most of the mistakes that were made by humans validators (and why we don't get a 100% human score) were attention mistakes (misreading/mistyping something for example) rather than a difference in actual capability - unless you count "focus" as a capability, in which case we could argue that machines in general are already better at it than most of us 😅

@gregmialz would have specific examples of this.

nembal

Nov 24, 2023

GAIA is the touring test of AI!

TouristShaun

Nov 24, 2023

This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:

In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.

There is indeed likely a discrepancy that is impossible to solve between the distribution of the annotators and the general population. That being said the questions rather require fundamental abilities (planning, tool use, multi-modal understanding, etc.) than expert knowledge

i wouldn't say impossible, but not sure how feasible it is to so:

require/validate specific standardized test results which must have been taken within X years (relative to the type of test) in the annotators c.v. prior to acceptance
rank annotators vs. population being compared against
annotator pay should reflect current job responsibilities and requirements to obtain

MichaelBarryUK

Nov 25, 2023

I love that NASA question. It will be something else entirely when LLM's are nailing those level 3 questions. I mean you could wrestle the answer out with some clever and patient prompt engineering and chaining, but when it can do that zero shot... Basically magic. This oughta be the gold standard.

someone13574

Nov 28, 2023

What is the plan for the inevitably of someone solving all the questions and putting them out on the open web? Just regularly create new problem sets?

zandrrlife

Nov 28, 2023

•

edited Nov 28, 2023

This couples nicely with this benchmark suite -> GPQA: A Graduate-Level Google-Proof Q&A Benchmark[https://arxiv.org/abs/2311.12022]

deleted

Nov 29, 2023

This comment has been hidden

thomwolf

Paper author Nov 29, 2023

@someone13574 yes these questions are quite easy to re-create or slightly modify in the case of memorization.
But also: getting the right answer without a good "trace of reasoning" doesn't mean much on this dataset

lunarflu

Nov 29, 2023

Thanks @clefourrier for letting us know! 🤗

librarian-bot

Nov 29, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

malohu

Nov 29, 2023

Nice work ! Those questions are fun. It's sad the new ChatGPT with all tool (web, image, python) doesn't have a proper API so that it could be tested also. Here is a totally cherry-picked example (worked only once), and still a loss because the answer is not properly formatted :

MichaelBarryUK

Nov 30, 2023

What is the plan for the inevitably of someone solving all the questions and putting them out on the open web? Just regularly create new problem sets?

You'd have to be a bit of a basterd to do that 😂 maybe someone would do it to poison the competition?

It's certainly not something to overlook.

Here are my thoughts:

Publish 70% of the dataset, then have 30% behind a trusted API. Hugging face et. all. could easily implement this functionality. Essentially we would all have to agree that this central authority is trustworthy and unbiased.
Regularly update the dataset. Requires humans and expensive. Who has the incentive to do this?
Synthetic dataset generated on the fly. Is this even plausible and is it self defeating?
Close your eyes and hope for the best

😂

zandrrlife

Nov 30, 2023

People really don't care about data contamination. How about we resist running to chatgpt with the dataset ha.

clefourrier

Paper author Nov 30, 2023

•

edited Nov 30, 2023

Hi! Thank you all for your points about data contamination!

This is precisely why

we only released the answers on the validation set, not on the test set, which is considerably bigger
we released the precise recipe for generating such a dataset, in the hope that it will be extended with time
we ask for the reasoning trace of the model

But since, at the moment, even the best models don't reach more than a few points on level 3, I think we have some time before us :)

MohammadOthman

Dec 3, 2023

The 'GAIA' paper presents a fascinating study but raises a crucial question: does the higher educational level of annotators, compared to the general population, affect the evaluation of AI performance? This discrepancy might skew the AI's ability to handle real-world tasks that are more representative of the broader population's capabilities and perspectives. It's vital to consider a more diverse range of annotators to truly assess AI's proficiency in real-world scenarios.

librarian-bot

Dec 6, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

skaskapro

Jan 7, 2024

•

edited Jan 7, 2024

@clefourrier @gregmialz
hi. I was looking for 'pure riddles' (number of used tools equal zero) in dataset.
Following tasks contain incorrect number of tools in solution (i.e. described solution contains 'websearch' or other tools, but 'tools' section is empty)

305ac316-eef6-4446-960a-92d80d542f82
cf106601-ab4f-4af9-b045-5295fe67b37d
5a0c1adf-205e-4841-a666-7c3ef95def9d

btw, i'm not sure answers for following riddles are correct

42576abe-0deb-4869-8c63-225c2d75a95a (ask Gpt4 to think step by step)
ec09fa32-d03f-4bf8-84b0-1f16922c3ae4

blanchon

Jun 9, 2024

GAIA: Benchmarking the True Capabilities of AI Assistants

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix

santiagoahl

11 days ago

Hey guys! I just finished to read the paper and wanted to share with u my summary notes

Intro

Problem → The complex the tasks passed to evaluate AI’s performance, the harder to leverage humans to evaluate and integrate tailored evaluation benchmarks. Existing benchmarks falls because (1) Expert-level tasks and (2) Human -subjective- evaluation
GAIA → Evaluation benchmark built to evaluate AI Assistants to do tasks that are simple but tedious for humans, assesing their capability to plan, reason, and use tools
Evaluation → Consists of 466 questions. zero-shot is a must. Answers should be a number or a plain few words sentence (e.g. 100, Andrew, etc.). The questions must follow the following framework
1. Real-world challenging questions
2. Ungameability: Hard to brute-force without cheating due to # steps + reason traces verification + question diversity
3. Unambiguity: Questions shouldn’t have multiple interpetations and answers should be unique
4. Simplicity: easy to understand questions + easy to verify answers. (Factoid answers)
LLMs do poorly on GAIA (<30% vs~90% for humans). Why? Because of the lack of integration between plans-reasonings-use of tools

Related work

Evaluating LLMs → Perfomance on human-expert exams such as US Bar or USMLE, suggestions include
- Compile evaluations
- Human in the loop evaluation (time-consuming + difficult to scale)
- Model evaluation (hard to evaluate state-of-art models)
Evaluating General Assistants → Focuses on the technical performance of AI assistants such as API calls. Whereas GAIA focuses on real-world questions

GAIA

What does it Consist of?

→ Questions consist of text + attached file / SOT

→ Answers must be short, single and easy to verify

→ Requires fundamental abilities: reasoning, multi-modality, code, and tool usage
Principes
- Design choices. Awards adaptability rather than specialized knowledge
- Interpretability. Model performance is easy to analize
- Robustness against memorization: Action Space size + question diversity
- Easy to use
Evaluation
- Automated, Fast and Factual (Quasi-exact verification)
Composition of GAIA
- Not so estrict on the behaviour as there are multiple ways to solve each question
- Levels
  1. ≤5 steps, No tool usage needed
  2. 5<steps<10. Reason + Tools
  3. Near to perfection. Large #steps and # tool calls
- Accesibility of questions + Diversity (topic domains + cultures)
Build and extend GAIA
- Craft questions → use of SOTs as wikipedia
- Validate questions → validation = same answer from 3 different annotators
- Relying on the web → Information changes alonside time. Paywalls. Restrictions

LLM results on GAIA

Their performance is very poor on these questions
API / tool access is a huge improvement
GPT4 surpasses web-search task due to memorization

Discussion

Reproducibility is not that important (Time decay)
Static vs dynamic benchmarks. Time decay is key (data contamination, web info dissappearance). GAIA aims to updated questions as time passes
Evaluation unification.
Partial vs full automation: Partial is the current panorama. Full is the target but represents a game change in work and economy. Solving GAIA requires full automation to guarantee objectivity. ethics and open-source will be key

Limitations

Evaluating traces is not implemented
Question design is still expensive
Lack of linguistic / cultural diversity

Key Points 🎯

LLMs do poorly on simple but tedious tasks, which require multi steps, multi modality and planning
GAIA is simple, scalable, automated. Leading to AI asistants evaluation for simple but tedious tasks