Papers
arxiv:2311.12983

GAIA: a benchmark for General AI Assistants

Published on Nov 21, 2023
· Submitted by akhaliq on Nov 23, 2023
#1 Paper of the day

Abstract

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

Community

This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:

In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.

In the level 3 example you say explicitly "Use commas as thousands separators in the number of minutes.". The provided answer is "Ground truth: White; 5876". Should it not be "Ground truth: White; 5,876"?

Paper author

You're absolutely right: "Use commas as thousands separators in the number of minutes." comes from an older version of the dataset, we will remove it in the next version of the paper

This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:

In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.

There is indeed likely a discrepancy that is impossible to solve between the distribution of the annotators and the general population. That being said the questions rather require fundamental abilities (planning, tool use, multi-modal understanding, etc.) than expert knowledge

Very cool benchmark, congrats!

Can you share any examples from levels 1 & 2 where GPT-4 got the right answer, but the human annotators didn't? I think this would be quite interesting to learn whether there's a type of multi-step question that LLMs are intrinsically better at than humans

Paper author

Most of the mistakes that were made by humans validators (and why we don't get a 100% human score) were attention mistakes (misreading/mistyping something for example) rather than a difference in actual capability - unless you count "focus" as a capability, in which case we could argue that machines in general are already better at it than most of us 😅

@gregmialz would have specific examples of this.

GAIA is the touring test of AI!

This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:

In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.

There is indeed likely a discrepancy that is impossible to solve between the distribution of the annotators and the general population. That being said the questions rather require fundamental abilities (planning, tool use, multi-modal understanding, etc.) than expert knowledge

i wouldn't say impossible, but not sure how feasible it is to so:

  • require/validate specific standardized test results which must have been taken within X years (relative to the type of test) in the annotators c.v. prior to acceptance
  • rank annotators vs. population being compared against
  • annotator pay should reflect current job responsibilities and requirements to obtain

I love that NASA question. It will be something else entirely when LLM's are nailing those level 3 questions. I mean you could wrestle the answer out with some clever and patient prompt engineering and chaining, but when it can do that zero shot... Basically magic. This oughta be the gold standard.

What is the plan for the inevitably of someone solving all the questions and putting them out on the open web? Just regularly create new problem sets?

This couples nicely with this benchmark suite -> GPQA: A Graduate-Level Google-Proof Q&A Benchmark[https://arxiv.org/abs/2311.12022]

deleted
This comment has been hidden
Paper author

@someone13574 yes these questions are quite easy to re-create or slightly modify in the case of memorization.
But also: getting the right answer without a good "trace of reasoning" doesn't mean much on this dataset

Thanks @clefourrier for letting us know! 🤗

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Nice work ! Those questions are fun. It's sad the new ChatGPT with all tool (web, image, python) doesn't have a proper API so that it could be tested also. Here is a totally cherry-picked example (worked only once), and still a loss because the answer is not properly formatted :
drawing.jpg

What is the plan for the inevitably of someone solving all the questions and putting them out on the open web? Just regularly create new problem sets?

You'd have to be a bit of a basterd to do that 😂 maybe someone would do it to poison the competition?

It's certainly not something to overlook.

Here are my thoughts:

  1. Publish 70% of the dataset, then have 30% behind a trusted API. Hugging face et. all. could easily implement this functionality. Essentially we would all have to agree that this central authority is trustworthy and unbiased.

  2. Regularly update the dataset. Requires humans and expensive. Who has the incentive to do this?

  3. Synthetic dataset generated on the fly. Is this even plausible and is it self defeating?

  4. Close your eyes and hope for the best

😂

People really don't care about data contamination. How about we resist running to chatgpt with the dataset ha.

Hi! Thank you all for your points about data contamination!

This is precisely why

  1. we only released the answers on the validation set, not on the test set, which is considerably bigger
  2. we released the precise recipe for generating such a dataset, in the hope that it will be extended with time
  3. we ask for the reasoning trace of the model

But since, at the moment, even the best models don't reach more than a few points on level 3, I think we have some time before us :)

The 'GAIA' paper presents a fascinating study but raises a crucial question: does the higher educational level of annotators, compared to the general population, affect the evaluation of AI performance? This discrepancy might skew the AI's ability to handle real-world tasks that are more representative of the broader population's capabilities and perspectives. It's vital to consider a more diverse range of annotators to truly assess AI's proficiency in real-world scenarios.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

@clefourrier @gregmialz
hi. I was looking for 'pure riddles' (number of used tools equal zero) in dataset.
Following tasks contain incorrect number of tools in solution (i.e. described solution contains 'websearch' or other tools, but 'tools' section is empty)

  • 305ac316-eef6-4446-960a-92d80d542f82
  • cf106601-ab4f-4af9-b045-5295fe67b37d
  • 5a0c1adf-205e-4841-a666-7c3ef95def9d

btw, i'm not sure answers for following riddles are correct

  • 42576abe-0deb-4869-8c63-225c2d75a95a (ask Gpt4 to think step by step)
  • ec09fa32-d03f-4bf8-84b0-1f16922c3ae4

GAIA: Benchmarking the True Capabilities of AI Assistants

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Hey guys! I just finished to read the paper and wanted to share with u my summary notes

Intro

  • Problem → The complex the tasks passed to evaluate AI’s performance, the harder to leverage humans to evaluate and integrate tailored evaluation benchmarks. Existing benchmarks falls because (1) Expert-level tasks and (2) Human -subjective- evaluation
  • GAIA → Evaluation benchmark built to evaluate AI Assistants to do tasks that are simple but tedious for humans, assesing their capability to plan, reason, and use tools
  • Evaluation → Consists of 466 questions. zero-shot is a must. Answers should be a number or a plain few words sentence (e.g. 100, Andrew, etc.). The questions must follow the following framework
    1. Real-world challenging questions
    2. Ungameability: Hard to brute-force without cheating due to # steps + reason traces verification + question diversity
    3. Unambiguity: Questions shouldn’t have multiple interpetations and answers should be unique
    4. Simplicity: easy to understand questions + easy to verify answers. (Factoid answers)
  • LLMs do poorly on GAIA (<30% vs~90% for humans). Why? Because of the lack of integration between plans-reasonings-use of tools

Related work

  • Evaluating LLMs → Perfomance on human-expert exams such as US Bar or USMLE, suggestions include
    • Compile evaluations
    • Human in the loop evaluation (time-consuming + difficult to scale)
    • Model evaluation (hard to evaluate state-of-art models)
  • Evaluating General Assistants → Focuses on the technical performance of AI assistants such as API calls. Whereas GAIA focuses on real-world questions

GAIA

  • What does it Consist of?

    → Questions consist of text + attached file / SOT

    → Answers must be short, single and easy to verify

    → Requires fundamental abilities: reasoning, multi-modality, code, and tool usage

  • Principes

    • Design choices. Awards adaptability rather than specialized knowledge
    • Interpretability. Model performance is easy to analize
    • Robustness against memorization: Action Space size + question diversity
    • Easy to use
  • Evaluation

    • Automated, Fast and Factual (Quasi-exact verification)
  • Composition of GAIA

    • Not so estrict on the behaviour as there are multiple ways to solve each question
    • Levels
      1. ≤5 steps, No tool usage needed
      2. 5<steps<10. Reason + Tools
      3. Near to perfection. Large #steps and # tool calls
    • Accesibility of questions + Diversity (topic domains + cultures)
  • Build and extend GAIA

    • Craft questions → use of SOTs as wikipedia
    • Validate questions → validation = same answer from 3 different annotators
    • Relying on the web → Information changes alonside time. Paywalls. Restrictions

LLM results on GAIA

  • Their performance is very poor on these questions
  • API / tool access is a huge improvement
  • GPT4 surpasses web-search task due to memorization

Discussion

  • Reproducibility is not that important (Time decay)
  • Static vs dynamic benchmarks. Time decay is key (data contamination, web info dissappearance). GAIA aims to updated questions as time passes
  • Evaluation unification.
  • Partial vs full automation: Partial is the current panorama. Full is the target but represents a game change in work and economy. Solving GAIA requires full automation to guarantee objectivity. ethics and open-source will be key

Limitations

  • Evaluating traces is not implemented
  • Question design is still expensive
  • Lack of linguistic / cultural diversity

Key Points 🎯

  • LLMs do poorly on simple but tedious tasks, which require multi steps, multi modality and planning
  • GAIA is simple, scalable, automated. Leading to AI asistants evaluation for simple but tedious tasks

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2311.12983 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 33

Collections including this paper 88