|
--- |
|
title: GAIA Benchmark Agent - Final Assessment |
|
emoji: π΅π»ββοΈ |
|
colorFrom: indigo |
|
colorTo: indigo |
|
sdk: gradio |
|
sdk_version: 5.25.2 |
|
app_file: app.py |
|
pinned: false |
|
hf_oauth: true |
|
|
|
hf_oauth_expiration_minutes: 480 |
|
--- |
|
|
|
# AI Agent for GAIA Benchmark |
|
|
|
**Final assessment for the Hugging Face AI Agents course** |
|
|
|
This repository contains a fully implemented autonomous agent designed to solve the [GAIA benchmark](https://arxiv.org/abs/2311.12983) - level 1. The agent leverages large language models and a suite of external tools to tackle complex, real-world, multi-modal tasks. It is ready to run and submit answers to the GAIA evaluation server, and is deployable as a HuggingFace Space with a Gradio interface. |
|
|
|
## Project Summary |
|
- **Purpose:** Automatically solve and submit answers for the GAIA benchmark, which evaluates generalist AI agents on tasks requiring reasoning, code execution, web search, data analysis, and more. |
|
- **Features:** |
|
- Uses LLMs (OpenAI, HuggingFace, etc.) for reasoning and planning |
|
- Integrates multiple tools: web search, Wikipedia, Python code execution, YouTube transcript, and more |
|
- Handles file-based and multi-modal tasks |
|
- Submits results and displays scores in a user-friendly Gradio interface |
|
|
|
## How to Run |
|
|
|
**On HuggingFace Spaces:** |
|
- Log in with your HuggingFace account. |
|
- Click "Run Evaluation & Submit All Answers" to evaluate the agent on the GAIA benchmark and see your score. |
|
|
|
**Locally:** |
|
```bash |
|
pip install -r requirements.txt |
|
python app.py |
|
``` |
|
|
|
## About GAIA |
|
GAIA is a challenging benchmark for evaluating the capabilities of generalist AI agents on real-world, multi-step, and multi-modal tasks. Each task may require code execution, web search, data analysis, or other tool use. This agent is designed to autonomously solve such tasks and submit answers for evaluation. |
|
|
|
## Architecture |
|
- `app.py` β Gradio app and evaluation logic. Fetches questions, runs the agent, and submits answers |
|
- `agent.py` β Main `Agent` class. Implements reasoning, tool use, and answer formatting |
|
- `model.py` β Loads and manages LLM backends (OpenAI, HuggingFace, LiteLLM, etc.) |
|
- `tools.py` β Implements external tools |
|
- `utils/logger.py` β Logging utility |
|
|
|
## Environment Variables |
|
Some models require API keys. Set these in your Space or local environment: |
|
- `OPENAI_API_KEY` and `OPENAI_API_BASE` (for OpenAI models) |
|
- `HUGGINGFACEHUB_API_TOKEN` (for HuggingFace Hub models) |
|
|
|
## Dependencies |
|
All required packages are listed in `requirements.txt` |
|
|