agent-course-final-assessment

Sleeping

App Files Files Community

agent-course-final-assessment / README.md

Daniil Bogdanov

Release v5

a225ae4 5 months ago

preview code

raw

history blame contribute delete

2.61 kB

A newer version of the Gradio SDK is available: 5.46.0

Upgrade

metadata

title: GAIA Benchmark Agent - Final Assessment
emoji: 🕵🏻‍♂️
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 480

AI Agent for GAIA Benchmark

Final assessment for the Hugging Face AI Agents course

This repository contains a fully implemented autonomous agent designed to solve the GAIA benchmark - level 1. The agent leverages large language models and a suite of external tools to tackle complex, real-world, multi-modal tasks. It is ready to run and submit answers to the GAIA evaluation server, and is deployable as a HuggingFace Space with a Gradio interface.

Project Summary

Purpose: Automatically solve and submit answers for the GAIA benchmark, which evaluates generalist AI agents on tasks requiring reasoning, code execution, web search, data analysis, and more.
Features:
- Uses LLMs (OpenAI, HuggingFace, etc.) for reasoning and planning
- Integrates multiple tools: web search, Wikipedia, Python code execution, YouTube transcript, and more
- Handles file-based and multi-modal tasks
- Submits results and displays scores in a user-friendly Gradio interface

How to Run

On HuggingFace Spaces:

Log in with your HuggingFace account.
Click "Run Evaluation & Submit All Answers" to evaluate the agent on the GAIA benchmark and see your score.

Locally:

pip install -r requirements.txt
python app.py

About GAIA

GAIA is a challenging benchmark for evaluating the capabilities of generalist AI agents on real-world, multi-step, and multi-modal tasks. Each task may require code execution, web search, data analysis, or other tool use. This agent is designed to autonomously solve such tasks and submit answers for evaluation.

Architecture

app.py — Gradio app and evaluation logic. Fetches questions, runs the agent, and submits answers
agent.py — Main Agent class. Implements reasoning, tool use, and answer formatting
model.py — Loads and manages LLM backends (OpenAI, HuggingFace, LiteLLM, etc.)
tools.py — Implements external tools
utils/logger.py — Logging utility

Environment Variables

Some models require API keys. Set these in your Space or local environment:

OPENAI_API_KEY and OPENAI_API_BASE (for OpenAI models)
HUGGINGFACEHUB_API_TOKEN (for HuggingFace Hub models)

Dependencies

All required packages are listed in requirements.txt