Daniil Bogdanov commited on
Commit
a225ae4
Β·
1 Parent(s): 932fded

Release v5

Browse files
README.md CHANGED
@@ -16,9 +16,9 @@ hf_oauth_expiration_minutes: 480
16
 
17
  **Final assessment for the Hugging Face AI Agents course**
18
 
19
- This repository contains a fully implemented autonomous agent designed to solve the [GAIA benchmark](https://arxiv.org/abs/2403.08790) - level 1. The agent leverages large language models and a suite of external tools to tackle complex, real-world, multi-modal tasks. It is ready to run and submit answers to the GAIA evaluation server, and is deployable as a HuggingFace Space with a Gradio interface.
20
 
21
- ## πŸ† Project Summary
22
  - **Purpose:** Automatically solve and submit answers for the GAIA benchmark, which evaluates generalist AI agents on tasks requiring reasoning, code execution, web search, data analysis, and more.
23
  - **Features:**
24
  - Uses LLMs (OpenAI, HuggingFace, etc.) for reasoning and planning
@@ -26,7 +26,7 @@ This repository contains a fully implemented autonomous agent designed to solve
26
  - Handles file-based and multi-modal tasks
27
  - Submits results and displays scores in a user-friendly Gradio interface
28
 
29
- ## πŸš€ How to Run
30
 
31
  **On HuggingFace Spaces:**
32
  - Log in with your HuggingFace account.
@@ -38,21 +38,20 @@ pip install -r requirements.txt
38
  python app.py
39
  ```
40
 
41
- ## 🧠 About GAIA
42
  GAIA is a challenging benchmark for evaluating the capabilities of generalist AI agents on real-world, multi-step, and multi-modal tasks. Each task may require code execution, web search, data analysis, or other tool use. This agent is designed to autonomously solve such tasks and submit answers for evaluation.
43
 
44
- ## πŸ—οΈ Architecture
45
  - `app.py` β€” Gradio app and evaluation logic. Fetches questions, runs the agent, and submits answers
46
  - `agent.py` β€” Main `Agent` class. Implements reasoning, tool use, and answer formatting
47
  - `model.py` β€” Loads and manages LLM backends (OpenAI, HuggingFace, LiteLLM, etc.)
48
  - `tools.py` β€” Implements external tools
49
  - `utils/logger.py` β€” Logging utility
50
- - `requirements.txt` β€” All dependencies for local and Spaces deployment
51
 
52
- ## πŸ”‘ Environment Variables
53
  Some models require API keys. Set these in your Space or local environment:
54
  - `OPENAI_API_KEY` and `OPENAI_API_BASE` (for OpenAI models)
55
  - `HUGGINGFACEHUB_API_TOKEN` (for HuggingFace Hub models)
56
 
57
- ## πŸ“¦ Dependencies
58
  All required packages are listed in `requirements.txt`
 
16
 
17
  **Final assessment for the Hugging Face AI Agents course**
18
 
19
+ This repository contains a fully implemented autonomous agent designed to solve the [GAIA benchmark](https://arxiv.org/abs/2311.12983) - level 1. The agent leverages large language models and a suite of external tools to tackle complex, real-world, multi-modal tasks. It is ready to run and submit answers to the GAIA evaluation server, and is deployable as a HuggingFace Space with a Gradio interface.
20
 
21
+ ## Project Summary
22
  - **Purpose:** Automatically solve and submit answers for the GAIA benchmark, which evaluates generalist AI agents on tasks requiring reasoning, code execution, web search, data analysis, and more.
23
  - **Features:**
24
  - Uses LLMs (OpenAI, HuggingFace, etc.) for reasoning and planning
 
26
  - Handles file-based and multi-modal tasks
27
  - Submits results and displays scores in a user-friendly Gradio interface
28
 
29
+ ## How to Run
30
 
31
  **On HuggingFace Spaces:**
32
  - Log in with your HuggingFace account.
 
38
  python app.py
39
  ```
40
 
41
+ ## About GAIA
42
  GAIA is a challenging benchmark for evaluating the capabilities of generalist AI agents on real-world, multi-step, and multi-modal tasks. Each task may require code execution, web search, data analysis, or other tool use. This agent is designed to autonomously solve such tasks and submit answers for evaluation.
43
 
44
+ ## Architecture
45
  - `app.py` β€” Gradio app and evaluation logic. Fetches questions, runs the agent, and submits answers
46
  - `agent.py` β€” Main `Agent` class. Implements reasoning, tool use, and answer formatting
47
  - `model.py` β€” Loads and manages LLM backends (OpenAI, HuggingFace, LiteLLM, etc.)
48
  - `tools.py` β€” Implements external tools
49
  - `utils/logger.py` β€” Logging utility
 
50
 
51
+ ## Environment Variables
52
  Some models require API keys. Set these in your Space or local environment:
53
  - `OPENAI_API_KEY` and `OPENAI_API_BASE` (for OpenAI models)
54
  - `HUGGINGFACEHUB_API_TOKEN` (for HuggingFace Hub models)
55
 
56
+ ## Dependencies
57
  All required packages are listed in `requirements.txt`
agent.py CHANGED
@@ -38,6 +38,7 @@ class Agent:
38
  "re",
39
  "openpyxl",
40
  "pathlib",
 
41
  ]
42
  self.agent = CodeAgent(
43
  model=self.model,
@@ -55,12 +56,14 @@ class Agent:
55
  - Reason step-by-step. Think through the solution logically and plan your actions carefully before answering.
56
  - Validate information. Always verify facts when possible instead of guessing.
57
  - Use code if needed. For calculations, parsing, or transformations, generate Python code and execute it. But be careful, some questions contains time-consuming tasks, so you should be careful with the code you run. Better analyze the question and think about the best way to solve it.
 
 
58
 
59
  IMPORTANT: When giving the final answer, output only the direct required result without any extra text like "Final Answer:" or explanations. YOU MUST RESPOND IN THE EXACT FORMAT AS THE QUESTION.
60
 
61
  QUESTION: {question}
62
 
63
- CONTEXT: {context}
64
 
65
  ANSWER:
66
  """
 
38
  "re",
39
  "openpyxl",
40
  "pathlib",
41
+ "sys",
42
  ]
43
  self.agent = CodeAgent(
44
  model=self.model,
 
56
  - Reason step-by-step. Think through the solution logically and plan your actions carefully before answering.
57
  - Validate information. Always verify facts when possible instead of guessing.
58
  - Use code if needed. For calculations, parsing, or transformations, generate Python code and execute it. But be careful, some questions contains time-consuming tasks, so you should be careful with the code you run. Better analyze the question and think about the best way to solve it.
59
+ - Don't forget to use `final_answer` to give the final answer.
60
+ - Use name of file ONLY FROM "FILE:" section. THIS IF ALWAYS A FILE.
61
 
62
  IMPORTANT: When giving the final answer, output only the direct required result without any extra text like "Final Answer:" or explanations. YOU MUST RESPOND IN THE EXACT FORMAT AS THE QUESTION.
63
 
64
  QUESTION: {question}
65
 
66
+ FILE: {context}
67
 
68
  ANSWER:
69
  """
app.py CHANGED
@@ -9,7 +9,7 @@ import requests
9
 
10
  from agent import Agent
11
  from model import get_model
12
- from tools import get_tools
13
 
14
  # (Keep Constants as is)
15
  # --- Constants ---
@@ -33,7 +33,7 @@ def run_and_submit_all(
33
  Tuple[str, Optional[pd.DataFrame]]: Status message and DataFrame of results.
34
  """
35
  # --- Determine HF Space Runtime URL and Repo URL ---
36
- space_id = "exsandebest/agent-course-final-assessment" # Get the SPACE_ID for sending link to the code
37
 
38
  if profile:
39
  username = f"{profile.username}"
@@ -95,10 +95,26 @@ def run_and_submit_all(
95
  try:
96
  file_response = requests.get(f"{files_url}/{task_id}", timeout=15)
97
  if file_response.status_code == 200 and file_response.content:
98
- with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
99
- tmp_file.write(file_response.content)
100
- file_path = tmp_file.name
101
- print(f"Downloaded file for task {task_id} to {file_path}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
  else:
103
  print(f"No file for task {task_id} or file is empty.")
104
  except Exception as e:
@@ -213,8 +229,8 @@ with gr.Blocks() as demo:
213
  if __name__ == "__main__":
214
  print("\n" + "-" * 30 + " App Starting " + "-" * 30)
215
  # Check for SPACE_HOST and SPACE_ID at startup for information
216
- space_host_startup = "exsandebest-agent-course-final-assessment.hf.space"
217
- space_id_startup = "exsandebest/agent-course-final-assessment"
218
 
219
  if space_host_startup:
220
  print(f"βœ… SPACE_HOST found: {space_host_startup}")
 
9
 
10
  from agent import Agent
11
  from model import get_model
12
+ from tools.tools import get_tools
13
 
14
  # (Keep Constants as is)
15
  # --- Constants ---
 
33
  Tuple[str, Optional[pd.DataFrame]]: Status message and DataFrame of results.
34
  """
35
  # --- Determine HF Space Runtime URL and Repo URL ---
36
+ space_id = os.getenv("SPACE_ID") # Get the SPACE_ID for sending link to the code
37
 
38
  if profile:
39
  username = f"{profile.username}"
 
95
  try:
96
  file_response = requests.get(f"{files_url}/{task_id}", timeout=15)
97
  if file_response.status_code == 200 and file_response.content:
98
+ # Get filename from Content-Disposition header or URL
99
+ filename = None
100
+ content_disposition = file_response.headers.get(
101
+ "Content-Disposition"
102
+ )
103
+ if content_disposition and "filename=" in content_disposition:
104
+ filename = content_disposition.split("filename=")[-1].strip('"')
105
+ else:
106
+ # Try to get filename from URL
107
+ url = file_response.url
108
+ filename = url.split("/")[-1]
109
+ if not filename or filename == str(task_id):
110
+ filename = f"file_{task_id}"
111
+
112
+ # Create temp directory and save file with original name
113
+ temp_dir = tempfile.mkdtemp()
114
+ file_path = os.path.join(temp_dir, filename)
115
+ with open(file_path, "wb") as f:
116
+ f.write(file_response.content)
117
+ print(f"Downloaded file for task {task_id} to {file_path}")
118
  else:
119
  print(f"No file for task {task_id} or file is empty.")
120
  except Exception as e:
 
229
  if __name__ == "__main__":
230
  print("\n" + "-" * 30 + " App Starting " + "-" * 30)
231
  # Check for SPACE_HOST and SPACE_ID at startup for information
232
+ space_host_startup = os.getenv("SPACE_HOST")
233
+ space_id_startup = os.getenv("SPACE_ID")
234
 
235
  if space_host_startup:
236
  print(f"βœ… SPACE_HOST found: {space_host_startup}")
requirements.txt CHANGED
@@ -11,4 +11,6 @@ smolagents[openai]
11
  smolagents[transformers]
12
  transformers
13
  wikipedia-api
14
- youtube-transcript-api
 
 
 
11
  smolagents[transformers]
12
  transformers
13
  wikipedia-api
14
+ youtube-transcript-api
15
+ openai-whisper
16
+ openai
tools.py DELETED
@@ -1,100 +0,0 @@
1
- from typing import Any, List
2
-
3
- import pytesseract
4
- from PIL import Image
5
- from smolagents import (
6
- DuckDuckGoSearchTool,
7
- PythonInterpreterTool,
8
- SpeechToTextTool,
9
- Tool,
10
- VisitWebpageTool,
11
- WikipediaSearchTool,
12
- )
13
- from youtube_transcript_api import YouTubeTranscriptApi
14
-
15
-
16
- class YouTubeTranscriptionTool(Tool):
17
- """
18
- Tool to fetch the transcript of a YouTube video given its URL.
19
-
20
- Args:
21
- video_url (str): YouTube video URL.
22
-
23
- Returns:
24
- str: Transcript of the video as a single string.
25
- """
26
-
27
- name = "youtube_transcription"
28
- description = "Fetches the transcript of a YouTube video given its URL"
29
- inputs = {
30
- "video_url": {"type": "string", "description": "YouTube video URL"},
31
- }
32
- output_type = "string"
33
-
34
- def forward(self, video_url: str) -> str:
35
- video_id = video_url.strip().split("v=")[-1]
36
- transcript = YouTubeTranscriptApi.get_transcript(video_id)
37
- return " ".join([entry["text"] for entry in transcript])
38
-
39
-
40
- class ReadFileTool(Tool):
41
- """
42
- Tool to read a file and return its content.
43
-
44
- Args:
45
- file_path (str): Path to the file to read.
46
-
47
- Returns:
48
- str: Content of the file or error message.
49
- """
50
-
51
- name = "read_file"
52
- description = "Reads a file and returns its content"
53
- inputs = {
54
- "file_path": {"type": "string", "description": "Path to the file to read"},
55
- }
56
- output_type = "string"
57
-
58
- def forward(self, file_path: str) -> str:
59
- try:
60
- with open(file_path, "r") as file:
61
- return file.read()
62
- except Exception as e:
63
- return f"Error reading file: {str(e)}"
64
-
65
-
66
- class ExtractTextFromImageTool(Tool):
67
- name = "extract_text_from_image"
68
- description = "Extracts text from an image using pytesseract"
69
- inputs = {
70
- "image_path": {"type": "string", "description": "Path to the image file"},
71
- }
72
- output_type = "string"
73
-
74
- def forward(self, image_path: str) -> str:
75
- try:
76
- image = Image.open(image_path)
77
- text = pytesseract.image_to_string(image)
78
- return text
79
- except Exception as e:
80
- return f"Error extracting text from image: {str(e)}"
81
-
82
-
83
- def get_tools() -> List[Tool]:
84
- """
85
- Returns a list of available tools for the agent.
86
-
87
- Returns:
88
- List[Tool]: List of initialized tool instances.
89
- """
90
- tools = [
91
- DuckDuckGoSearchTool(),
92
- PythonInterpreterTool(),
93
- WikipediaSearchTool(),
94
- VisitWebpageTool(),
95
- SpeechToTextTool(),
96
- YouTubeTranscriptionTool(),
97
- ReadFileTool(),
98
- ExtractTextFromImageTool(),
99
- ]
100
- return tools
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tools/describe_image_tool.py ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import base64
2
+ import os
3
+
4
+ from openai import OpenAI
5
+ from smolagents import Tool
6
+
7
+ client = OpenAI()
8
+
9
+
10
+ class DescribeImageTool(Tool):
11
+ """
12
+ Tool to analyze and describe any image using GPT-4 Vision API.
13
+
14
+ Args:
15
+ image_path (str): Path to the image file.
16
+ description_type (str): Type of description to generate. Options:
17
+ - "general": General description of the image
18
+ - "detailed": Detailed analysis of the image
19
+ - "chess": Analysis of a chess position
20
+ - "text": Extract and describe text from the image
21
+ - "custom": Custom description based on user prompt
22
+
23
+ Returns:
24
+ str: Description of the image based on the requested type.
25
+ """
26
+
27
+ name = "describe_image"
28
+ description = "Analyzes and describes images using GPT-4 Vision API"
29
+ inputs = {
30
+ "image_path": {"type": "string", "description": "Path to the image file"},
31
+ "description_type": {
32
+ "type": "string",
33
+ "description": "Type of description to generate (general, detailed, chess, text, custom)",
34
+ "nullable": True,
35
+ },
36
+ "custom_prompt": {
37
+ "type": "string",
38
+ "description": "Custom prompt for description (only used when description_type is 'custom')",
39
+ "nullable": True,
40
+ },
41
+ }
42
+ output_type = "string"
43
+
44
+ def encode_image(self, image_path: str) -> str:
45
+ """Encode image to base64 string."""
46
+ with open(image_path, "rb") as image_file:
47
+ return base64.b64encode(image_file.read()).decode("utf-8")
48
+
49
+ def get_prompt(self, description_type: str, custom_prompt: str = None) -> str:
50
+ """Get appropriate prompt based on description type."""
51
+ prompts = {
52
+ "general": "Provide a general description of this image. Focus on the main subjects, colors, and overall scene.",
53
+ "detailed": """Analyze this image in detail. Include:
54
+ 1. Main subjects and their relationships
55
+ 2. Colors, lighting, and composition
56
+ 3. Any text or symbols present
57
+ 4. Context or possible meaning
58
+ 5. Notable details or interesting elements""",
59
+ "chess": """Analyze this chess position and provide a detailed description including:
60
+ 1. List of pieces on the board for both white and black
61
+ 2. Whose turn it is to move
62
+ 3. Basic evaluation of the position
63
+ 4. Any immediate tactical opportunities or threats
64
+ 5. Suggested next moves with brief explanations""",
65
+ "text": "Extract and describe any text present in this image. If there are multiple pieces of text, organize them clearly.",
66
+ }
67
+ return (
68
+ custom_prompt
69
+ if description_type == "custom"
70
+ else prompts.get(description_type, prompts["general"])
71
+ )
72
+
73
+ def forward(
74
+ self,
75
+ image_path: str,
76
+ description_type: str = "general",
77
+ custom_prompt: str = None,
78
+ ) -> str:
79
+ try:
80
+ if not os.path.exists(image_path):
81
+ return f"Error: Image file not found at {image_path}"
82
+
83
+ # Encode the image
84
+ base64_image = self.encode_image(image_path)
85
+
86
+ # Get appropriate prompt
87
+ prompt = self.get_prompt(description_type, custom_prompt)
88
+
89
+ # Make the API call
90
+ response = client.chat.completions.create(
91
+ model="gpt-4.1",
92
+ messages=[
93
+ {
94
+ "role": "user",
95
+ "content": [
96
+ {"type": "text", "text": prompt},
97
+ {
98
+ "type": "image_url",
99
+ "image_url": {
100
+ "url": f"data:image/jpeg;base64,{base64_image}"
101
+ },
102
+ },
103
+ ],
104
+ }
105
+ ],
106
+ max_tokens=1000,
107
+ )
108
+
109
+ return response.choices[0].message.content
110
+
111
+ except Exception as e:
112
+ return f"Error analyzing image: {str(e)}"
tools/openai_speech_to_text_tool.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+
3
+ import whisper
4
+ from smolagents import Tool
5
+
6
+
7
+ class OpenAISpeechToTextTool(Tool):
8
+ """
9
+ Tool to convert speech to text using OpenAI's Whisper model.
10
+
11
+ Args:
12
+ audio_path (str): Path to the audio file.
13
+
14
+ Returns:
15
+ str: Transcribed text from the audio file.
16
+ """
17
+
18
+ name = "transcribe_audio"
19
+ description = "Transcribes audio to text and returns the text"
20
+ inputs = {
21
+ "audio_path": {"type": "string", "description": "Path to the audio file"},
22
+ }
23
+ output_type = "string"
24
+
25
+ def forward(self, audio_path: str) -> str:
26
+ try:
27
+ model = whisper.load_model("small")
28
+
29
+ if not os.path.exists(audio_path):
30
+ return f"Error: Audio file not found at {audio_path}"
31
+
32
+ result = model.transcribe(audio_path)
33
+ return result["text"]
34
+ except Exception as e:
35
+ return f"Error transcribing audio: {str(e)}"
tools/read_file_tool.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from smolagents import Tool
2
+
3
+
4
+ class ReadFileTool(Tool):
5
+ """
6
+ Tool to read a file and return its content.
7
+
8
+ Args:
9
+ file_path (str): Path to the file to read.
10
+
11
+ Returns:
12
+ str: Content of the file or error message.
13
+ """
14
+
15
+ name = "read_file"
16
+ description = "Reads a file and returns its content"
17
+ inputs = {
18
+ "file_path": {"type": "string", "description": "Path to the file to read"},
19
+ }
20
+ output_type = "string"
21
+
22
+ def forward(self, file_path: str) -> str:
23
+ try:
24
+ with open(file_path, "r") as file:
25
+ return file.read()
26
+ except Exception as e:
27
+ return f"Error reading file: {str(e)}"
tools/tools.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List
2
+
3
+ from smolagents import (
4
+ DuckDuckGoSearchTool,
5
+ PythonInterpreterTool,
6
+ Tool,
7
+ VisitWebpageTool,
8
+ WikipediaSearchTool,
9
+ )
10
+
11
+ from .describe_image_tool import DescribeImageTool
12
+ from .openai_speech_to_text_tool import OpenAISpeechToTextTool
13
+ from .read_file_tool import ReadFileTool
14
+ from .youtube_transcription_tool import YouTubeTranscriptionTool
15
+
16
+
17
+ def get_tools() -> List[Tool]:
18
+ """
19
+ Returns a list of available tools for the agent.
20
+
21
+ Returns:
22
+ List[Tool]: List of initialized tool instances.
23
+ """
24
+ tools = [
25
+ DuckDuckGoSearchTool(),
26
+ PythonInterpreterTool(),
27
+ WikipediaSearchTool(),
28
+ VisitWebpageTool(),
29
+ OpenAISpeechToTextTool(),
30
+ YouTubeTranscriptionTool(),
31
+ ReadFileTool(),
32
+ DescribeImageTool(),
33
+ ]
34
+ return tools
tools/youtube_transcription_tool.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from smolagents import Tool
2
+ from youtube_transcript_api import YouTubeTranscriptApi
3
+
4
+
5
+ class YouTubeTranscriptionTool(Tool):
6
+ """
7
+ Tool to fetch the transcript of a YouTube video given its URL.
8
+
9
+ Args:
10
+ video_url (str): YouTube video URL.
11
+
12
+ Returns:
13
+ str: Transcript of the video as a single string.
14
+ """
15
+
16
+ name = "youtube_transcription"
17
+ description = "Fetches the transcript of a YouTube video given its URL"
18
+ inputs = {
19
+ "video_url": {"type": "string", "description": "YouTube video URL"},
20
+ }
21
+ output_type = "string"
22
+
23
+ def forward(self, video_url: str) -> str:
24
+ video_id = video_url.strip().split("v=")[-1]
25
+ transcript = YouTubeTranscriptApi.get_transcript(video_id)
26
+ return " ".join([entry["text"] for entry in transcript])