Yago Bolivar commited on
Commit
4d7d7f8
·
1 Parent(s): 5e7fe8b

feat: add API documentation and project overview for GAIA benchmark agent

Browse files
Files changed (2) hide show
  1. docs/API.md +80 -0
  2. docs/project_overview.md +48 -0
docs/API.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ You can access the GAIA benchmark API provided for your agent's evaluation using the following endpoints, as described in the provided documentation:
2
+
3
+ ### Base URL
4
+ The API's base URL (according to the provided document) is:
5
+ ```
6
+ https://agents-course-unit4-scoring.hf.space
7
+ ```
8
+
9
+ ### API Endpoints
10
+
11
+ 1. **Retrieve Evaluation Questions**
12
+ Endpoint:
13
+ ```
14
+ GET /questions
15
+ ```
16
+ - Returns the full list of filtered evaluation questions.
17
+
18
+ 2. **Retrieve a Random Question**
19
+ Endpoint:
20
+ ```
21
+ GET /random-question
22
+ ```
23
+ - Fetches a single random question from the available set.
24
+
25
+ 3. **Download Associated Files**
26
+ Endpoint:
27
+ ```
28
+ GET /files/{task_id}
29
+ ```
30
+ - Downloads files associated with specific tasks, useful for questions that require external data or multimodal analysis.
31
+
32
+ 4. **Submit Agent Answers**
33
+ Endpoint:
34
+ ```
35
+ POST /submit
36
+ ```
37
+ - Submit your agent's answers to be evaluated against the benchmark.
38
+ - Requires JSON payload structured as:
39
+ ```json
40
+ {
41
+ "username": "Your Hugging Face username",
42
+ "agent_code": "URL to your Hugging Face Space code repository",
43
+ "answers": [{"task_id": "task identifier", "submitted_answer": "your answer"}]
44
+ }
45
+ ```
46
+
47
+ ### API Usage Example:
48
+ Here's an illustrative example using Python and the `requests` library:
49
+
50
+ ```python
51
+ import requests
52
+
53
+ BASE_URL = "https://agents-course-unit4-scoring.hf.space"
54
+
55
+ # Retrieve all questions
56
+ response = requests.get(f"{BASE_URL}/questions")
57
+ questions = response.json()
58
+
59
+ # Fetch a random question
60
+ random_question = requests.get(f"{BASE_URL}/random-question").json()
61
+
62
+ # Download a file for a specific task_id
63
+ task_id = "example_task_id"
64
+ file_response = requests.get(f"{BASE_URL}/files/{task_id}")
65
+ with open("downloaded_file", "wb") as f:
66
+ f.write(file_response.content)
67
+
68
+ # Submit answers
69
+ submission_payload = {
70
+ "username": "your_username",
71
+ "agent_code": "https://huggingface.co/spaces/your_username/your_space_name/tree/main",
72
+ "answers": [{"task_id": "task_id", "submitted_answer": "answer_text"}]
73
+ }
74
+ submit_response = requests.post(f"{BASE_URL}/submit", json=submission_payload)
75
+ print(submit_response.json())
76
+ ```
77
+
78
+ Ensure you have proper authentication if required, and replace placeholder texts (`your_username`, `task_id`, `answer_text`, etc.) with your actual values.
79
+
80
+ Let me know if you need further assistance or more detailed examples!
docs/project_overview.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### Project: GAIA Benchmark Agent Development
2
+
3
+ ## Contrasubject
4
+ The project involves the design and implementation of an advanced AI agent that can efficiently tackle a variety of real-world tasks defined by the GAIA benchmark. This benchmark evaluates AI systems across three complexity levels, focusing on core competencies like reasoning, multimodal understanding, web browsing, and proficient use of tools. The agent must demonstrate capabilities in structured problem-solving, multimodal reasoning, multi-hop fact retrieval, and coherent task sequencing.
5
+
6
+ ## Requirements
7
+
8
+ ### General Requirements
9
+ - Design and implement an AI agent capable of addressing tasks from the GAIA benchmark.
10
+ - Agent must achieve a minimum performance benchmark of 30% accuracy on GAIA's level 1 questions.
11
+ - Maintain publicly accessible code and documentation on Hugging Face to allow verification and reproducibility.
12
+
13
+ ### Technical Requirements
14
+ - Use the provided API endpoints for interacting with the GAIA evaluation:
15
+ - `GET /questions`: Retrieve evaluation questions.
16
+ - `GET /random-question`: Obtain individual random questions.
17
+ - `GET /files/{task_id}`: Access files associated with specific task IDs.
18
+ - `POST /submit`: Submit answers for evaluation and leaderboard updates.
19
+
20
+ ### Submission Requirements
21
+ - Username: Provide your Hugging Face username for identification purposes.
22
+ - Code Link (agent_code): Include a URL pointing directly to your Hugging Face Space code repository.
23
+ - Answers: Submit a structured response (`{"task_id": ..., "submitted_answer": ...}`) generated by your agent.
24
+
25
+ ### Evaluation Criteria
26
+ - Answers evaluated through exact match comparison to the provided ground truth.
27
+ - Results and rankings displayed on a student leaderboard, accessible publicly.
28
+
29
+ ### Advanced Capabilities
30
+ - Ensure your agent can efficiently:
31
+ - Engage in multi-step reasoning and complex coordination of multiple tools.
32
+ - Conduct multimodal analyses involving text and image-based data.
33
+ - Execute web browsing and structured retrieval of information.
34
+ - Manage advanced planning and tool integration (particularly relevant for Level 2 and Level 3 GAIA tasks).
35
+
36
+ ### Additional Notes
37
+ - GAIA's benchmark questions range from simple (Level 1, fewer than 5 steps) to complex tasks requiring extensive coordination and planning (Levels 2 and 3).
38
+ - Agent development should prioritize the following principles:
39
+ - Real-world applicability of solutions.
40
+ - Clarity and interpretability for human evaluators.
41
+ - Robustness against brute-force methods (non-gameability).
42
+ - Concise and unambiguous answers.
43
+
44
+ ### Documentation and Compliance
45
+ - Ensure proper attribution and compliance with GAIA data usage policies.
46
+ - Keep your codebase transparent and publicly accessible for verification purposes.
47
+ - Clearly document your agent's design, architecture, and reasoning processes to facilitate comprehension and evaluation.
48
+