Yago Bolivar
commited on
Commit
·
4d7d7f8
1
Parent(s):
5e7fe8b
feat: add API documentation and project overview for GAIA benchmark agent
Browse files- docs/API.md +80 -0
- docs/project_overview.md +48 -0
docs/API.md
ADDED
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
You can access the GAIA benchmark API provided for your agent's evaluation using the following endpoints, as described in the provided documentation:
|
2 |
+
|
3 |
+
### Base URL
|
4 |
+
The API's base URL (according to the provided document) is:
|
5 |
+
```
|
6 |
+
https://agents-course-unit4-scoring.hf.space
|
7 |
+
```
|
8 |
+
|
9 |
+
### API Endpoints
|
10 |
+
|
11 |
+
1. **Retrieve Evaluation Questions**
|
12 |
+
Endpoint:
|
13 |
+
```
|
14 |
+
GET /questions
|
15 |
+
```
|
16 |
+
- Returns the full list of filtered evaluation questions.
|
17 |
+
|
18 |
+
2. **Retrieve a Random Question**
|
19 |
+
Endpoint:
|
20 |
+
```
|
21 |
+
GET /random-question
|
22 |
+
```
|
23 |
+
- Fetches a single random question from the available set.
|
24 |
+
|
25 |
+
3. **Download Associated Files**
|
26 |
+
Endpoint:
|
27 |
+
```
|
28 |
+
GET /files/{task_id}
|
29 |
+
```
|
30 |
+
- Downloads files associated with specific tasks, useful for questions that require external data or multimodal analysis.
|
31 |
+
|
32 |
+
4. **Submit Agent Answers**
|
33 |
+
Endpoint:
|
34 |
+
```
|
35 |
+
POST /submit
|
36 |
+
```
|
37 |
+
- Submit your agent's answers to be evaluated against the benchmark.
|
38 |
+
- Requires JSON payload structured as:
|
39 |
+
```json
|
40 |
+
{
|
41 |
+
"username": "Your Hugging Face username",
|
42 |
+
"agent_code": "URL to your Hugging Face Space code repository",
|
43 |
+
"answers": [{"task_id": "task identifier", "submitted_answer": "your answer"}]
|
44 |
+
}
|
45 |
+
```
|
46 |
+
|
47 |
+
### API Usage Example:
|
48 |
+
Here's an illustrative example using Python and the `requests` library:
|
49 |
+
|
50 |
+
```python
|
51 |
+
import requests
|
52 |
+
|
53 |
+
BASE_URL = "https://agents-course-unit4-scoring.hf.space"
|
54 |
+
|
55 |
+
# Retrieve all questions
|
56 |
+
response = requests.get(f"{BASE_URL}/questions")
|
57 |
+
questions = response.json()
|
58 |
+
|
59 |
+
# Fetch a random question
|
60 |
+
random_question = requests.get(f"{BASE_URL}/random-question").json()
|
61 |
+
|
62 |
+
# Download a file for a specific task_id
|
63 |
+
task_id = "example_task_id"
|
64 |
+
file_response = requests.get(f"{BASE_URL}/files/{task_id}")
|
65 |
+
with open("downloaded_file", "wb") as f:
|
66 |
+
f.write(file_response.content)
|
67 |
+
|
68 |
+
# Submit answers
|
69 |
+
submission_payload = {
|
70 |
+
"username": "your_username",
|
71 |
+
"agent_code": "https://huggingface.co/spaces/your_username/your_space_name/tree/main",
|
72 |
+
"answers": [{"task_id": "task_id", "submitted_answer": "answer_text"}]
|
73 |
+
}
|
74 |
+
submit_response = requests.post(f"{BASE_URL}/submit", json=submission_payload)
|
75 |
+
print(submit_response.json())
|
76 |
+
```
|
77 |
+
|
78 |
+
Ensure you have proper authentication if required, and replace placeholder texts (`your_username`, `task_id`, `answer_text`, etc.) with your actual values.
|
79 |
+
|
80 |
+
Let me know if you need further assistance or more detailed examples!
|
docs/project_overview.md
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
### Project: GAIA Benchmark Agent Development
|
2 |
+
|
3 |
+
## Contrasubject
|
4 |
+
The project involves the design and implementation of an advanced AI agent that can efficiently tackle a variety of real-world tasks defined by the GAIA benchmark. This benchmark evaluates AI systems across three complexity levels, focusing on core competencies like reasoning, multimodal understanding, web browsing, and proficient use of tools. The agent must demonstrate capabilities in structured problem-solving, multimodal reasoning, multi-hop fact retrieval, and coherent task sequencing.
|
5 |
+
|
6 |
+
## Requirements
|
7 |
+
|
8 |
+
### General Requirements
|
9 |
+
- Design and implement an AI agent capable of addressing tasks from the GAIA benchmark.
|
10 |
+
- Agent must achieve a minimum performance benchmark of 30% accuracy on GAIA's level 1 questions.
|
11 |
+
- Maintain publicly accessible code and documentation on Hugging Face to allow verification and reproducibility.
|
12 |
+
|
13 |
+
### Technical Requirements
|
14 |
+
- Use the provided API endpoints for interacting with the GAIA evaluation:
|
15 |
+
- `GET /questions`: Retrieve evaluation questions.
|
16 |
+
- `GET /random-question`: Obtain individual random questions.
|
17 |
+
- `GET /files/{task_id}`: Access files associated with specific task IDs.
|
18 |
+
- `POST /submit`: Submit answers for evaluation and leaderboard updates.
|
19 |
+
|
20 |
+
### Submission Requirements
|
21 |
+
- Username: Provide your Hugging Face username for identification purposes.
|
22 |
+
- Code Link (agent_code): Include a URL pointing directly to your Hugging Face Space code repository.
|
23 |
+
- Answers: Submit a structured response (`{"task_id": ..., "submitted_answer": ...}`) generated by your agent.
|
24 |
+
|
25 |
+
### Evaluation Criteria
|
26 |
+
- Answers evaluated through exact match comparison to the provided ground truth.
|
27 |
+
- Results and rankings displayed on a student leaderboard, accessible publicly.
|
28 |
+
|
29 |
+
### Advanced Capabilities
|
30 |
+
- Ensure your agent can efficiently:
|
31 |
+
- Engage in multi-step reasoning and complex coordination of multiple tools.
|
32 |
+
- Conduct multimodal analyses involving text and image-based data.
|
33 |
+
- Execute web browsing and structured retrieval of information.
|
34 |
+
- Manage advanced planning and tool integration (particularly relevant for Level 2 and Level 3 GAIA tasks).
|
35 |
+
|
36 |
+
### Additional Notes
|
37 |
+
- GAIA's benchmark questions range from simple (Level 1, fewer than 5 steps) to complex tasks requiring extensive coordination and planning (Levels 2 and 3).
|
38 |
+
- Agent development should prioritize the following principles:
|
39 |
+
- Real-world applicability of solutions.
|
40 |
+
- Clarity and interpretability for human evaluators.
|
41 |
+
- Robustness against brute-force methods (non-gameability).
|
42 |
+
- Concise and unambiguous answers.
|
43 |
+
|
44 |
+
### Documentation and Compliance
|
45 |
+
- Ensure proper attribution and compliance with GAIA data usage policies.
|
46 |
+
- Keep your codebase transparent and publicly accessible for verification purposes.
|
47 |
+
- Clearly document your agent's design, architecture, and reasoning processes to facilitate comprehension and evaluation.
|
48 |
+
|