Zachary Siegel commited on
Commit
fd7b6c5
·
1 Parent(s): a1b0cc7

edits to submission instructions

Browse files
Files changed (1) hide show
  1. agent_submission.md +5 -5
agent_submission.md CHANGED
@@ -1,15 +1,15 @@
1
  ### To submit **a new agent** to the leaderboard, follow these steps:
2
 
3
- 1. **Run your agent on the [CORE-Bench Harness](https://github.com/siegelz/core-bench).** When developing your agent, ensure that it generates a file named `agent_trace.log` in the base directory it is invoked for each run. The file must be in JSON format and **at least** include the keys `cost` and `agent_trace`:
4
 
5
  ```json
6
  {
7
  "cost": 0.59,
8
- "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution."
9
  }
10
  ```
11
 
12
- - **`cost`**: A float representing the total cost (USD) of API calls made by the agent (the agent will need to log the cost of each request and sum them up).
13
  - **`agent_trace`**: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by [SWE-Bench](https://www.swebench.com/submit.html):
14
  - Human-readable.
15
  - Reflects the intermediate steps your system took that led to the final solution.
@@ -17,13 +17,13 @@
17
 
18
  If you have any trouble implementing this, feel free to reach out to us for support.
19
 
20
- 2. **Evaluate your agent using the [CORE-Bench Harness](https://github.com/siegelz/core-bench)** on all tasks of the test set. You will almost certainly need to run your agent on Azure (with the `--use_azure` flag) to avoid long experiment times. Set the `--experiment_name` flag to be the name of your agent. We highly encourage you to run your agent on all three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, and CORE-Bench-Hard, but you can choose to run on any subset of these levels.
21
 
22
  3. **Submit the following two directories from the harness**:
23
  - `benchmark/results/[experiment_name]`: Contains the results of your agent on each task.
24
  - `benchmark/logs/[experiment_name]`: Contains the logs of your agent's execution on each task (which are the `agent_trace.log` files your agent submits).
25
  - These files are automatically generated by the harness when you run your agent. You should not be manually modifying these files.
26
 
27
- Compress these directories into two `.tar.gz` files and email them to [[email protected]](mailto:[email protected]). **In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.**
28
 
29
  4. [Optional] We highly encourage you to submit the files of your agent (i.e. `benchmark/agents/[agent_name]`) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a `.tar.gz` file and include it in the email.
 
1
  ### To submit **a new agent** to the leaderboard, follow these steps:
2
 
3
+ 1. **Run your agent on the [CORE-Bench Harness](https://github.com/siegelz/core-bench).** When developing your agent, ensure that it generates a file named `agent_trace.log` in the base directory it is invoked for each run. The content of this file must be in JSON format and **at least** include the keys `cost` and `agent_trace`:
4
 
5
  ```json
6
  {
7
  "cost": 0.59,
8
+ "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution. This trace does not need to follow a specific format."
9
  }
10
  ```
11
 
12
+ - **`cost`**: A float representing the total cost (USD) of API calls made by the agent. We recommend using [Weave](https://github.com/wandb/weave) for easy cost logging.
13
  - **`agent_trace`**: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by [SWE-Bench](https://www.swebench.com/submit.html):
14
  - Human-readable.
15
  - Reflects the intermediate steps your system took that led to the final solution.
 
17
 
18
  If you have any trouble implementing this, feel free to reach out to us for support.
19
 
20
+ 2. **Run your agent** on all tasks of the test set. You will almost certainly need to run your agent using our Azure VM harness (with the `--use_azure` flag) to avoid long experiment times. Set the `--experiment_name` flag to be the name of your agent. We highly encourage you to run your agent on all three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, and CORE-Bench-Hard, but you can choose to run on any subset of these levels.
21
 
22
  3. **Submit the following two directories from the harness**:
23
  - `benchmark/results/[experiment_name]`: Contains the results of your agent on each task.
24
  - `benchmark/logs/[experiment_name]`: Contains the logs of your agent's execution on each task (which are the `agent_trace.log` files your agent submits).
25
  - These files are automatically generated by the harness when you run your agent. You should not be manually modifying these files.
26
 
27
+ Compress these directories into two `.tar.gz` or `.zip` files and email them to [[email protected]](mailto:[email protected]). **In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.**
28
 
29
  4. [Optional] We highly encourage you to submit the files of your agent (i.e. `benchmark/agents/[agent_name]`) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a `.tar.gz` file and include it in the email.