core_leaderboard

Running

App Files Files Community

Zachary Siegel commited on Sep 29, 2024

Commit

fd7b6c5

1 Parent(s): a1b0cc7

edits to submission instructions

Browse files

Files changed (1) hide show

agent_submission.md +5 -5

agent_submission.md CHANGED Viewed

@@ -1,15 +1,15 @@
 ### To submit **a new agent** to the leaderboard, follow these steps:
-1. **Run your agent on the [CORE-Bench Harness](https://github.com/siegelz/core-bench).** When developing your agent, ensure that it generates a file named `agent_trace.log` in the base directory it is invoked for each run. The file must be in JSON format and **at least** include the keys `cost` and `agent_trace`:
     ```json
     {
         "cost": 0.59,
-        "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution."
     }
     ```
-   - **`cost`**: A float representing the total cost (USD) of API calls made by the agent (the agent will need to log the cost of each request and sum them up).
    - **`agent_trace`**: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by [SWE-Bench](https://www.swebench.com/submit.html):
      - Human-readable.
      - Reflects the intermediate steps your system took that led to the final solution.
@@ -17,13 +17,13 @@
    If you have any trouble implementing this, feel free to reach out to us for support.
-2. **Evaluate your agent using the [CORE-Bench Harness](https://github.com/siegelz/core-bench)** on all tasks of the test set. You will almost certainly need to run your agent on Azure (with the `--use_azure` flag) to avoid long experiment times. Set the `--experiment_name` flag to be the name of your agent. We highly encourage you to run your agent on all three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, and CORE-Bench-Hard, but you can choose to run on any subset of these levels.
 3. **Submit the following two directories from the harness**:
    - `benchmark/results/[experiment_name]`: Contains the results of your agent on each task.
    - `benchmark/logs/[experiment_name]`: Contains the logs of your agent's execution on each task (which are the `agent_trace.log` files your agent submits).
    - These files are automatically generated by the harness when you run your agent. You should not be manually modifying these files.
-   Compress these directories into two `.tar.gz` files and email them to [[email protected]](mailto:[email protected]). **In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.**
 4. [Optional] We highly encourage you to submit the files of your agent (i.e. `benchmark/agents/[agent_name]`) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a `.tar.gz` file and include it in the email.

 ### To submit **a new agent** to the leaderboard, follow these steps:
+1. **Run your agent on the [CORE-Bench Harness](https://github.com/siegelz/core-bench).** When developing your agent, ensure that it generates a file named `agent_trace.log` in the base directory it is invoked for each run. The content of this file must be in JSON format and **at least** include the keys `cost` and `agent_trace`:
     ```json
     {
         "cost": 0.59,
+        "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution. This trace does not need to follow a specific format."
     }
     ```
+   - **`cost`**: A float representing the total cost (USD) of API calls made by the agent. We recommend using [Weave](https://github.com/wandb/weave) for easy cost logging.
    - **`agent_trace`**: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by [SWE-Bench](https://www.swebench.com/submit.html):
      - Human-readable.
      - Reflects the intermediate steps your system took that led to the final solution.
    If you have any trouble implementing this, feel free to reach out to us for support.
+2. **Run your agent** on all tasks of the test set. You will almost certainly need to run your agent using our Azure VM harness (with the `--use_azure` flag) to avoid long experiment times. Set the `--experiment_name` flag to be the name of your agent. We highly encourage you to run your agent on all three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, and CORE-Bench-Hard, but you can choose to run on any subset of these levels.
 3. **Submit the following two directories from the harness**:
    - `benchmark/results/[experiment_name]`: Contains the results of your agent on each task.
    - `benchmark/logs/[experiment_name]`: Contains the logs of your agent's execution on each task (which are the `agent_trace.log` files your agent submits).
    - These files are automatically generated by the harness when you run your agent. You should not be manually modifying these files.
+   Compress these directories into two `.tar.gz` or `.zip` files and email them to [[email protected]](mailto:[email protected]). **In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.**
 4. [Optional] We highly encourage you to submit the files of your agent (i.e. `benchmark/agents/[agent_name]`) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a `.tar.gz` file and include it in the email.