Spaces:

ServiceNow
/

browsergym-leaderboard

Running

App Files Files Community

meghsn commited on Dec 6, 2024

Commit

90d6776

1 Parent(s): 2a1e680

Added readme, visualwebarena

Browse files

Files changed (9) hide show

app.py +1 -1
results/GenericAgent-Claude-3.5-Sonnet/README.md +3 -1
results/GenericAgent-Claude-3.5-Sonnet/visualwebarena.json +16 -0
results/GenericAgent-GPT-4o-mini/visualwebarena.json +16 -0
results/GenericAgent-GPT-4o/README.md +46 -1
results/GenericAgent-GPT-4o/visualwebarena.json +16 -0
results/GenericAgent-GPT-o1-mini/README.md +46 -1
results/GenericAgent-Llama-3.1-405b/README.md +46 -1
results/GenericAgent-Llama-3.1-70b/README.md +46 -1

app.py CHANGED Viewed

@@ -17,7 +17,7 @@ import re
 import html
 from typing import Dict, Any
-BENCHMARKS = ["WebArena", "WorkArena-L1", "WorkArena-L2", "WorkArena-L3", "MiniWoB", "WebLINX", "AssistantBench"]
 def sanitize_agent_name(agent_name):
     # Only allow alphanumeric chars, hyphen, underscore

 import html
 from typing import Dict, Any
+BENCHMARKS = ["WebArena", "WorkArena-L1", "WorkArena-L2", "WorkArena-L3", "MiniWoB", "WebLINX", "VisualWebArena", "AssistantBench"]
 def sanitize_agent_name(agent_name):
     # Only allow alphanumeric chars, hyphen, underscore

results/GenericAgent-Claude-3.5-Sonnet/README.md CHANGED Viewed

@@ -41,4 +41,6 @@ BASE_FLAGS = GenericPromptFlags(
     be_cautious=True,
     extra_instructions=None,
 )
-```

     be_cautious=True,
     extra_instructions=None,
 )
+```
+Note: Agents don't use vision except for VisualWebArena, where the vision flag is turned on (and the LLM suports it).

results/GenericAgent-Claude-3.5-Sonnet/visualwebarena.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+        "agent_name": "GenericAgent-Claude-3.5-Sonnet",
+        "study_id": "study_id",
+        "benchmark": "VisualWebArena",
+        "score": 21.0,
+        "std_err": 1.3,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "NA",
+        "original_or_reproduced": "Original",
+        "date_time": "2021-01-01 12:00:00"
+    }
+]

results/GenericAgent-GPT-4o-mini/visualwebarena.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+        "agent_name": "GenericAgent-GPT-4o-mini",
+        "study_id": "study_id",
+        "date_time": "2021-01-01 12:00:00",
+        "benchmark": "VisualWebArena",
+        "score": 16.9,
+        "std_err": 1.2,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "NA",
+        "original_or_reproduced": "Original"
+    }
+]

results/GenericAgent-GPT-4o/README.md CHANGED Viewed

	@@ -1 +1,46 @@
1	- ## GPT-4o ~~model~~

+### GenericAgent-GPT-4o
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses GPT-4o as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,  # gpt-4o config except for this line
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```
+Note: Agents don't use vision except for VisualWebArena, where the vision flag is turned on (and the LLM suports it).

results/GenericAgent-GPT-4o/visualwebarena.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+        "agent_name": "GenericAgent-GPT-4o",
+        "study_id": "study_id",
+        "date_time": "2021-01-01 12:00:00",
+        "benchmark": "VisualWebArena",
+        "score": 26.7,
+        "std_err": 1.5,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "NA",
+        "original_or_reproduced": "Original"
+    }
+]

results/GenericAgent-GPT-o1-mini/README.md CHANGED Viewed

	@@ -1 +1,46 @@
1	- ## GPT-o1-mini ~~model~~

+### GenericAgent-GPT-o1-mini
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses o1-mini as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,  # gpt-4o config except for this line
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```
+Note: Agents don't use vision except for VisualWebArena, where the vision flag is turned on (and the LLM suports it).

results/GenericAgent-Llama-3.1-405b/README.md CHANGED Viewed

	@@ -1 +1,46 @@
1	- ### Llama-3.1-~~405B~~

+### GenericAgent-Llama-3.1-405b
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses Llama-3.1-405b as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,  # gpt-4o config except for this line
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```
+Note: Agents don't use vision except for VisualWebArena, where the vision flag is turned on (and the LLM suports it).

results/GenericAgent-Llama-3.1-70b/README.md CHANGED Viewed

	@@ -1 +1,46 @@
1	- ### Llama-3.1-~~70B~~

+### GenericAgent-Llama-3.1-70b
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses Llama-3.1-70b as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,  # gpt-4o config except for this line
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```
+Note: Agents don't use vision except for VisualWebArena, where the vision flag is turned on (and the LLM suports it).