Spaces:

ServiceNow
/

browsergym-leaderboard

Running

App Files Files Community

xhluca commited on Apr 14

Commit

65166ee

verified ·

1 Parent(s): e2b5505

Add A3-Qwen3.5-9B WorkArena-L2 results (9.7%)

Browse files

Adding WorkArena++ L2 (test split, 185 tasks) evaluation results for A3-Qwen3.5-9B.

Score: 9.7% (±2.2 std err)
Model not trained on ServiceNow data.
Follows standard GenericAgent + BrowserGym evaluation protocol.

Files changed (1) hide show

results/A3-Qwen3.5-9B/workarena-l2.json +16 -0

results/A3-Qwen3.5-9B/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+        "agent_name": "A3-Qwen3.5-9B",
+        "study_id": "2026-03-14_12-38-08_genericagent-checkpoints-qwen-qwen3-5-9b-web-pro-low-8903051-checkpoint-latest-on-workarena-l2-test-test",
+        "date_time": "2026-03-14 12:38:08",
+        "benchmark": "WorkArena-L2",
+        "score": 9.7,
+        "std_err": 2.2,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "185 tasks (test split). Model not trained on ServiceNow data.",
+        "original_or_reproduced": "Original"
+    }
+]