Spaces:

ServiceNow
/

browsergym-leaderboard

Running

Add A3-Qwen3.5-9B WorkArena-L2 results (9.7%)

#13

by xhluca - opened Apr 14

←

Apr 14

Adding WorkArena++ L2 (test split, 185 tasks) evaluation results for A3-Qwen3.5-9B.

Score: 9.7% (±2.2 std err)
Model not trained on ServiceNow data.
Follows standard GenericAgent + BrowserGym evaluation protocol.

Apr 14

Closing in favor of a clean PR with correct title and description.

xhluca changed pull request status to closed Apr 14

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment