Add A3-Qwen3.5-9B WorkArena-L2 results (9.7%)
#13
by xhluca - opened
Adding WorkArena++ L2 (test split, 185 tasks) evaluation results for A3-Qwen3.5-9B.
Score: 9.7% (±2.2 std err)
Model not trained on ServiceNow data.
Follows standard GenericAgent + BrowserGym evaluation protocol.
Closing in favor of a clean PR with correct title and description.
xhluca changed pull request status to closed