Add A3-Qwen3.5-9B WorkArena-L2 results (9.7%)

#13

Adding WorkArena++ L2 (test split, 185 tasks) evaluation results for A3-Qwen3.5-9B.

Score: 9.7% (±2.2 std err)
Model not trained on ServiceNow data.
Follows standard GenericAgent + BrowserGym evaluation protocol.

Closing in favor of a clean PR with correct title and description.

xhluca changed pull request status to closed

Sign up or log in to comment