WebShepherd commited on
Commit
25c711a
·
verified ·
1 Parent(s): cf9ac4d

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +10 -10
index.html CHANGED
@@ -3,10 +3,10 @@
3
  <head>
4
  <meta charset="utf-8">
5
  <meta name="description"
6
- content="WEB-SHEPHERD: Advancing PRMs for Reinforcing Web Agents">
7
  <meta name="keywords" content="Nerfies, D-NeRF, NeRF">
8
  <meta name="viewport" content="width=device-width, initial-scale=1">
9
- <title>WEB-SHEPHERD: Advancing PRMs for Reinforcing Web Agents</title>
10
 
11
  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
12
  rel="stylesheet">
@@ -35,7 +35,7 @@
35
  <div class="column has-text-centered">
36
  <h1 class="title is-1 publication-title">
37
  <img src="static/images/shepherd_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
38
- WEB-SHEPHERD:
39
  </h1>
40
  <h2 class="subtitle is-3 publication-subtitle">
41
  Advancing PRMs for Reinforcing Web Agents
@@ -141,7 +141,7 @@
141
  <div class="container is-max-desktop">
142
  <div class="content has-text-centered">
143
  <img src="static/images/figure_1.png" alt="geometric reasoning" width="95%"/>
144
- <p> Performance and cost-efficiency of WEB-SHEPHERD (3B). WEB-SHEPHERD achieves the state-of-the-art performance while requiring significantly lower cost compared to existing baselines. </p>
145
  </div>
146
  <!-- </div> -->
147
  </div>
@@ -160,10 +160,10 @@
160
  </p>
161
  <p>
162
  Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment.
163
- To address this, in this work, we propose the first process reward model (PRM) called WEB-SHEPHERD which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WEBPRM collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WEB-RewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our WEB-SHEPHERD achieves about 30 points better accuracy compared to using GPT-4o on WEB-RewardBench.
164
  </p>
165
  <p>
166
- Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and WEB-SHEPHERD as the verifier, we achieve 10.3 points better performance, in 10 times less cost compared to using GPT-4o-mini as the verifier.
167
  </p>
168
  </div>
169
  </div>
@@ -207,7 +207,7 @@
207
  <section class="hero is-light is-small">
208
  <div class="hero-body has-text-centered">
209
  <h1 class="title is-1 mmmu">
210
- <span class="mmmu" style="vertical-align: middle">WEB-SHEPHERD</span>
211
  </h1>
212
  </div>
213
  </section>
@@ -224,10 +224,10 @@
224
  </div>
225
  <div class="content has-text-justified">
226
  <p>
227
- We introduce WEB-SHEPHERD, a process reward model designed to provide dense and reliable supervision to web agents and enable more informative credit assignment.
228
  </p>
229
  <p>
230
- We train WEB-SHEPHERD on the WEBPRM Collection to support two key functionalities: (1) generating task-specific checklists, and (2) assigning rewards based on checklist completion.
231
  </p>
232
  </div>
233
  </div>
@@ -258,7 +258,7 @@
258
 
259
  <div class="content has-text-justified">
260
  <p>
261
- Table above reports the evaluation results on WEB-RewardBench. As shown in Table, state-of-the-art MLLMs struggle to provide reliable rewards for web navigation tasks. This limitation is particularly evident in the trajectory accuracy metric. In this measure, models frequently fail to assign correct rewards consistently at each time step within a single task. In contrast, WEB-SHEPHERD significantly outperforms all baselines, demonstrating a substantial performance gap across all benchmark settings.
262
  </p>
263
  <p>
264
  Also, Table above demonstrates that both baseline and our models benefit significantly from the checklist in assigning rewards.
 
3
  <head>
4
  <meta charset="utf-8">
5
  <meta name="description"
6
+ content="Web-Shepherd: Advancing PRMs for Reinforcing Web Agents">
7
  <meta name="keywords" content="Nerfies, D-NeRF, NeRF">
8
  <meta name="viewport" content="width=device-width, initial-scale=1">
9
+ <title>Web-Shepherd: Advancing PRMs for Reinforcing Web Agents</title>
10
 
11
  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
12
  rel="stylesheet">
 
35
  <div class="column has-text-centered">
36
  <h1 class="title is-1 publication-title">
37
  <img src="static/images/shepherd_emoji.png" style="width:1em;vertical-align: middle" alt="Logo"/>
38
+ Web-Shepherd:
39
  </h1>
40
  <h2 class="subtitle is-3 publication-subtitle">
41
  Advancing PRMs for Reinforcing Web Agents
 
141
  <div class="container is-max-desktop">
142
  <div class="content has-text-centered">
143
  <img src="static/images/figure_1.png" alt="geometric reasoning" width="95%"/>
144
+ <p> Performance and cost-efficiency of Web-Shepherd (3B). Web-Shepherd achieves the state-of-the-art performance while requiring significantly lower cost compared to existing baselines. </p>
145
  </div>
146
  <!-- </div> -->
147
  </div>
 
160
  </p>
161
  <p>
162
  Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment.
163
+ To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WEBPRM collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WEB-RewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WEB-RewardBench.
164
  </p>
165
  <p>
166
+ Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.3 points better performance, in 10 times less cost compared to using GPT-4o-mini as the verifier.
167
  </p>
168
  </div>
169
  </div>
 
207
  <section class="hero is-light is-small">
208
  <div class="hero-body has-text-centered">
209
  <h1 class="title is-1 mmmu">
210
+ <span class="mmmu" style="vertical-align: middle">Web-Shepherd</span>
211
  </h1>
212
  </div>
213
  </section>
 
224
  </div>
225
  <div class="content has-text-justified">
226
  <p>
227
+ We introduce Web-Shepherd, a process reward model designed to provide dense and reliable supervision to web agents and enable more informative credit assignment.
228
  </p>
229
  <p>
230
+ We train Web-Shepherd on the WEBPRM Collection to support two key functionalities: (1) generating task-specific checklists, and (2) assigning rewards based on checklist completion.
231
  </p>
232
  </div>
233
  </div>
 
258
 
259
  <div class="content has-text-justified">
260
  <p>
261
+ Table above reports the evaluation results on WEB-RewardBench. As shown in Table, state-of-the-art MLLMs struggle to provide reliable rewards for web navigation tasks. This limitation is particularly evident in the trajectory accuracy metric. In this measure, models frequently fail to assign correct rewards consistently at each time step within a single task. In contrast, Web-Shepherd significantly outperforms all baselines, demonstrating a substantial performance gap across all benchmark settings.
262
  </p>
263
  <p>
264
  Also, Table above demonstrates that both baseline and our models benefit significantly from the checklist in assigning rewards.