File size: 8,179 Bytes
6cde7f4
1d2cc9c
 
 
 
6cde7f4
 
1d2cc9c
6cde7f4
 
 
 
 
 
1d2cc9c
 
6cde7f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fed844c
1d2cc9c
6cde7f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0746ae9
 
2442bae
0746ae9
6cde7f4
 
0746ae9
6cde7f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d2cc9c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
base_model:
- Qwen/Qwen2.5-7B-Instruct
datasets:
- miromind-ai/MiroRL-GenQA
language:
- en
license: apache-2.0
tags:
- agent
- deepresearch
- llm
- rl
- reinforcementlearning
pipeline_tag: text-generation
library_name: transformers
---

# Model Card for PokeeResearch

## Model Details

### Model Description

**PokeeResearch-7B** is a **7-billion-parameter deep research agent** developed by **Pokee AI** to advance reliable, aligned, and scalable research-grade reasoning in tool-augmented LLMs.  
The model integrates **Reinforcement Learning from AI Feedback (RLAIF)** with a **robust reasoning scaffold**, enabling it to conduct complex, multi-step research workflows that include self-correction, verification, and synthesis across multiple independent research threads.

- **Developed by:** Pokee AI
- **Model type:** Tool-augmented large language model (LLM) research agent  
- **Language(s):** English, Chinese and many more
- **License:** Apache 2.0  
- **Finetuned from model:** Qwen2.5-7B-Instruct

### Model Sources

- **Repository:** [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)  
- **Paper:** [*PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold*](https://arxiv.org/pdf/2510.15862), Pokee AI, October 2025
- **Project Page:** [https://pokee.ai/deepresearch-preview](https://pokee.ai/deepresearch-preview)

---

## Uses

### Direct Use
PokeeResearch-7B is designed for **deep research automation**, where the model autonomously:
- Decomposes complex user queries  
- Retrieves and reads from external sources  
- Synthesizes factual, verifiable, and grounded answers  

It can be used as a **standalone research assistant** or integrated into **multi-agent systems** to support academic, enterprise, or product-level research tasks.

### Downstream Use
PokeeResearch-7B can be **fine-tuned** or **extended** for:
- Domain-specific scientific discovery  
- Autonomous document retrieval and synthesis  
- Multi-source verification and summarization pipelines  
- Integration into reinforcement learning research agents (RLHF/RLAIF frameworks)

### Out-of-Scope Use
The model should **not** be used for:
- Generating unverified or speculative claims  
- Automated decision-making in high-stakes domains (medical, legal, or financial)  
- Applications requiring strict factual precision without external verification  
- Generating content without citation or evidence tracing  

---

## Bias, Risks, and Limitations

PokeeResearch-7B is optimized for factual grounding and robustness, but limitations include:
- Dependence on **external data quality** and **retrieval accuracy**  
- Potential **semantic bias** introduced by AI-based feedback signals  
- Limited coverage for **non-English** or **multi-modal** reasoning tasks  
- Risk of **hallucinated synthesis** when sources conflict or lack clarity  

### Recommendations
Users should:
- Cross-verify answers, especially in multi-hop reasoning cases  
- Monitor output for citation accuracy and alignment with source data  
- Refrain from using outputs as sole evidence in decision-critical contexts  

---

## How to Get Started with the Model
please refer to the following codebase for how to use PokeeResearch-7B
https://github.com/Pokee-AI/PokeeResearchOSS/blob/main/README.md

---

## Training Details

### Training Data
- **Dataset:** MiroRL-GenQA dataset (MiroMind AI, 2025)  
- **Data characteristics:** Complex, multi-turn question–answer pairs requiring multi-step reasoning  
- **Data filtering:** No benchmark data used for testing; the model was trained only on open-domain text Q&A samples  

### Training Procedure

#### Preprocessing
- Normalization and tokenization aligned with Qwen2.5 tokenizer  
- Structured prompt–response pairs in research/verification format (`<tool_call>`, `<answer>`, `<verification>`)

#### Training Hyperparameters
- **Algorithm:** RLOO (REINFORCE Leave-One-Out)  
- **Batch size:** 64  
- **Research threads per prompt:** 8  
- **Learning rate:** 3e-6  
- **Context limit:** 32,768 tokens  
- **Steps:** 140 fine-tuning iterations  
- **Regularization:** None (no entropy or KL regularization)  
- **Precision regime:** bf16 mixed precision  

#### Reward Design
- Combined reward signal from:
  - **AI feedback** (semantic equivalence via external LLM judge)  
  - **Format adherence reward** (ensures correct agent behavior)  

#### Speeds, Sizes, Times
- **Model size:** 7 billion parameters  
- **Training duration:** ~5 days on 8 × A100 80G GPUs  
- **Checkpoint size:** ~13 GB  

---

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data
10 open-domain research and QA benchmarks:
- NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, Humanity’s Last Exam

#### Factors
- Benchmarks differ by reasoning depth, retrieval dependence, and factual precision requirements.  
- Evaluations disaggregate by dataset difficulty and task type (single-hop vs multi-hop).  

#### Metrics
- Mean accuracy (mean@4 across independent research threads) based on 

### Results

**PokeeResearch-7B (RTS variant)** and **PokeeResearch-7B** outperforms all baselines at 7B scale across 10 benchmarks.  
Highlights (mean@4 accuracy):  
| **Method** | **HLE** | **GAIA** | **BrowseComp** | **BAMB** | **2WIKI** | **TQ** | **NQ** | **POPQA** | **MUSIQUE** | **HOTPOTQA** |
|-------------|----------|-----------|----------------|-----------|-----------|----------|----------|-------------|---------------|----------------|
| R1searcher | 5.4 | 8.3 | 1.0 | 63.2 | 61.4 | 77.2 | 59.6 | 51.8 | 35.8 | 62.4 |
| SearchR1 | 13.0 | 18.7 | 0.4 | 67.8 | 62.8 | 81.0 | 67.6 | 59.6 | 33.2 | 63.2 |
| ZeroSearch | 8.6 | 9.9 | 1.4 | 51.4 | 33.6 | 61.6 | 48.2 | 38.0 | 19.0 | 32.4 |
| ASearcher | 13.8 | 22.1 | 3.2 | 68.8 | 69.2 | 85.2 | 71.2 | 58.2 | 35.8 | 71.0 |
| DeepResearcher | 6.0 | 24.03 | 1.8 | 71.0 | 58.8 | 82.2 | 60.2 | 55.2 | 26.8 | 56.6 |
| **PR** | **15.2** | **36.9** | **5.4** | **74.5** | **74.0** | **91.3** | **75.1** | **59.8** | **39.8** | **71.2** |
| **PR+** | **17.6** | **41.3** | **8.4** | **75.0** | **75.0** | **91.8** | **75.0** | **60.0** | **41.4** | **71.6** |

#### Summary
PokeeResearch-7B variants achieves **state-of-the-art performance among 7B-scale open deep research agents**, validating RLAIF and reasoning scaffold design for robust, verifiable research workflows.

---

## Technical Specifications

### Model Architecture and Objective
- **Base Architecture:** Transformer decoder (Qwen2.5-7B-Instruct backbone)  
- **Objective:** Reinforcement learning with AI feedback to maximize semantic correctness and alignment with human-style reasoning  

### Compute Infrastructure
#### Hardware
- NVIDIA A100 80GB GPUs ×8 for training and x1 for inference
---

## Citation

**BibTeX:**
```bibtex
@article{pokee2025deepresearch,
  title={PokeeResearch: Effective Deep Research via
          Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold},
  author={Yi Wan* and Jiuqi Wang* and Liam Li
          and Jinsong Liu and Ruihao Zhu and Zheqing Zhu},
  journal={Pokee AI Technical Report},
  year={2025},
  url={https://arxiv.org/pdf/2510.15862}
}
```

**APA:**
Wan, Y., Wang, J., Li, L., Liu, J., Zhu, R., & Zhu, Z. (2025). *PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold.* Pokee AI.

---

## Glossary

- **RLAIF:** Reinforcement Learning from AI Feedback – optimization using LLM-based reward signals.  
- **RLOO:** REINFORCE Leave-One-Out – unbiased policy gradient variant for on-policy learning.  
- **RTS:** Research Threads Synthesis – synthesis of multiple independent reasoning threads at inference time.  

---

## More Information
For technical details, visit: [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)  
For inquiries, contact: [email protected]  

---

## Model Card Authors
**Yi Wan**, **Jiuqi Wang**, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu — Pokee AI Research Team  

## Model Card Contact
Pokee AI Team — [email protected]