billmatrix commited on
Commit
6cde7f4
·
verified ·
1 Parent(s): d709715

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +217 -0
README.md ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - agent
7
+ - deepresearch
8
+ - llm
9
+ - rl
10
+ - reinforcementlearning
11
+ datasets:
12
+ - miromind-ai/MiroRL-GenQA
13
+ base_model:
14
+ - Qwen/Qwen2.5-7B-Instruct
15
+ ---
16
+
17
+ # Model Card for PokeeResearch
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+
23
+ **PokeeResearch-7B** is a **7-billion-parameter deep research agent** developed by **Pokee AI** to advance reliable, aligned, and scalable research-grade reasoning in tool-augmented LLMs.
24
+ The model integrates **Reinforcement Learning from AI Feedback (RLAIF)** with a **robust reasoning scaffold**, enabling it to conduct complex, multi-step research workflows that include self-correction, verification, and synthesis across multiple independent research threads.
25
+
26
+ - **Developed by:** Pokee AI
27
+ - **Model type:** Tool-augmented large language model (LLM) research agent
28
+ - **Language(s):** English, Chinese and many more
29
+ - **License:** Apache 2.0
30
+ - **Finetuned from model:** Qwen2.5-7B-Instruct
31
+
32
+ ### Model Sources
33
+
34
+ - **Repository:** [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
35
+ - **Paper:** *PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold*, Pokee AI, October 2025
36
+ - **API Access:** [https://pokee.ai/deepresearch](https://pokee.ai/deepresearch)
37
+
38
+ ---
39
+
40
+ ## Uses
41
+
42
+ ### Direct Use
43
+ PokeeResearch-7B is designed for **deep research automation**, where the model autonomously:
44
+ - Decomposes complex user queries
45
+ - Retrieves and reads from external sources
46
+ - Synthesizes factual, verifiable, and grounded answers
47
+
48
+ It can be used as a **standalone research assistant** or integrated into **multi-agent systems** to support academic, enterprise, or product-level research tasks.
49
+
50
+ ### Downstream Use
51
+ PokeeResearch-7B can be **fine-tuned** or **extended** for:
52
+ - Domain-specific scientific discovery
53
+ - Autonomous document retrieval and synthesis
54
+ - Multi-source verification and summarization pipelines
55
+ - Integration into reinforcement learning research agents (RLHF/RLAIF frameworks)
56
+
57
+ ### Out-of-Scope Use
58
+ The model should **not** be used for:
59
+ - Generating unverified or speculative claims
60
+ - Automated decision-making in high-stakes domains (medical, legal, or financial)
61
+ - Applications requiring strict factual precision without external verification
62
+ - Generating content without citation or evidence tracing
63
+
64
+ ---
65
+
66
+ ## Bias, Risks, and Limitations
67
+
68
+ PokeeResearch-7B is optimized for factual grounding and robustness, but limitations include:
69
+ - Dependence on **external data quality** and **retrieval accuracy**
70
+ - Potential **semantic bias** introduced by AI-based feedback signals
71
+ - Limited coverage for **non-English** or **multi-modal** reasoning tasks
72
+ - Risk of **hallucinated synthesis** when sources conflict or lack clarity
73
+
74
+ ### Recommendations
75
+ Users should:
76
+ - Cross-verify answers, especially in multi-hop reasoning cases
77
+ - Monitor output for citation accuracy and alignment with source data
78
+ - Refrain from using outputs as sole evidence in decision-critical contexts
79
+
80
+ ---
81
+
82
+ ## How to Get Started with the Model
83
+ please refer to the following codebase for how to use PokeeResearch-7B
84
+ https://github.com/Pokee-AI/PokeeResearchOSS/blob/main/README.md
85
+
86
+ ---
87
+
88
+ ## Training Details
89
+
90
+ ### Training Data
91
+ - **Dataset:** MiroRL-GenQA dataset (MiroMind AI, 2025)
92
+ - **Data characteristics:** Complex, multi-turn question–answer pairs requiring multi-step reasoning
93
+ - **Data filtering:** No benchmark data used for testing; the model was trained only on open-domain text Q&A samples
94
+
95
+ ### Training Procedure
96
+
97
+ #### Preprocessing
98
+ - Normalization and tokenization aligned with Qwen2.5 tokenizer
99
+ - Structured prompt–response pairs in research/verification format (`<tool_call>`, `<answer>`, `<verification>`)
100
+
101
+ #### Training Hyperparameters
102
+ - **Algorithm:** RLOO (REINFORCE Leave-One-Out)
103
+ - **Batch size:** 64
104
+ - **Research threads per prompt:** 8
105
+ - **Learning rate:** 3e-6
106
+ - **Context limit:** 32,768 tokens
107
+ - **Steps:** 140 fine-tuning iterations
108
+ - **Regularization:** None (no entropy or KL regularization)
109
+ - **Precision regime:** bf16 mixed precision
110
+
111
+ #### Reward Design
112
+ - Combined reward signal from:
113
+ - **AI feedback** (semantic equivalence via external LLM judge)
114
+ - **Format adherence reward** (ensures correct agent behavior)
115
+
116
+ #### Speeds, Sizes, Times
117
+ - **Model size:** 7 billion parameters
118
+ - **Training duration:** ~5 days on 8 × A100 80G GPUs
119
+ - **Checkpoint size:** ~13 GB
120
+
121
+ ---
122
+
123
+ ## Evaluation
124
+
125
+ ### Testing Data, Factors & Metrics
126
+
127
+ #### Testing Data
128
+ 10 open-domain research and QA benchmarks:
129
+ - NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, Humanity’s Last Exam
130
+
131
+ #### Factors
132
+ - Benchmarks differ by reasoning depth, retrieval dependence, and factual precision requirements.
133
+ - Evaluations disaggregate by dataset difficulty and task type (single-hop vs multi-hop).
134
+
135
+ #### Metrics
136
+ - Mean accuracy (mean@4 across independent research threads) based on
137
+
138
+ ### Results
139
+
140
+ **PokeeResearch-7B (RTS variant)** and **PokeeResearch-7B** outperforms all baselines at 7B scale across 10 benchmarks.
141
+ Highlights (mean@4 accuracy):
142
+ | **Method** | **HLE** | **GAIA** | **BrowseComp** | **BAMB** | **2WIKI** | **TQ** | **NQ** | **POPQA** | **MUSIQUE** | **HOTPOTQA** |
143
+ |-------------|----------|-----------|----------------|-----------|-----------|----------|----------|-------------|---------------|----------------|
144
+ | R1searcher | 5.4 | 8.3 | 1.0 | 63.2 | 61.4 | 77.2 | 59.6 | 51.8 | 35.8 | 62.4 |
145
+ | SearchR1 | 13.0 | 18.7 | 0.4 | 67.8 | 62.8 | 81.0 | 67.6 | 59.6 | 33.2 | 63.2 |
146
+ | ZeroSearch | 8.6 | 9.9 | 1.4 | 51.4 | 33.6 | 61.6 | 48.2 | 38.0 | 19.0 | 32.4 |
147
+ | ASearcher | 13.8 | 22.1 | 3.2 | 68.8 | 69.2 | 85.2 | 71.2 | 58.2 | 35.8 | 71.0 |
148
+ | DeepResearcher | 6.0 | 24.03 | 1.8 | 71.0 | 58.8 | 82.2 | 60.2 | 55.2 | 26.8 | 56.6 |
149
+ | **PR** | **15.2** | **36.9** | **5.4** | **74.5** | **74.0** | **91.3** | **75.1** | **59.8** | **39.8** | **71.2** |
150
+ | **PR+** | **17.6** | **41.3** | **8.4** | **75.0** | **75.0** | **91.8** | **75.0** | **60.0** | **41.4** | **71.6** |
151
+
152
+ #### Summary
153
+ PokeeResearch-7B variants achieves **state-of-the-art performance among 7B-scale open deep research agents**, validating RLAIF and reasoning scaffold design for robust, verifiable research workflows.
154
+
155
+ ---
156
+
157
+ ## Model Examination
158
+ The model’s **self-verification loop** prevents common reasoning errors by iteratively verifying its answers.
159
+ Example walkthroughs in Appendix B show that incorrect responses are identified and corrected through self-evaluation cycles.
160
+
161
+ ---
162
+
163
+ ## Technical Specifications
164
+
165
+ ### Model Architecture and Objective
166
+ - **Base Architecture:** Transformer decoder (Qwen2.5-7B-Instruct backbone)
167
+ - **Objective:** Reinforcement learning with AI feedback to maximize semantic correctness and alignment with human-style reasoning
168
+
169
+ ### Compute Infrastructure
170
+ #### Hardware
171
+ - NVIDIA A100 80GB GPUs ×8 for training and x1 for inference
172
+ -
173
+
174
+ #### Software
175
+ - Framework: PyTorch + DeepSpeed
176
+ - Training orchestrator: MiroRL (MiroMind Foundation)
177
+ - Toolchain integration: Serper.dev and Jina Reader
178
+
179
+ ---
180
+
181
+ ## Citation
182
+
183
+ **BibTeX:**
184
+ ```bibtex
185
+ @article{pokee2025deepresearch,
186
+ title={PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold},
187
+ author={Yi Wan* and Jiuqi Wang* and Liam Li and Jinsong Liu and Ruihao Zhu and Zheqing Zhu},
188
+ journal={Pokee AI Technical Report},
189
+ year={2025},
190
+ url={https://github.com/Pokee-AI/PokeeResearchOSS}
191
+ }
192
+ ```
193
+
194
+ **APA:**
195
+ Wan, Y., Wang, J., Li, L., Liu, J., Zhu, R., & Zhu, Z. (2025). *PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold.* Pokee AI.
196
+
197
+ ---
198
+
199
+ ## Glossary
200
+
201
+ - **RLAIF:** Reinforcement Learning from AI Feedback – optimization using LLM-based reward signals.
202
+ - **RLOO:** REINFORCE Leave-One-Out – unbiased policy gradient variant for on-policy learning.
203
+ - **RTS:** Research Threads Synthesis – synthesis of multiple independent reasoning threads at inference time.
204
+
205
+ ---
206
+
207
+ ## More Information
208
+ For technical details, visit: [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
209
+ For inquiries, contact: [email protected]
210
+
211
+ ---
212
+
213
+ ## Model Card Authors
214
+ **Yi Wan**, **Jiuqi Wang**, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu — Pokee AI Research Team
215
+
216
+ ## Model Card Contact
217
+ Pokee AI Team — [email protected]