File size: 8,179 Bytes
6cde7f4 1d2cc9c 6cde7f4 1d2cc9c 6cde7f4 1d2cc9c 6cde7f4 fed844c 1d2cc9c 6cde7f4 0746ae9 2442bae 0746ae9 6cde7f4 0746ae9 6cde7f4 1d2cc9c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
---
base_model:
- Qwen/Qwen2.5-7B-Instruct
datasets:
- miromind-ai/MiroRL-GenQA
language:
- en
license: apache-2.0
tags:
- agent
- deepresearch
- llm
- rl
- reinforcementlearning
pipeline_tag: text-generation
library_name: transformers
---
# Model Card for PokeeResearch
## Model Details
### Model Description
**PokeeResearch-7B** is a **7-billion-parameter deep research agent** developed by **Pokee AI** to advance reliable, aligned, and scalable research-grade reasoning in tool-augmented LLMs.
The model integrates **Reinforcement Learning from AI Feedback (RLAIF)** with a **robust reasoning scaffold**, enabling it to conduct complex, multi-step research workflows that include self-correction, verification, and synthesis across multiple independent research threads.
- **Developed by:** Pokee AI
- **Model type:** Tool-augmented large language model (LLM) research agent
- **Language(s):** English, Chinese and many more
- **License:** Apache 2.0
- **Finetuned from model:** Qwen2.5-7B-Instruct
### Model Sources
- **Repository:** [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
- **Paper:** [*PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold*](https://arxiv.org/pdf/2510.15862), Pokee AI, October 2025
- **Project Page:** [https://pokee.ai/deepresearch-preview](https://pokee.ai/deepresearch-preview)
---
## Uses
### Direct Use
PokeeResearch-7B is designed for **deep research automation**, where the model autonomously:
- Decomposes complex user queries
- Retrieves and reads from external sources
- Synthesizes factual, verifiable, and grounded answers
It can be used as a **standalone research assistant** or integrated into **multi-agent systems** to support academic, enterprise, or product-level research tasks.
### Downstream Use
PokeeResearch-7B can be **fine-tuned** or **extended** for:
- Domain-specific scientific discovery
- Autonomous document retrieval and synthesis
- Multi-source verification and summarization pipelines
- Integration into reinforcement learning research agents (RLHF/RLAIF frameworks)
### Out-of-Scope Use
The model should **not** be used for:
- Generating unverified or speculative claims
- Automated decision-making in high-stakes domains (medical, legal, or financial)
- Applications requiring strict factual precision without external verification
- Generating content without citation or evidence tracing
---
## Bias, Risks, and Limitations
PokeeResearch-7B is optimized for factual grounding and robustness, but limitations include:
- Dependence on **external data quality** and **retrieval accuracy**
- Potential **semantic bias** introduced by AI-based feedback signals
- Limited coverage for **non-English** or **multi-modal** reasoning tasks
- Risk of **hallucinated synthesis** when sources conflict or lack clarity
### Recommendations
Users should:
- Cross-verify answers, especially in multi-hop reasoning cases
- Monitor output for citation accuracy and alignment with source data
- Refrain from using outputs as sole evidence in decision-critical contexts
---
## How to Get Started with the Model
please refer to the following codebase for how to use PokeeResearch-7B
https://github.com/Pokee-AI/PokeeResearchOSS/blob/main/README.md
---
## Training Details
### Training Data
- **Dataset:** MiroRL-GenQA dataset (MiroMind AI, 2025)
- **Data characteristics:** Complex, multi-turn question–answer pairs requiring multi-step reasoning
- **Data filtering:** No benchmark data used for testing; the model was trained only on open-domain text Q&A samples
### Training Procedure
#### Preprocessing
- Normalization and tokenization aligned with Qwen2.5 tokenizer
- Structured prompt–response pairs in research/verification format (`<tool_call>`, `<answer>`, `<verification>`)
#### Training Hyperparameters
- **Algorithm:** RLOO (REINFORCE Leave-One-Out)
- **Batch size:** 64
- **Research threads per prompt:** 8
- **Learning rate:** 3e-6
- **Context limit:** 32,768 tokens
- **Steps:** 140 fine-tuning iterations
- **Regularization:** None (no entropy or KL regularization)
- **Precision regime:** bf16 mixed precision
#### Reward Design
- Combined reward signal from:
- **AI feedback** (semantic equivalence via external LLM judge)
- **Format adherence reward** (ensures correct agent behavior)
#### Speeds, Sizes, Times
- **Model size:** 7 billion parameters
- **Training duration:** ~5 days on 8 × A100 80G GPUs
- **Checkpoint size:** ~13 GB
---
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
10 open-domain research and QA benchmarks:
- NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, Humanity’s Last Exam
#### Factors
- Benchmarks differ by reasoning depth, retrieval dependence, and factual precision requirements.
- Evaluations disaggregate by dataset difficulty and task type (single-hop vs multi-hop).
#### Metrics
- Mean accuracy (mean@4 across independent research threads) based on
### Results
**PokeeResearch-7B (RTS variant)** and **PokeeResearch-7B** outperforms all baselines at 7B scale across 10 benchmarks.
Highlights (mean@4 accuracy):
| **Method** | **HLE** | **GAIA** | **BrowseComp** | **BAMB** | **2WIKI** | **TQ** | **NQ** | **POPQA** | **MUSIQUE** | **HOTPOTQA** |
|-------------|----------|-----------|----------------|-----------|-----------|----------|----------|-------------|---------------|----------------|
| R1searcher | 5.4 | 8.3 | 1.0 | 63.2 | 61.4 | 77.2 | 59.6 | 51.8 | 35.8 | 62.4 |
| SearchR1 | 13.0 | 18.7 | 0.4 | 67.8 | 62.8 | 81.0 | 67.6 | 59.6 | 33.2 | 63.2 |
| ZeroSearch | 8.6 | 9.9 | 1.4 | 51.4 | 33.6 | 61.6 | 48.2 | 38.0 | 19.0 | 32.4 |
| ASearcher | 13.8 | 22.1 | 3.2 | 68.8 | 69.2 | 85.2 | 71.2 | 58.2 | 35.8 | 71.0 |
| DeepResearcher | 6.0 | 24.03 | 1.8 | 71.0 | 58.8 | 82.2 | 60.2 | 55.2 | 26.8 | 56.6 |
| **PR** | **15.2** | **36.9** | **5.4** | **74.5** | **74.0** | **91.3** | **75.1** | **59.8** | **39.8** | **71.2** |
| **PR+** | **17.6** | **41.3** | **8.4** | **75.0** | **75.0** | **91.8** | **75.0** | **60.0** | **41.4** | **71.6** |
#### Summary
PokeeResearch-7B variants achieves **state-of-the-art performance among 7B-scale open deep research agents**, validating RLAIF and reasoning scaffold design for robust, verifiable research workflows.
---
## Technical Specifications
### Model Architecture and Objective
- **Base Architecture:** Transformer decoder (Qwen2.5-7B-Instruct backbone)
- **Objective:** Reinforcement learning with AI feedback to maximize semantic correctness and alignment with human-style reasoning
### Compute Infrastructure
#### Hardware
- NVIDIA A100 80GB GPUs ×8 for training and x1 for inference
---
## Citation
**BibTeX:**
```bibtex
@article{pokee2025deepresearch,
title={PokeeResearch: Effective Deep Research via
Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold},
author={Yi Wan* and Jiuqi Wang* and Liam Li
and Jinsong Liu and Ruihao Zhu and Zheqing Zhu},
journal={Pokee AI Technical Report},
year={2025},
url={https://arxiv.org/pdf/2510.15862}
}
```
**APA:**
Wan, Y., Wang, J., Li, L., Liu, J., Zhu, R., & Zhu, Z. (2025). *PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold.* Pokee AI.
---
## Glossary
- **RLAIF:** Reinforcement Learning from AI Feedback – optimization using LLM-based reward signals.
- **RLOO:** REINFORCE Leave-One-Out – unbiased policy gradient variant for on-policy learning.
- **RTS:** Research Threads Synthesis – synthesis of multiple independent reasoning threads at inference time.
---
## More Information
For technical details, visit: [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
For inquiries, contact: [email protected]
---
## Model Card Authors
**Yi Wan**, **Jiuqi Wang**, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu — Pokee AI Research Team
## Model Card Contact
Pokee AI Team — [email protected] |