PokeeAI
/

pokee_research_7b

+---
+license: apache-2.0
+language:
+- en
+tags:
+- agent
+- deepresearch
+- llm
+- rl
+- reinforcementlearning
+datasets:
+- miromind-ai/MiroRL-GenQA
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
+---
+# Model Card for PokeeResearch
+## Model Details
+### Model Description
+**PokeeResearch-7B** is a **7-billion-parameter deep research agent** developed by **Pokee AI** to advance reliable, aligned, and scalable research-grade reasoning in tool-augmented LLMs.
+The model integrates **Reinforcement Learning from AI Feedback (RLAIF)** with a **robust reasoning scaffold**, enabling it to conduct complex, multi-step research workflows that include self-correction, verification, and synthesis across multiple independent research threads.
+- **Developed by:** Pokee AI
+- **Model type:** Tool-augmented large language model (LLM) research agent
+- **Language(s):** English, Chinese and many more
+- **License:** Apache 2.0
+- **Finetuned from model:** Qwen2.5-7B-Instruct
+### Model Sources
+- **Repository:** [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
+- **Paper:** *PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold*, Pokee AI, October 2025
+- **API Access:** [https://pokee.ai/deepresearch](https://pokee.ai/deepresearch)
+---
+## Uses
+### Direct Use
+PokeeResearch-7B is designed for **deep research automation**, where the model autonomously:
+- Decomposes complex user queries
+- Retrieves and reads from external sources
+- Synthesizes factual, verifiable, and grounded answers
+It can be used as a **standalone research assistant** or integrated into **multi-agent systems** to support academic, enterprise, or product-level research tasks.
+### Downstream Use
+PokeeResearch-7B can be **fine-tuned** or **extended** for:
+- Domain-specific scientific discovery
+- Autonomous document retrieval and synthesis
+- Multi-source verification and summarization pipelines
+- Integration into reinforcement learning research agents (RLHF/RLAIF frameworks)
+### Out-of-Scope Use
+The model should **not** be used for:
+- Generating unverified or speculative claims
+- Automated decision-making in high-stakes domains (medical, legal, or financial)
+- Applications requiring strict factual precision without external verification
+- Generating content without citation or evidence tracing
+---
+## Bias, Risks, and Limitations
+PokeeResearch-7B is optimized for factual grounding and robustness, but limitations include:
+- Dependence on **external data quality** and **retrieval accuracy**
+- Potential **semantic bias** introduced by AI-based feedback signals
+- Limited coverage for **non-English** or **multi-modal** reasoning tasks
+- Risk of **hallucinated synthesis** when sources conflict or lack clarity
+### Recommendations
+Users should:
+- Cross-verify answers, especially in multi-hop reasoning cases
+- Monitor output for citation accuracy and alignment with source data
+- Refrain from using outputs as sole evidence in decision-critical contexts
+---
+## How to Get Started with the Model
+please refer to the following codebase for how to use PokeeResearch-7B
+https://github.com/Pokee-AI/PokeeResearchOSS/blob/main/README.md
+---
+## Training Details
+### Training Data
+- **Dataset:** MiroRL-GenQA dataset (MiroMind AI, 2025)
+- **Data characteristics:** Complex, multi-turn question–answer pairs requiring multi-step reasoning
+- **Data filtering:** No benchmark data used for testing; the model was trained only on open-domain text Q&A samples
+### Training Procedure
+#### Preprocessing
+- Normalization and tokenization aligned with Qwen2.5 tokenizer
+- Structured prompt–response pairs in research/verification format (`<tool_call>`, `<answer>`, `<verification>`)
+#### Training Hyperparameters
+- **Algorithm:** RLOO (REINFORCE Leave-One-Out)
+- **Batch size:** 64
+- **Research threads per prompt:** 8
+- **Learning rate:** 3e-6
+- **Context limit:** 32,768 tokens
+- **Steps:** 140 fine-tuning iterations
+- **Regularization:** None (no entropy or KL regularization)
+- **Precision regime:** bf16 mixed precision
+#### Reward Design
+- Combined reward signal from:
+  - **AI feedback** (semantic equivalence via external LLM judge)
+  - **Format adherence reward** (ensures correct agent behavior)
+#### Speeds, Sizes, Times
+- **Model size:** 7 billion parameters
+- **Training duration:** ~5 days on 8 × A100 80G GPUs
+- **Checkpoint size:** ~13 GB
+---
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+10 open-domain research and QA benchmarks:
+- NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, Humanity’s Last Exam
+#### Factors
+- Benchmarks differ by reasoning depth, retrieval dependence, and factual precision requirements.
+- Evaluations disaggregate by dataset difficulty and task type (single-hop vs multi-hop).
+#### Metrics
+- Mean accuracy (mean@4 across independent research threads) based on
+### Results
+**PokeeResearch-7B (RTS variant)** and **PokeeResearch-7B** outperforms all baselines at 7B scale across 10 benchmarks.
+Highlights (mean@4 accuracy):
+| **Method** | **HLE** | **GAIA** | **BrowseComp** | **BAMB** | **2WIKI** | **TQ** | **NQ** | **POPQA** | **MUSIQUE** | **HOTPOTQA** |
+|-------------|----------|-----------|----------------|-----------|-----------|----------|----------|-------------|---------------|----------------|
+| R1searcher | 5.4 | 8.3 | 1.0 | 63.2 | 61.4 | 77.2 | 59.6 | 51.8 | 35.8 | 62.4 |
+| SearchR1 | 13.0 | 18.7 | 0.4 | 67.8 | 62.8 | 81.0 | 67.6 | 59.6 | 33.2 | 63.2 |
+| ZeroSearch | 8.6 | 9.9 | 1.4 | 51.4 | 33.6 | 61.6 | 48.2 | 38.0 | 19.0 | 32.4 |
+| ASearcher | 13.8 | 22.1 | 3.2 | 68.8 | 69.2 | 85.2 | 71.2 | 58.2 | 35.8 | 71.0 |
+| DeepResearcher | 6.0 | 24.03 | 1.8 | 71.0 | 58.8 | 82.2 | 60.2 | 55.2 | 26.8 | 56.6 |
+| **PR** | **15.2** | **36.9** | **5.4** | **74.5** | **74.0** | **91.3** | **75.1** | **59.8** | **39.8** | **71.2** |
+| **PR+** | **17.6** | **41.3** | **8.4** | **75.0** | **75.0** | **91.8** | **75.0** | **60.0** | **41.4** | **71.6** |
+#### Summary
+PokeeResearch-7B variants achieves **state-of-the-art performance among 7B-scale open deep research agents**, validating RLAIF and reasoning scaffold design for robust, verifiable research workflows.
+---
+## Model Examination
+The model’s **self-verification loop** prevents common reasoning errors by iteratively verifying its answers.
+Example walkthroughs in Appendix B show that incorrect responses are identified and corrected through self-evaluation cycles.
+---
+## Technical Specifications
+### Model Architecture and Objective
+- **Base Architecture:** Transformer decoder (Qwen2.5-7B-Instruct backbone)
+- **Objective:** Reinforcement learning with AI feedback to maximize semantic correctness and alignment with human-style reasoning
+### Compute Infrastructure
+#### Hardware
+- NVIDIA A100 80GB GPUs ×8 for training and x1 for inference
+-
+#### Software
+- Framework: PyTorch + DeepSpeed
+- Training orchestrator: MiroRL (MiroMind Foundation)
+- Toolchain integration: Serper.dev and Jina Reader
+---
+## Citation
+**BibTeX:**
+```bibtex
+@article{pokee2025deepresearch,
+  title={PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold},
+  author={Yi Wan* and Jiuqi Wang* and Liam Li and Jinsong Liu and Ruihao Zhu and Zheqing Zhu},
+  journal={Pokee AI Technical Report},
+  year={2025},
+  url={https://github.com/Pokee-AI/PokeeResearchOSS}
+}
+```
+**APA:**
+Wan, Y., Wang, J., Li, L., Liu, J., Zhu, R., & Zhu, Z. (2025). *PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold.* Pokee AI.
+---
+## Glossary
+- **RLAIF:** Reinforcement Learning from AI Feedback – optimization using LLM-based reward signals.
+- **RLOO:** REINFORCE Leave-One-Out – unbiased policy gradient variant for on-policy learning.
+- **RTS:** Research Threads Synthesis – synthesis of multiple independent reasoning threads at inference time.
+---
+## More Information
+For technical details, visit: [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
+For inquiries, contact: [email protected]
+---
+## Model Card Authors
+**Yi Wan**, **Jiuqi Wang**, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu — Pokee AI Research Team
+## Model Card Contact
+Pokee AI Team — [email protected]