Model Card for PokeeResearch

Model Details

Model Description

PokeeResearch-7B is a 7-billion-parameter deep research agent developed by Pokee AI to advance reliable, aligned, and scalable research-grade reasoning in tool-augmented LLMs.
The model integrates Reinforcement Learning from AI Feedback (RLAIF) with a robust reasoning scaffold, enabling it to conduct complex, multi-step research workflows that include self-correction, verification, and synthesis across multiple independent research threads.

  • Developed by: Pokee AI
  • Model type: Tool-augmented large language model (LLM) research agent
  • Language(s): English, Chinese and many more
  • License: Apache 2.0
  • Finetuned from model: Qwen2.5-7B-Instruct

Model Sources


Uses

Direct Use

PokeeResearch-7B is designed for deep research automation, where the model autonomously:

  • Decomposes complex user queries
  • Retrieves and reads from external sources
  • Synthesizes factual, verifiable, and grounded answers

It can be used as a standalone research assistant or integrated into multi-agent systems to support academic, enterprise, or product-level research tasks.

Downstream Use

PokeeResearch-7B can be fine-tuned or extended for:

  • Domain-specific scientific discovery
  • Autonomous document retrieval and synthesis
  • Multi-source verification and summarization pipelines
  • Integration into reinforcement learning research agents (RLHF/RLAIF frameworks)

Out-of-Scope Use

The model should not be used for:

  • Generating unverified or speculative claims
  • Automated decision-making in high-stakes domains (medical, legal, or financial)
  • Applications requiring strict factual precision without external verification
  • Generating content without citation or evidence tracing

Bias, Risks, and Limitations

PokeeResearch-7B is optimized for factual grounding and robustness, but limitations include:

  • Dependence on external data quality and retrieval accuracy
  • Potential semantic bias introduced by AI-based feedback signals
  • Limited coverage for non-English or multi-modal reasoning tasks
  • Risk of hallucinated synthesis when sources conflict or lack clarity

Recommendations

Users should:

  • Cross-verify answers, especially in multi-hop reasoning cases
  • Monitor output for citation accuracy and alignment with source data
  • Refrain from using outputs as sole evidence in decision-critical contexts

How to Get Started with the Model

please refer to the following codebase for how to use PokeeResearch-7B https://github.com/Pokee-AI/PokeeResearchOSS/blob/main/README.md


Training Details

Training Data

  • Dataset: MiroRL-GenQA dataset (MiroMind AI, 2025)
  • Data characteristics: Complex, multi-turn question–answer pairs requiring multi-step reasoning
  • Data filtering: No benchmark data used for testing; the model was trained only on open-domain text Q&A samples

Training Procedure

Preprocessing

  • Normalization and tokenization aligned with Qwen2.5 tokenizer
  • Structured prompt–response pairs in research/verification format (<tool_call>, <answer>, <verification>)

Training Hyperparameters

  • Algorithm: RLOO (REINFORCE Leave-One-Out)
  • Batch size: 64
  • Research threads per prompt: 8
  • Learning rate: 3e-6
  • Context limit: 32,768 tokens
  • Steps: 140 fine-tuning iterations
  • Regularization: None (no entropy or KL regularization)
  • Precision regime: bf16 mixed precision

Reward Design

  • Combined reward signal from:
    • AI feedback (semantic equivalence via external LLM judge)
    • Format adherence reward (ensures correct agent behavior)

Speeds, Sizes, Times

  • Model size: 7 billion parameters
  • Training duration: ~5 days on 8 × A100 80G GPUs
  • Checkpoint size: ~13 GB

Evaluation

Testing Data, Factors & Metrics

Testing Data

10 open-domain research and QA benchmarks:

  • NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, Humanity’s Last Exam

Factors

  • Benchmarks differ by reasoning depth, retrieval dependence, and factual precision requirements.
  • Evaluations disaggregate by dataset difficulty and task type (single-hop vs multi-hop).

Metrics

  • Mean accuracy (mean@4 across independent research threads) based on

Results

PokeeResearch-7B (RTS variant) and PokeeResearch-7B outperforms all baselines at 7B scale across 10 benchmarks.
Highlights (mean@4 accuracy):

Method HLE GAIA BrowseComp BAMB 2WIKI TQ NQ POPQA MUSIQUE HOTPOTQA
R1searcher 5.4 8.3 1.0 63.2 61.4 77.2 59.6 51.8 35.8 62.4
SearchR1 13.0 18.7 0.4 67.8 62.8 81.0 67.6 59.6 33.2 63.2
ZeroSearch 8.6 9.9 1.4 51.4 33.6 61.6 48.2 38.0 19.0 32.4
ASearcher 13.8 22.1 3.2 68.8 69.2 85.2 71.2 58.2 35.8 71.0
DeepResearcher 6.0 24.03 1.8 71.0 58.8 82.2 60.2 55.2 26.8 56.6
PR 15.2 36.9 5.4 74.5 74.0 91.3 75.1 59.8 39.8 71.2
PR+ 17.6 41.3 8.4 75.0 75.0 91.8 75.0 60.0 41.4 71.6

Summary

PokeeResearch-7B variants achieves state-of-the-art performance among 7B-scale open deep research agents, validating RLAIF and reasoning scaffold design for robust, verifiable research workflows.


Technical Specifications

Model Architecture and Objective

  • Base Architecture: Transformer decoder (Qwen2.5-7B-Instruct backbone)
  • Objective: Reinforcement learning with AI feedback to maximize semantic correctness and alignment with human-style reasoning

Compute Infrastructure

Hardware

  • NVIDIA A100 80GB GPUs ×8 for training and x1 for inference

Citation

BibTeX:

@article{pokee2025deepresearch,
  title={PokeeResearch: Effective Deep Research via
          Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold},
  author={Yi Wan* and Jiuqi Wang* and Liam Li
          and Jinsong Liu and Ruihao Zhu and Zheqing Zhu},
  journal={Pokee AI Technical Report},
  year={2025},
  url={https://arxiv.org/pdf/2510.15862}
}

APA: Wan, Y., Wang, J., Li, L., Liu, J., Zhu, R., & Zhu, Z. (2025). PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold. Pokee AI.


Glossary

  • RLAIF: Reinforcement Learning from AI Feedback – optimization using LLM-based reward signals.
  • RLOO: REINFORCE Leave-One-Out – unbiased policy gradient variant for on-policy learning.
  • RTS: Research Threads Synthesis – synthesis of multiple independent reasoning threads at inference time.

More Information

For technical details, visit: https://github.com/Pokee-AI/PokeeResearchOSS
For inquiries, contact: [email protected]


Model Card Authors

Yi Wan, Jiuqi Wang, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu — Pokee AI Research Team

Model Card Contact

Pokee AI Team — [email protected]

Downloads last month
3,288
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Model tree for PokeeAI/pokee_research_7b

Base model

Qwen/Qwen2.5-7B
Finetuned
(2793)
this model
Quantizations
3 models

Dataset used to train PokeeAI/pokee_research_7b