Beyond Semantic Similarity: Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline
We are thrilled to announce that NVIDIA NeMo Retriever team has developed a new agentic retrieval pipeline that has officially secured the #1 spot on the ViDoRe v3 pipeline leaderboard. In addition, this exact same pipeline architecture achieved the #2 spot on the highly demanding, reasoning-intensive BRIGHT leaderboard.
In the rapidly evolving landscape of AI retrieval, many solutions are highly specialized, engineered to perform exceptionally well on specific, narrow tasks. However, real-world enterprise applications rarely have the luxury of perfectly curated, single-domain data. They require systems that can seamlessly adapt to a wide variety of challenges—from parsing complex visual layouts to executing deep logical reasoning.
That is why we prioritized generalizability in our design. Rather than relying on dataset-specific heuristics, we built an agentic pipeline that dynamically adapts its search and reasoning strategy to the data at hand. This allows us to deliver state-of-the-art performance across vastly different benchmarks without requiring any underlying architectural changes.
Here is a look at how we built it.
The Motivation: Why Semantic Similarity Isn't Enough
For years, dense retrieval based on semantic similarity has been the standard for finding information. However, as the applications of retrieval expand, finding relevant documents goes beyond semantic similarity alone. Complex document search requires reasoning skills, an understanding of real-world systems, and iterative exploration.
There is a fundamental gap: LLMs are great at thinking and reasoning but cannot process millions of documents at once. Conversely, retrievers can easily sift through millions of documents but possess limited reasoning skills. Agentic retrieval bridges this gap by creating an active, iterative loop between the LLM and the retriever.
How It Works: The Agentic Loop
Agentic retrieval pipeline overview
Our agentic retrieval pipeline relies on a ReACT architecture. Instead of a single "one-and-done" query, the agent iteratively searches, evaluates, and refines its approach.
The agent utilizes built-in tools like think to plan its approach and final_results to output the exact documents needed for a given query, alongside a retrieve (query, top_k) tool to explore the corpus. Through this loop, we observed successful search patterns emerge naturally:
- Generating better queries: The agent dynamically adjusts its search queries based on newly discovered information.
- Persistent rephrasing: It continually rephrases queries until useful information is found.
- Breaking down complexity: It translates complex, multi-part queries into multiple simpler queries with clear goals.
Finally, to synthesize the iterative discoveries, the agent calls a final_results tool to output the most relevant documents, ranked by their relevance to a given query. As a safety net—for example, when the agent hits the maximum number of steps or the context length limit—the pipeline falls back to Reciprocal Rank Fusion (RRF), which scores documents based on their ranks across all retrieval attempts in the agent trajectory.
Engineering for Speed and Scale
Agentic workflows are notoriously slow and resource-intensive. To make this pipeline viable for leaderboard-scale evaluation, we had to rethink how the LLM agent and the retriever communicate.
Initially, the retriever was exposed to the agent via a Model Context Protocol (MCP) server—a natural choice, since MCP is designed precisely for giving LLMs access to external tools. But in practice, this architecture imposed a compounding tax on experiment velocity. Every run required spinning up a separate MCP server, loading the right dataset corpus into GPU memory, and orchestrating the lifecycle of both client and server. Each of these steps was an opportunity for silent misconfiguration or a server freeze under high-volume requests. The network round-trips added latency to every retrieval call, and the overall cognitive burden of managing the two-process setup made it significantly harder for other teams to adopt and iterate on the pipeline.
To resolve this, we replaced the MCP server with a thread-safe singleton retriever that lives in-process. The singleton loads the model and corpus embeddings once, protects all access with a reentrant lock, and exposes the same retrieve() interface to arbitrarily many concurrent agent tasks—achieving the key benefit of an MCP server (safe, shared access to a GPU-resident retriever from multiple threads) without incurring network serialization overhead or requiring a separate server process. This single architectural change eliminated an entire class of deployment errors and dramatically improved both GPU utilization and experiment throughput.
Generalization vs. Specialization Across Benchmarks
A common observation in modern retrieval evaluation is that solutions highly optimized for one specific type of task often experience a performance gap when applied to a completely different domain.
| Pipeline | ViDoRe v3 |
|---|---|
| NeMo Agentic Retrieval (Opus 4.5 + nemotron-colembed-vl-8b-v2) | 69.22 (#1) |
| Dense retrieval (nemotron-colembed-vl-8b-v2) | 64.36 |
| INF-X-Retriever (INF-Query-Aligner + nemotron-colembed-vl-8b-v2) | 62.31 |
| INF-X-Retriever | 51.01 |
| Pipeline | BRIGHT |
|---|---|
| INF-X-Retriever | 63.40 (#1) |
| NeMo Agentic Retrieval (Opus 4.5 + nemotron-reasoning-3b) | 50.90 (#2) |
We placed #2 on the reasoning-intensive BRIGHT leaderboard with an NDCG@10 of 50.90. The #1 solution on that leaderboard, INF-X-Retriever, achieves an impressive 63.40. However, to test cross-domain adaptability, we benchmarked the INF-X pipeline (coupled with the same nemotron-colembed-vl-8b-v2 embedding model used in our agentic pipeline) on ViDoRe v3, a dataset focusing on visually rich and diverse enterprise documents. On this different task, its performance landed at an NDCG@10 of 62.31, lower than the dense retrieval score 64.36. In other words, INF-Query-Aligner does not improve over the dense retrieval baseline on ViDoRe v3.
In contrast, our same agentic pipeline (pairing Opus 4.5 with nemotron-colembed-vl-8b-v2) achieved the #1 spot on ViDoRe v3 with a score of 69.22.
This highlights a core strength of our approach: generalizability. Rather than relying on dataset-specific heuristics or a query-rewriter/aligner, our agentic loop naturally adapts its strategy to the dataset at hand, whether it requires multi-step logical reasoning or parsing complex visual layouts.
Ablation Studies: Open vs. Closed Models
ViDoRe v3
| Agent | Embedding Model | NDCG @10 | Average sec/query | Total input tokens (M) | Total output tokens (M) | Average retrieval calls |
|---|---|---|---|---|---|---|
| Opus 4.5 | nemotron-colembed-vl-8b-v2 | 69.22 | 136.3 | 1837 | 15 | 9.2 |
| gpt-oss-120b | nemotron-colembed-vl-8b-v2 | 66.38 | 78.6 | 1860 | 13 | 2.4 |
| gpt-oss-120b | llama-nemotron-embed-vl-1b-v2 | 62.42 | 78.1 | 1459 | 13 | 2.5 |
| - | nemotron-colembed-vl-8b-v2 | 64.36 | 0.67 | - | - | - |
| - | llama-nemotron-embed-vl-1b-v2 | 55.83 | 0.02 | - | - | - |
BRIGHT
| Agent | Embedding Model | NDCG @10 | Average sec/query | Total input tokens (M) | Total output tokens (M) | Average retrieval calls |
|---|---|---|---|---|---|---|
| Opus 4.5 | llama-embed-nemotron-reasoning-3b | 50.79 | 148.2 | 1251 | 11 | 11.8 |
| gpt-oss-120b | llama-embed-nemotron-reasoning-3b | 41.27 | 92.8 | 1546 | 11 | 4.5 |
| gpt-oss-120b | llama-nemotron-embed-vl-1b-v2 | 33.85 | 139.1 | 1516 | 12 | 6.6 |
| - | llama-embed-nemotron-reasoning-3b | 38.28 | 0.11 | - | - | - |
| - | llama-nemotron-embed-vl-1b-v2 | 19.56 | 0.08 | - | - | - |
We conducted extensive ablations to understand the tradeoff between frontier closed models and open-weights alternatives:
- Model Choice: On ViDoRe v3, swapping Opus 4.5 for the open
gpt-oss-120bresulted in a small accuracy drop (69.22 to 66.38 NDCG@10) and makes much fewer retrieval calls. On BRIGHT, the gap was wider, indicating that deeper reasoning tasks still benefit heavily from frontier models like Opus 4.5. - Embeddings: Pairing the agent with specialized embeddings (
nemotron-colembed-vl-8b-v2for ViDoRe andllama-embed-nemotron-reasoning-3bfor BRIGHT) yielded the best results, proving that a strong baseline retriever provides a higher ceiling for the agent to reach. - It's also interesting to note that the agent can close the gap between stronger and weaker embedding models. For example, on ViDoRe, the gap between the stronger
nemotron-colembed-vl-8b-v2and the weakerllama-nemotron-embed-vl-1b-v2is about 8.5 in dense retrieval, but when coupled withgpt-oss-120bagent, the gap shrinks to about 4. Similarly,llama-embed-nemotron-reasoning-3bis about 19 points better thanllama-nemotron-embed-vl-1b-v2on BRIGHT, but the lead shrinks to about 7.5 when coupled withgpt-oss-120bagent.
The Cost of Autonomy and What's Next
There is no free lunch. Agentic retrieval is more expensive and slower than standard dense retrieval. Looking at our ViDoRe v3 results, the agent averages 136 seconds per query and consumes roughly 760k input and 6.3k output tokens per query. (Note: the sequential latency is measured on a single A100 GPU with a single concurrent Claude API call—i.e., not searching multiple queries at the same time—to reflect true search time in real-world use cases).
However, we believe agentic retrieval is a highly viable approach for high-stakes, complex queries. Our immediate next steps focus on cost reduction: we are actively researching ways to distill these agentic reasoning patterns into smaller, specialized open-weight agents. By fine-tuning smaller models to orchestrate the think and retrieve loop natively, we aim to deliver Opus-level accuracy at a fraction of the latency and cost.
Build Your Own Agentic Pipeline
While our leaderboard-topping runs explored combinations like Claude Opus and gpt-oss alongside our research embedding models, the true strength of this architecture is its modularity. For production-ready deployments, we highly encourage you to try pairing your agent of choice with our robust commercial embedding model llama-nemotron-embed-vl-1b-v2. To explore these models, dive into the tools, and start building your own highly generalizable retrieval workflows, visit the NeMo Retriever library today.
