About Real-World Application

#2
by Ideal319 - opened

Hi! This work indeed demonstrates promising performance. However, I wonder whether the inference speed can truly meet practical requirements. Although the embedding dimension is reduced, the average embedding length is approximately 1000 times that of a bi-encoder architecture with pooling. As a result, the computational burden of the late interaction stage may be difficult to afford in practice—even without considering the storage overhead.

Hi! Indeed, we discussed such overhead and trade-offs in our technical report section 5: https://arxiv.org/abs/2507.05513

Hi! This work indeed demonstrates promising performance. However, I wonder whether the inference speed can truly meet practical requirements. Although the embedding dimension is reduced, the average embedding length is approximately 1000 times that of a bi-encoder architecture with pooling. As a result, the computational burden of the late interaction stage may be difficult to afford in practice—even without considering the storage overhead.

Does pooled retrieval not work for this model? It does for colpali, colqwen and colnomic models.

NVIDIA org

Does pooled retrieval not work for this model? It does for colpali, colqwen and colnomic models.

Hello, we fine-tuned this model with colbert-like late interaction not by pooling method, so using late-interaction could achieve the best performance.

Does pooled retrieval not work for this model? It does for colpali, colqwen and colnomic models.

Hello, we fine-tuned this model with colbert-like late interaction not by pooling method, so using late-interaction could achieve the best performance.

I understand. We can still use pooling at the vector database level to improve retrieval

Sign up or log in to comment