Post
168
TensorBLEU - GPU-based vectorized BLEU score for in-training optimization
Today I published my next paper, that's introducing TensorBLEU - TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation (2510.05485), the optimization dedicated for Reinforcement Learning rewards based BLEU score. It achieved over 10x speed improvement over NLTK's version on small T4 GPU and even 40x improvement with A100 GPU.
That's not exactly linguistically correct BLEU, because it's not based on text n-grams, but on token ids. It's a conscious choice to skip computationally expensive token decoding, in case when it serves only as a reward signal. It was previously possible with NLTK's
In our case, in Reactive AI (
ReactiveAI
) we are using BLEU as a part of the reward in Memory Reinforcement Learning (MRL) of Reactive Transformer models (
Reactive Transformer (RxT) -- Stateful Real-Time Processing for
Event-Driven Reactive Language Models (2510.03561)), combined with cosine similarity. To rate memory quality, we calculate BLEU and cosine similarity between generated answer and reference answer from dataset, as well as between generated answer and previous interaction(s), to ensure that current answer includes some information from previous time-steps. Cosine similarity is calculated on GPU, but BLEU calculation with NLTK have to be performed on CPU, with a lot of data moving and conversion. When all the episode (generating batch of answers, memory updates and reward calculation) takes i.e. 6 seconds, even 0.5s for the reward is noticeable, so we decided to optimize it.
TensorBLEU calculation is performed on GPU for all the batch on sentence or corpus level -
Please check the paper and upvote it, if you like it :)
Today I published my next paper, that's introducing TensorBLEU - TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation (2510.05485), the optimization dedicated for Reinforcement Learning rewards based BLEU score. It achieved over 10x speed improvement over NLTK's version on small T4 GPU and even 40x improvement with A100 GPU.
That's not exactly linguistically correct BLEU, because it's not based on text n-grams, but on token ids. It's a conscious choice to skip computationally expensive token decoding, in case when it serves only as a reward signal. It was previously possible with NLTK's
sentence_bleu
, but required moving tensors with token ids from GPU to CPU, converting them to lists and calculating in python loop, creating significant performance bottlenecks.In our case, in Reactive AI (

TensorBLEU calculation is performed on GPU for all the batch on sentence or corpus level -
tensor_sentence_bleu
or tensor_corpus_bleu
from rxlm.metrics.tensorbleu
(https://github.com/RxAI-dev/rxlm)Please check the paper and upvote it, if you like it :)