
I present a structured review of the evidence on whether BLEU is a valid evaluation technique—in other words, whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; …
Although people refer to “the” BLEU score, BLEU is in fact a param-eterized metric whose values can vary wildly with changes to these parameters. These pa-rameters are often not reported or are hard …
It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited …
Size: Included at least 5 NLP systems if BLEU scores are computed at the system level, or at least 5 texts if scores are computed at the text level. 5 data points is the minimum needed to be able to have …
The primary programming task for a BLEU imple-mentor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches.
In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per …
In this paper, we provide experimental analysis of each component in BLEU aiming to design better evaluation metrics for sentence level MT evaluation and MT system tuning with a single reference.