OpenNMT with CTranslate2: the most efficient NMT system at WNGT 2020

For the second time, OpenNMT participated to the efficiency task part of the WNGT 2020 workshop (previously WNMT 2018).

The goal of the task is to see how accuracy (BLEU) and efficiency (speed, memory usage, model size) can be combined. Since it is easy to understand that both are tightly connected, competitive systems must be on the Pareto frontier, where for a given accuracy there is no more efficient system.

We submitted 4 systems that were evaluated both on CPU and GPU tracks. The systems were trained with OpenNMT-tf and decoded with CTranslate2. We used several techniques:

  • Model distillation (also known as teacher-student training)
  • Optimized greedy decoding and caching
  • 8-bit quantization on CPU
  • Memory-efficient parallelism at the file level on CPU
  • Sorted and dynamic batches
  • Target vocabulary reduction

These techniques are described in the system description.

The other systems in competition were:

  • University of Edinburgh (using Marian NMT)
  • NiuTrans

Results are presented in length in the overview paper, but here is a quick summary:

  • Almost all submitted systems are on the Pareto frontier, with remarkable numbers for the memory footprint (disk and RAM) and Mw/USD.
  • As in 2018, the smaller engines are reaching the highest throughput on both CPU and GPU.
7 Likes

This task is now a rolling evaluation, meaning it will continue to accept new submissions! I will keep you informed if there is anything interesting.

The first addition is that we recently added FP16 support in CTranslate2. So I submitted a new GPU model to the task:

gputhwordspersecond

Source

See the OpenNMT model at around 10k words/s and 34 BLEU? This is the new submission. It dominates 4 other models (both faster and higher quality).

3 Likes