I am experimenting with English-French Machine Translation using Transformers. The dataset consists of 16M parallel sentences. But after applying BPE my vocab size reduces to 92000 (approx) for both English and French. On using 2 Parallel GPUs, and Batch size = 4096.
I have 2 doubts
- One epoch with the above data will be 16M/8192 steps. (Please correct me if I am wrong)
- How many epochs will be sufficient to achieve a respectable BLEU Score?
Many thanks in advance.