I using Transformer model to train our dataset (5M sentences) but total trainable parameter is 139M params. I read in Attention is all you need paper, they have only 65M params. Why it is so many params?
Could someone explain to me?
Can you be more specific about the OpenNMT version you are using and your training options?
I use OpenNMT tensorflow, I use model TransformerANN, num_layers = 2, other options are default. I use tensorflow-gpu version 1.12.
What is the size of your vocabulary?
Additionally, the TransformerAAN model is not the one used in the Google’s paper.
My vocab is about 6M sentences. I will try Transformer-base.
I using Transformer model, it runs with 150663034 trainables, source vocab : 54569, target vocab : 76665.
Sounds about right. Is it an issue?