How to use Bert embedding into OpenNMT seq2seq model

Originally I tried the seq2seq model (Glove embedding + RNN encoder-decoder + copy generator) on Text2SQL task by OpenNMT, everything works perfectly fine. I can get an accuracy of ~60% on the GeoQuery benchmark, the cross-entropy on the training set will drop to as low as 0.10, and accuracy on training will be something > 90% (token level accuracy).

When I add Bert encoder and replace the Glove embedding with the last layer output of Bert on the encoder, the model seems to learn nothing during training. The token level accuracy in training cannot reach 90%, and the cross-entropy will remain something like 0.3. During inference, the model predicts unreasonable SQL results and can barely achieve 1% accuracy on the testing set.

I have investigated this issue for quite a long time, I double-checked my optimizer, and I use different optimizer (Adam with a learning rate of 1e-3 for parameters in my LSTM part, BertAdam with a learning rate of 1e-5 for Bert part). For the encoding part, I directly copy codes from a published Github Repo.

I could not come up with any other places that my code might go wrong. Any help will be much appreciated!

BTW, I have posted my training information as well as the SQL prediction results in the reply session for your reference.

Here is the training information for the original LSTM seq2seq model

Here is the training information for Bert + seq2seq model

Here is the SQL prediction for the original Seq2seq model. We can see there are variations in lengths of SQL predictions, and variations in values the model predict in each SQL

Here is what Bert + seq2seq predicts. Not only it fails to predict the long SQL (1-3) compared to the original seq2seq, it also predicts the same value over and over again (15 - 25) for different questions. This looks really weird to me. Any ideas?