Hi everyone!
I am trying to translate Bengali to Nepali and vice versa using a custom model. The datasets are extremely small—about 1000 sentences. I have searched and tried following similar posts in this forum and many other things before finally posting. I apologise if this is a simple problem, but I really need help to move forward as I have been stuck for a long time.
I was able to use the TinyTransformer for translating, but now I wanted to use the POS tagged dataset, so I tried using a custom model. But I get the following error:
File “/usr/local/lib/python3.8/dist-packages/opennmt/inputters/inputter.py”, line 423, in get_dataset_size
raise RuntimeError(“Parallel datasets do not have the same size”)
RuntimeError: Parallel datasets do not have the same size
I have checked the datasets and both of them seems to have equal number of sentences, so I can’t figure out where am I going wrong.
Here is my model:
import tensorflow as tf
import tensorflow_addons as tfa
import opennmtfrom opennmt import decoders, encoders, inputters, layers
class MultiTransformer(opennmt.models.SequenceToSequence):
def init(self):
super().init(
source_inputter=opennmt.inputters.ParallelInputter(
[
opennmt.inputters.WordEmbedder(embedding_size=512),
opennmt.inputters.WordEmbedder(embedding_size=512),
],
reducer=None,
),
target_inputter=opennmt.inputters.WordEmbedder(embedding_size=512),encoder=encoders.RNNEncoder( num_layers=4, num_units=1000, dropout=0.2, residual_connections=False, cell_class=tf.keras.layers.LSTMCell, reducer=None ), decoder=decoders.AttentionalRNNDecoder( num_layers=3, num_units=512, attention_mechanism_class=tfa.seq2seq.LuongMonotonicAttention, cell_class=tf.keras.layers.LSTMCell, dropout=0.3, residual_connections=False, first_layer_attention=True, ), )
model = MultiTransformer
Here is an excerpt of the train dataset:
ভালৈ কাটাচ্ছিলন দুজন ।
গান শুনত আমার ত ভাল লাগ ।
And here is how the POS tagged dataset looks:
ভালৈ|ADV কাটাচ্ছিলন|VERB দুজন|NOUN ।|PUNCT
গান|NOUN শুনত|VERB আমার|PRON ত|NUM ভাল|ADJ লাগ|VERB ।|PUNCT
Here is my yml file:
model_dir: bnne
data:
train_features_file:
- /content/drive/MyDrive/corpus_bn-ne/Old_Corpus/bn-ne.train.clean.bn
- /content/drive/MyDrive/corpus_bn-ne/pos_corpus/bn-ne-NepTreebankwithPOS.train.clean.bn
train_labels_file: /content/drive/MyDrive/corpus_bn-ne/Old_Corpus/bn-ne.train.clean.ne
eval_features_file:
- /content/drive/MyDrive/corpus_bn-ne/Old_Corpus/bn-ne.valid.clean.bn
- /content/drive/MyDrive/corpus_bn-ne/pos_corpus/bn-ne-NepTreebankwithPOS.valid.bn
eval_labels_file: /content/drive/MyDrive/corpus_bn-ne/Old_Corpus/bn-ne.valid.clean.ne
source_1_vocabulary: /content/bnne/first_test/src-vocab1.txt
source_2_vocabulary: /content/bnne/first_test/src-vocab2.txt
target_vocabulary: /content/bnne/first_test/tgt-vocab.txt
params:
optimizer: Adam
optimizer_params:
beta_1: 0.8
beta_2: 0.998
learning_rate: 1.0train:
batch_size: 1024
effective_batch_size: 1024eval:
steps: 500
scorers: bleu