Unable to use POS tagged dataset for translation

samindra · January 19, 2023, 8:16am

Hi everyone!

I am trying to translate Bengali to Nepali and vice versa using a custom model. The datasets are extremely small—about 1000 sentences. I have searched and tried following similar posts in this forum and many other things before finally posting. I apologise if this is a simple problem, but I really need help to move forward as I have been stuck for a long time.

I was able to use the TinyTransformer for translating, but now I wanted to use the POS tagged dataset, so I tried using a custom model. But I get the following error:

File “/usr/local/lib/python3.8/dist-packages/opennmt/inputters/inputter.py”, line 423, in get_dataset_size
raise RuntimeError(“Parallel datasets do not have the same size”)
RuntimeError: Parallel datasets do not have the same size

I have checked the datasets and both of them seems to have equal number of sentences, so I can’t figure out where am I going wrong.

Here is my model:

import tensorflow as tf
import tensorflow_addons as tfa
import opennmt

from opennmt import decoders, encoders, inputters, layers

class MultiTransformer(opennmt.models.SequenceToSequence):
def init(self):
super().init(
source_inputter=opennmt.inputters.ParallelInputter(
[
opennmt.inputters.WordEmbedder(embedding_size=512),
opennmt.inputters.WordEmbedder(embedding_size=512),
],
reducer=None,
),
target_inputter=opennmt.inputters.WordEmbedder(embedding_size=512),
        encoder=encoders.RNNEncoder(
            num_layers=4,
            num_units=1000,
            dropout=0.2,
            residual_connections=False,
            cell_class=tf.keras.layers.LSTMCell,
            reducer=None
        ),
        decoder=decoders.AttentionalRNNDecoder(
            num_layers=3,
            num_units=512,
            attention_mechanism_class=tfa.seq2seq.LuongMonotonicAttention,
            cell_class=tf.keras.layers.LSTMCell,
            dropout=0.3,
            residual_connections=False,
            first_layer_attention=True,
        ),
        
    )
model = MultiTransformer

Here is an excerpt of the train dataset:

ভালৈ কাটাচ্ছিলন দুজন ।
গান শুনত আমার ত ভাল লাগ ।

And here is how the POS tagged dataset looks:

ভালৈ|ADV কাটাচ্ছিলন|VERB দুজন|NOUN ।|PUNCT
গান|NOUN শুনত|VERB আমার|PRON ত|NUM ভাল|ADJ লাগ|VERB ।|PUNCT

Here is my yml file:

model_dir: bnne

data:
train_features_file:
- /content/drive/MyDrive/corpus_bn-ne/Old_Corpus/bn-ne.train.clean.bn
- /content/drive/MyDrive/corpus_bn-ne/pos_corpus/bn-ne-NepTreebankwithPOS.train.clean.bn
train_labels_file: /content/drive/MyDrive/corpus_bn-ne/Old_Corpus/bn-ne.train.clean.ne
eval_features_file:
- /content/drive/MyDrive/corpus_bn-ne/Old_Corpus/bn-ne.valid.clean.bn
- /content/drive/MyDrive/corpus_bn-ne/pos_corpus/bn-ne-NepTreebankwithPOS.valid.bn
eval_labels_file: /content/drive/MyDrive/corpus_bn-ne/Old_Corpus/bn-ne.valid.clean.ne
source_1_vocabulary: /content/bnne/first_test/src-vocab1.txt
source_2_vocabulary: /content/bnne/first_test/src-vocab2.txt
target_vocabulary: /content/bnne/first_test/tgt-vocab.txt
params:
optimizer: Adam
optimizer_params:
beta_1: 0.8
beta_2: 0.998
learning_rate: 1.0

train:
batch_size: 1024
effective_batch_size: 1024

eval:
steps: 500
scorers: bleu

vince62s · January 19, 2023, 9:26am

with such a small dataset your best option is to use NLLB, search the forum there are two threads on this.

SamuelLacombe · January 19, 2023, 1:20pm

Hello,

I haven’t used the post tag yet, but from my recollection you need to create a separeted file witch has as many tokens as you have in your main file. If your using sentence piece, it’s mean that you have additional work to do.

Your shouldnt have the post tag combined with your text. In the post tag file there should be only the post tag, but it’s really important that it match the number of tokens on every line.

Another way to put it, the number of spaces between the 2 files are supposed to be equal.

Best regards,
Samuel

JOHW85 · January 22, 2023, 7:42am

Make sure that your POS tags are custom tokens in your sentence piece model when training it. You don’t want |VERB being broken into several SP tokens for instance