OpenNMT Forum

SentencePiece in Google Colab

Hello. I’m having some problems translating with the Europarl V7 en-es data. While the training is running smoothly, just calling the onmt_translate script goes wrong where the same exact sentence is predicted no matter the input to the translate. I noticed on some other forums that sentencepiece is a must especially with this kind of data. I haven’t been using sentencepiece, so I would like to get started with it. Since I’m running on google colab, how do I get the sentencepiece binaries up and running? After installing it with pip, do I run the cmake processes? Thanks

The same prediction is being used for each input test sentence. Why would this be the case? I’m running the script as

!onmt_translate -model model/model_step_3000.pt -src europarl-v7.es-en.es.test -output en-sp/pred.txt -gpu 0 -verbose

[2021-04-29 04:34:42,200 INFO] Translating shard 0.
[2021-04-29 04:34:42,379 INFO] 
SENT 1: ['–', 'Me', 'complace', 'que', 'el', 'orador', 'anterior', 'hubiera', 'sido', 'informado.']
PRED 1: I would like to point out that this is a very important step towards the Commission and the Commission.
PRED SCORE: -27.6684

[2021-04-29 04:34:42,379 INFO] 
SENT 2: ['Este', 'sistema', 'debe', 'regularse', 'de', 'otro', 'modo.']
PRED 2: I would like to point out that the European Union is in a position to take account of the fact that the European Union and the Member States have a great deal of work in this area.
PRED SCORE: -55.3306

[2021-04-29 04:34:42,379 INFO] 
SENT 3: ['También', 'he', 'refrendado', 'las', 'enmiendas', 'destinadas', 'a', 'que', 'se', 'incluya', 'el', 'atún', 'rojo', 'en', 'el', 'Apéndice', 'II', 'de', 'la', 'CITES,', 'de', 'conformidad', 'con', 'las', 'recientes', 'recomendaciones', 'del', 'Comité', 'Especial', 'de', 'la', 'Organización', 'de', 'las', 'Naciones', 'Unidas', 'para', 'la', 'Agricultura', 'y', 'la', 'Alimentación', '(FAO),', 'que', 'apoyó', 'el', 'anuncio', 'de', 'la', 'inclusión', 'del', 'atún', 'rojo', 'en', 'el', 'Apéndice', 'II', 'de', 'la', 'CITES.']
PRED 3: I would like to point out that this is a very important step towards the Commission and the Commission.
PRED SCORE: -27.5454

[2021-04-29 04:34:42,379 INFO] 
SENT 4: ['El', 'acuerdo', 'no', 'es', 'tan', 'obligatorio', 'o', 'transparente', 'como', 'la', 'legislación', 'y', 'existe', 'una', 'verdadera', 'falta', 'de', 'confianza', 'expresada', 'por', 'los', 'cuerpos', 'de', 'protección', 'de', 'los', 'peatones.']
PRED 4: I would like to thank the rapporteur for his excellent work.
PRED SCORE: -14.1551

[2021-04-29 04:34:42,379 INFO] 
SENT 5: ['Situación', 'política', 'en', 'Myanmar', '(Birmania)']
PRED 5: I would like to point out that this is a very important step towards the Commission and the Commission.
PRED SCORE: -27.6174

[2021-04-29 04:34:42,380 INFO] 
SENT 6: ['Pero', 'no', 'es', 'la', 'primera', 'vez', 'que', 'se', 'formula', 'la', 'pregunta.']
PRED 6: I would like to point out that the European Union is in a position to take account of the fact that the European Union is in a position to be able to adopt a common position.
PRED SCORE: -55.5286

[2021-04-29 04:34:42,380 INFO] 
SENT 7: ['Esto', 'pretende', 'reducir', 'la', 'dependencia', 'de', 'la', 'UE', 'con', 'respecto', 'a', 'Estados', 'individuales', 'de', 'los', 'que', 'hasta', 'la', 'fecha', 'hemos', 'adquirido', 'nuestros', 'combustibles', 'fósiles.']
PRED 7: I would like to point out that this is a very important step towards the Commission and the Commission.
PRED SCORE: -27.6010

[2021-04-29 04:34:42,380 INFO] 
SENT 8: ['Se', 'calcula', 'que', 'hay', 'aproximadamente', 'otros', '200', 'presos', 'políticos', 'en', 'Cuba.']
PRED 8: I would like to point out that this is a very important step towards the Commission and the Commission.
PRED SCORE: -27.7359

[2021-04-29 04:34:42,380 INFO] 
SENT 9: ['La', 'mejora', 'propuesta', 'del', 'Sistema', 'estadístico', 'europeo', 'tiene', 'por', 'objeto', 'responder', 'a', 'las', 'legítimas', 'preocupaciones', 'expresadas', 'sobre', 'la', 'validez', 'y', 'el', 'control', 'de', 'los', 'datos', 'proporcionados', 'por', 'los', 'Estados', 'miembros.']
PRED 9: I would like to point out that this is a very important step towards the Commission and the Commission.
PRED SCORE: -27.5812

[2021-04-29 04:34:42,380 INFO] 
SENT 10: ['Hace', 'algunos', 'meses,', 'la', 'mayoría', 'de', 'los', 'responsable', 'políticos', 'negaban', 'simplemente', 'la', 'existencia', 'de', 'ECHELON.']
PRED 10: I would like to point out that this is a very important step towards the Commission and the Commission.
PRED SCORE: -27.7134

[2021-04-29 04:34:42,380 INFO] PRED AVG SCORE: -1.4676, PRED PPL: 4.3390

There must be something wrong with your model.

For sentencepiece in colab, it would probably be easier to just use the python module: https://colab.research.google.com/github/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

So I was able to find a way to build sentencepiece using cmake in google colab using this link.

however, I’m getting this error after running this script command from the tutorial
!onmt_build_vocab -config en-sp.yaml -n_sample -1

Traceback (most recent call last):
  File "/usr/local/bin/onmt_build_vocab", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 63, in main
    build_vocab_main(opts)
  File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 32, in build_vocab_main
    transforms = make_transforms(opts, transforms_cls, fields)
  File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/transform.py", line 176, in make_transforms
    transform_obj.warm_up(vocabs)
  File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/tokenize.py", line 110, in warm_up
    load_src_model.Load(self.src_subword_model)
  File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 367, in Load
    return self.LoadFromFile(model_file)
  File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

As far as my model goes, I’m scripting everything. Would somebody mind taking a look to see if anything else is wrong? I used yasmin’s split data tool to get the test, dev, and train data from the Europarl V7 Spanish-English parallel corpus

## Where the samples will be written
save_data: en-sp/run/example

## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt

## Where the model will be saved
save_model: drive/MyDrive/Europarl/model/model

# Prevent overwriting existing files in the folder
overwrite: True

# Corpus opts:
data:
    europarl:
        path_src: train_europarl-v7.es-en.es
        path_tgt: train_europarl-v7.es-en.en
        transforms: [sentencepiece]
        weight: 1

    valid:
        path_src: dev_europarl-v7.es-en.es
        path_tgt: dev_europarl-v7.es-en.en
        transforms: [sentencepiece]

skip_empty_level: silent

world_size: 1
gpu_ranks: [0]

# General opts
report_every: 100
train_steps: 10000
valid_steps: 1000

# Optimizer
optim: adam
learning_rate: 0.001


# Logging

tensorboard: true
tensorboard_log_dir: logs
log_file: logs/log-file.txt
verbose: True
attn_debug: True
align_debug: True

So to fix this issue, I followed the filtering, tokenizing, and training steps outline in this helpful source by Yasmin GitHub - ymoslem/MT-Preparation: Machine Translation (MT) Preparation Scripts

I also used GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation. for reference