Shared Vocab usage & workflow

JptoEn · August 13, 2021, 8:46pm

Hi. do we generally see higher accuracy with shared vocabularies? Even when the languages are unrelated? According to these experiments shared is always better even for Ja/En.

github.com

google/sentencepiece/blob/master/doc/experiments.md

# SentencePiece Experiments

## Experiments 1 (subword vs word-based model)
### Experimental settings

*   Segmentation algorithms:
    *   **SentencePiece**: SentencePiece with a language-model based segmentation. (`--model_type=unigram`)
    *   **SentencePeice(BPE)**: SentencePiece with Byte Pair Encoding. [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]] (`--model_type=bpe`)
    *   **Moses**: [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) for English.
    *   **KyTea**: [KyTea](http://www.phontron.com/kytea/) for Japanese.
    *   **MeCab**: [MeCab](http://taku910.github.io/mecab/) for Japanese.
    *   **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese.
    *   **(Moses/KyTea)+SentencePiece**: Apply SentencePiece (Unigram) to pre-tokenized sentences. We have several variants with different tokenizers., e.g., **(Moses/MeCab)+SentencePiece**, **(MeCab/Moses)+SentencePiece**.
    *   *char**: Segments sentence by characters.

*   Data sets:
    *   [KFTT](http://www.phontron.com/kftt/index.html)

*   NMT parameters: ([Google’s Neural Machine Translation System](https://arxiv.org/pdf/1609.08144.pdf) is applied for all experiments.)
    *   Dropout prob: 0.2

This file has been truncated. show original

So do I implement this by training a sentencepiece model on one file containing both languages, then convert that shared vocab to work with OpenNMT?

Also I wonder if you guys have seen this paper of a novel sharing technique for the embedding space, they got good results even for Japanese/Chinese → English

Regards,
Matt

argosopentech · August 14, 2021, 1:04am

The paper was interesting! As I understand it they first determine which (sub)words are more similar and then make their embeddings more similar. The advantage of this is that it makes encoding and decoding easier for the model.

For Argos Translate I put all of the data into one file and then train the tokenizer on that file:

github.com

argosopentech/onmt-models/blob/e2b7ff8aa262c6c9d088b91bcbb47bfde06d6640/bin/argos-train#L6

    
      
          #!/bin/sh
          
          
echo "Splitting train and valid data"
          ./split_train_and_valid.py raw_data/source raw_data/target
          
          
cat split_data/*train.txt >> split_data/all.txt
          
          
spm_train --input=split_data/all.txt --model_prefix=sentencepiece \
                     --vocab_size=$vocab_size --character_coverage=$character_coverage \
          	   --input_sentence_size=1000000 --shuffle_input_sentence=true \
          	   --user_defined_symbols=$special_tokens
          
          
onmt_build_vocab -config config.yml -n_sample -1
          
          
rm split_data/all.txt

argosopentech · August 14, 2021, 1:25am

The method described in the paper seems needlessly complex. They classify pairs of words in the source and target vocab as lexically similar, words of the same form, or unrelated. Why not just compare all words (source and target) based some general measure of similarity? That would remove the arbitrary similarity boundaries and allow you to exploit similarities within a language.