hello guy. I build this model to translate Eng to Vietnamese. i confuse about build voc, u know in vietnamese they need 2 or 3 words combine to have a meanning. for example the word “PHONE” in english mean “ĐIỆN THOẠI” in vietnamese. But when we use build voc they will seperate “ĐIỆN THOẠI” into “ĐIỆN” and “THOẠI” and their 2 words has no meaning when they standing alone. So what can i do to improve this? what preprocess , what encoder and decoder should i use? anyone has experience on this help me pls ?
Hi, Have you tried training a model yet? With enough data your model should learn these relationships, i.e. that “ĐIỆN THOẠI” mean “phone”. I would start with the basic --auto_config setup as illunstrated in the QuickStart.
ok tks i first test 10000 rows and its not good. So i dont need preprocess or setting encoder or decoder in build_voc or train ?
If you follow the QuickStart you will need to tokenize your data first. I use SentencePiece but there are other ways to tokenize. You will not get good results with 10000 sentence pairs. Experience teaches you will need around 1 million sentence pairs as a minimum to start getting usable results.