thanks, i see this wrong, and it works very good, thanks.
is there some language detect models? like langdetect, langid.
and is this project or model can terminology intervention and corpus intervention?
It seems the original repository includes a language identification model for fastText available here under the CC-BY-NC licence, but consider this issue. As you can see in these examples, the original fastText models (lid.176.bin and lid.176.ftz) outputs a more accurate language ID than lid218e.bin. Moreover, the original fastText models have a different licence.
private final Locale locale;
private final String iso639_1;
private final String iso639_3;
private final String chineseName;
private final String englishName;
private final String localName;
Language(Locale locale, String iso639_1, String iso639_3, String chineseName, String englishName, String localName) {
this.locale = locale;
this.iso639_1 = iso639_1;
this.iso639_3 = iso639_3;
this.chineseName = chineseName;
this.englishName = englishName;
this.localName = localName;
}
public Locale getLocale() {
return locale;
}
public String getIso639_1() {
return iso639_1;
}
public String getIso639_3() {
return iso639_3;
}
public String getChineseName() {
return chineseName;
}
public String getEnglishName() {
return englishName;
}
public String getLocalName() {
return localName;
}
}
I want to know the 205 language codes of NLLB-200, such as
ace_Arab Acehnese (Arabic script)
ace_Latn Acehnese (Latin script)
What standard is used?
Normal locale codes usually include a language code (ISO 639-1 standard) and a country/region code (ISO 3166-1 alpha-2 standard).
Do I need to make a mapping table of NLLB-200 language codes —> standard language codes?
import sys
import ctranslate2
import sentencepiece as spm
# [Modify] Set paths to the CTranslate2 and SentencePiece models
ct_model_path = "/root/models/nllb-200-3.3B-int8"
sp_model_path = "/root/models/flores200_sacrebleu_tokenizer_spm.model"
device = "cuda" # or "cpu"
# Load the source SentecePiece model
sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)
translator = ctranslate2.Translator(ct_model_path, device)
def translation_function(source_sents, src_lang, tgt_lang):
beam_size = 4
source_sentences = [sent.strip() for sent in source_sents]
target_prefix = [[tgt_lang]] * len(source_sentences)
# Subword the source sentences
source_sents_subworded = sp.encode_as_pieces(source_sentences)
source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]
print(source_sents_subworded)
# print("First subworded source sentence:", source_sents_subworded[0], sep="\n")
# Translate the source sentences
translator = ctranslate2.Translator(ct_model_path, device=device)
translations_subworded = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, beam_size=beam_size, target_prefix=target_prefix)
print(translations_subworded)
translations_subworded = [translation.hypotheses[0] for translation in translations_subworded]
for translation in translations_subworded:
if tgt_lang in translation:
translation.remove(tgt_lang)
# Desubword the target sentences
translations = sp.decode(translations_subworded)
for src_sent, tgt_sent in zip(source_sentences, translations):
print(src_sent)
print(tgt_sent)
if __name__ == '__main__':
source_sents = [input('input text: ')]
src_lang = input('input src_lang: ')
tgt_lang = input('input tgt_lang: ')
if not source_sents or not src_lang or not tgt_lang:
source_sents = [
"In the Big Model era, big models should be used for translation, but the translation efficiency is too low, what to do?",
# "大家好",
# "你是谁"
]
# src_lang = "kin_Latn"
src_lang = "eng_Latn"
tgt_lang = "zho_Hans"
translation_function(source_sents, src_lang, tgt_lang)
my problem:
The target language is simplified: tgt_lang = “zho_Hans”, translation result: 在大模型时代, 大模型应该用于翻译,
Change the target language to Traditional Chinese: tgt_lang = “zho_Hant”, the translation result is: 但翻譯效率太低了, 我們該怎麼辦?
Only the first half and the second half of the sentence were translated respectively.
Please help analyze the reason.
yes i could try madlad. but i want to kown why nllb-200-3.3b has this wrong.
and An additional unrelated question: I want to buy a server and host it in an overseas computer room. Is there any way to obtain it? Of course, you can also use Amazon cloud server directly. But I want to buy server hosting directly. Thank you very much for your patient reply.