"Appropriate" way to translate from NOT a file

I’m trying to use ONMT-py from within a small app. In previous versions (i’m talking commit 26421ce20c6b626ceacafbb3282cad1d5dce04ca ) i could simply do as follows:

def translate(input_statement):
    translator = onmt.Translator(translator_args)
    response, score_raw, _, _, _ = translator.translate(srcBatch=[input_statement.text.split()],
                                                                              goldBatch=[])
    return response

And I would be able to get a translation from a string.

Now I’m trying to do the same with the latest version in the repo, and I’m not sure what’s the best way to not depend upon a file, since the Tranlator.translate() method uses an object batch, which comes from an onmt.IO.OrderedIterator, which comes from an onmt.IO.ONMTDataset which requires a file to load the source from.

Is there a simpler way to get to that same object so i can use the translate() method that does not involve reading from a file? What am i missing?

Or an easy way to generate the ONMTDataset object from just a string? Seems a bit convoluted, but as far as i understood the structure, it’s either that or creating a custom translate().

I did look for documentation and searcherd the forum, but couldn’t find anything.

Thanks!

Hi, your proposed method would be only feasible (but yes, rather cumbersome) way now.

Previously, there are users asking for support of translating a single sentence, we thought
it is easy to fake it as a file to use current interface, so we don’t add that support.

But your requirement seems reasonable, maybe later when I refactor the translate part of
code, I will implement this.

Thanks.

For whatever it’s worth (maybe it’ll help someone else), this is what i’m doing:

I created a class from ONMTDataset with only what i needed:

class StringONMTDataset(ONMTDataset):
    def __init__(self, src_path, fields):
        self.src_vocabs = []
        self.n_src_feats = 0
        self.n_tgt_feats = 0
        src_base = [{"src": tuple(src_path.split()),
                     "indices": 0}]
        src_examples = (x for x in src_base)
        self.n_src_feats = 0
        examples = src_examples

        examples = self.dynamic_dict(examples)

        ex, examples = peek(examples)
        keys = ex.keys()

        fields = [(k, fields[k]) for k in keys]
        example_values = ([ex[k] for k in keys] for ex in examples)
        out_examples = (torchtext.data.Example.fromlist(ex_values, fields)
                        for ex_values in example_values)

        super(ONMTDataset, self).__init__(
            out_examples,
            fields,
            None
        )

And having that, i just initialize the object:

line = "Whatever i need to translate"
data = StringONMTDataset(line, translator.fields)
test_data = onmt.IO.OrderedIterator(
    dataset=data, device=translator_args.gpu, batch_size=translator_args.batch_size, 
    train=False, sort=False, shuffle=False)
test_data.create_batches()
prepared_string = torchtext.data.Batch(test_data.batches[0], data, 
                             translator_args.gpu, False)
pred_batch, gold_batch, pred_scores, gold_scores, attn, src \
    = translator.translate(prepared_string, data)

Maybe not too clean, but it works…

1 Like

Hi, I am wondering if there is any new progress on this problem. I spend some time, but no success. @juanshlm solution does not work anymore because of the updates in the project. I tried to adapt it but it does not work yet. Any help will be appreciated? I believe this can be an important future for anyone who will try to use project in a demo application rather than experiments. Thanks

1 Like

Agreed. We will try to support this better.

Thanks for the update. I am looking forward to it.