Extending the translation server to add features in source text

pr-welcome

(Panos Kanavos) #1

Continuing the discussion from Using features for domain/client/subject adaptation, it would be great to extend the json array sent to the server to include more fields with user-provided features so the source text can be automatically tagged properly according to the model used. The main issue for a user is not adding the features themselves before sending the request, but the consistency in the tokenization which is currently performed at the server.


(Vincent Nguyen) #2

I think what you are asking for goes beyond the "server “functionality”.

If a model accept N features, you have to provide a source text with N features.

You need to add another customed layer on your client side to include the features.


(Panos Kanavos) #3

Hi @vince62s,

As I noted, the problem is not adding the features, but the tokenization. Aren’t the features added after tokenization? If so, then the features functionality depends on the tokenization functionality and that’s why it has to be done in the same place, with the same tool and rules, in the server-side.

I realize that this does not fully implement the feature functionality for the source text, because the user cannot tag the source text with whatever way they want – each source word must be marked with the same feature(s). Still, it would be very useful, for the sole purpose of allowing domain recognition.

Anyway, I looked at the code in github and I think this can be easily implemented – probably not for me since I haven’t ever written a single line of lua :slight_smile:. I will try to come up with something though if my request is not accepted. For starters, can you please verify that the starting point should be a new module that will implement the logic of case.lua, and particularly a function similar to case.addCase (toks)?

Thanks!


(Vincent Nguyen) #4

@panosk
I understand your use case, however tagging with some case specific features requires more parameters, like an external model or something (not something like the embedded case_feature).

The easiest for you would be to use the zmq server which does not tokenize
or we could add a flag on the rest_server which would by-pass the tokenizer.

In the end you have to prepare the data by code.


(Panos Kanavos) #5

Maybe you could clarify why an external model would be needed, I may be missing sth about the inner workings of the server. My idea is this: instead of sending the request as it is now
[{ "src" : "Hello World" }]
allow to send the request like this:

[{ "src" : "Hello World","features" : ["FEATURE1","FEATURE2"] }]
Then after tokenization, we could add the features in the array to each token and then send the final featurized string for translation.


(Vincent Nguyen) #6

okay, I thought you wanted a tokenizer-like behavior on the server side to actually tag the source. (hence needing a model for that).

Now, I don’t understand your syntax. features are related to words, not segments.
so you need features for “Hello” and for “World”, right ?


(Guillaume Klein) #7

This seems like an issue. How do you know the number of tokens before tokenization?


(Panos Kanavos) #8

For domain recognition, and according to @jean.senellart’s paper (please see my link in my first post), that’s what has to be done. Here is a detailed example:

[{ "src" : "This is a sample sentence in a specific domain","features" : ["DOMAIN_NAME","SUBDOMAIN"] }]

After tokenization in the server, this would become

This|DOMAIN_NAME|SUBDOMAIN is|DOMAIN_NAME|SUBDOMAIN a|DOMAIN_NAME|SUBDOMAIN sample|DOMAIN_NAME|SUBDOMAIN sentence|DOMAIN_NAME|SUBDOMAIN in|DOMAIN_NAME|SUBDOMAIN a|DOMAIN_NAME|SUBDOMAIN specific|DOMAIN_NAME|SUBDOMAIN domain|DOMAIN_NAME|SUBDOMAIN
and it will be fed to a model created for this purpose.

The translator can choose the domain(s) for their project from a drop-down box. The client application can then use these values as features to send every translation request in the above format.


(Panos Kanavos) #9

Why do we need to know this? The string reaches the server untokenized and it is tokenized there. Then, if the -case_feature flag is on, the case.lua gets into play and the table of tokens is passed to the function case.addCase (toks) which adds the case feature. That’s my rough understanding from the quick look I took into the code, I may as well miss sth important, in which case I apologize…

Similarly, we could create another module src_features, add a new flag -src_feature that will trigger in the same way -case_feature triggers upon starting the server, and the module will have a function src_features.addFeatures(toks,features) that will append each feature in features to each token in toks.


(Guillaume Klein) #10

Sorry, I misunderstood that it is about adding the same features to every tokens in the sentence.


(Vincent Nguyen) #11

back to question 1: you need a model to tag your features. so it this not the server role to do that.

you’d better pass the data (already tagged with features) with a flag “do not tokenize” to the server.


(Panos Kanavos) #12

Of course a model with appropriate features in the source side will have been trained and the server will work on this model. The whole point of this request is to make it work with properly trained models, not with any models.


(Vincent Nguyen) #13

I am not talking about a “translation model”

buit a “tagging model”.

I guess in your example
This|DOMAIN_NAME|SUBDOMAIN is|DOMAIN_NAME|SUBDOMAIN a|DOMAIN_NAME|SUBDOMAIN sample|DOMAIN_NAME|SUBDOMAIN sentence|DOMAIN_NAME|SUBDOMAIN in|DOMAIN_NAME|SUBDOMAIN a|DOMAIN_NAME|SUBDOMAIN specific|DOMAIN_NAME|SUBDOMAIN domain|DOMAIN_NAME|SUBDOMAIN

you expect each token to be tagged with its own DOMAIN_NAME and its own SUBDOMAIN
Am I correct ?

EDIT: ok I see you want the full sentence to belong to the same DOMAIN/SUBDOMAIN


(Panos Kanavos) #14

No. Consider DOMAIN_NAME and SUBDOMAIN as real features and as the only features that will be sent with every request. Every word in every request will be marked with only these two features.

The translation model will have been trained with two features on the source side, which could be any of the following:

DOMAIN_NAME, SUBDOMAIN,ANOTHERDOMAIN_NAME,ANOTHER_SUBDOMAIN

To get a better idea, please have a look at @jean.senellart’s paper to see how they did it.