For massively multilingual Neural Machine Translation the translations improve significantly with more capacity. A simple approach to increase this amount is the memory layer.
What do you think is the best way to implement Large Memory Layers with Product Keys? I would replace the feed forward layer in the transformer with a minimalistic product-key memory layer.
Should this be a separate model or just an optional configuration in the existing model?
Do you think the minimalistic version will satisfy our needs or do we want to test different options with this layer?
Are you targeting a specific implementation (i.e. OpenNMT-py, OpenNMT-tf, etc)?
If it is a drop-in replacement for an existing layer, I think it is appropriate to expose an optional parameter. Then the implementation should probably use the hyperparameters that produce the best result according to the paper.
The fastest way is to copy it to pytorch, because it is simple a copy-pasta of the original, minimalistic implemention that is used for the most results in the linked paper. But I want to test it with some features of the tensorflow version. Is it possible to use one model with different settings for each layer?
You would probably need to implement this behavior but that should not be complicated.