Newbie question on using OpenNMT

austin7lee · April 17, 2018, 8:07pm

Hello,
I’m researching a way to potentially translate from english to a specific type of DSL using OpenNMT. My background in machine learning is limited but growing. I’ve been able to run the quick start for the OpenNMT-py version on an Ec2 P2 instance using the gpu, which is much faster.

Now in exploring on converting to a DSL, I noticed a stickiness to types of translations. For example in some really simple DSL queries for numeric comparisons you have phrase like “greater than”, “less than”, etc. It seems for whatever reason no matter what I tends to be biased towards one of these. Now my training set and validation are really small, as I’m just exploring if this is possible.

So my question is if I’m doing anything wrong or if there are certain suggestions in how to setup my data. I trimmed down the entities by putting in place holders to represent things like numbers. Or could this just be the case that I don’t have enough data? If I need more data would repeating the sentences be ok or should I find creative ways to keep introducing some form of diversity.

I’m also wondering about the vocabulary list option and how that would work. What types of words would I put in there. Can it be single word or multi word phrases?

Thanks.