We’ve had a lot of fun running the software against the sample data you provided using both GPU (via Ubuntu Desktop v16x) and CPU (on MacOS Sierra) processors (noting the VAST DIFFERENCE in performance) and are looking to now attempt a much larger corpus. Next week, we’ll try a full cross-linked GPU environment with a beefed up processor and plenty of RAM. Thanks for putting this out into open source!
So my question, the pre-process step requires a validation data set for source and target, and I was wondering where I could learn more about this data set, and how I might go about preparing it before we kick off a multi-million segment training. I’ve looked through the documentation, but can’t seem to find this. I know from SMT, we use similar data sets to “hold back” from training in order to validate the trained engines - but not sure in this case.
Sorry if it’s a noob-level question!