Validation Dataset - Content License

ymoslem · April 21, 2022, 6:25am

Hello!

I understand that we may not use non-commercial datasets in training for building commercial MT systems, while we can use them in the testing phase only.

What about the development/validation dataset? I see we disable gradient calculations during the validation phase, and I am not aware of any parameters tuned based on the validation score (please correct me if I am wrong), so nothing really goes to the training.

So, is it a generally acceptable behaviour to use non-commercial datasets for validation (i.e. after a number of training steps or epochs as it is handled in OpenNMT and similar toolkits)?

I would appreciate sharing your point of view. Thanks!

Kind regards,
Yasmin

argosopentech · May 16, 2022, 11:16pm

I don’t think the legal status of copyright and machine learning data are clearly settled. Commercial models like GitHub Copilot and GPT-3 are trained using a lot of data covered by copyright.

If a dataset is released under an explicitly non-commercial license the intent is likely that the data isn’t used in any part of a commercial process but there is a lot of gray area on this issue.