Sure. FYI - The base model is built with UN parallel corpus.
Details for fine tune data -
This is data related to fire standards that we internal maintain within our organization. They are very similar to legal languages used within patents, with inclusion of measurements and metrics for buildings structures and equipment related to various domains. We had the historical translations of these standards into Spanish. They have been curated to match each English line with its Spanish equivalent.
Train (both En and Es) - 37,000 Parallel Sentences
Validation - 1,000 Parallel Sentences
Pre-processed (Tokenized and BPE encoded) and built the vocabulary on the training data set.
Let me know if you need any specific information.