Hello, I wanted to ask about the effectiveness of transformer model on small dataset(200K). I have been training a small transformer model (1 layer, 512 dim, 4 heads). I am trying to extract amount, date from OCR text from receipts, so the source sentences are very long(~500 words) and target sentences are just one word. I have run the training for 100K iterations, but the loss seems very high (~1.5). Should I keep running it for longer? Is a transformer model as effective as an RNN on a small dataset like mine?
At least for translation tasks, it is known that Transformer models perform worse than LSTM in small-data regimes.
But a loss of 1.5 seems quite good to me? Did you try running inference on real data?
So I guess I need to augment the data.
I will do an evaluation on the test set to compute sequence accuracy today.
I had look at the sequence accuracy and it actually pretty terrible. The low loss could be due to small target lenghts?
Hello Renjith,
pretrained Transformer like BART can be fine-tuned with a small dataset. Which language do you train?
How many target words do you have? Do they occur in the source sentence?
Are they usable for multiple labels? If they are repetitive clusters, then you should try a classification model.
Greetings from the multi translation space
https://bachstelze.gitlab.io/multisource/
The source dataset is in multiple languages (invoice ocrs)
The target words are all digits (amounts, dates). So my target vocab is quite small.
An example dataset would be like.
Source
Fl Qbuzz \ N# Vervoerbewijs 2,70 Eurokaartje VERKOCHT: 10/03/18 10:5lF 1 uur geldig in maximaal 2 zones Overstappen is toegestaan i PRIJS: € 2,70 Zonenummer: 1200 Ticketnummer: 01034531906151051001 Voertuignummer: 3453
Target 1
2.70
Target 2
2018-03-10
The targets may not occur as it is in source due to difference in formats (ex: 2,70 -> 2.70). Or due to OCR errors (2.T0 -> 2.70)
Are the multiple languages included in the pretraining of mBART?
But for me it seems like an OCR correction and extraction task, besides the multilingual aspect of the problem.
The OCR is mainly of language english, german, russian etc.
Yes it is an OCR correction and extraction task. My training data has date and amount labels in same format (YYYY-MM-DD and %.2f).
I used transformer implementation here(https://www.tensorflow.org/tutorials/text/transformer) and used a subword tokenizer for ocr/amount/date and achived just 65% sequence accuracy.