Effectiveness of Transformer model on small dataset

renjithks · May 5, 2020, 7:26am

Hello, I wanted to ask about the effectiveness of transformer model on small dataset(200K). I have been training a small transformer model (1 layer, 512 dim, 4 heads). I am trying to extract amount, date from OCR text from receipts, so the source sentences are very long(~500 words) and target sentences are just one word. I have run the training for 100K iterations, but the loss seems very high (~1.5). Should I keep running it for longer? Is a transformer model as effective as an RNN on a small dataset like mine?

guillaumekln · May 5, 2020, 9:01am

At least for translation tasks, it is known that Transformer models perform worse than LSTM in small-data regimes.

But a loss of 1.5 seems quite good to me? Did you try running inference on real data?

renjithks · May 5, 2020, 9:18am

So I guess I need to augment the data.
I will do an evaluation on the test set to compute sequence accuracy today.

renjithks · May 5, 2020, 7:16pm

I had look at the sequence accuracy and it actually pretty terrible. The low loss could be due to small target lenghts?

Bachstelze · May 5, 2020, 10:23pm

Hello Renjith,
pretrained Transformer like BART can be fine-tuned with a small dataset. Which language do you train?

How many target words do you have? Do they occur in the source sentence?
Are they usable for multiple labels? If they are repetitive clusters, then you should try a classification model.

Greetings from the multi translation space
https://bachstelze.gitlab.io/multisource/

renjithks · May 6, 2020, 5:29am

The source dataset is in multiple languages (invoice ocrs)
The target words are all digits (amounts, dates). So my target vocab is quite small.
An example dataset would be like.

Source

Fl      Qbuzz      \      N#      Vervoerbewijs    2,70 Eurokaartje  VERKOCHT: 10/03/18 10:5lF      1 uur geldig in maximaal 2 zones    Overstappen is toegestaan      i      PRIJS: € 2,70      Zonenummer: 1200    Ticketnummer: 01034531906151051001  Voertuignummer: 3453

Target 1

2.70

Target 2

2018-03-10

The targets may not occur as it is in source due to difference in formats (ex: 2,70 -> 2.70). Or due to OCR errors (2.T0 -> 2.70)

Bachstelze · May 10, 2020, 9:45am

Are the multiple languages included in the pretraining of mBART?
But for me it seems like an OCR correction and extraction task, besides the multilingual aspect of the problem.

renjithks · May 12, 2020, 6:52pm

The OCR is mainly of language english, german, russian etc.
Yes it is an OCR correction and extraction task. My training data has date and amount labels in same format (YYYY-MM-DD and %.2f).

I used transformer implementation here(https://www.tensorflow.org/tutorials/text/transformer) and used a subword tokenizer for ocr/amount/date and achived just 65% sequence accuracy.