Token concept in batch_size, training_step, epoch

Shanha-Kim · February 11, 2022, 6:18am

Hi,
When I set up the transformer config file, I wonder how I can calulate proper training_step.
Because batch_type of tokens make me confused.
If I use batch_type of sentence or epochs concept, I can easly measure the position of now in full data training(1epoch).
But in case batch_type of token, I don’t know that.

How can i know the full tokens of training dataset and set up the proper training_step?
And what is the effect of accum_count?(is it just multiply batch_size at the sametime? using more gpu memory?)

ex)

batch_size: 4096
batch_type: tokens
accum_count: 8
max_generator_batches: 2

# full sentence of training dataset: 2M

The training log in the beginning

[2022-02-11 14:28:16,031 INFO] Step 1000/500000; acc:  15.54; ppl: 445.55; xent: 6.10; lr: 0.00012; 9709/9217 tok/s;   2307 sec
[2022-02-11 15:05:29,537 INFO] Step 2000/500000; acc:  33.13; ppl: 43.33; xent: 3.77; lr: 0.00025; 10909/10052 tok/s;   4541 sec

What is the tok/s in training log?(are they number of tokens and sentence?, it’s weird )

liuxf · March 27, 2022, 10:11am

Hi,

The update step of model is based on effective_batch_size, not batch_size. Empirically, a sentence contains about twenty-five tokens on average, so the model is updated every 1000 sentences (effective_batch_size/25=25000/25=1000). Of course, you can count the number of tokens before training. If you use sentencepiece package to tokenize corpus, you can do this easily.

Kind regards,
Liu Xiaofeng

Shanha-Kim · April 7, 2022, 2:09am

Thanks to answer my question! I don’t know about effective_batch_size before.
But, I still can’t understatnd the step 1000/500000 in training log.
500k is whole training step that I set.
If the 1000 step means 1000 sentences, 500k training step would mean 500k sentences.
But my dataset is consist of 2.5m sentences and 500k training step is more than 20 epoch.

liuxf · April 10, 2022, 1:15am

Hi,

In training log, the two steps are effective update steps, which are based on effective batch size.

I give you a example. If batch_type=tokens, batch_size=2000, effective_batch_size=20000, max_step=500K, the size of corpus=2.5M sentences, average length of sentence=20Tokens, then:
the model is updated every 20000/20=1K sentences, or at each step the model consumes 1K sentences. Thus, one epoch will contain 2.5M/1K = 2.5K steps, and 500K steps consist of 500K/2.5K=200epochs.
This is based on my experience on OpenNMT-tf, and I hope it can help you.