Curriculum Learning in the Age of Transformers - Parts I-II

jean.senellart · April 23, 2020, 9:55am

Hello OpenNMT community!

In this strange time of isolation, I found the time to spend more time with my own GPU-server and to experiment on Curriculum Learning (Bengio, 2009) which is together with Lifelong Learning (Thrun, 1996), two incredible ideas that I really want to believe in to drive our daily task on design and training of deep neural networks.

This study is also stimulated by a coincidental confinement reading of the excellent Dune prequel series Legends of Dune where Erasmus, the only independent robot, strives to understand humans - and in particular how they learn and think. Can we, reciprocally, understand how the omnipotent Artificial Intelligence (Omnius in the Legends of Dune) managed to learn, and could we help avoiding it to take a full control of the universe through a more educated teaching process ?

I will publish this topic in several posts. They will be coming, as any good story, with enough time between parts to raise expectations

Part I - Introduction on Curriculum Learning - definition of the concept and overview of the global thread
Part II - The starting point - review of the original paper idea for Language Modeling and reproduction of the results
Part III - Ideas for Extending Curriculum Learning - analysis of different papers proposing new ideas as extension to Curriculum Learning
Part IV - Introduction of Curriculum Learning in OpenNMT-tf
Part V - Analyzing Results and Reproducing published papers
Part VI - Conclusion - can Curriculum Learning work in the age of transformers?

I am using this forum for presentation of this work, as a way to share code, results, and as a call for feedback and new ideas…

Stay Healthy and Tuned!

jean.senellart · April 23, 2020, 10:00am

Part I - Introduction on Curriculum Learning

Serena had been gazing into Manion’s large, inquisitive eyes, but turned her head toward the robot’s polished mirror face. “He will be only three months old tomorrow. At this age, he can’t do anything for himself yet. He has to grow and learn. Human babies need to be nurtured.”
“Machines are fully functional from the day of their programming.” Erasmus sounded smug.
“That explains a lot. For us, life is a gradual developmental process. Without nurturing, we can’t survive,” she said.

--- Legends of Dune

Curriculum Learning, introduced in (Bengio, 2009) is an interesting training strategy idea with some intended similarities with human learning/animal training process (see shaping). A deep neural network is trained in several phases in which examples are presented in increasing complexity order.

Mathematical Curriculum Learning (schema from Unity Blog)

For Natural Language Processing perspective, the idea is very attractive and is matching our own human learner experience. For instance for baby talk or foreign language learning, we have a natural intuition that increasing the level of difficulty of the presented examples can facilitate the learning process. In any method for learning a foreign language, for the first years, learners are dealing with a limited vocabulary that progressively expands. Here the perceived “difficulty” is simply the size of the vocabulary. However, even in that simple language learning context, it is not so easy to formalize what is the “difficulty”. Some metrics like the Gunning-Fog Index have been empirically defined for English language to estimate “the years of formal education a person needs to understand a text on a first reading”. Also, even with an appropriate metric, it is not totally obvious how one can select real internet text in a human language teaching curriculum (Sharoff, 2008).

In addition, as usual when dealing with deep learning models, let us be careful about intuition: if humans are obviously limited for learning massive amount of vocabulary, it is doubtful that modern deep learning models suffer from the same limitation, however we can imagine that the progressivity of the training process could help structuring better the captured language knowledge inside the neural network. The original paper, inspired from the concept of “shaping” (used in particular in animal training), suggested that this process helps to find better (and faster) local optimum of a non-convex training criterion.

For natural language processing, the task covered in the original paper is a language model, on a simple yet effective model that has inspired modern word embedding learning methods (Collobert 2008).

Several recent papers revisited the concept of Curriculum Learning for neural machine translation and are bringing some modifications to the original idea - more complex curriculum, sampling strategies, or self-estimation of the difficulty… in the line of the original paper, these papers claim that their approach helps for a faster convergence of NMT models and that can even bring quantitative benefits. These approaches are either used to train a NMT model in comparison with standard training process, or for domain adaptation.

In this post, I present my own experiments around the concept, first by re-implementing original paper and commenting on the results, then integrating in OpenNMT-tf the contributions from the recent papers and reproducing most of the experiments covered in these papers. As a teaser/spoiler, I must say that the results on modern transformer architecture are however mostly disappointing, and I am showing that some of the published results are actually deceiving and some are even not really credible, and I am concluding this post with my own interpretations and potential ideas for improvements.

tel34 · April 24, 2020, 5:21pm

Glad to see you are making such constructive use of your lockdown. I had great ambitions but have spent more time than I should reading the Reacher novels

jean.senellart · May 14, 2020, 8:31am

Part II - The starting point

“Capacity for information is the key. Machines will absorb not only more raw data, but more feelings, as soon as we understand them. When that happens we will be able to love and hate far more passionately than humans. Our music will be greater, our paintings more magnificent. Once we achieve complete self-awareness, thinking machines will create the greatest renaissance in history.”
— Erasmus, Legends of Dune

The original experiment

The original paper, published in 2009, is using a very simple and tiny (for today standard) neural network as described in the following figure:
Architecture of the deep neural network computing the score of the next word given the previous ones

Unlike precedent approaches on neural language modeling and following (Collobert, 2008), this language model is not trying to predict the probability of the word to score, but is instead implementing a max-margin loss function and using ranking as an evaluation criterion.

The max-margin loss function is defined by the following equation:

max-margin loss

where s is the word we are scoring (last word of the window) and w is a counter-example word that is randomly sampled from the vocabulary - and the loss formula simply tries to push the score of the context with this random word w, being at least 1 smaller than the score of the context with the original word s. This naturally teaches the neural network to rank the possible words that can follow the context. The higher score, the most likely.

The evaluation of the performance of the network is then naturally performed using a ranking criterion: evaluated on all the possible vocabs following a given context, what is the rank of the actual word ?

This simple architecture has been a key milestone in word embeddings development since the most of the parameters of the network are for the lookup table (1M parameters for a 20k vocab, compared to 25100 additional parameters composing the 2 downstream dense layers) - what the model learn is basically a good word vector representation fitting the ranking task.

The training process is pretty straight forward: in the original paper, vocabulary are lowercased plain words, limited to 20k most frequent. The training corpus is 2008 English wikipedia, making 631M 5-word windows (excluding windows with out of vocabulary tokens).

The baseline training (no-curriculum) is running for ~3 full epochs on the training corpus, iteratively sampling pairs of (s,w) used to compute loss for standard SGD optimization.

In constrat, the Curriculum Learning performs 3 iterations on the same data, but for each iteration, the vocabulary is limited to 1/ the top most frequent 5000 words, 2/ the top most frequent 10000 words, and then the full vocabulary. In each of these iterations, the training windows is therefore limited to the windows with only tokens part of these restricted vocabulary (respectively 270M, 370M, 638M). For the evaluation, the complete vocabulary is of course kept.

The results are the following:
Language Model Curriculum Learning results

The authors conclude: “we observe that the log rank on the target distribution with the curriculum strategy crosses the error of the no-curriculum strategy after about 1 billion updates, shortly after switching to the target vocabulary size of 20,000 words, and the difference keeps increasing afterwards. The final test set average log-ranks are 2.78 and 2.83 respectively, and the difference is statistically significant”.

Interpretation of the result:

we can clearly see the different phases of the training in the “curriculum” curve. In particular, we can see that the model slows down at the end of each intermediate phase: this is expected, the model does not know yet the complete vocabulary so is totally unable to guess the rank of training-unknown vocabs (for which the lookup table is still totally random).
After 1500 million updates, for both models, the learning has still not completed. The average logrank is around 2.8 corresponding to a rank of ~16 ( the average of the logrank is not corresponding to the log of the average rank, so the average rank is not as good as ~16)
On the no-curriculum curve, we can see a first phase (between 0-100M updates) where the learning seems to slow down before restarting more steadily. The same effect can be observed at the very beginning of the curriculum curve - I will give more light on this effect in my own experiment
The paper does not indicate why they can conclude the difference is statistically significant - however, we must note that the difference is corresponding to about 1 position for non-log ranking (16.12 vs 16.95). It is a win but not a huge win and we can suppose that the two models would have eventually converged to the same value. However, we can also see that the final value after 1500 million updates for the non-curriculum curve is reached around 1100 million updates only for the curriculum curve. So the acceleration of the training is here really significant (2-days win out of a reported 7-day training time).

Last, it is interesting to note that the model is really tiny in our 2020 standard, with 1M parameters, the model is 500-1000 times smaller than a real scale transformer model!

My reproduction of the experiment

For fairness of the recent paper reproduction process, I did reproduce the same network and experiment. The commented code is here.

However, to reduce the differences with following experiments on Neural Machine Translation, I am using the same preprocessed English corpus as the one for the reference WMT English-German training defined here. The details are:

Corpus - 4.5M sentences from:
- Commoncrawl: http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
- Europarl: http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz
- News Commentary: http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz
Preprocessing: Sentence Piece with 32k vocabulary

For the LM training with 20k top vocabs, this makes a total of 117M 5-token windows.

Let us note that working with sentence piece tokens, the scoring task with 4 context token (only!) is far more complicated since the actual context is smaller than the 4 full words in the original experiment, also the actual covered vocabulary is far larger (20k out of the total 32k sentence piece vocab compared to 20k words out of … 840791 different tokens on simple lowercased tokenized corpus). So we expect the Language Model to struggle far more.

For instance for the sentence “Airbus says the competing version of its A350 will carry 350 people in 18-inch-wide economy seat laid out 9 abreast.”, OOV and the extracted windows will be:

	Sentence Piece (top 20k vocab)	Simple Tokenization (top 20k vocab)
OOV		`a350`, `18-inch-wide`
windows	`▁Airbus ▁says ▁the ▁competing ▁version` `▁says ▁the ▁competing ▁version _of` `▁the ▁competing ▁version _of _its` `▁competing ▁version _of _its _A` `▁version _of _its _A 350` `▁of ▁its ▁A 350 ▁will` `▁its ▁A 350 ▁will _carry` `▁A 350 ▁will _carry _350` `350 ▁will _carry _350 _people` `▁will _carry _350 _people _in` `▁carry ▁350 ▁people ▁in ▁18` `▁350 ▁people ▁in ▁18 -` `▁people ▁in ▁18 - in` `▁in ▁18 - in ch` `▁18 - in ch -` `- in ch - wide` `in ch - wide ▁economy` `ch - wide _economy ▁seat` `- wide _economy _seat ▁laid` `wide _economy _seat _laid ▁out` `_economy _seat _laid _out ▁9` `_seat _laid _out _9 ▁a` `_laid _out _9 _a bre` `_out _9 _a brea ast` `_9 _a brea ast .`	`airbus says the competing version` `says the competing version of` `the competing version of its` `will carry 350 people in` `economy seat laid out 9` `seat laid out 9 abreast` `laid out 9 abreast .`

The results directly exported from tensorboard are represented by the following graph:

These graphs show the longrank evolution with the number of presented examples - the difference between each experiments is the covered vocabulary evolution as represented by the following graph:

the blue curve does not select the windows based on vocabulary: from the very first updates, all examples are used to train the model
the curve 3-steps-manual (red), is following exactly the curriculum of the paper: during a first pass on the complete corpus, only windows containing most frequent 5000 vocabs are selected (about 71M windows), then in the second phase, the windows with the 10000 most frequent vocabs (about 100M windows), and afterward, all of the windows are preserved.
the green line is following a uniform 4-steps curriculum adding last phase limited to windows with top-15000 vocab windows

The parameters used for this experiment are given by the following configuration:

batch_size = 256
steps_per_epoch = 16384
sgd_learning_rate = 0.01
sgd_momentum = 0.9
sgd_decay = 1e-06
buffer_size = 500000
vocab = 20000
include_unk = False
window_size = 5
embedding_size = 50
dense_size = 100

Exactly like the results of the paper, we observe the same impact of the curriculum learning.

In the very first phase(s), the performance of curriculum-learning runs are worse than baseline. This is totally expected since the model is only trained with a limited vocabulary, and is therefore totally ignorant when dealing with windows of the test set with rare vocabs.
Everytime the vocabulary increases, the learning speeds up - and quickly catch-up with the full-vocab learning
All of the experiments with curriculum learning start the last phase of their training with a huge head start on the baseline training. At only 200M updates, the 3 curriculum-learning experiments have the score that the baseline reaches after 300M updates.

As hypothetized however, the heads up given by the curriculum learning is gradually reducing and is eventually totally wearing out.

Some independent effect that is more clearly seen on this experiment is that the very first phase of the learning seems to stall before finally kicking out around 50M updates. During this initial phase, very clear on the training loss evolution below and that we can also observe looking at the test outputs, the model seems to just be learning to sort the vocabs which is indeed a easy first quick win.

Training Loss

Some examples of ranked tokens after 50M updates:

Context	`w`	rank of `w`	#1	#2	#3	#4	#5	#6	#7
`▁But ▁the ▁victim '`	`s`	3	`.`	`,`	`s`	`-`	`▁and`	`▁in`	`'`
`▁would ▁want ▁to ▁hurt`	`▁him`	176	`.`	`,`	`▁of`	`▁to`	`▁in`	`▁and`	`▁the`
`▁line ▁cook ▁in ▁Boston`	`,`	2	`.`	`,`	`▁of`	`▁in`	`-`	`s`	`▁and`

Conclusion

The Curriculum Learning as presented by the original paper is fully reproduced in our experiment even if our task is harder than the one in the original paper. The main effect of the Curriculum Learning is a huge boost on the convergence speed but we are also showing that for that experiment, training without Curriculum Learning and training with Curriculum Learning are eventually reaching the same performance.

The questions that we will be trying to answer in our following posts are if the same can work with modern neural network architecture and with tasks like Neural Machine Translation…

francoishernandez · June 16, 2020, 4:49pm

Hi there,
I just stumbled upon this paper from Google which seems to do quite the same but in reverse.
Did not read this through yet but seemed interesting enough to point it out.

jean.senellart · June 16, 2020, 8:01pm

Thanks François, I had not seen it. I am about to publish the next part on the bibliography and will make sure to cover this…

ymoslem · February 20, 2022, 9:22pm

This is very interesting. Thanks! Anyone has more insights on the advantages of applying Curriculum Learning to Neural Machine Translation, if any?

Jourdelune · July 13, 2023, 3:53pm

Hi, they use a type of curriculum learning in this paper: https://aclanthology.org/W18-6314.pdf, this paper is presented here (I’m trying to implement it lol)

ymoslem · July 30, 2023, 1:33am

Another nice paper on the topic: Curriculum Learning for Domain Adaptation in Neural Machine Translation