OpenNMT-py BERT Tutorial

How do I use BERT?

BERT is a general-purpose “language understanding” model introduced by Google, it can be used for various downstream NLP tasks and easily adapted into a new task using transfer learning. Using BERT has two stages: Pre-training and fine-tuning. But as the Pre-training is super expensive, we do not recommand you to pre-train a BERT from scratch. Instead loading weights from a existing pretrained model and fine-tuning is suggested. Currently we support sentence(-pair) classification and token tagging downstream task.

Use pretrained BERT weights

To use weights from a existing huggingface’s pretrained model, we provide you a script to convert huggingface’s BERT model weights into ours.

Usage:

python bert_ckp_convert.py   --layers NUMBER_LAYER
                             --bert_model_weights_file HUGGINGFACE_BERT_WEIGHTS
                             --output_name OUTPUT_FILE

Preprocess train/dev dataset

To generate train/dev data for BERT, you can use preprocess_bert.py by providing raw data in certain format and choose a BERT Tokenizer model -vm coherent with pretrained model.

Classification

For classification dataset, we support input file in csv or plain text file format.

  • For csv file, each line should contain a instance with one or two sentence column and one column for label as in GLUE dataset, other csv format dataset should be compatible. A typical csv file should be like:

    ID SENTENCE_A SENTENCE_B(Optional) LABEL
    0 sentence a of instance 0 sentence b of instance 0 class 2
    1 sentence a of instance 1 sentence b of instance 1 class 1

    Then calling preprocess_bert.py and providing input sentence columns and label column:

    python preprocess_bert.py --task classification --corpus_type {train, valid}
                              --file_type csv [--delimiter '\t'] [--skip_head]
                              --input_columns 1 2 --label_column 3
                              --data DATA_DIR/FILENAME.tsv
                              --save_data dataset
                              -vm bert-base-cased --max_seq_len 256 [--do_lower_case]
                              [--sort_label_vocab] [--do_shuffle]
    
  • For plain text format, we accept multiply files as input, each file contains instances for one specific class. Each line of the file contains one instance which could be composed by one sentence or two separated by |||. All input file should be arranged in following way:

       .
       ├── LABEL_A
       │   └── FILE_WITH_INSTANCE_A
       └── LABEL_B
           └── FILE_WITH_INSTANCE_B
    

    Then call preprocess_bert.py as following to generate training data:

    python preprocess_bert.py --task classification --corpus_type {'train', 'valid'}
                              --file_type txt [--delimiter ' ||| ']
                              --data DIR_BASE/LABEL_1/FILENAME1 ... DIR_BASE/LABEL_N/FILENAME2
                              --save_data dataset
                              --vocab_model {bert-base-uncased,...}
                              --max_seq_len 256 [--do_lower_case]
                              [--sort_label_vocab] [--do_shuffle]
    

Tagging

For tagging dataset, we support input file in plain text file format.

Each line of the input file should contain one token and its tagging, different fields should be separated by a delimiter(default space) while sentences are separated by a blank line.

A example of input file is given below (Token X X Label):

-DOCSTART- -X- O O

CRICKET NNP I-NP O
- : O O
LEICESTERSHIRE NNP I-NP I-ORG
TAKE NNP I-NP O
OVER IN I-PP O
AT NNP I-NP O
TOP NNP I-NP O
AFTER NNP I-NP O
INNINGS NNP I-NP O
VICTORY NN I-NP O
. . O O

LONDON NNP I-NP I-LOC
1996-08-30 CD I-NP O

Then call preprocess_bert.py providing token column and label column as following to generate training data for token tagging task:

python preprocess_bert.py --task tagging --corpus_type {'train', 'valid'}
                          --file_type txt [--delimiter ' ']
                          --input_columns 1 --label_column 3
                          --data DATA_DIR/FILENAME
                          --save_data dataset
                          --vocab_model {bert-base-uncased,...}
                          --max_seq_len 256 [--do_lower_case]
                          [--sort_label_vocab] [--do_shuffle]

Pretraining objective

Even if it’s not recommended, we also provide you a script to generate pretraining dataset as you may want to finetuning a existing pretrained model on masked language modeling and next sentence prediction.

The script expects a single file as input, consisting of untokenized text, with one sentence per line, and one blank line between documents.
A usage example is given below:

python pregenerate_bert_training_data.py  --input_file INPUT_FILE
                                          --output_dir OUTPUT_DIR
                                          --output_name OUTPUT_FILE_PREFIX
                                          --corpus_type {'train', 'valid'}
                                          --vocab_model {bert-base-uncased,...}
                                          [--do_lower_case] [--do_whole_word_mask] [--reduce_memory]
                                          --epochs_to_generate 2
                                          --max_seq_len 128
                                          --short_seq_prob 0.1 --masked_lm_prob 0.15
                                          --max_predictions_per_seq 20
                                          [--save_json]

Training

After preprocessed data have been generated, you can load weights from a pretrained BERT and transfer it to downstream task with a task specific output head. This task specific head will be initialized by a method you choose if there is no such architecture in weights file specified by --train_from. Among all available optimizers, you are suggested to use --optim bertadam as it is the method used to train BERT. warmup_steps could be set as 1% of train_steps as in original paper if use linear decay method.

A usage example is given below:

python train.py  --is_bert --task_type {pretraining, classification, tagging}
                 --data PREPROCESSED_DATAIFILE     
                 --train_from CONVERTED_CHECKPOINT.pt [--param_init 0.1]
                 --save_model MODEL_PREFIX --save_checkpoint_steps 1000
                 [--world_size 2] [--gpu_ranks 0 1]
                 --word_vec_size 768 --rnn_size 768
                 --layers 12 --heads 8 --transformer_ff 3072
                 --activation gelu --dropout 0.1 --average_decay 0.0001
                 --batch_size 8 [--accum_count 4] --optim bertadam [--max_grad_norm 0]
                 --learning_rate 2e-5 --learning_rate_decay 0.99 --decay_method linear
                 --train_steps 4000 --valid_steps 200 --warmup_steps 40
                 [--report_every 10] [--seed 3435]
                 [--tensorboard] [--tensorboard_log_dir LOGDIR]

Predicting

After training, you can use predict.py to generate predicting for raw file. Make sure to use the same BERT Tokenizer model --vocab_model as in training data.

For classification task, file to be predicted should be one sentence(-pair) a line with ||| separating sentence.
For tagging task, each line should be a tokenized sentence with tokens separated by space.

Usage:

python predict.py  --task {classification, tagging}
                   --model ONMT_BERT_CHECKPOINT.pt
                   --vocab_model bert-base-uncased [--do_lower_case]
                   --data DATA_2_PREDICT [--delimiter {' ||| ', ' '}] --max_seq_len 256
                   --output PREDICT.txt [--batch_size 8] [--gpu 1] [--seed 3435]

Thank you for very interesting information.
How do I get bert_ckp_convert.py, preprocess_bert.py, pregenerate_bert_training_data.py?

Please check these script in PR Bert#1543

Is this URL ok ?

Where is pytorch-pretrained-BERT? #1543

Check: https://github.com/OpenNMT/OpenNMT-py/pull/1543

Thank you for providing information.
I found a script in Zenglinxiao/OpenNMT-py ( https://github.com/Zenglinxiao/OpenNMT-py/tree/bert ).

Excuse me, can you give me a model file that has been trained? I need to use the pre-training model because I don’t have time to retrain, I’m in urgent need, please help. Thanks.

You can check huggingface’s dedicated site for available pretrained models.

1 Like

Thank you.
I ask more about this problem, I am installing the Bottom-up-summary model. According to the author using the pre-trained OpenNMT-py model, I only need the input of the Bottom-up model to be the model file OMNT. I wonder if the OpenNMT-py BERT model can be used to the Bottom-up-summary model.

These are different architectures. You can’t replace one with the other.

Hello,

Is this tutorial still up to date? Or BERT underwent major changes that would render it useless?

the link to " modeling_bert.py" is not working anymore.

Best regards,
Samuel

Hello Samuel,
Since the PR for this change has not been merged eventually, we did not made updates into the PR.
The link you mentioned is broken since they(huggingface) have made a major change to their repo, but you can actually find the equivalent file in transformers/modeling_bert.py.
It’s only a place to find available pretrained models, the PR itself do not depend on huggingface/transformers.

one alternative is to use GPT-2 in openNMT-py, which is integrated.

Hi Zenglinxiao,
Thanks for a very nice workaround to perform classification with opennmt.
Is there a way to generate my own vocab_model using my own data vocabulary file? Or to feed the model a text file with the already tokenized data?