I use tokenized europarl v7 corpus for training, and tokenized WMT News news-test2013 for evaluation. For time saving, I use only the first 10 sentences in this evaluation dataset. The configuration is at the end of this post.
At the beginning (of course, after a few checkpoints, say at ckpt-5000), the BLEU score looked fine (it increased, and had been at about 9.0). But after one night, at ckpt-70000, the BLEU score became 0.
With a relative small dataset (2M sentences), is this situation expected? Or there is something wrong in my setting?
The following are target sentences and predictions at about ckpt-5000 and ckpt-70000
target: A Republican strategy to counter the re-election of Obama ckpt-5000: An interim strategy for combating <unk> is <unk> . ckpt-70000: This is an important issue . target: Unlike in Canada , the American States are responsible for the organisation of federal elections in the United States . ckpt-5000: On the contrary , the American States are responsible for elections in the United States . ckpt-70000: The European Union ’ s security policy is a good example of the European Union . target: Republican leaders justified their policy by the need to combat electoral fraud . ckpt-5000: The political leaders of their political leaders have to fight against fraud . ckpt-70000: The vote will take place tomorrow at 12 noon .
And the following are the 10 predictions at ~ckpt0-70000, which seems strange:
This is an important issue . The vote will take place tomorrow at 12 noon . It is also important that the European Union has a duty to do so . I would like to thank the rapporteur for his excellent work . It is a matter of urgency . It is also important for the European Union to play a leading role in this area . The European Union ’ s security policy is a good example of the European Union . It is also important to ensure that the European Union ’ s security policy . The Commission ’ s proposal for a directive on the implementation of the Lisbon Strategy . The European Union ’ s security strategy is a key issue .
# The directory where models and summaries will be saved. It is created if it does not exist. model_dir: Model/TransformerBig/ data: # (required for train_and_eval and train run types). train_features_file: Data/Training/Tokenized/europarl-v7.fr-en.fr_tokenized train_labels_file: Data/Training/Tokenized/europarl-v7.fr-en.en_tokenized # (required for train_end_eval and eval run types). eval_features_file: Data/Evaluation/Tokenized/newstest-2013.fr_tokenized.txt eval_labels_file: Data/Evaluation/Tokenized/newstest-2013.en_tokenized.txt # (optional) Models may require additional resource files (e.g. vocabularies). source_words_vocabulary: fr-vocab-30000-tokenized.txt target_words_vocabulary: en-vocab-30000-tokenized.txt # source_tokenizer_config: config-tokenization.yml # target_tokenizer_config: config-tokenization.yml params: gradients_accum: 1 # Training options. train: batch_size: 3000 # (optional) Batch size is the number of "examples" or "tokens" (default: "examples"). batch_type: tokens # (optional) Save a checkpoint every this many steps. save_checkpoints_steps: 100 # (optional) How many checkpoints to keep on disk. keep_checkpoint_max: 10 # (optional) Save summaries every this many steps. save_summary_steps: 100 # (optional) Train for this many steps. If not set, train forever. train_steps: 100000 # (optional) The number of threads to use for processing data in parallel (default: 4). num_threads: 4 # (optional) The number of elements from which to sample during shuffling (default: 500000). # Set 0 or null to disable shuffling, -1 to match the number of training examples. sample_buffer_size: 0 # (optional) Number of checkpoints to average at the end of the training to the directory # model_dir/avg (default: 0). average_last_checkpoints: 0 # (optional) Evaluation options. eval: # (optional) The batch size to use (default: 32). batch_size: 10 # (optional) The number of threads to use for processing data in parallel (default: 1). num_threads: 4 # (optional) Evaluate every this many seconds (default: 18000). eval_delay: 0 # (optional) Save evaluation predictions in model_dir/eval/. save_eval_predictions: True # (optional) Evalutator or list of evaluators that are called on the saved evaluation predictions. # Available evaluators: BLEU, BLEU-detok, ROUGE external_evaluators: [BLEU] # (optional) Model exporter(s) to use during the training and evaluation loop: # last, final, best, or null (default: last). exporters: last