The data preprocessing script in OpenNMT-Lua is not working. And the Im2Latex-100K dataset link provided on OpenNMT-py is broken

Hi there,

I have a couple of issues for which I need your help.

  1. The Im2Latex-100K dataset link provided at https://github.com/OpenNMT/OpenNMT-py/blob/v1.2.0/docs/source/im2text.md is broken. If I tried to copy and paste the command provided, it would download an HTML file, not a tar zip file.

  2. I found the 2014 OpenNMT-Lua repository which has the proper link to the dataset. But that dataset is not tokenized so we have to preprocess it using the command in OpenNMT-Lua. The main issue arises after preprocessing. The final normalized/tokenized file “formulas.norm.lst” will have more than provided equations i.e. the input file has ~103K equations while the output file has ~104K. While analyzing the issue, I have found that a couple of things are happening over there:
    (i) Some equations are not properly normalized. These equations will be missing the final “formulas.norm.lst” file. Instead, we will find a space there. This can be handled.
    (ii) Some large equations are not properly normalized or not properly written in the final file. Due to this, we can find the spaces as well as the equations broken into separate lines. This creates indexing problems with the train, test, and validation files having eqn_index and respective image_name.
    For example, in the “formulas.norm.lst” file: at line 156 we will find a blank line. At lines 178, 180, and 181, there are 2 blank lines but only one equation is missing. Rest are shifted without any reason. The equation at line 864 is broken into 4 chunks while putting blank spaces. This means the equations are not properly tokenized by the script in one shot, but rather split down into multiple chunks. Which again creates indexing issues. It is impossible to track these down for 100K+ equations.
    May I request you please help me with this? Any help will be appreciated. Thank you!

Hi,

Maybe this can help:

As a matter of fact we dropped img2text in the v2.0

I would like to find some people to re-implement img and audio/video in the v3.0

If you are interested, then the starting point would be to implement the dataloader part which should not be so difficult.

Let me know.

Otherwise you’ll have to stick with v1.2

Thank you @vince62s.
Yes! That’s what I use. It has ~94K preprocessed equations.
It can not be used for research purposes. It is a difficult task to tokenize the raw latex equations in the original Im2Latex-100K dataset in the proper format as the OpenNMT-Lua
script does. They use JavaScript to open the MathML webserver to do. I am not that much familiar with that.
As a matter of fact, I tried that script with other Latex equations datasets, and it worked properly. It is only with Im2Latex-100K dataset equations, it creates problems or may be some really large equations. But removing them will manipulate the original dataset.