Hi there,
I have a couple of issues for which I need your help.
-
The Im2Latex-100K dataset link provided at https://github.com/OpenNMT/OpenNMT-py/blob/v1.2.0/docs/source/im2text.md is broken. If I tried to copy and paste the command provided, it would download an HTML file, not a tar zip file.
-
I found the 2014 OpenNMT-Lua repository which has the proper link to the dataset. But that dataset is not tokenized so we have to preprocess it using the command in OpenNMT-Lua. The main issue arises after preprocessing. The final normalized/tokenized file “formulas.norm.lst” will have more than provided equations i.e. the input file has ~103K equations while the output file has ~104K. While analyzing the issue, I have found that a couple of things are happening over there:
(i) Some equations are not properly normalized. These equations will be missing the final “formulas.norm.lst” file. Instead, we will find a space there. This can be handled.
(ii) Some large equations are not properly normalized or not properly written in the final file. Due to this, we can find the spaces as well as the equations broken into separate lines. This creates indexing problems with the train, test, and validation files having eqn_index and respective image_name.
For example, in the “formulas.norm.lst” file: at line 156 we will find a blank line. At lines 178, 180, and 181, there are 2 blank lines but only one equation is missing. Rest are shifted without any reason. The equation at line 864 is broken into 4 chunks while putting blank spaces. This means the equations are not properly tokenized by the script in one shot, but rather split down into multiple chunks. Which again creates indexing issues. It is impossible to track these down for 100K+ equations.
May I request you please help me with this? Any help will be appreciated. Thank you!