Extracting specific values from inside documents

I have about 30000 domain specific documents which are very different form each other. I should extract specific values found inside them. These values are unique for each document and they could be numerical, short strings, dates, etc
As input data for training i have each word in the document along with its’s relative coordinates bounding box. Each document has about 1000 words.
As target data, i have a single value per document to be extracted. Here is a list of such values:

SHKGEA902498 / A
IC - 00708751

The target value is found inside the source data, so the job of the model is to use the input word data and generate the target value for each document’s data.

I cannot use regular word embedding to create the vocabulary because the target values are unique for each document and there is no point in having a vocabulary. I guess the best approach is to use character embedding for this task.

My question is how can set up the model so that it only uses character embedding for both source and target data. Basically the vocabulary should only consist of single characters/letters/numbers not words? Also is is possible to somehow pass as input feature the bounding box coordinates for each word to the model? If so, how can i do that? The bounding box is actually relevant as the position of each word in the document can influence the target value.

I’m not sure I understood if you can or not detect those “special char”,

But if they are to stay “as is” between source and target, I would build a script with regex to identify them and replace them by a tag (<tag>) from both source and target and train my model.

So you would just need to replace those special string by <tag> before calling your model and then replace the <tag> by the special string you initially “replaced”.

Hope this help!

To answer your question “how to train the model just with char”. I believe you could use BPE from Sentence Piece and force all the char as an “as is character”, but I never tried that myself… and you might have some performance issues!

There are no “special char”. The source data is basically the OCR data with positions (hOCR). The target values that the model should generate, is found inside this hOCR data itself. Each target value is unique. If i would mask out characters with a tag (<tag> ) in the source data, the model would lose context and therefore impossible to generate target values.


You can simply split your data on each character (adding a space between each character) and build the vocabulary from this character-tokenized data.

OpenNMT-tf also has a tokenizer CharacterTokenizer that you can set when building the vocabulary and running the training. However, to get started I suggest manually tokenizing your data before the training so that you can see the result.

OpenNMT-tf accepts multiple input features. The secondary input features can be a list of labels, or a numerical vector that is concatenated to the main embeddings. There is more information in the documentation.

1 Like