I have about 30000 domain specific documents which are very different form each other. I should extract specific values found inside them. These values are unique for each document and they could be numerical, short strings, dates, etc
As input data for training i have each word in the document along with its’s relative coordinates bounding box. Each document has about 1000 words.
As target data, i have a single value per document to be extracted. Here is a list of such values:
83-3165719 60221372 915418 150 5816 F631618 97017601 30000002 73201903 SHKGEA902498 / A 10835 4566575 1260691403 205503725 12828189 813709067 IC - 00708751 2291
The target value is found inside the source data, so the job of the model is to use the input word data and generate the target value for each document’s data.
I cannot use regular word embedding to create the vocabulary because the target values are unique for each document and there is no point in having a vocabulary. I guess the best approach is to use character embedding for this task.
My question is how can set up the model so that it only uses character embedding for both source and target data. Basically the vocabulary should only consist of single characters/letters/numbers not words? Also is is possible to somehow pass as input feature the bounding box coordinates for each word to the model? If so, how can i do that? The bounding box is actually relevant as the position of each word in the document can influence the target value.