OpenNMT Forum

Should I use all available words for vectorization (word2vec/GloVe)?

For example, there is a lot of words that means nothing,
Articles - a/the. They have not so much meaning. If I want to create Eng-Rus NMT system - I don’t care if it is “a car” or “the car” - in Russian (and most of other languages) it translates the same. And It could harm my system: sometimes instead of “the car”->“автомобиль” it will write something like “тот самый автомобиль” (that specific car) - and it will be correct, but any extra words are harmful for text understanding.
Or I am worrying too much and smart and complex LSTM/Transformer will do the job for me?

Related question:
What is the best tactics about apostrophe?
If we just tokenize it it will create three entries instead of just one:
one's -> ["one", "'", "s"]
Obviously, we can replace shortings like:
n’t -> " not"
‘m -> " am"
, etc
But what about apostophes for possessives:
children’s
parents’
, etc

Different guides says that we should do this:
["children', "'", "s"]
But it creates 3 entries! Why not just:
["children's"]

How is your question related to word2vec or GloVe? It seems like you are just referring to the word vocabulary.

In general you want to split “children’s” into “children ; ’ ; s”. If you keep “children’s” in one word, this will increase the vocabulary size dramatically and the system would have to learn that “children” and “children’s” are related.

I think it is related to both vocabulary and word vectorization.
Because my thought was: vectorization should create some vector for " ’ ", “the”, “an”, etc. But sometimes this tokens has not meaning.
Thanks for useful commentary about vocab size. Sure, it is better to tokenize apostrophes.
But what about articles? Maybe it is much easier to remove them even before onmt preprocess?

An example: in your example, “children’s”, apostrophe-s means “possession”. In other cases “his, her or their” denote possession. “A”, “The”, etc distinguish between “an anonymous item” v.s. “a particular item”.

I think one way to attack this problem is to find a very good word labeler for both languages, and use word+label as your token rather than just the word. This also gives you a quality check in that you can audit the output sequences by checking for a legitimate sequence of labels (whether or not they are the right words). On a Mac laptop I got 200 sentences/second out of the AllenNLP parse tree labeler,.

Sounds interesting, but what do you mean by “label”?
Create some classes of words (nouns, verbs, articles, names, etc) and use pairs [‘word’, ‘label’] as token?
I am hungry -> [{‘I’: ‘pronoun’}, {‘am’: ‘article’}, {‘hungry’: ‘adjective’}]
Like this?

Yes, it’s called ‘parts-of-speech tagging’ in NLP analysis. Check out the ‘spacy’ library if you are a Python user.
‘dependency parsing’ and ‘constituency parsing’ are also tools for assigning parts-of-speech to words.