OpenNMT Forum

Should I use all available words for vectorization (word2vec/GloVe)?

For example, there is a lot of words that means nothing,
Articles - a/the. They have not so much meaning. If I want to create Eng-Rus NMT system - I don’t care if it is “a car” or “the car” - in Russian (and most of other languages) it translates the same. And It could harm my system: sometimes instead of “the car”->“автомобиль” it will write something like “тот самый автомобиль” (that specific car) - and it will be correct, but any extra words are harmful for text understanding.
Or I am worrying too much and smart and complex LSTM/Transformer will do the job for me?

Related question:
What is the best tactics about apostrophe?
If we just tokenize it it will create three entries instead of just one:
one's -> ["one", "'", "s"]
Obviously, we can replace shortings like:
n’t -> " not"
‘m -> " am"
, etc
But what about apostophes for possessives:
children’s
parents’
, etc

Different guides says that we should do this:
["children', "'", "s"]
But it creates 3 entries! Why not just:
["children's"]

How is your question related to word2vec or GloVe? It seems like you are just referring to the word vocabulary.

In general you want to split “children’s” into “children ; ’ ; s”. If you keep “children’s” in one word, this will increase the vocabulary size dramatically and the system would have to learn that “children” and “children’s” are related.

I think it is related to both vocabulary and word vectorization.
Because my thought was: vectorization should create some vector for " ’ ", “the”, “an”, etc. But sometimes this tokens has not meaning.
Thanks for useful commentary about vocab size. Sure, it is better to tokenize apostrophes.
But what about articles? Maybe it is much easier to remove them even before onmt preprocess?