Natural Language Processing is a field of Machine Learning and Artifical Inteligence that solves problems with understanding human languages by the computers. Before 2013 and Tomas Mikolov (from Google Inc.) development of word2vec technique, many Natural Language Processing systems treated words without any notation of similarities between them.
1. Word Embeddings
One of first ideas to represent words in NLP was discrete atomic symbols. There are some technical advantages of this approach: simplicity, robustness and the fact that simple models trained on huge amounts of data were better than complex models trained on small data sets. To fully understand this technique, we provide some example of two words:
sedan = > Id782 truck = > Id921
As you can easily see, encodings provide no information about semantic sense of the words or relationship between them. This representation doesn’t give the model any ability to learn connection between the meanings of sedan and truck. Moreover, unique and discrete indices can lead to data sparsity, which forced us to gather more data to provide successful training.
At this point the concept of Vector Space Models (VSM) should be introduced. The idea behind VSM is quite simple – we represent embedded words in continuous vector space, where similar words are mapped to nearby points. Of course, we must use a method to automatically find semantic relationships between those words. The best way to do this is to use one of the Distributional Hypothesis, like Latent Semantic Analysis (LSA) or neural probabilistic language models. Let’s look at the first technique (LSA), which is count-based: We must check how often words appear in the same context and assume that words share some semantic meaning. Next, we map these statistical counts into a dense vector for each word. Second technique is one of the predictive methods, which are not covered in this article.
2. Word to Vector
It’s a predictive model for learning word embeddings from raw text, it’s worth noticing that it’s very efficient. We can divide Word2vec into two basic types: Continuous Bag-of-Words model (CBOW) and the Skip-Gram model.
The core idea – algorithmic structure of these models is very similar, the main difference is that CBOW predicts target words from source context words, when Skip-Gram model predicts source context from target words. In another words Skip-Gram is an inverse method of CBOW and this inversion has statistical effect on algorithms. CBOW smooths over a lot of distributional information, while in Skip-Gram model each context-word pair is taken for a new observation. These characteristics lead to very important conclusions: CBOW works best on small data sets, while Skip-Gram performs best on large ones. Word2vec method looks similar to autoencoders, because we encode each word in the document into a vector, but we don’t use this model to train for word reconstruction, we use it to set neighbor words from input document.
To future reading we propose: