Mikolav et al 2013 - Distributed Representations of Words and Phrases and their Compositionality
These are notes for mikolov2013distributed.
This paper introduces word2vec: an approach for learning word embeddings. Compare with bag-of-words, term-document matrix, and other vector space models of semantics (turney2010frequency).
1. Training
1.1. Skip-gram model
The model learns vector representations for words. The representations are used as parameters in a distribution. The distribution gives the likelihood of words in the context. Then, for a sequence of training words \(\{w_1,w_2,...\}\) the objective of the skip-gram model is: \[ \frac{1}{T} \sum_{i=1}^{T} \sum_{j=1}^{c} \log p(w_{i+j} \mid w_{i}) \]
Where \(c\) is the number of context words used during training. The probability \(p(w_{i+j} \mid w_i)\) is computed using softmax.
1.2. Hierarchical Softmax
Because softmax takes a long time to compute for large vocabulary \(|W|\), hierarchical softmax is used instead.
1.3. Negative Sampling
The distribution produced by the model should be able to distinguish between words in the context and random noise. Let \(v_w\) be the vector representation of \(w\). The following objective is used to replace \(p(w_O \mid w_I)\) in the above objective: \[\log \sigma ({v_{w_O}}^{\intercal} v_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w_I)} \log(1-\sigma(v_{w_i}^{\intercal} v_{w_I}))\] In the above, the LHS term is simply the log probability of the context word, conditioned on the input word. The RHS is the part introduced by negative sampling. \(P_{n}\) is the noise distribution, which grabs random words from outside the context. The RHS term encourages \(p(w_O \mid w_i)\) to be low for words outside of the context.