Turney and Pantel 2010 - From frequency to meaning: Vector space models of semantics
Notes for turney2010frequency
1. Word-Document Matrices
In a word-document matrix \(\mathbf{X}\), the rows correspond to words (or terms) and the columns correspond to documents. For term \(i\) and document \(j\), \(\mathbf{X}_{i,j}\) is the frequency of \(i\) in \(j\).
\(\mathbf{X}_{:j}\) characterizes a document and \(\mathbf{X}_{i}\) characterizes a word. The values in this matrix can be given a TF-IDF weighting.
1.1. Relation to Linguistic Theories
- Distributional hypothesis: words that occur in similar contexts will have similar meanings. Wittgenstein's language games are meant to show how word meaning arise from word use in context. Firth says that "you shall know a word by the company it keeps" (see Firth 1957 - A Synopsis of Linguistic Theory).
- Bag of Word hypothesis: Documents with similar word frequencies tend to be generated by the same topic.
2. Pair-pattern matrices
The rows corresponds to pairs of words like, "wood":"ax". The columns correspond to relation-templates such as \(X\) cuts \(Y\).
2.1. Relation to Linguistic theories
Latent Relation hypothesis: pairs of words that occur in the same patterns have the same semantic relations.
3. Summary
The above mentioned hypotheses can be subsumed under a "Statistical Semantics Hypothesis", which claims that semantics can be learned from statistical analysis of human word usage.
Using word-context is suited for attributional similarity, i.e. "do these two words have similar properties?". Pair-pattern is suited for measuring relational similarity, i.e. "do these two pairs ("cow":"moo" :: "dog":"bark") stand in the same relation to each other?"
Relational similarity can include:
- synonyms
- antonyms
- functionally related
- meronyms (see Synecdoche vs Metonymy)
- hypernym (a car is a vehicle)
Words are syntagmatic associates when they appear together (bee and honey). They are paradigmatic parallels when they appear in similar contexts (doctor and nurse).
4. Text processing considerations
- Tokenization: how to deal with multi-word terms, punctuation, stop words (words with low information content)
- Normalization: upper/lower case, stemming