Turney and Pantel 2010 - From frequency to meaning: Vector space models of semantics

Notes for turney2010frequency

1. Word-Document Matrices

In a word-document matrix \(\mathbf{X}\), the rows correspond to words (or terms) and the columns correspond to documents. For term \(i\) and document \(j\), \(\mathbf{X}_{i,j}\) is the frequency of \(i\) in \(j\).

\(\mathbf{X}_{:j}\) characterizes a document and \(\mathbf{X}_{i}\) characterizes a word. The values in this matrix can be given a TF-IDF weighting.

1.1. Relation to Linguistic Theories

Distributional hypothesis: words that occur in similar contexts will have similar meanings. Wittgenstein's language games are meant to show how word meaning arise from word use in context. Firth says that "you shall know a word by the company it keeps" (see Firth 1957 - A Synopsis of Linguistic Theory).
Bag of Word hypothesis: Documents with similar word frequencies tend to be generated by the same topic.

2. Pair-pattern matrices

The rows corresponds to pairs of words like, "wood":"ax". The columns correspond to relation-templates such as \(X\) cuts \(Y\).

2.1. Relation to Linguistic theories

Latent Relation hypothesis: pairs of words that occur in the same patterns have the same semantic relations.

3. Summary

The above mentioned hypotheses can be subsumed under a "Statistical Semantics Hypothesis", which claims that semantics can be learned from statistical analysis of human word usage.

Using word-context is suited for attributional similarity, i.e. "do these two words have similar properties?". Pair-pattern is suited for measuring relational similarity, i.e. "do these two pairs ("cow":"moo" :: "dog":"bark") stand in the same relation to each other?"

Relational similarity can include:

synonyms
antonyms
functionally related
meronyms (see Synecdoche vs Metonymy)
hypernym (a car is a vehicle)

Words are syntagmatic associates when they appear together (bee and honey). They are paradigmatic parallels when they appear in similar contexts (doctor and nurse).

4. Text processing considerations

Tokenization: how to deal with multi-word terms, punctuation, stop words (words with low information content)
Normalization: upper/lower case, stemming

bibliography/references.bib