Vector Symbolic Architectures

Monday, July 25, 2005

Interesting papers at ACL

Hal Daume III posts at Machine Learning (Theory) a list of interesting papers from ACL 2005, which appear to have much to do with learning structure (which is as would be expected - a machine learning person reporting from a language conference). Here is what he posted:


David Chiang, A Hierarchical Phrase-Based Model for Statistical Machine Translation. (Best paper award.) This paper takes the standard phrase-based MT model that is popular in our field (basically, translate a sentence by individually translating phrases and reordering them according to a complicated statistical model) and extends it to take into account hierarchy in phrases, so that you can learn things like “X ’s Y” -> “Y de X” in chinese, where X and Y are arbitrary phrases. This takes a step toward linguistic syntax for MT, which our group is working strongly on, but doesn’t require any linguists to sit down and write out grammars or parse sentences.



Rie Kubota Ando and Tong Zhang, A High-Performance Semi-Supervised Learning Method for Text Chunking. This is more of a machine learning style paper, where they improve a sequence labeling task by augmenting it with models from related tasks for which data is free. I.e., I might train a model that, given a context with a missing word, will predict the word (eg., “The ____ gave a speech” might want you to insert “president".) By doing so, you can use these other models to give additional useful information to your main task.


Noah A. Smith and Jason Eisner, Contrastive Estimation: Training og-Linear Models on Unlabeled Data. This paper talks about training sequence labeling models in an unsupervised fashion, basically by contrasting what the model does on the correct string with what the model does on a corrupted version of the string. They get significantly better results than just by using EM in an HMM, and the idea is pretty nice.


Patrick Pantel, Inducing Ontological Co-occurrence Vectors. This is a pretty neat idea (though I’m biased — Patrick is a friend)
where one attempts to come up with feature vectors that describe nodes in a semantic hierarchy (ontology) that could enable you to figure out where to insert new words that are not in your ontology. The results are pretty good, and the method is fairly simple; I’d imagine that a more complex model/learning framework could improve the model even further.

Wednesday, July 20, 2005

No Magic!

One thing that sometimes confuses people about HRRs is the nature of the inverse operation.

For example, if we have a binding of two vectors, say A*B (where '*' is a circular convolution operations in HRRs), then how can we recover A (or B) from this binding, and exactly what information is stored in the binding?

The short answer is “there is no magic!” -- the amount of information stored in a binding is the same as the amount of information stored in the vector A or B.

This means that just given the binding alone, it is impossible to discover that is is constructed out of A and B (because then the binding would have had twice as much information as A and B).

However, given the binding A*B and one of the vectors, say A, it is possible to reconstruct the other member of the binding, i.e., B. This can be done using the “convolution inverse”, which can be written A*.

The convolution inverse (called “involution” in HRRs) is actually just a permutation of the elements of the vector – in ordinary HRRs it keeps the first element in place and reverses the order of the remaining elements. In terms of matrix linear algebra, the matrix corresponding to the convolution inverse of vector A is the transpose of the matrix corresponding to vector A.

Note that we assume that the “cue” vector (that we are using to decode the binding) is in fact part of the binding. It is also possible to use a a different cue vector, say C, to try to decode the binding A*B. If we do that, we get the vector C* * A*B, which is a just another vector. To identify that this is actually a nonsense vector, we need to have some way of identifying “valid” vectors. This is the one of the function of the “cleanup memory” in HRRs.

Where does hierarchical structure occur?

Vector representations are widely used for representing unstructured objects, e.g., bag-of-words style vector representations as used in information retrieval, but the emphasis in VSA's is on representation and processing of hierarchical compositional structure.

But, where does hierarchical compositional structure occur?
  • In language constructs that refer to other entity's states of mind (cf Robin Dunbar's “Gossip” theory of the evolution of language), e.g., "Pat thinks Tom loves Robin"
  • In planning (cf Mark Steedman's work on planning)

What are some examples of VSA's?

  • Holographic Reduced Representation (HRR)
  • Binary Spatter Codes
  • Associative-Projective Neural Networks (aka Context Dependent Thinning)

Where do VSA's fit into AI?

John Langford discusses different aspect of machine learning in a post Languages of Learning:

...

In addition there are languages related to various aspects of learning.


  1. Reductions: A language for translating between varying real-world losses and core learning algorithm optimizations.
  2. Feature Languages: Exactly how features are specified varies from on learning algorithm to another. Several people have been working on languages for features that cope with sparsity or the cross-product nature of databases.
  3. Data interaction languages: The statistical query model of learning algorithms provides a standardized interface between data and learning algorithm.

...

It seems VSA's fit right into the "Feature Languages" category, and they provide solutions to the problem of representing combinatorial features (which are very sparse).