Vector Symbolic Architectures

This blog has moved

This blog is now located at http://d-reps.blogspot.com/.
You will be automatically redirected in 30 seconds, or you may click here.

For feed subscribers, please update your feed subscriptions to
http://d-reps.blogspot.com/feeds/posts/default.

George Dyson's visit to Google

http://www.edge.org/3rd_culture/dyson05/dyson05_index.html

Human non-markov statistical language learning

Amy Perfors (or was it James Greiner?) posts at Social Sciences Statistics Blog on Human Statistical Learning:

From the modeling perspective, this result can be captured by Markov models in which the learner keeps track of the string of syllables and the transition probabilities between them, updating the transition probabilities as they hear more data. More recent work has begun to investigate whether humans are capable of statistical learning that cannot be captured by a Markov model -- that is, learning nonadjacent dependencies (dependencies between syllables that do not directly follow each other) in a stream of speech. For instance, papers by Gomez et. al. and Onnis et. al. provide evidence that discovering even nonadjacent dependencies is possible through statistical learning, as long as the variability of the intervening items is low or high enough. This has obvious implications for how statistical learning might help in acquiring grammar (in which many dependencies are nonadjacent), but it also opens up new modeling issues, since simple Markov models are no longer applicable. What more sophisticated statistical and computational tools are necessary in order to capture own unconscious, amazing abilities?

Step right up, step right up, Confabulation Theory...

Was just reading a couple of Hecht-Nielsen's papers on Confabulation Theory (available at http://inc2.ucsd.edu/addedpages/techreports.html : "Confabulation Theory Synopis" (short, in 05.01.pdf), and "A Theory of Cortex" (long, in 04.04.pdf).

I was indirectly prompted to look up Hecht-Nielsen's stuff by a email from Ross Gayler who pointed me to Sahlgren's paper on random indexing, which had a reference to a 1994 Hecht-Nielsen paper:

Hecht-Nielsen (1994), … demonstrated that there are many more nearly orthogonal than truly orthogonal directions in a high-dimensional space.

It's undeniable that Hecht-Nielsen has chutzpa. The synopsis paper was somewhat tantalizing, but I found myself wondering "where's the beef?" Seems solidly in the tradition of scruffy AI -- thinking up a useful mechanism to solve an informally stated problem, rather than mathematically defining a problem and then deriving a way of solving it. He claims this theoretical architecture can do everything, but doesn't really specify how to do anything. I'd like to see some concrete examples of performing an interesting computation.

His profile at UCSD starts with:

An authority on neural networks, he introduced the first comprehensive theory of the mammalian cerebral cortex and thalamus in 2002.

Also left me wondering about what exactly is "Confabulation theory". He claims it is the principle that we choose the H that maximises p(E | H). Says this is different from Maximum A-Posteriori reasoning. So, what's the difference? He takes into account priors over different possible H's? So then it's Bayesian reasoning? If so, why not just say so? Bayesian reasoning is so widely known now that I would call it a jargonistic offence of the first order to readers to not state this if this is so.

"Confabulation Theory: A Synopsis" (TR#0501) cites "A Theory of Thalamocortex" for details (Chapter 4 of "Computational Models for Neuroscience" , ed Robert Hecht-Nielsen & Thomas McKenna, Springer, 2003). This chapter starts in wonderfully Hecht-Nielsenly way:

This chapter presents the first comprehensive high-level theory of
the information processing function of mammalian cortex and
thalamus; ... READER WARNING: The content of this chapter is
complicated and almost entirely novel and unfamiliar. ...

Although this chapter doesn't use the term "Confabulation Theory", it does contain "the principle of mammalian induction", which I think is the same thing. This is the reasoning rule that, when faced with pieces of evidence E1 through E_n, and possible hypothesis, H_1 through H_k, chooses the hypothesis H_i that maximizes min over j P(E_j | H_i). So, this is different to both Bayesian reasoning (which takes the joint probability of all evidence into account, as well as priors on the hypothesis) and MAP reasoning (which just takes the joint probability of all evidence into account). As Hecht-Nielsen says, this proposed reasoning rule could generate lots of testable hypothesis. At first glance, it does provide a better model of the common (incorrect) conclusion that people arrive at in the case of a positive outcome of a test for a rare disease (most people, including doctors, fail to take the priors into account and estimate a erroneously high probability of actually having the disease.) However, I'd be sort of surprised if this reasoning rule is novel.

Some other comments about confabulation theory:

interesting excerpts from a psychological work on confabulation; uncritical general article about Hecht-Nielsen's theory.

Personally, I suspect confabulation is far more central to human cognition (memory and reasoning) than most people think. So, maybe Hecht-Nielsen is onto something, or maybe it's just good marketing.

Random Indexing

Ross Gayler pointed out two interesting statements in An Introduction to Random Indexing (Magnus Sahlgren):

“Hecht-Nielsen (1994), … demonstrated that there are many more nearly orthogonal than truly orthogonal directions in a high-dimensional space.” (Hecht-Nielsen, R.; Context vectors; general purpose approximate meaning representations self-organized from raw data. In Zurada, J. M.; R. J. Marks II; C. J. Robinson; Computational intelligence: imitating life. IEEE Press, 1994.)
“the Johnson-Lindenstrauss lemma (Johnson & Lindenstrauss, 1984) – that states that if we project points in a vector space into a randomly selected subspace of sufficiently high dimensionality, the distances between the points are approximately preserved.” (Johnson, W. B. & J. Lindenstrauss; Extensions to Lipshitz mapping into Hilbert space. Contemporary mathematics. 26, 1984.)

Both these are very powerful properties of vector space representations, and these are particularly nice statemnts of these properties.

Interesting papers at ACL

Hal Daume III posts at Machine Learning (Theory) a list of interesting papers from ACL 2005, which appear to have much to do with learning structure (which is as would be expected - a machine learning person reporting from a language conference). Here is what he posted:

David Chiang, A Hierarchical Phrase-Based Model for Statistical Machine Translation. (Best paper award.) This paper takes the standard phrase-based MT model that is popular in our field (basically, translate a sentence by individually translating phrases and reordering them according to a complicated statistical model) and extends it to take into account hierarchy in phrases, so that you can learn things like “X ’s Y” -> “Y de X” in chinese, where X and Y are arbitrary phrases. This takes a step toward linguistic syntax for MT, which our group is working strongly on, but doesn’t require any linguists to sit down and write out grammars or parse sentences.

Rie Kubota Ando and Tong Zhang, A High-Performance Semi-Supervised Learning Method for Text Chunking. This is more of a machine learning style paper, where they improve a sequence labeling task by augmenting it with models from related tasks for which data is free. I.e., I might train a model that, given a context with a missing word, will predict the word (eg., “The ____ gave a speech” might want you to insert “president".) By doing so, you can use these other models to give additional useful information to your main task.

Noah A. Smith and Jason Eisner, Contrastive Estimation: Training og-Linear Models on Unlabeled Data. This paper talks about training sequence labeling models in an unsupervised fashion, basically by contrasting what the model does on the correct string with what the model does on a corrupted version of the string. They get significantly better results than just by using EM in an HMM, and the idea is pretty nice.

Patrick Pantel, Inducing Ontological Co-occurrence Vectors. This is a pretty neat idea (though I’m biased — Patrick is a friend)
where one attempts to come up with feature vectors that describe nodes in a semantic hierarchy (ontology) that could enable you to figure out where to insert new words that are not in your ontology. The results are pretty good, and the method is fairly simple; I’d imagine that a more complex model/learning framework could improve the model even further.

No Magic!

One thing that sometimes confuses people about HRRs is the nature of the inverse operation.

For example, if we have a binding of two vectors, say A*B (where '*' is a circular convolution operations in HRRs), then how can we recover A (or B) from this binding, and exactly what information is stored in the binding?

The short answer is “there is no magic!” -- the amount of information stored in a binding is the same as the amount of information stored in the vector A or B.

This means that just given the binding alone, it is impossible to discover that is is constructed out of A and B (because then the binding would have had twice as much information as A and B).

However, given the binding A*B and one of the vectors, say A, it is possible to reconstruct the other member of the binding, i.e., B. This can be done using the “convolution inverse”, which can be written A*.

The convolution inverse (called “involution” in HRRs) is actually just a permutation of the elements of the vector – in ordinary HRRs it keeps the first element in place and reverses the order of the remaining elements. In terms of matrix linear algebra, the matrix corresponding to the convolution inverse of vector A is the transpose of the matrix corresponding to vector A.

Note that we assume that the “cue” vector (that we are using to decode the binding) is in fact part of the binding. It is also possible to use a a different cue vector, say C, to try to decode the binding A*B. If we do that, we get the vector C* * A*B, which is a just another vector. To identify that this is actually a nonsense vector, we need to have some way of identifying “valid” vectors. This is the one of the function of the “cleanup memory” in HRRs.

Where does hierarchical structure occur?

Vector representations are widely used for representing unstructured objects, e.g., bag-of-words style vector representations as used in information retrieval, but the emphasis in VSA's is on representation and processing of hierarchical compositional structure.

But, where does hierarchical compositional structure occur?

In language constructs that refer to other entity's states of mind (cf Robin Dunbar's “Gossip” theory of the evolution of language), e.g., "Pat thinks Tom loves Robin"
In planning (cf Mark Steedman's work on planning)

What are some examples of VSA's?

Holographic Reduced Representation (HRR)
Binary Spatter Codes
Associative-Projective Neural Networks (aka Context Dependent Thinning)

Where do VSA's fit into AI?

John Langford discusses different aspect of machine learning in a post Languages of Learning:

...
In addition there are languages related to various aspects of learning.

Reductions: A language for translating between varying real-world losses and core learning algorithm optimizations.

Feature Languages: Exactly how features are specified varies from on learning algorithm to another. Several people have been working on languages for features that cope with sparsity or the cross-product nature of databases.

Data interaction languages: The statistical query model of learning algorithms provides a standardized interface between data and learning algorithm.

...

It seems VSA's fit right into the "Feature Languages" category, and they provide solutions to the problem of representing combinatorial features (which are very sparse).

Vector Symbolic Architectures

Tuesday, April 27, 2010

This blog has moved

Tuesday, November 08, 2005

George Dyson's visit to Google

Wednesday, November 02, 2005

Human non-markov statistical language learning

Thursday, October 20, 2005

Step right up, step right up, Confabulation Theory...

Friday, October 14, 2005

Random Indexing

Monday, July 25, 2005

Interesting papers at ACL

Wednesday, July 20, 2005

No Magic!

Where does hierarchical structure occur?

What are some examples of VSA's?

Where do VSA's fit into AI?

About

About Me

Links

Previous

Archives