Randomized co-training: from cortical neurons to machine learning and back again
Despite its size and complexity, the human cortex exhibits striking anatomical regularities, suggesting there may simple meta-algorithms underlying cortical learning and computation. We expect such meta-algorithms to be of interest since they need to operate quickly, scalably and effectively with little-to-no specialized assumptions.
This note focuses on a specific question: How can neurons use vast quantities of unlabeled data to speed up learning from the comparatively rare labels provided by reward systems? As a partial answer, we propose randomized co-training as a biologically plausible meta-algorithm satisfying the above requirements. As evidence, we describe a biologically-inspired algorithm, Correlated Nyström Views (XNV) that achieves state-of-the-art performance in semi-supervised learning, and sketch work in progress on a neuronal implementation.
Although staggeringly complex, the human cortex has a remarkably regular structure [1, 2]. For example, even expert anatomists find it difficult-to-impossible to distinguish between slices of tissue taken from, say, visual and prefrontal areas. This suggests there may be simple, powerful meta-algorithms underlying neuronal learning.
Consider one problem such a meta-algorithm should solve: taking advantage of massive quantities of unlabeled data. Evolution has provided mammals with neuromodulatory systems, such as the dopamenergic system, that assign labels (e.g. pleasure or pain) to certain outcomes. However, these labels are rare; an organism’s interactions with its environment are typically indifferent. Nevertheless, mammals often generalize accurately from just a few good or bad outcomes. Our problem is therefore to understand how organisms, and more specifically individual neurons, use unlabeled data to learn quickly and accurately.
Next, consider some properties a semi-supervised neuronal learning algorithm should have. It should be:
broadly applicable (that is, requiring few-to-no specialized assumptions); and
The first four requirements are of course desirable properties in any learning algorithm. The fourth requirement is particularly important due to the wide range of environments, both stochastic and adversarial, that organisms are exposed to.
Regarding the fifth requirement, it is unlikely that evolution has optimized all of the cortex’s connections, especially given the explosive growth in brain size over the last few million years. A simpler explanation, fitting neurophysiological evidence, is that the macroscopic connectivity (largely the white matter) was optimized, and the details are filled in randomly.
This note proposes randomized co-training as a semi-supervised meta-algorithm that satisfies the five criteria above. In particular, we argue that randomizing cortical connectivity is not only necessary but also beneficial.
A co-training algorithm takes labeled and unlabeled data consisting of two views . Examples of views are audio and visual recordings of objects or photographs taken from different angles. The key insight is that good predictors will agree on both views whereas bad predictors may not . Co-training algorithms therefore use unlabeled data to eliminate predictors disagreeing across views, shrinking the search space used on the labeled data – resulting in better generalization bounds and improved empirical performance [5, 6, 7, 8, 9, 10, 11].
The most spectacular application of co-training is perhaps never-ending language learning (NELL), a semi-autonomous agent that updates a massive database of beliefs about English language categories and relations based on information extracted from the web [12, 13, 14].
However, despite its conceptual elegance, co-training remains a niche method. A possible reason for the lack of applications is that it is difficult to find naturally occurring views satisfying the technical assumptions required to improve performance. Constructing randomized views is a cheap workaround that dramatically extends co-training’s applicability.
Section §1 shows that discretizing standard models of synaptic dynamics and plasticity  leads to a small tweak on linear regression. Incorporating NMDA synapses into the model as a second view leads to a neuronal co-training algorithm.
Section §2 reviews recent work which translated the above observations about neurons into a learning algorithm. We introduce Correlated Nyström Views (XNV), a state-of-the-art semi-supervised learning algorithm that combines multi-view regression with random views via the Nyström method .
Finally, §3 returns to neurons and sketches preliminary work analyzing the benefits of randomized co-training in cortex.
1 Modeling neurons as selective linear regressors
The perceptron was introduced in the 50s as a simple model of how neurons learn . It has been extremely influential, counting both deep learning architectures and support vector machines amongst its descendants. Unfortunately however, the perceptron and related models badly misrepresent important features of neurons. In particular, they treat outputs symmetrically () rather than asymmetrically (0/1). This is crucial since spikes (1s) are more metabolically expensive than silences (0s). Further, by-and-large neurons only update their synapses after spiking. By contrast, perceptrons update their weights after every misclassification.
To build a tighter link between learning algorithms and neurocomputational models, we recently discretized standard models of neuronal dynamics and plasticity to obtain the selectron . This section shows that, suitably regularized, the selectron is almost identical to linear regression. The difference is a selectivity term – arising from the spike/silence asymmetry – that encourages neurons to specialize.
Let denote the set of inputs and the set of possible synaptic weights. Let , where threshold is fixed. Given , neuron with synaptic weights outputs 0 or 1 according to
We model neuromodulatory signals via where , with positive values corresponding to desirable outcomes and conversely. Signals may arrive after a few hundred millisecond delay, which we do not model explicitly.
Definition 1 (selectron).
A threshold neuron is a selectron if its reward function takes the form
The reward function is continuously differentiable (in fact, linear) as a function of everywhere except at the kink where it is continuous but not differentiable. We can therefore maximize the reward via gradient ascent to obtain synaptic updates
Theorem 1 (discretized neurons, ).
If rewards are more common than punishments then (3) leads to overpotentiation (and eventually epileptic seizures). Neuroscientists therefore introduced a depotentiation bias into STDP. Alternatively, , introduced an -constraint on synaptic weights enforced during sleep. Below, we interpolate between the two approaches by replacing the constraint with a regularizer.
We first recall linear regression to aid the comparison. Given data , regression finds parameters minimizing the mean squared error
One way to solve this is by gradient descent using
Selective linear regression.
Now, if we regularize the reward in (2) as follows
the result is selective linear regression: a neuron’s excess current predicts neuromodulation when it spikes. Since synaptic weights are non-negative, the neuron will not fire for such that . Computing gradient ascent obtains
The synaptic updates are . In other words, if neuron receives spike and produces spike , then it modifies synapse proportional to how much greater the rescaled neuromodulatory signal is than the excess current.
The selectivity term in (6) makes biological sense. There are billions of neurons in cortex, so it is necessary that they specialize. Neuromodulatory signals are thus ignored unless the neuron spikes – providing a niche wherein the neuron operates.
Multi-view learning in cortex.
There are two main types of excitatory synapse: AMPA and NMDA. So far we have modeled AMPA synapses which are typically feedforward and “driving” – they cause neurons to initiate spikes. NMDA synapses differs from AMPA in that 
they multiplicatively modulate synaptic updates and
they prolong, but do not initiate, spiking activity.
We model the two types of synaptic inputs as AMPA, , and NMDA, , views with synaptic weights and respectively. In accord with the observations above, we extend (5) by adding a multiplicative modulation term. The NMDA view is encouraged to align with neuromodulators and is regularized the same as AMPA. The NMDA view has no selectivity term since it does not initiate spikes.
Finally, we obtain a (discretized) neuronal co-optimization algorithm, which simultaneously attempts to maximize how well each view predicts neuromodulatory signals and aligns the two views on unlabeled data:
The next section describes a semi-supervised regression algorithm inspired by (7).
2 Correlated random features for fast semi-supervised learning
This section translates the multiview optimization above into a workable learning algorithm.
From neurons to machine learning.
We make three observations about cortical neurons:
The observations suggest neurons perform an analog of
kernelized multiview regression
with random views and
a CCA penalty.
To check these form a viable combination, we put the pieces together to develop Correlated Nyström Views (XNV), see Algorithm 1 and . Reassuringly, XNV beats the state-of-the-art in semi-supervised learning .
The main ingredient in XNV is multiview regression, which we now describe. Suppose the loss of the best regressor in each view is within of the best joint estimator.
Introduce the canonical norm where are orthogonal solutions to (8) with correlation coefficients . Multiview regression is then
Penalizing with the canonical norm biases the estimator towards features that are correlated across both views (the signal) and away from features that are uncorrelated (the noise). Multiview regression is thus a specific instantiation of the general co-training principle that good regressors agree across views whereas bad regressors may not.
Theorem 2 (multiview regression, ).
The multiview estimator’s error, (9), compared to the best linear predictor , is bounded by
According to the theorem, a slight increase in bias compared to ordinary regression is potentially more than compensated for by a large drop in variance. The reduction in variance depends on how quickly the correlation coefficients decay. For example, in the trivial case where the two views are identical, there is no benefit from multiview regression. To work well, the algorithm requires sufficiently different views (where most basis vectors are uncorrelated) that nevertheless both contain good regressors (that is, the few correlated directions are of high quality).
To convert multiview regression into a general-purpose tool we need to construct pairs of views – for any data – satisfying two requirements. First, they should contain good regressors. Second, they should differ enough that their correlation coefficients decay rapidly.
A computationally cheap approach that does not depend on specific properties of the data is to generate random views. To do so, we used the Nyström method  and random kitchen sinks . A recent theorem of Bach implies Nyström views contain good regressors in expectation  and a similar result holds for random kitchen sinks. Although there are currently no results on correlation coefficients across random views, recent lower bounds on the Frobenius norm of the Nyström approximation suggest “medium-sized” views (a few hundred dimensional) differ sufficiently . Empirical performance is discussed below.
Table 1 summarizes experiments evaluating XNV on 18 datasets, see  for details. We do not report on random kitchen sinks, since they performed worse than Nyström views. Performance is evaluated against kernel ridge regression (KRR) and a randomized version of SSSL, a semi-supervised algorithm with state-of-the-art performance .
XNV typically outperforms SSSL by between 10% and 15%, with about 30% less variance. Both semi-supervised algorithms achieve less than half the error of kernel ridge regression.
|Avg reduction in error vs KRR||56%||62%||63%||63%||63%|
|Avg in error vs SSSL||11%||16%||15%||12%||9%|
|Avg in standard-error vs SSSL||15%||30%||31%||33%||30%|
Importantly, XNV is fast. For points, XNV runs in on a laptop whereas (unrandomized) SSSL takes . For points, XNV’s runtime is whereas SSSL takes unfeasibly long.
3 Randomized cortical co-training
We describe work in progress on multiview neuronal regression.
Selective co-regularized least squares.
The model closely resembles XNV, with a penalty that is easier for neurons to implement.
The selectivity term in (10) ensures that neurons only predict the neuromodulatory signals when they spike. In other words, neurons have the flexibility to search for an AMPA view containing a good regressor. The NMDA weights are then simultaneously aligned with the neuromodulatory signal and the AMPA weights by the remaining two terms.
The benefit from co-training depends on the extent to which co-regularizing with unlabeled shrinks the function space applied to the labeled data. In short, it depends on the Rademacher complexity of
Denote the Gram matrices of the two views by
where and are dot-products of unlabeled data on the respective views, and other blocks are similarly constructed using mixed and labeled data. It is shown in  that
The last term is of particular interest. Each column represents a labeled point in the first view by its dot-product with unlabeled points, and similarly for in the second view. The greater the difference between the representations in the two views, as measured by (11), the lower the Rademacher complexity and the better the generalization bounds on co-training.
It remains to be seen how selective multiview regression performs empirically, and to what extent (11) provides a good guide to the improvement in generalization performance of the original undiscretized models.
Co-training and randomization are two simple, powerful methods that complement each other well – and which neurons appear to use in conjunction. No doubt there are more tricks waiting to be discovered. It is particularly intriguing that NELL, one of the more ambitious AI projects in recent years, uses various co-training strategies as basic building blocks [12, 13, 14]. Despite a large body of research on how humans learn categories and relations, it remains unknown how (or whether) individual neurons learn categories.
Although the results sketched here are suggestive, they fall far short of the full story. For example, since neurons learn online – and only when they spike – they face similar explore/exploit dilemmas to those investigated in the literature on bandits. It will be interesting to see if new (randomized) bandit algorithms can be extracted from models of synaptic plasticity.
I thank my co-authors Michel Besserve, Joachim Buhmann and Brian McWilliams for their help developing these ideas.
- SSSL, the randomized version of SSSL, performs similarly to the original in a fraction of the runtime.
- Douglas RJ, Martin KAC: Neuronal Circuits of the Neocortex. Annu Rev Neurosci 2004, 27:419–451.
- Douglas RJ, Martin KAC: Mapping the Matrix: The Ways of Neocortex. Neuron 2007, 56.
- Blum A, Mitchell T: Combining labeled and unlabeled data with co-training. In COLT 1998.
- Balcan MF, Blum A: A Discriminative Model for Semi-Supervised Learning. J. ACM 2010.
- Sindhwani V, Niyogi P, Belkin M: A co-regularization approach to semi-supervised learning with multiple views. In ICML 2005.
- Farquhar JDR, Hardoon DR, Meng H, Shawe-Taylor J, Szedmik S: Two view learning: SVM-2K, theory and practice. In NIPS 2005.
- Brefeld U, Gärtner T, Scheffer T, Wrobel S: Efficient co-regularised least squares regression. In ICML 2006.
- Rosenberg D, Bartlett PL: The Rademacher complexity of co-regularized kernel classes. In AISTATS 2007.
- Kakade S, Foster DP: Multi-view Regression Via Canonical Correlation Analysis. In COLT 2007.
- Sridharan K, Kakade S: Information Theoretic Framework for Multi-view Learning. In COLT 2008.
- McWilliams B, Balduzzi D, Buhmann J: Correlated random features for fast semi-supervised learning. In Adv in Neural Information Processing Systems (NIPS) 2013.
- Carlson A, Betteridge J, Wang RC, Hruschka ER, Mitchell T: Coupled Semi-Supervised Learning for Information Extraction. In WSDM 2010.
- Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER, Mitchell T: Towards an Architecture for Never-Ending Language Learning. In AAAI 2010.
- Balcan MF, Blum A, Mansour Y: Exploiting Ontology Structures and Unlabeled Data for Learning. In ICML 2013.
- Balduzzi D, Besserve M: Towards a learning-theoretic analysis of spike-timing dependent plasticity. In NIPS 2012.
- Rosenblatt F: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 1958, 65(6):386–408.
- Gerstner W, Kistler W: Spiking Neuron Models. Cambridge University Press 2002.
- Song S, Miller KD, Abbott LF: Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat Neurosci 2000, 3(9).
- Roelfsema PR, van Ooyen A: Attention-gated reinforcement learning of internal representations for classification. Neural Comput 2005, 17(10):2176–2214.
- Schölkopf B, Smola AJ: Learning with Kernels. MIT Press 2002.
- Maass W, Natschlager T, Markram H: Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Comput 2002, 14(11):2531–2560.
- Hardoon DR, Szedmak S, Shawe-Taylor J: Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Comp 2004, 16(12):2639–2664.
- Ji M, Yang T, Lin B, Jin R, Han J: A Simple Algorithm for Semi-supervised Learning with Improved Generalization Error Bound. In ICML 2012.
- Williams C, Seeger M: Using the Nyström method to speed up kernel machines. In NIPS 2001.
- Rahimi A, Recht B: Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Adv in Neural Information Processing Systems (NIPS) 2008.
- Bach F: Sharp analysis of low-rank kernel approximations. In COLT 2013.
- Wang S, Zhang Z: Improving CUR Matrix Decomposition and the Nyström Approximation via Adapative Sampling. JMLR 2013, 14:2549–2589.