Randomized cotraining: from cortical neurons to machine learning and back again
Abstract
Despite its size and complexity, the human cortex exhibits striking anatomical regularities, suggesting there may simple metaalgorithms underlying cortical learning and computation. We expect such metaalgorithms to be of interest since they need to operate quickly, scalably and effectively with littletono specialized assumptions.
This note focuses on a specific question: How can neurons use vast quantities of unlabeled data to speed up learning from the comparatively rare labels provided by reward systems? As a partial answer, we propose randomized cotraining as a biologically plausible metaalgorithm satisfying the above requirements. As evidence, we describe a biologicallyinspired algorithm, Correlated Nyström Views (XNV) that achieves stateoftheart performance in semisupervised learning, and sketch work in progress on a neuronal implementation.
Although staggeringly complex, the human cortex has a remarkably regular structure [1, 2]. For example, even expert anatomists find it difficulttoimpossible to distinguish between slices of tissue taken from, say, visual and prefrontal areas. This suggests there may be simple, powerful metaalgorithms underlying neuronal learning.
Consider one problem such a metaalgorithm should solve: taking advantage of massive quantities of unlabeled data. Evolution has provided mammals with neuromodulatory systems, such as the dopamenergic system, that assign labels (e.g. pleasure or pain) to certain outcomes. However, these labels are rare; an organism’s interactions with its environment are typically indifferent. Nevertheless, mammals often generalize accurately from just a few good or bad outcomes. Our problem is therefore to understand how organisms, and more specifically individual neurons, use unlabeled data to learn quickly and accurately.
Next, consider some properties a semisupervised neuronal learning algorithm should have. It should be:

fast;

scalable;

effective;

broadly applicable (that is, requiring fewtono specialized assumptions); and

biologically plausible.
The first four requirements are of course desirable properties in any learning algorithm. The fourth requirement is particularly important due to the wide range of environments, both stochastic and adversarial, that organisms are exposed to.
Regarding the fifth requirement, it is unlikely that evolution has optimized all of the cortex’s connections, especially given the explosive growth in brain size over the last few million years. A simpler explanation, fitting neurophysiological evidence, is that the macroscopic connectivity (largely the white matter) was optimized, and the details are filled in randomly.
Contribution.
This note proposes randomized cotraining as a semisupervised metaalgorithm that satisfies the five criteria above. In particular, we argue that randomizing cortical connectivity is not only necessary but also beneficial.
Cotraining.
A cotraining algorithm takes labeled and unlabeled data consisting of two views [3]. Examples of views are audio and visual recordings of objects or photographs taken from different angles. The key insight is that good predictors will agree on both views whereas bad predictors may not [4]. Cotraining algorithms therefore use unlabeled data to eliminate predictors disagreeing across views, shrinking the search space used on the labeled data – resulting in better generalization bounds and improved empirical performance [5, 6, 7, 8, 9, 10, 11].
The most spectacular application of cotraining is perhaps neverending language learning (NELL), a semiautonomous agent that updates a massive database of beliefs about English language categories and relations based on information extracted from the web [12, 13, 14].
However, despite its conceptual elegance, cotraining remains a niche method. A possible reason for the lack of applications is that it is difficult to find naturally occurring views satisfying the technical assumptions required to improve performance. Constructing randomized views is a cheap workaround that dramatically extends cotraining’s applicability.
Outline.
Section §1 shows that discretizing standard models of synaptic dynamics and plasticity [15] leads to a small tweak on linear regression. Incorporating NMDA synapses into the model as a second view leads to a neuronal cotraining algorithm.
Section §2 reviews recent work which translated the above observations about neurons into a learning algorithm. We introduce Correlated Nyström Views (XNV), a stateoftheart semisupervised learning algorithm that combines multiview regression with random views via the Nyström method [11].
Finally, §3 returns to neurons and sketches preliminary work analyzing the benefits of randomized cotraining in cortex.
1 Modeling neurons as selective linear regressors
The perceptron was introduced in the 50s as a simple model of how neurons learn [16]. It has been extremely influential, counting both deep learning architectures and support vector machines amongst its descendants. Unfortunately however, the perceptron and related models badly misrepresent important features of neurons. In particular, they treat outputs symmetrically () rather than asymmetrically (0/1). This is crucial since spikes (1s) are more metabolically expensive than silences (0s). Further, byandlarge neurons only update their synapses after spiking. By contrast, perceptrons update their weights after every misclassification.
To build a tighter link between learning algorithms and neurocomputational models, we recently discretized standard models of neuronal dynamics and plasticity to obtain the selectron [15]. This section shows that, suitably regularized, the selectron is almost identical to linear regression. The difference is a selectivity term – arising from the spike/silence asymmetry – that encourages neurons to specialize.
The selectron.
Let denote the set of inputs and the set of possible synaptic weights. Let , where threshold is fixed. Given , neuron with synaptic weights outputs 0 or 1 according to
(1) 
We model neuromodulatory signals via where , with positive values corresponding to desirable outcomes and conversely. Signals may arrive after a few hundred millisecond delay, which we do not model explicitly.
Definition 1 (selectron).
A threshold neuron is a selectron if its reward function takes the form
(2) 
The reward function is continuously differentiable (in fact, linear) as a function of everywhere except at the kink where it is continuous but not differentiable. We can therefore maximize the reward via gradient ascent to obtain synaptic updates
(3) 
Theorem 1 (discretized neurons, [15]).
If rewards are more common than punishments then (3) leads to overpotentiation (and eventually epileptic seizures). Neuroscientists therefore introduced a depotentiation bias into STDP. Alternatively, [15], introduced an constraint on synaptic weights enforced during sleep. Below, we interpolate between the two approaches by replacing the constraint with a regularizer.
Linear regression.
We first recall linear regression to aid the comparison. Given data , regression finds parameters minimizing the mean squared error
One way to solve this is by gradient descent using
(4) 
Selective linear regression.
Now, if we regularize the reward in (2) as follows
(5)  
the result is selective linear regression: a neuron’s excess current predicts neuromodulation when it spikes. Since synaptic weights are nonnegative, the neuron will not fire for such that . Computing gradient ascent obtains
(6) 
The synaptic updates are . In other words, if neuron receives spike and produces spike , then it modifies synapse proportional to how much greater the rescaled neuromodulatory signal is than the excess current.
The selectivity term in (6) makes biological sense. There are billions of neurons in cortex, so it is necessary that they specialize. Neuromodulatory signals are thus ignored unless the neuron spikes – providing a niche wherein the neuron operates.
Multiview learning in cortex.
There are two main types of excitatory synapse: AMPA and NMDA. So far we have modeled AMPA synapses which are typically feedforward and “driving” – they cause neurons to initiate spikes. NMDA synapses differs from AMPA in that [19]

they multiplicatively modulate synaptic updates and

they prolong, but do not initiate, spiking activity.
We model the two types of synaptic inputs as AMPA, , and NMDA, , views with synaptic weights and respectively. In accord with the observations above, we extend (5) by adding a multiplicative modulation term. The NMDA view is encouraged to align with neuromodulators and is regularized the same as AMPA. The NMDA view has no selectivity term since it does not initiate spikes.
Finally, we obtain a (discretized) neuronal cooptimization algorithm, which simultaneously attempts to maximize how well each view predicts neuromodulatory signals and aligns the two views on unlabeled data:
(7)  
The next section describes a semisupervised regression algorithm inspired by (7).
2 Correlated random features for fast semisupervised learning
This section translates the multiview optimization above into a workable learning algorithm.
From neurons to machine learning.
We make three observations about cortical neurons:
Input: Labeled data: and unlabeled data:
Output:
The observations suggest neurons perform an analog of

kernelized multiview regression

with random views and

a CCA penalty.
To check these form a viable combination, we put the pieces together to develop Correlated Nyström Views (XNV), see Algorithm 1 and [11]. Reassuringly, XNV beats the stateoftheart in semisupervised learning [23].
Multiview regression.
The main ingredient in XNV is multiview regression, which we now describe. Suppose the loss of the best regressor in each view is within of the best joint estimator.
(A) 
Introduce the canonical norm where are orthogonal solutions to (8) with correlation coefficients . Multiview regression is then
(9) 
Penalizing with the canonical norm biases the estimator towards features that are correlated across both views (the signal) and away from features that are uncorrelated (the noise). Multiview regression is thus a specific instantiation of the general cotraining principle that good regressors agree across views whereas bad regressors may not.
Theorem 2 (multiview regression, [9]).
The multiview estimator’s error, (9), compared to the best linear predictor , is bounded by
According to the theorem, a slight increase in bias compared to ordinary regression is potentially more than compensated for by a large drop in variance. The reduction in variance depends on how quickly the correlation coefficients decay. For example, in the trivial case where the two views are identical, there is no benefit from multiview regression. To work well, the algorithm requires sufficiently different views (where most basis vectors are uncorrelated) that nevertheless both contain good regressors (that is, the few correlated directions are of high quality).
Randomization.
To convert multiview regression into a generalpurpose tool we need to construct pairs of views – for any data – satisfying two requirements. First, they should contain good regressors. Second, they should differ enough that their correlation coefficients decay rapidly.
A computationally cheap approach that does not depend on specific properties of the data is to generate random views. To do so, we used the Nyström method [24] and random kitchen sinks [25]. A recent theorem of Bach implies Nyström views contain good regressors in expectation [26] and a similar result holds for random kitchen sinks. Although there are currently no results on correlation coefficients across random views, recent lower bounds on the Frobenius norm of the Nyström approximation suggest “mediumsized” views (a few hundred dimensional) differ sufficiently [27]. Empirical performance is discussed below.
Performance.
Table 1 summarizes experiments evaluating XNV on 18 datasets, see [11] for details. We do not report on random kitchen sinks, since they performed worse than Nyström views. Performance is evaluated against kernel ridge regression (KRR) and a randomized version of SSSL, a semisupervised algorithm with stateoftheart performance [23].
XNV typically outperforms SSSL by between 10% and 15%, with about 30% less variance. Both semisupervised algorithms achieve less than half the error of kernel ridge regression.
Avg reduction in error vs KRR  56%  62%  63%  63%  63% 
Avg in error vs SSSL  11%  16%  15%  12%  9% 
Avg in standarderror vs SSSL  15%  30%  31%  33%  30% 
Importantly, XNV is fast. For points, XNV runs in on a laptop whereas (unrandomized) SSSL takes . For points, XNV’s runtime is whereas SSSL takes unfeasibly long.
3 Randomized cortical cotraining
We describe work in progress on multiview neuronal regression.
Selective coregularized least squares.
The multiview optimization in (7) can be rewritten, essentially, as coregularized least squares [5] with a selectivity term encouraging specialization:
(10)  
The model closely resembles XNV, with a penalty that is easier for neurons to implement.
The selectivity term in (10) ensures that neurons only predict the neuromodulatory signals when they spike. In other words, neurons have the flexibility to search for an AMPA view containing a good regressor. The NMDA weights are then simultaneously aligned with the neuromodulatory signal and the AMPA weights by the remaining two terms.
Guarantees.
Theoretical guarantees for coregularization are provided in [8, 10]. As above, they depend on having good regressors in sufficiently different views. We briefly sketch how these apply.
The benefit from cotraining depends on the extent to which coregularizing with unlabeled shrinks the function space applied to the labeled data. In short, it depends on the Rademacher complexity of
Denote the Gram matrices of the two views by
where and are dotproducts of unlabeled data on the respective views, and other blocks are similarly constructed using mixed and labeled data. It is shown in [8] that
(11) 
The last term is of particular interest. Each column represents a labeled point in the first view by its dotproduct with unlabeled points, and similarly for in the second view. The greater the difference between the representations in the two views, as measured by (11), the lower the Rademacher complexity and the better the generalization bounds on cotraining.
It remains to be seen how selective multiview regression performs empirically, and to what extent (11) provides a good guide to the improvement in generalization performance of the original undiscretized models.
4 Discussion
Cotraining and randomization are two simple, powerful methods that complement each other well – and which neurons appear to use in conjunction. No doubt there are more tricks waiting to be discovered. It is particularly intriguing that NELL, one of the more ambitious AI projects in recent years, uses various cotraining strategies as basic building blocks [12, 13, 14]. Despite a large body of research on how humans learn categories and relations, it remains unknown how (or whether) individual neurons learn categories.
Although the results sketched here are suggestive, they fall far short of the full story. For example, since neurons learn online – and only when they spike – they face similar explore/exploit dilemmas to those investigated in the literature on bandits. It will be interesting to see if new (randomized) bandit algorithms can be extracted from models of synaptic plasticity.
Acknowledgements.
I thank my coauthors Michel Besserve, Joachim Buhmann and Brian McWilliams for their help developing these ideas.
Footnotes
 SSSL, the randomized version of SSSL, performs similarly to the original in a fraction of the runtime.
References
 Douglas RJ, Martin KAC: Neuronal Circuits of the Neocortex. Annu Rev Neurosci 2004, 27:419–451.
 Douglas RJ, Martin KAC: Mapping the Matrix: The Ways of Neocortex. Neuron 2007, 56.
 Blum A, Mitchell T: Combining labeled and unlabeled data with cotraining. In COLT 1998.
 Balcan MF, Blum A: A Discriminative Model for SemiSupervised Learning. J. ACM 2010.
 Sindhwani V, Niyogi P, Belkin M: A coregularization approach to semisupervised learning with multiple views. In ICML 2005.
 Farquhar JDR, Hardoon DR, Meng H, ShaweTaylor J, Szedmik S: Two view learning: SVM2K, theory and practice. In NIPS 2005.
 Brefeld U, Gärtner T, Scheffer T, Wrobel S: Efficient coregularised least squares regression. In ICML 2006.
 Rosenberg D, Bartlett PL: The Rademacher complexity of coregularized kernel classes. In AISTATS 2007.
 Kakade S, Foster DP: Multiview Regression Via Canonical Correlation Analysis. In COLT 2007.
 Sridharan K, Kakade S: Information Theoretic Framework for Multiview Learning. In COLT 2008.
 McWilliams B, Balduzzi D, Buhmann J: Correlated random features for fast semisupervised learning. In Adv in Neural Information Processing Systems (NIPS) 2013.
 Carlson A, Betteridge J, Wang RC, Hruschka ER, Mitchell T: Coupled SemiSupervised Learning for Information Extraction. In WSDM 2010.
 Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER, Mitchell T: Towards an Architecture for NeverEnding Language Learning. In AAAI 2010.
 Balcan MF, Blum A, Mansour Y: Exploiting Ontology Structures and Unlabeled Data for Learning. In ICML 2013.
 Balduzzi D, Besserve M: Towards a learningtheoretic analysis of spiketiming dependent plasticity. In NIPS 2012.
 Rosenblatt F: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 1958, 65(6):386–408.
 Gerstner W, Kistler W: Spiking Neuron Models. Cambridge University Press 2002.
 Song S, Miller KD, Abbott LF: Competitive Hebbian learning through spiketimingdependent synaptic plasticity. Nat Neurosci 2000, 3(9).
 Roelfsema PR, van Ooyen A: Attentiongated reinforcement learning of internal representations for classification. Neural Comput 2005, 17(10):2176–2214.
 Schölkopf B, Smola AJ: Learning with Kernels. MIT Press 2002.
 Maass W, Natschlager T, Markram H: Realtime computing without stable states: a new framework for neural computation based on perturbations. Neural Comput 2002, 14(11):2531–2560.
 Hardoon DR, Szedmak S, ShaweTaylor J: Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Comp 2004, 16(12):2639–2664.
 Ji M, Yang T, Lin B, Jin R, Han J: A Simple Algorithm for Semisupervised Learning with Improved Generalization Error Bound. In ICML 2012.
 Williams C, Seeger M: Using the Nyström method to speed up kernel machines. In NIPS 2001.
 Rahimi A, Recht B: Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Adv in Neural Information Processing Systems (NIPS) 2008.
 Bach F: Sharp analysis of lowrank kernel approximations. In COLT 2013.
 Wang S, Zhang Z: Improving CUR Matrix Decomposition and the Nyström Approximation via Adapative Sampling. JMLR 2013, 14:2549–2589.