Signatures of criticality arise in simple neural population models with correlations
Marcel Nonnenmacher^{1,2,3*}, Christian Behrens^{3,4}, Philipp Berens^{3,4,5,6}, Matthias Bethge^{2,3,4}, Jakob H. Macke^{1,2,3*}
1 research center caesar, an associate of the Max Planck Society, Bonn, Germany
2 Max Planck Institute for Biological Cybernetics, Tübingen, Germany
3 Bernstein Center for Computational Neuroscience, Tübingen
4 Centre for Integrative Neuroscience and Institute of Theoretical Physics, University of Tübingen;
5 Institute of Opthalmic Research, University of Tübingen
6 Baylor College of Medicine, Houston, TX, USA
* marcel.nonnenmacher@caesar.de, jakob.macke@caesar.de
Abstract
Largescale recordings of neuronal activity make it possible to gain insights into the collective activity of neural ensembles. It has been hypothesized that neural populations might be optimized to operate at a ‘thermodynamic critical point’, and that this property has implications for information processing. Support for this notion has come from a series of studies which identified statistical signatures of criticality in the ensemble activity of retinal ganglion cells. What are the underlying mechanisms that give rise to these observations?
Here we show that signatures of criticality arise even in simple feedforward models of retinal population activity. In particular, they occur whenever neural population data exhibits correlations, and is randomly subsampled during data analysis. These results show that signatures of criticality are not necessarily indicative of an optimized coding strategy, and challenge the utility of analysis approaches based on equilibrium thermodynamics for understanding partially observed biological systems.
Signatures of criticality arise from random subsampling in simple population models.
PLoS Comput Biol 13(10): e1005718. https://doi.org/10.1371/journal.pcbi.1005718
1 Introduction
Recent advances in neural recording technology [1, 2] and computational tools for describing neural population activity [3] make it possible to empirically examine the statistics of large neural populations and search for principles underlying their collective dynamics [4]. One intriguing hypothesis that has emerged from this approach is the idea that neural populations might be poised at a thermodynamic critical point [5, 6, 7], and that this might have important consequences for how neural populations process and encode sensory information [7, 8]. As similar observations have been made in other biological systems (e.g. [9, 10, 11]), it has been suggested that this might reflect a more general organizing principle [12].
In the case of neural coding, evidence in favour of this hypothesis has been put forward by a series of studies which measured neural activity from large populations of retinal ganglion cells and reported that their statistics resemble those of physical systems at a critical point [7, 8]. Using largescale multielectrode array recordings [2], spikesorting methods that scale to large () populations [13, 2] and specially developed maximum entropy models [14, 15, 16, 17, 18, 12, 3, 19], Tkačik et al. observed that the specific heat—a global population statistic which measures the range of probabilities of spike patterns—diverges as a function of population size. In addition, when an artificial temperature parameter is introduced, specific heat is maximised for the statistics of the observed data rather than for statistics which have been perturbed by changing the temperature parameter. These properties of retinal populations resemble the behaviour of physical systems at critical points, and gave rise to the hypothesis that neural systems might also be poised at critical points.
What neural mechanisms can explain these observations? It had been hypothesised that the properties of the system need to be finely tuned [12, 7] to keep the system at a critical point, for example through adaptation [20]. A competing hypothesis [21, 22, 23] had stated that generic mechanisms based on latentvariable models could be sufficient to give rise to activity data with these statistics, but neither of these theoretical studies had investigated mechanistic models of retinal population activity. Thus, subsequent studies advocating criticality in the retina [7, 8], continued to interpret their observations as indication for the retina to be poised at a special state that is advantageous for coding. It is therefore still an open question as to whether previously reported signatures of criticality reveal a new mechanism of retinal coding, or they are a direct consequence of the standard enconding models of retinal ganglion cell responses[24, 25, 26, 27, 28].
We here challenge the conclusion of studies which used tools from statistical physics to search for signatures of criticality by applying exactly the same data analysis approach to a simplistic feedforward cascade model of retinal ganglion cell responses and showing that it exhibits the same effects. Focusing on how the specific heat of this simulated data varies with population size and temperature, we show that this simple model exhibits signatures of criticality and reproduces the experimentally reported dependence on different stimulus ensembles [7]. We provide a theoretical analysis of an analytically tractable model [29, 21, 30], and show mathematically that it exhibits signatures of criticality under a wide range of parameters.
This analysis also points to a subtle but important difference between how practical neural data analysis and theoretical studies often differ in how they study scaling behaviour of the system: Whereas many theoretical studies describe different systems of size , in practical neural data analysis populations of different size are typically constructed by randomly subsampling a large (but fixed) recording of neural activity. We show that this sampling process produces ’signatures of criticality’ whenever neural data has nonzero correlations, which could arise from a shared stimulus drive, recurrent connectivity or global statefluctuations [31, 32, 33, 34].
2 Results
2.1 Signatures of criticality arise in a simple model of retinal ganglion cell activity
A hallmark of criticality is that the specific heat of the model diverges when the temperature reaches the critical temperature [5]. Tkačik et al. [7] developed a statistical approach for translating this concept to neural data analysis. In their analysis, neural populations of different size are generated from the full recording by randomly subsampling the entire population. The statistics of activity for each population of size are characterized using a maximum entropy model [14, 15, 17, 18, 3]. Finally, the maximum entropy models are perturbed by introducing a temperature parameter, and specific heat is computed for each population size and temperature from the (perturbed) maximum entropy model fit. Divergence of specific heat with population size , and a peak of the specific heat near unit temperature (the ’temperature’ of the original data) are interpreted as evidence for the system being at a critical point [7].
To test if these signatures of criticality can be reproduced by canonical properties of retinal circuits, we first created a simple phenomenological model of retinal ganglion cell (RGC) activity based on linearnonlinear neuron models [24, 25, 28]. In this model (Fig. 1a), we assumed retinal ganglion cells to have centresurround receptive fields [35, 28] with linear spatial integration [36], sigmoid nonlinearities and stochastic binary spikes, i.e. in each time bin of size , each neuron either emitted a spike () or not . We used a sequence of natural images (see Methods 3.1 for details). In addition to the feedforward drive by the stimulus, nearby neurons received shared Gaussian noise, mimicking common input from bipolar cells [37]. Thus, crossneural correlations in the model arise from correlations in the stimulus, receptivefield overlap and shared noise, but not from lateral connections between RGCs. Parameters of the model were chosen to approximate the statistics of receptivefield centre locations of RGCs (Fig. 1b), as well as histograms of firing rates, pairwise correlationcoefficients and population spikecounts (Fig. 1d). Nevertheless, the model clearly cannot accurately capture all statistics of real RGC activity: Our goal was not to provide a realistic model of retinal processing. Rather, we wanted to directly test whether canonical mechanisms of retinal processing (overlapping centresurround receptive fields, spiking nonlinearities, shared Gaussian noise) are sufficient for the signatures of criticality to arise, or whether this would require finetuning or sophisticated neural circuitry.
As a next step in the analysis, we subsampled populations of different size by uniformly sampling cells from our simulated recording of size neurons. For each population we fit a ’Kpairwise’ maximum entropy model [3]. This model assigns a probability to each spikepattern . It is an extension of pairwise maximum entropy models (i.e. Ising models) [14, 15] which reproduces the firing rates and pairwise covariances which has additional terms which make sure that the model also captures the population spikecounts of the data [3] (see Fig. 1d, and Methods 3.2 for details of model specification and parameterisation). As we needed to efficiently fit this model [38, 39, 40] to multiple simulated data sets, we developed an improved fitting algorithm based on maximumlikelihood techniques using Markov chain Monte Carlo (MCMC) techniques (Matlab implementation available at https://github.com/mackelab/CorBinian), building on work by [16]. In particular, we made the most computationally expensive component of the algorithm, the estimation of pairwise covariances via MCMC sampling, more efficient by using a ’pairwise’ Gibbssampling scheme with RaoBlackwellisation [41, 42] (see Methods 3.2 for details). RaoBlackwellisation resulted in a reduction of the number of samples (and computation time) needed for achieving lowvariance estimates of the covariances by a factor of approximately (Fig. 2a, Suppl. Inf. S1). After parameter fitting, the model reproduced the statistics of the simulated data relevant for the model (Fig. 2b). Using the formalism developed by Tkačik et al., we then introduced a temperature parameter which rescales the probabilities of the model,
(1) 
Here, temperature corresponds to the statistics of the empirical data. By changing to other parameter values one can perturb the statistics of the system [43]: Increasing temperature leads to models with higher firing rates and weaker correlations (Fig. 2c), with approaching the uniform distribution for very large . If the temperature is decreased towards zero, has most of its probability mass over the most probable spike patterns. In many probabilistic systems, lowering leads to increasing correlations, as the systems then ’jumps’ between several different patterns and thus the activation probabilities of different elements are strongly dependent on each other. However, for the simulated RGC activity, the sparsity of data leads to a decrease of correlations: At a bin size of [14], the most probable state is the silent state, followed by patterns in which exactly one neurons spikes. In an example population of size , of observed spike patterns contain at most one spike. When decreasing the temperature to , patterns with at most one spike dominate the systems even more strongly: For the same population and temperature , we find of observed patterns to contain at most one spike. Thus when the temperature is lowered, the shift in probability mass to singlespike patterns decreases correlations.
We compute the specific heat of a population directly from the probabilistic model fit to data [7], using
(2) 
i.e. the variance of the logprobabilities of the model, normalised by [7]. Specific heat is minimal for data in which all patterns that occur in the data are equally probable, and big for data in which patternprobabilities span a large range. We used MCMCsampling to approximate the variance across all probabilities (see Methods 3.3), and used this approach to calculate, for each population of size , the specific heat as a function of temperature (Fig. 2d).
We found that the temperature curves obtained from the simulated data qualitatively reproduces the critical features of those that had been observed for largescale recordings in the salamander [7] and rat [8] retina: The peak of the curves diverges as the population size is increased, and moves closer to unit temperature for increasing (Fig. 2e). Consistent with experimental findings, we found that specific heat diverged linearly with population size (Fig. 2e). These results show that signatures of criticality arise in a simple feedforward LN cascade model based on generic properties of retinal ganglion cells, and do not require finely tuned parameters or sophisticated circuitry.
2.2 Specific heat diverges linearly in flat population models
In the phenomenological population model above, we observed that specific heat grew linearly with population size, as it did in previous studies built on experimental data [44, 7, 8]. Can we understand this phenomenon analytically in a simplified model? In particular, is the divergence indeed linear, and what determines its rate? To address these questions, we replaced the Kpairwise maximum entropy model by a model which only captures the distribution of population spikecount [21, 30, 29, 32] of the data, and in which all neurons have the same mean firing rate and pairwise correlations. This ’flat’ model can be fit to data by matching its parameters to the population spikecount distribution, sidestepping the computational challenges of the Kpairwise model (see Methods 3.4 for details). We here introduce a new parametrised flat model in which the spikecount distribution is given by betabinomial distribution , reducing the number of free parameters from to . The betabinomial model is a straightforward extension of an independent (i.e. binomial) population model: At each timepoint, a new firing probability is drawn from a betadistribution with parameters and , and neurons then spike independently with probability . The fact that the underlying fluctuations in are shared across the population leads to correlations in neural activity. This betabinomial model provided a good fit to the population spikecount distributions of the simulated data (Fig. 3a) across different population sizes (Fig. 3b). The bestfitting parameters and did not vary systematically across population sizes, and converged to values of and (Fig. 3 c), corresponding to an average firing rate of (i.e. each neuron has a probability of spiking of in each bin) and average pairwise correlations of . The betabinomial model also provided good fits to population spikecount distributions published in [30] and [32], [8] (Fig. 3d). When we applied this flat model to populations subsampled from the RGC simulation, we could qualitatively reproduce the heat curves of the Kpairwise model. In particular, we found a linearly diverging peak that moved closer to as the population size was increased (Fig. 3 e). Thus, linear divergence of specific heat is qualitatively captured by flat models. We note that the absolute values of the specific heat do not match those of the Kpairwise model or simulated data, but are substantially bigger ( at ).
One of the difficulties of interpreting the scaling behaviour of maximum entropy models fit to neural data is the fact that the construction of the limit in differs from those studied in statistical physics: In statistical physics, different ’’ typically correspond to systems of different total size, and the parameters are scaled as a deterministic function of (e.g. drawn from a Gaussian with variance proportional to in spinglasses [45, 46]). In studies using maximum entropy models for neural data analysis, populations of different are obtained by randomly subsampling a fixed large recording, and the parameters are fit to each subpopulation individually. Thus, there is no analytical relationship between population size and parameter values in this approach, and this has made it hard to determine whether the scaling observed in these studies is surprising or not.
With the flat model, it is possible to analytically characterise the behaviour of the specific heat for large population sizes for this sampling process: We assume that each population of size is randomly drawn from an underlying, infinitely large flat population model [21, 29]. Using this approach, one can mathematically show (see Methods 3.4, Suppl. Inf. S2.3 and [21] for details) that for virtually all flat models, the specific heat diverges linearly at unit temperature, but not for any other temperature or (Suppl. Inf. S2.4). As a consequence, the peak must move to as is increased. Hence almost any data set analysed with the methods developed by [7] will under the flat model exhibit signatures of criticality. These results hold irrespective of the details of the properties of the full populations that the subpopulations are sampled from, including full populations that are more weakly or more strongly correlated than real neural populations, and even for models with unrealistic population spikecount distributions (see Suppl. Fig. S3 for an illustration). There are only two exceptions: The first one is a model in which all neurons are independent (i.e. a binomial population model), and the second one is a flat pairwise maximum entropy model—indeed, this is the only flat model with nonvanishing correlations for which the specific heat does not have its peak at unit temperature (see Fig. 3f and [21] for an illustration).
2.3 Strong neural correlations lead to fast divergence of specific heat.
The rate at which the specific heat diverges provides a mean of quantifying the ’strength’ of criticality. What is the relationship between correlations in a neural population and the rate of divergence? To study how the specific heat rate depends on the strength of correlations, we used a betabinomial model to generate simulated data with firing rate (i.e. each neuron has a probability of spiking of in each bin), and different (populationwide, as all neuron pairs have the same correlation) pairwise correlation coefficients ranging from to (Fig. 4a). We found that the heat curves had the same shape as in the analyses above, with a peak that increases and moves to unit temperature (Fig. 4b). Comparing the results for different specified correlation strengths within the populations, we found that the specific heat rates increased strictly monotonically with (Fig. 4b,c). For the betabinomial model, the largen value of can be calculated analytically (see Suppl. Inf. S3.2 for details) as a function of the parameters and ,
(3) 
This analytical evaluation of (valid for large ) was in good agreement with numerical simulations (Fig. 4c left). In the case of weak correlations , equation 3 can be simplified: In this case, the specific heat rate is proportional to the strength of correlations (see Suppl. Inf. S3.1 for details), i.e.
(4) 
This expression can also be derived from the Gaussian model in [8] equation (4), by inserting the expected values of the mean and variance of the population spikecount under random subsampling. Thus, at least for flat models and the analysis based on specific heat proposed previously, ’being very critical’ is a consequence of ’being strongly correlated’.
2.4 Specific heat depends on average correlation strength in Kpairwise model
Is the relationship between the strength and correlations and the ’strength’ of criticality (i.e. the divergence rate of specific heat) also true in more general models? In the original study [7], specific heat was computed from Kpairwise model fits to RGC activity resulting from three different kind of stimuli: Checkerboard stimuli (which do not have longrange spatial correlations, although stimulusdriven crossneural correlations can arise from receptive field overlap), natural images, which exhibit strong spatial correlations, and fullfield flicker (which constitutes an extreme case of spatial correlations since all pixels in the display are identical). Tkačik et al. found that specific heat diverges in all three conditions, and interpreted this as evidence that signatures of criticality are not ‘inherited from the stimulus’ [7]. Comparing the specific heat values for reported in [7] across stimulus conditions, Tkačik et al. found the smallest peak for checkerboard stimuli ( for ), intermediate for natural images () and strongest for fullfield flicker ().
We tested whether we find the same pattern of results in Kpairwise model fits to our retinal simulation. Specific heat divergence also followed the pattern predicted by the flat models (Fig. 4d): Checkerboard (which gave an average correlation between neural activity of ) had the smallest peak (peak specific heat ) followed by natural images (, ) and fullfield flicker (, ). We conclude that the experimental evidence—which showed that the specific heat diverges, and how the speed of divergences depends on the stimulus ensemble—is entirely consistent with a simple, feedforward phenomenological model of retinal processing.
2.5 Sources of criticalityinducing correlations in neural activity
In the above, we showed that a betabinomial spike count distribution can be sufficient for signatures of criticality to arise. For this to hold we need the variance of the population spikecount to grow at least quadratically with population size, i.e. . The variance of the population spikecount is equal to the sum of all variances and covariances in the population, . A sufficient condition for signatures of criticality to arise in these models is that the average covariances (and hence correlations) between neurons are independent of , [6, 5]. One possible correlation structure which has this properties are so called ’infinite range’ correlations (Fig. 5a): correlation between neurons do not drop off to for large spatial distances. In this case, adding more and more neurons to a population will not change the average pairwise correlation within the population (Fig. 5b).
In neural systems, there are at least two reasons that can facilitate the required correlation structure. First, as shown above, the choice of stimuli has a clear effect on the heat capacity indicating an important effect of inputinduced correlations. In particular for fullfield flicker stimuli infiniterange correlations are to be expected but also white noise input can generate correlations of considerable extent due to overlapping receptive fields. Second, even a neural population which does not have infinite range correlations can appear critical if it is randomly subsampled during analysis: Suppose that different populations of size are obtained as above by (uniformly) subsampling a large recording of size . Then, for any correlation structure on the full recording (including limitedrange correlations, Fig. 5c), the average correlation in a population of size will be independent of (Fig. 5c): If neurons are randomly subsampled from the large recording, then the pairwise correlations in each subpopulation are also a random subsample of the large correlation matrix. As a consequence, the average correlation will be independent of , and specific heat will diverge with constant slope (Fig. 5d). In contrast, if different population sizes are constructed by taking into account the spatial structure of the population (i.e. by iteratively adding neighbouring cells) then the average correlation in each subpopulation will drop with , and the slope of specific heat growth will decrease with population size.
In our RGC simulation, correlations did drop off to zero with spatial distance for checkerboard and natural images, but not for fullfield flicker (Fig. 5e). Correlations in the fullfield flicker condition initially drop off due to distancedependent shared noise, but eventually saturate at a level far above zero that is determined by the fullfield stimulus. Due to these strong infiniterange correlations, both spatially structured sampling and uniform sampling then give rise to linear growth in specific heat (Fig. 5f left). For the other two stimulus conditions, however, the choice of subsampling scheme does result in markedly different behavior of the specific heat growth: Both for natural images and checkerboard stimuli, we can see the rate of growth decreases for large under spatially structured subsampling (Fig. 5f centre, right). This effect will be more pronounced for larger simulations, and in additional simulations we found specific heat to saturate completely once populations are substantially bigger than the spatial range of correlations.
In summary, populations will exhibit critical behaviour if correlations have infinite range (over the size of the recording), irrespective of the sampling scheme. In addition, if a population is randomly subsampled (as was done in [7, 8]), then signatures of criticality will arise even if the underlying correlations have limited range.
3 Materials and Methods
3.1 Numerical simulation of retinal ganglion cell activity
Retina simulation:
We simulated a population of retinal ganglion cells as linear threshold neurons whose receptive fields were modelled by differenceofGaussian filters with ONcentres [36, 28, 25]. The simulation comprised two subgroups of cells with different receptive field sizes (surrounds and in retinal space, centres and , respectively, one third cells with large receptive fields). For both subgroups, the weight of the surround was of the centre weight. Locations of receptive field centres were based on a reconstruction of soma locations from a patch of mouse retina [47]. As the reconstructed locations in that data set also comprised about amacrine cell somata, we randomly discarded of the cell locations. The resulting patch of retina covered an area of , corresponding to pixels in stimulus space. Correlated noise across neurons was modelled using correlated additive Gaussian noise. Correlations dropped off exponentially with soma distance with a decay constant of i.e. noise covariance matrix was chosen as , where is the distance between neurons and and . We set and . We modelled neural spiking in discrete time using bins. In each bin , the total input to neuron was given by , where is the receptive field of neuron , the vectorised stimulus and the input noise of neuron . A neuron in a given bin is active () if and inactive () otherwise, with offset . Parameters of the simulation (centre and surround sizes, relative strength of centre and surround, magnitude and correlations of noise, spiking threshold) were chosen to roughly match the statistics of neural spiking (firing rates, pairwise correlations, population activity counts) reported in studies of salamander retinal ganglion cells [14, 3, 2]. (Code will be available at www.mackelab.org/code).
Stimuli:
We used three types of stimuli for this study: natural images, checkerboard patterns and fullfield flicker. For natural image stimuli, we used a sequence of images of meadow sceneries taken from low hight. Each image was pixels, and each image was presented for with 300 repetitions total. The luminance histograms of the images were transformed to a normal distribution with mean 0.5 and pixel values between 0 and 1.
For the fullfield flicker stimulus, luminance levels were drawn from a Gaussian distribution with mean and variance . Checkerboard stimuli consisted of tiles of size pixels each. Luminance levels (from within the interval ) of each tile were chosen to be either or with probability . The parameters of both stimulus sets were chosen to match the dynamic range of the simulated retinal ganglion cells. For both types of stimuli, images were generated and the image sequences were presented with repetitions. To calculate specific heat as function of increasing population size, we randomly selected subsamples of the full simulated population of cells at population sizes by uniformly drawing neurons out of the full population without replacement.
3.2 Modeling neural population data with maximum entropy models
Model definition:
We modelled retinal ganglion cell activity by using a ’Kpairwise’ maximum entropy model [3]. In a maximum entropy model [48], the probability of observing the binary spike word for parameters is given by
(5) 
Here, the parameter vector (of size ) and the uppertriangular matrix correspond to the biasterms and interaction terms in a pairwise maximum entropy model (also known as an Ising model or spinglass) [14]. The term denotes the population spikecount, i.e. the total number of spikes across the population within a single time bin, and the indicatorterm is whenever the population spikecount equals , and is otherwise. The term was introduced by [3] to ensure that the model precisely captures the population spikecount distribution of the data using additional free parameters. The partition function for given is chosen such that the probabilities of the model sum to .
Parameter fitting:
To fit the model parameters to a data set , we maximised the penalised loglikelihood [49, 50] of the data under the model,
(6) 
Here, the penalty controlled the magnitudes of parameters , , the term favoured sparse coupling matrices, and the regularisation term on the parameters ensures that the terms controlling the spike count distribution vary smoothly in (Suppl. Inf. S1). This smoothness prior is particularly important for large spike counts, as it makes it possible to interpolate parameters for which the number of observed counts is small.
In maximum entropy models, exact evaluation of the penalised loglikelihood and its gradients requires the calculation of expectations under the model, , or equivalently , and (Suppl. Inf. S1.1), which in turn requires summations over all possible states and is prohibitive for . Following previous work [16], we used Gibbs sampling to approximate the relevant expectations (Suppl. Inf. S1.1 for derivations and implementation details). We used two modifications over previous applications of Gibbs sampling to fitting maximum entropy models to neural population spike train data, with the goals of speeding up parameter learning and alleviating memory usage:
First, we use RaoBlackwellisation [41, 42] to speed up convergence of the estimation of covariances of . We used pairwise Gibbs sampling (blocked Gibbs with block size ), where each new sample in the MCMC chain was obtained by updating two entries and of at a time, rather than just a single entry. This allowed us to get estimates of the conditional probabilities , and to use them to speed up the estimation of the second moment from empirical average of these conditional probabilities (Suppl. Inf. S1.1).
Second, we used a variant of coordinate ascent that calculated all relevant quantities as running averages over the MCMC sample, and thereby avoided having to store the entire MCMC sample in memory [16], where is the length of the sample. Because all features of the maximum entropy model are either or (, and the indicator function for the spike count), the gain in loglikelihood obtainable from either updating a single element of or [16, 40], or from updating all simultaneously (but not from updating multiple entries of and ) can be computed directly from MCMC estimates of , and (Suppl. Inf. S1.2). For each iteration, we calculated the gain in loglikelihood for each possible update of , and full , and picked the update which led to the largest gain [51, 16].
We measured the length of Markov chains in sweeps, where one sweep corresponds to one round of Markov chain updates that encompasses all pairs of entries of in random order. We set a learning schedule that started at sweeps for the first parameter update and doubled the number of sweeps in the chain after each set of parameter updates. We monitored convergence of the algorithm using a normalised mean square error between empirical , , and their estimates from the MCMC sample. For normalisation, we used the average squared values of the target quantity, e.g. for the firing rates. We stopped the algorithm when a preset threshold was reached (, , for , , , respectively), or when the fitting algorithm took more than of computation time on a single core ( GHz AMD Opteron(TM) Processor 6276) (Suppl. Fig. S1). For populations of size (for natural images), the normalised MSEs after modelfitting were , , ). An implementation of the fitting algorithms in MATLAB is available at https://github.com/mackelab/CorBinian.
3.3 Calculating specific heat and temperature curves
Specific heat calculations:
To investigate thermodynamic properties of neural population codes, Tkačik et al [7] introduced a temperature parameter for equation 5:
(7) 
Model fits are obtained at , and the temperature parameter is scaled to study the system (i.e. characterised by for ). We note that varying , in effect, modulates probabilities by exponentiating them with ,
(8) 
and that the family of probability distributions obtained by varying can be constructed for any distribution, not just maximum entropy models. For large temperatures approaches a uniform distribution ( for each ), whereas for small temperatures it converges to a singleton, with .
The specific heat, as given in equation 2, can be obtained from the variance of the logprobabilities of the model. As the variance in practice can not be outright computed for beyond , we obtained estimates of using a pairwise Gibbs sampler. We note that the specific heat does not depend on , as changing results in a constant, additive shift in logprobabilities which does not affect the variance. We tracked the variance of logprobabilities over an MCMC chain of length sampled at temperature ,
(9) 
For each population, we evaluated for temperatures between and , and found the Gibbs sampler to provide reliable estimates over this temperature range. We used a burnin of sweeps, and ran the sampler for of CPU time, resulting in between e and e sweeps (mean std) for (i.e. between e and e sampled individual spike words).
3.4 Simplified population models and the betabinomial model
For the theoretical analysis, we adopted a class of population models (here referred to as ’flat’ models) in which all neurons have identical mean firing rates, pairwise correlations and higherorder correlations [29, 21, 52, 3, 53]. Such a model is fully specified by the population spikecount distribution , and all spike words with the same spike count are equally probable. As a result, the probabilities of individual patterns can be read off from the spike count distribution by
(10) 
whenever . In a maximum entropy formalism, this model can be obtained by setting and for all and only optimising entries of . Without loss of generality, we fixed fixed [30], resulting in degrees of freedom for the model.
In flat models, it is possible to explicitly construct a limit which will help us understand population analyses performed on experimental data: We assume that there is a spike count density , , which describes the population spikecount distribution of an infinitely large population. denotes the probability density of a fraction of neurons spiking simultaneously. Finitesize populations of cells are then obtained as random subsamples out of this infinitely large system. Based on previous findings by [21], we show in Suppl. Inf. S2.3 that, in this construction, flat models always exhibit a linear divergence of specific heat, unless the limit is given by either a single delta peak or a mixture of two symmetric delta peaks. These two models corresponds to systems that (for large ) either behave like a fully independent population (whose spike count distribution converges to a single delta peak), or a population described by a pure pairwise maximum entropy model (which converges to two delta peaks). In particular, any flat model with higherorder correlations [18, 52, 53, 27], or a nondegenerate , will exhibit ’signatures of criticality’. Furthermore, we show that, for continuous , does not diverge for any . In combination, these results show that the peak of the specific heat is mathematically bound to converge to for in this model class.
We further simplified the flat model by reparametrising by a betabinomial distribution, thereby reducing the number of parameters from to two, and—importantly—obtaining parameters which do not explicitly depend on . In this model,
(11) 
and
(12) 
For simulated data, we found values for , extracted from the betabinomial fits to populations of different sizes to be stable over a large range of (Fig. 3b). We used the betabinomial parameters obtained from the largest investigated to estimate the divergence rate for .
4 Discussion
An intriguing hypothesis about the collective activity of large neural populations has been the idea that their statistics resemble those of physical systems at a critical point. Using a definition of criticality which is based on temporal dynamics with powerlaw statistics, numerous studies have reported and studied critical behaviour in neural population activity [54, 55, 8, 20]. Multiple possible mechanisms for these dynamics have been proposed (e.g. [56, 20, 57]). It has been argued that such temporal dynamics might be beneficial for neural computation and communication [58, 59, 20] (see [5] for an overview). More recently, a second line of studies [12, 5, 6, 11, 7, 8] has studied the statistics of timeinstantenous patterns of neural activity using tools from statistical mechanics, and argued that they also exhibit critical behaviour. This hypothesis could open up further questions on how the system maintains its critical state, and what implications this observation has for how neural populations encode sensory information and perform computations on it. Similarly, signatures of criticality have also been observed in natural images [11] and small cortical populations [6], and have been studied using the theory of finitesize scaling and critical exponents [6]. It has been argued that systems close to a critical point might be optimally sensitive to external perturbations [6] and that the large dynamic range of the code (i.e. the large variance of logprobabilities) might be beneficial for encoding sensory events which likewise have a large distribution of occurrenceprobabilities [17].
Alternatively, generic mechanisms could be sufficient to give rise to activity data with these statistics. We had demonstrated in a previous theoretical study [21] that a simple models with common input can exhibit signatures of criticality. More recently, Schwab et al. [22] and Aitchison et al. [23] elaborated on these findings, showing that common input (or other latent variables which lead to shared modulations in firing rates) can give rise to Zipflike scaling of pattern probabilities (a second signature of criticality). Mathematically, Zipf’s Law is equivalent to stating that the plot of entropy vs energy (i.e. logprobability) is a straight line with unit slope [22, 23]. Schwab et al [22] showed that particular latent variable models give rise to Zipf’s law. This result was generalized by [23] which showed that, under fairly general circumstances, highdimensional latent variable models exhibit a wide distribution of energies (i.e. logprobabilities) and hence a large specific heat. In addition, they showed that large fluctuations in the specific heat are (under some additional assumptions) sufficient to achieve Zipfian scaling. While it has also been argued that the use of datasets which are too small might give rise to spuriously big specific heats [60]– while this is true in principle, additional analyses e.g. in [7] show that their results are robust with respect to dataset size.
However, neither of these previous theoretical studies analysed mechanistic models of neural population activity, nor did they have tools for studying population statistics in large simulations or recordings, and they were therefore limited to studying very small () systems. It has thus been an open question of whether and how these theoretical considerations can account for effects observed in retinal ganglion cells. We here showed that surprisingly simple mechanisms are sufficient for two key signatures of thermodynamic criticality—a divergence of specific heat and a peak of the specific heat near unit temperature— to arise.
We found that neural population activity exhibits signatures of criticality whenever the average correlation in population of different sizes is larger than zero and does not depend on population size. In the thermodynamic analysis of physical systems at equilibrium, longrange correlations typically vanish in the thermodynamic limit. In neural systems, however, such ’criticalityinducing’ correlations can arise as a consequence of various factors: In a local patch of retina, retinal ganglion cells have a large degree of receptive field overlap, and natural stimuli also contain strong spatial correlations. This can lead to correlations which do have unlimited range within the experimentally accessible length scales. Thus, fluctuations in the stimulus will lead to common activity modulations amongst neurons within the population. Empirically, activity correlations between pairs of retinal ganglion cells only fall of slowly with the distance between somas (or receptive field centres) [28]. Similarly, Mora et al [8] used a movingbar stimulus with strong temporal correlations, and found that including activity from multiple timelags markedly increase the strength of specific heat. We hypothesise that this increase in specific heat is a consequence of temporal correlations being stronger than interneural correlations in this stimulus condition. In addition, firing rates of cortical neurons are modulated by global fluctuations in excitability [31, 32, 33, 34], resulting in neural correlations with infinite range.
Finally, we showed that criticalityinducing correlations arise as a consequence of constructing different subpopulations by uniformly subsampling a large recording with correlations. Signatures of criticality are entirely consistent with canonical properties of neural population activity, and require neither finelytuned parameters in the population, nor sophisticated circuitry or active mechanisms for keeping the system at the critical point. Signatures of criticality are likely going to be found not just in retinal ganglion cells, but in multiple brain areas and model systems. These observation raise the question of whether signatures of criticality are really indicative of an underlying principle, or rather are a consequence of viewing the statistics of neural populations through the lens of equilibrium thermodynamics. In order to realise the potential of largescale recordings of neural activity in the search of a theory of neural computation, we will need dataanalysis methods which are adapted to the specific properties of biological data [4, 19].
5 Acknowledgments
Work was funded by the German Federal Ministry of Education and Research (BMBF; FKZ: 01GQ1002, Bernstein Center Tübingen, FKZ 01GQ1601 to PB), the German Research Foundation (BE 5601/11 to PB; SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP 14 to JHM), the Max Planck Society and the caesar foundation. We thank F. Franzen for help with figures and cluster computing, S. Buchholz, D. Greenberg and S. Turaga for comments on the manuscript and useful discussions. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
 1. Kerr JND, Denk W. Imaging in vivo: watching the brain in action. Nature Reviews Neurosci. 2008;9(3):195–205.
 2. Marre O, Amodei D, Deshmukh N, Sadeghi K, Soo F, Holy TE, et al. Mapping a complete neural population in the retina. The Journal of Neuroscience. 2012;32(43):14859–14873.
 3. Tkačik G, Marre O, Amodei D, Schneidman E, Bialek W, Berry MJ 2nd. Searching for collective behavior in a large network of sensory neurons. PLoS Comput Biol. 2014;10(1):e1003408.
 4. Gao P, Ganguli S. On simplicity and complexity in the brave new world of largescale neuroscience. Current opinion in neurobiology. 2015;32:148–155.
 5. Beggs JM, Timme N. Being critical of criticality in the brain. Frontiers in physiology. 2012;3.
 6. Yu S, Yang H, Shriki O, Plenz D. Universal organization of resting brain activity at the thermodynamic critical point. Front Syst Neurosci. 2013;7:42.
 7. Tkačik G, Mora T, Marre O, Amodei D, Palmer SE, Berry MJ, et al. Thermodynamics and signatures of criticality in a network of neurons. Proceedings of the National Academy of Sciences. 2015;112(37):11508–11513.
 8. Mora T, Deny S, Marre O. Dynamical criticality in the collective activity of a population of retinal neurons. Physical review letters. 2015;114(7):078105.
 9. Mora T, Walczak AM, Bialek W, Callan CG. Maximum entropy models for antibody diversity. Proceedings of the National Academy of Sciences. 2010;107(12):5405–5410.
 10. Bialek W, Cavagna A, Giardina I, Mora T, Silvestri E, Viale M, et al. Statistical mechanics for natural flocks of birds. Proceedings of the National Academy of Sciences. 2012;109(13):4786–4791.
 11. Stephens GJ, Mora T, Tkačik G, Bialek W. Statistical thermodynamics of natural images. Phys Rev Lett. 2013 Jan;110(1):018701.
 12. Mora T, Bialek W. Are biological systems poised at criticality? Journal of Statistical Physics. 2011;144(2):268–302.
 13. Segev R, Goodhouse J, Puchalla J, Berry MJn. Recording spikes from a large fraction of the ganglion cells in a retinal patch. Nature Neuroscience. 2004;7(10):1154–1161.
 14. Schneidman E, Berry MJn, Segev R, Bialek W. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature. 2006;440(7087):1007–12.
 15. Shlens J, Field GD, Gauthier JL, Grivich MI, Petrusca D, Sher A, et al. The structure of multineuron firing patterns in primate retina. J Neurosci. 2006;26(32):8254–66.
 16. Broderick T, Dudik M, Tkacik G, Schapire RE, Bialek W. Faster solutions of the inverse pairwise Ising problem. arXiv. 2007;0712.2437v2.
 17. Tkacik G, Schneidman E, Berry MJ II, Bialek W. Spin glass models for a network of real neurons. arXiv:qbio/0611072v2. 2009;.
 18. Ohiorhenuan IE, Mechler F, Purpura KP, Schmid AM, Hu Q, Victor JD. Sparse coding and highorder correlations in finescale cortical networks. Nature. 2010;466(7306):617–621.
 19. Roudi Y, Dunn B, Hertz J. Multineuronal activity and functional connectivity in cell assemblies. Current opinion in neurobiology. 2015;32:38–44.
 20. Shew WL, Clawson WP, Pobst J, Karimipanah Y, Wright NC, Wessel R. Adaptation to sensory input tunes visual cortex to criticality. Nature Physics. 2015;11(8):659–663.
 21. Macke JH, Opper M, Bethge M. Common input explains higherorder correlations and entropy in a simple model of neural population activity. Physical Review Letters. 2011;106(20):208102.
 22. Schwab DJ, Nemenman I, Mehta P. Zipf’s law and criticality in multivariate data without finetuning. Physical review letters. 2014;113(6):068102.
 23. Aitchison L, Corradi N, Latham PE. Zipf’s law arises naturally in structured, highdimensional data. arXiv preprint. 2014;1407.7135.
 24. Chichilnisky E. A simple white noise analysis of neuronal light responses. Network: Computation in Neural Systems. 2001;12(2):199–213.
 25. Carandini M, Demb JB, Mante V, Tolhurst DJ, Dan Y, Olshausen BA, et al. Do we know what the early visual system does? J Neurosci. 2005;25(46):10577–10597.
 26. Pillow JW, Shlens J, Paninski L, Sher A, Litke AM, Chichilnisky EJ, et al. Spatiotemporal correlations and visual signalling in a complete neuronal population. Nature. 2008;454(7207):995–9.
 27. Leen DA, SheaBrown E. A Simple Mechanism for BeyondPairwise Correlations in IntegrateandFire Neurons. J Math Neurosci. 2015;5(1):30.
 28. Pitkow X, Meister M. Decorrelation and efficient coding by retinal ganglion cells. Nature neuroscience. 2012;15(4):628–635.
 29. Amari Si, Nakahara H, Wu S, Sakai Y. Synchronous firing and higherorder interactions in neuron pool. Neural Computation. 2003;15(1):127–142.
 30. Tkačik G, Marre O, Mora T, Amodei D, Berry II MJ, Bialek W. The simplest maximum entropy model for collective behavior in a neural network. Journal of Statistical Mechanics: Theory and Experiment. 2013;2013(03):P03011.
 31. Harris KD, Thiele A. Cortical state and attention. Nat Rev Neurosci. 2011;12(9):509–523.
 32. Okun M, Yger P, Marguet SL, GerardMercier F, Benucci A, Katzner S, et al. Population rate dynamics and multineuron firing patterns in sensory cortex. J Neurosci. 2012;32(48):17108–19.
 33. Ecker AS, Berens P, Cotton RJ, Subramaniyan M, Denfield GH, Cadwell CR, et al. State dependence of noise correlations in macaque primary visual cortex. Neuron. 2014;82(1):235–48.
 34. Schölvinck ML, Saleem AB, Benucci A, Harris KD, Carandini M. Cortical state determines global variability and correlations in visual cortex. J Neurosci. 2015 Jan;35(1):170–8.
 35. Kuffler SW. Discharge patterns and functional organization of mammalian retina. Journal of neurophysiology. 1953;16(1):37–68.
 36. Rodieck RW. Quantitative analysis of cat retinal ganglion cell response to visual stimuli. Vision research. 1965;5(12):583–601.
 37. Trong PK, Rieke F. Origin of correlated activity between parasol retinal ganglion cells. Nature Neuroscience. 2008;11(11):1343–1351.
 38. Ferrenberg AM, Swendsen RH. New Monte Carlo technique for studying phase transitions. Physical review letters. 1988;61(23):2635.
 39. SohlDickstein J, Battaglino PB, DeWeese MR. New method for parameter estimation in probabilistic models: minimum probability flow. Physical review letters. 2011;107(22):220601.
 40. Schwartz G, Macke J, Amodei D, Tang H, Berry MJ 2nd. Low error discrimination using a correlated population code. J Neurophysiol. 2012;108(4):1069–88.
 41. Radhakrishna Rao C. Information and accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society. 1945;37(3):81–91.
 42. Blackwell D. Conditional expectation and unbiased sequential estimation. The Annals of Mathematical Statistics. 1947;p. 105–110.
 43. Kirkpatrick S, Gelatt CD, Vecchi MP, et al. Optimization by simulated annealing. science. 1983;220(4598):671–680.
 44. Tkacik G, Schneidman E, Berry II MJ, Bialek W. Ising models for networks of real neurons. arXiv preprint. 2006;0611072v1.
 45. Sherrington D, Kirkpatrick S. Solvable model of a spinglass. Physical review letters. 1975;35(26):1792.
 46. Mezard M, Parisi G, Virasoro M. Spin Glass Theory and Beyond (Singapore: Word Scientific); 1987.
 47. Baden T, Berens P, RomanRoson M, Bethge M, Euler T. The functional diversity of mouse retinal ganglion cells. Nature. (in press);.
 48. Jaynes ET. Information theory and statistical mechanics. Physical review. 1957;106(4):620.
 49. Dudík M, Schapire RE. Maximum entropy distribution estimation with generalized regularization. In: Learning Theory. Springer; 2006. p. 123–138.
 50. Altun Y, Smola A. Unifying divergence minimization and statistical inference via convex duality. In: Learning theory. Springer; 2006. p. 139–153.
 51. Dudik M, Phillips SJ, Schapire RE. Performance guarantees for regularized maximum entropy density estimation. In: Learning Theory. Springer; 2004. p. 472–486.
 52. Yu S, Yang H, Nakahara H, Santos GS, Nikolic D, Plenz D. Higherorder interactions characterized in cortical activity. J Neurosci. 2011;31(48):17514–17526.
 53. Barreiro AK, Gjorgjieva J, Rieke F, SheaBrown E. When do microcircuits produce beyondpairwise correlations? Front Comput Neurosci. 2014;8:10.
 54. Beggs JM, Plenz D. Neuronal avalanches in neocortical circuits. The Journal of neuroscience. 2003;23(35):11167–11177.
 55. Petermann T, Thiagarajan TC, Lebedev MA, Nicolelis MAL, Chialvo DR, Plenz D. Spontaneous cortical activity in awake monkeys composed of neuronal avalanches. Proc Natl Acad Sci U S A. 2009 Sep;106(37):15921–6.
 56. Levina A, Herrmann JM, Geisel T. Dynamical synapses causing selforganized criticality in neural networks. Nature physics. 2007;3(12):857–860.
 57. Markram H, Muller E, Ramaswamy S, Reimann MW, Abdellah M, Sanchez CA, et al. Reconstruction and Simulation of Neocortical Microcircuitry. Cell. 2015;163(2):456–92.
 58. Bertschinger N, Natschläger T. Realtime computation at the edge of chaos in recurrent neural networks. Neural computation. 2004;16(7):1413–1436.
 59. Shew WL, Yang H, Yu S, Roy R, Plenz D. Information capacity and transmission are maximized in balanced cortical networks with neuronal avalanches. J Neurosci. 2011;31(1):55–63.
 60. Saremi S, Sejnowski TJ. On Criticality in HighDimensional Data. Neural Comput. 2014 Jul;26(7):1329–1339.
 61. Rasmussen C, Williams C. Gaussian Processes for Machine Learning. MIT Press; 2006.
 62. Berkson J. Tests of significance considered as evidence. Journal of the American Statistical Association. 1942;37(219):325–335.
 63. Archer E, Park IM, Pillow JW. Bayesian entropy estimation for countable discrete distributions. The Journal of Machine Learning Research. 2014;15(1):2833–2868.
 64. Schmidt M. minFunc: unconstrained differentiable multivariate optimization in Matlab; 2005. http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html.
Supporting Information
Appendix S1 Fitting the Kpairwise maximum entropy model to data
To identify the values of the model parameters which yield the best fit of the maximum entropy model to data, we maximise the loglikelihood of the model given the data. The general form of the loglikelihood of a maximum entropy model parametrised by vector is given by
(13) 
for the spikedata vectors , . Any choice of the feature function defines a specific maximum entropy model over this dimensional binary space. For the Kpairwise maximum entropy model used in this paper, is composed of

firstorder features
with corresponding parameters collected in . The , control singlecell firing rates (in units of bins rather than Hz).

secondorder features
with parameters , , , controlling pairwise neuronal correlations, and

n+1 populationscale features
with parameters , . The vector controls the overall number of spikes in each temporal bin.
Note that that there is some degeneracy between the parameter vectors and both and —a global upwards shift of firing rates for example can be achieved both by adding a positive constant to each , or by adding to each of the . Similarly, adding a constant to every can be balanced by subtracting from each . Since either manipulation of is zero for , fixing is not sufficient for getting rid of this parameter degeneracy. As we never interpreted the parametervalues themselves, but only the fit to data, we made no attempt to add additional constraints.
We can rewrite the Kpairwise model into the general maximum entropy form by stacking the feature functions , and into the vectorvalued feature function and doing the same with parameters , , and to obtain . The derivative of the loglikelihood with respect to any single parameter is given by
(14) 
As can be seen from equation (14), the gradient of the loglikelihood vanishes if and only if the data means match the expectations of under the model.
To deal with datasets of limited size, we maximised a regularised variant of the loglikelihood,
(15)  
Here, the matrix implements a combined ridge and smoothing regression over , with identity matrix and smoothing matrix corresponding to a squaredexponential kernel [61]. We set and accounted for this by conditioning on and correspondingly subtracted from . We used , , and .
To fit maximum entropy models to large neural populations, one needs to

find efficient methods for updating the parameters .
We introduce two modifications over previous approaches to fitting maximum entropy models to neural data [16] to improve computational efficiency:

We used pairwise Gibbs sampling and RaoBlackwellisation to considerably improve estimation of the secondorder feature moments

The authors of [51] described a trick for efficiently updating the parameters in pairwise binary maximum entropy models: If one restricted updates to coordinatewise updates, then one can calculate the gain from updating a single variable in closed form, which makes it easy to select both the variable to update as well as the steplength in closed form. We show how this trick can be extended to allow a joint update of all the populationcount features . In addition, the gain in loglikelihood is linear in the featuremoments, which makes it possible to compute it from a running average over the MCMC sample, and avoids having to store the entire sample in memory at any point.
We describe our contributions in the sections S1.1 and S1.2, respectively.
s1.1 Pairwise Gibbs sampling and RaoBlackwellisation
Following previous work [16], we used MCMC sampling to approximate the expectations of the feature functions under the Kpairwise model with parameters . These expected values are required to evaluate the gradients of the (penalised) loglikelihood, as well as the loglikelihood gains resulting from parameter updates. As the number of pairwise terms grows quadratically with population size , most of the parameters of the model for large control pairwise moments . To make the estimation of these pairwise interactions more efficient, we implemented a pairwise Gibbs sampler that for each update step of the Markov chain samples two variables and , , . This furthermore allowed us to ’RaoBlackwellise’ the singlecell and pairwise feature components and [41, 42, 62], i.e. to use the conditional probabilities and for moment estimation, instead of the binary and .
RaoBlackwellisation provably reduces the variance of the resulting estimators, and empirically resulted in substantially faster convergence of the MCMCestimated model firing rates , second moments , and thus also of the covariances (see supplementary figure S1). Unlike the binary variables , however, the conditional probabilities are real numbers from the interval and cannot be stored in memoryefficient sparse matrices. We thus implemented a running average over conditional probabilities that discards the current chain element immediately after drawing the next one, while keeping track of the quantities
as increases from to MCMC sample size . We also kept track of the nonRaoBlackwellised estimates
for the expectations of the populationlevel indicator feature functions , with Kronecker delta function if , and otherwise.
We quantified the advantage of RaoBlackwellising the Gibbs sampler with long Markov chains drawn from the Kpairwise maximum entropy model fits to populations of size drawn from the simulated RGC data. For each investigated parameter fit, we ran two chains under different conditions: a first chain for which we RaoBlackwellised the singlecell and pairwise feature moments, and a second chain for which we did not. These Markov chains were run for sweeps and hence orders of magnitude longer than had occurred for the invidivual parameter updates within this study, which comprised to sweeps, or to individual MCMC chain updates at . The long sample runs served to give an approximation for the ”true” expected values of the target quantities of interest to us: firing rates , covariances and population spike count distribution .
We quantified the speed of convergence of the estimates to the ”true” expected feature moments by the normalised MSE between samplerderived feature moments after any given length of the MCMC chain and the results we got after the full chain length. After the full sweeps, the RaoBlackwellised and nonRaoBlackwellised estimates on average differed by , and normalised MSE for firing rates, covariances and population spike count distributions, respectively. We computed the distance to ”truth” for each condition as the normalised MSE to the averaged over both conditions. We obtained MCMC estimates for the feature moments of the Kpairwise maximum entropy models fits to subsampled populations of neurons each drawn from our retina simulation. Supplementary figure S1a displays the results for the two conditions, RaoBlackwellised vs. nonRaoBlackwellised, for each of the investigated fits.
MSEs of firing rates for singlecell features did not benefit from RaoBlackwellisation. This is expected, as each is sampled times per sweep and thus the moments are already well estimated relative to the secondorder features. For covariances , normalised MSEs showed clear improvement under RaoBlackwellisation, visible as an approximately constant offset between the avarages over all parameter fits in the loglogdomain as seen in figure S1b. The normalised MSE on average was times higher for nonRaoBlackwellised (given by the downwards offset of the normalised MSEs of the RaoBlackwellised estimates). The fraction of samples needed from RaoBlackwellised runs to achieve the same normalised MSE on the pariwise moments than nonRaoBlackwellised runs (given by the leftward offset of the normalised MSEs of the RaoBlackwellised) overall was . The fraction ranged from at sweeps to at sweeps. The ratio of normalised MSEs was similarly stable, being times higher at sweeps and times higher at sweeps for nonRaoBlackwellised samples than for RaoBlackwellised ones.
s1.2 Exploiting the structure of the Kpairwise feature functions allows blockwise parameter updates.
As described in the previous section, we can use MCMC to obtain the expected values of the feature function that are needed to to optimise the model parameters . To find the parameter setting which maximise the loglikelihood over the given data vectors , , …, , we follow an iterative update scheme introduced previously [51], and extend it to the Kpairwise model. The update scheme optimises parameter changes relative to a current parameter estimate , rather than the parameters directly. The benefit of this scheme over standard gradient ascent on the regularised logligkelihood as in eq. (14) is that we can give closedform solutions for optimal values of a single component when temporarily holding all other components fixed.
Changing the current parameter estimate to leads to a change in loglikelihood of
(16) 
The only relevant expectations are w.r.t. the data distribution and , i.e. the current parameter estimate. The term can be simplified when restricting the update vector to be nonzero only in selected components. In the simplest case, only a single component is updated. In this case, the fact that all components of the Kpairwise feature function are binary, allows to move the exponent out of the expected value, a trick used by [51]:
The resulting singlecoordinate updates only require the feature moments :
Equation (16) can now be solved analytically for the single free component that maximises the change in loglikelihood. A closedform optimal solution is still possible when adding an penalty to the loglikelihood [51]. We use this regularised variant to calculate the possible gain in penalized loglikelihood for each possible update of the singlecell () and pairwise () feature moments and , and then perform the update which yield the largest gain.
If we instead allow more than a single component of the update to be nonzero, we in general would have to deal with the term
which requires the higherorder moments for all and being the index set of components that are not set to zero.
The population spike count features , however, are mutually exclusive (only one of the n+1 features can be nonzero at any time), and therefore we can all parameters of jointly, and still pull the expectation term outside of the expectation. For the populationspike count features , hereafter collectively called , all such terms of order are zero due to the sparsity of . When restricting the current parameter update of to only update components corresponding to , we have
and
We obtained estimates of the values of from the MCMC sample using the indicator functions , and optimising w.r.t. , ,…, using gradientbased methods [64].
In summary, our updatescheme for maximising the loglikelihood proceeds as follows: For a given parameter vector , we first estimate the expectation of the feature functions , and using a running average over an MCMC sampling and RaoBlackwellization. We then calculate, for each possible singleneuron parameter and each possibly pairwise term the gain in penalised loglikelihood that we would get from updating it, using methods as described above and derived in [51]. We additionally compute the gain in penalised loglikelihood that would result from optimising all of the free parameters jointly, using a convex optimization. Finally, we choose the update that brings the largest gain, and either update a single , a single , or all parameters. Subsequently, we again estimate the new feature functions using MCMC sampling given the current estimate of before we update again. We initialised the algorithm assuming independent neurons (i.e. setting each using the firing rate of each neuron, and leaving and zero). The algorithm then typically first updated all parameters, before proceeding to jump between different , and updates.
Appendix S2 Supplementary Text: Specific heat in simple models
We refer to a maxmimum entropy model as ’flat’ if it is fully specified by the population spike count distribution , i.e. the model class studied in [21, 30, 29]. In this model class, all neurons have the same firing rate and pairwise correlation . As neuron identities become interchangeable, all possible patterns with are assigned the same probability . In flat models, all relevant population properties can be computed from summing over different spike counts, and one never has to (explicitly) sum over the entire possible spike patterns.
s2.1 A noncritical special case: Independent neurons
A special case of a flat model is an independent model in which all neurons have the same firing rates and zero correlations. Assuming independent spiking for each of the neurons and a shared probability to fire in a time bin, the distribution of population spike counts is given by a binomial distribution,
To compute specific heat capacities for the underlying neural population of size , we can rewrite the binomial distribution in maximum entropy form
Reintroducing parameters , , we find
and for the heat capacity, we get
The binomial variance is . We plug this in and see that at unit temperature , the specific heat is given by
(17) 
which is independent of population size .
When explicitly introducing temperatures other than , we add a factor that scales the parameters and renormalise, yielding
where , , …, is defined w.r.t. as above. This is the same functional form as was given for the binomial distribution at , with only parameters being replaced by . We can also go back to the standard binomial parametrisation with and obtain
Changing the temperature retains the binomial form of the population model, and we can generalise the expression for the specific heat (17) of the independent flat model for any temperature to be