Sharing deep generative representation for perceived image reconstruction from human brain activity
Abstract
Decoding human brain activities via functional magnetic resonance imaging (fMRI) has gained increasing attention in recent years. While encouraging results have been reported in brain states classification tasks, reconstructing the details of human visual experience still remains difficult. Two main challenges that hinder the development of effective models are the perplexing fMRI measurement noise and the high dimensionality of limited data instances. Existing methods generally suffer from one or both of these issues and yield dissatisfactory results. In this paper, we tackle this problem by casting the reconstruction of visual stimulus as the Bayesian inference of missing view in a multiview latent variable model. Sharing a common latent representation, our joint generative model of external stimulus and brain response is not only “deep” in extracting nonlinear features from visual images, but also powerful in capturing correlations among voxel activities of fMRI recordings. The nonlinearity and deep structure endow our model with strong representation ability, while the correlations of voxel activities are critical for suppressing noise and improving prediction. We devise an efficient variational Bayesian method to infer the latent variables and the model parameters. To further improve the reconstruction accuracy, the latent representations of testing instances are enforced to be close to that of their neighbours from the training set via posterior regularization. Experiments on three fMRI recording datasets demonstrate that our approach can more accurately reconstruct visual stimuli.
1 Introduction
Brain decoding, which aims to predict the information about external stimuli using brain activities, plays an important role in brainmachine interfaces (BMIs). Recent developments in this area have shown promising results [\citeauthoryearSchoenmakers et al.2015, \citeauthoryearLee and Kuhl2016]. However, most previous researches only focus their attention on the prediction of the category of presented stimulus [\citeauthoryearVan Gerven et al.2010a, \citeauthoryearNg and Abugharbieh2011, \citeauthoryearDamarla and Just2013, \citeauthoryearElahe’Yargholi2016]. Accurate reconstruction of the visual stimuli from brain activities still lacks adequate examination and requires plenty of efforts to improve. Two main challenges that hinder the development of effective models are the perplexing measurement noise of functional magnetic resonance imaging (fMRI) and the high dimensionality of limited data instances. Existing methods generally suffer from one or both of these issues and yield dissatisfactory results.
Fujiwara et al. has proposed to use Bayesian canonical correlation analysis (BCCA) for building the reconstruction model, where image bases are automatically extracted from the measured data [\citeauthoryearFujiwara et al.2013]. As a latent variable model interpretation of nonprobabilistic CCA, BCCA assumes linear observation model for visual images and spherical covariance for the Gaussian distribution of voxel activities. In practice, however, linear observation model for visual images has limited representation power, and spherical covariance can not capture the correlations among voxel activities. Since the measurement noises are ubiquitous in voxel activities, utilizing the correlations among voxel activities would be critical for suppressing noise and improving prediction performance.
On the other hand, introducing deep structure into multiview representation learning is attracting more and more attentions recently [\citeauthoryearWang et al.2015, \citeauthoryearChandar et al.2016]. Deep canonically correlated autoencoders (DCCAE), which consists of two deep autoencoders and optimizes the combination of canonical correlation between the learned bottleneck representations and the reconstruction errors of the autoencoders, can extract nonlinear features from both views and reconstruct each view by the correlational bottleneck representations [\citeauthoryearWang et al.2015]. Nevertheless, DCCAE did not consider the crossreconstruction between two views, which limits its effectiveness in applications where a missing view needs to be reconstructed from the existing one. To our knowledge, no deep multiview learning model with shared generative latent representation has been designed specifically for missing view reconstruction.
Focusing on these problems, we present a deep generative multiview model (DGMM), where we cast the reconstruction of perceived image as the Bayesian inference of the missing view. Sharing a common latent representation, DGMM allows us to generate visual images and fMRI activity patterns simultaneously. For visual images, unlike BCCA, we explore nonlinear observation models parameterized by deep neural networks (DNNs), which can be multilayered perceptrons (MLPs) or convolutional neural networks (CNNs). This nonlinearity and deep structure endow our model with strong representation ability. For fMRI activity patterns, we adopt a full covariance matrix for the Gaussian distribution of voxel activities. While the full covariance matrix has the advantage of capturing the correlations among voxels, it results in severe computational issues. To reduce the complexity, we impose a lowrank assumption on the covariance matrix. This is beneficial to suppressing noise and improving prediction performance. Furthermore, we devise an efficient meanfield variational inference method to infer the latent variables and the model parameters. To further improve the reconstruction accuracy, the latent representations of testing instances are enforced to be close to that of their neighbours from the training set via posterior regularization [\citeauthoryearZhu et al.2014]. Compared with the nonprobabilistic deep multiview representation learning models mentioned above [\citeauthoryearWang et al.2015, \citeauthoryearChandar et al.2016], our Bayesian model has the inherent advantage of avoiding overfitting to small training set by model averaging. Finally, extensive experimental comparisons on three fMRI recording datasets demonstrate that our approach can reconstruct visual images from fMRI measurements more accurately.
2 Related work
In the literature of brain decoding, there are a relatively limited number of studies reporting perceived image reconstructions to date. Miyawaki et al. reconstructed the lowerorder information such as binary contrast patterns using a combination of multiscale local image bases whose shapes are predefined [\citeauthoryearMiyawaki et al.2008]. Van Gerven et al. reconstructed handwritten digits using deep belief networks [\citeauthoryearVan Gerven et al.2010b]. Schoenmakers et al. reconstructed handwritten characters using a straightforward linear Gaussian approach [\citeauthoryearSchoenmakers et al.2013]. Fujiwara et al. proposed to build the reconstruction model in which image bases can be automatically estimated by Bayesian canonical correlation analysis (BCCA) [\citeauthoryearFujiwara et al.2013]. In addition, there are works trying to reconstruct movie clips [\citeauthoryearNishimoto et al.2011, \citeauthoryearHaiguang Wen and Liu2016].
Though a similar strategy to our work has been used by Fujiwara et al. [\citeauthoryearFujiwara et al.2013] for visual image reconstruction, its linear observation model for visual images has limited representation power in practice. Several recently proposed deep multiview representation learning models can provide a service to visual image reconstruction [\citeauthoryearWang et al.2015, \citeauthoryearChandar et al.2016]. For example, deep canonically correlated autoencoders (DCCAE) with nonlinear observation models for both views has good ability to learn deep correlational representations and reconstruct each view using the learned representations respectively [\citeauthoryearWang et al.2015]. Compared with DCCAE, correlational neural networks (CorrNet) further considered the crossreconstruction between two views [\citeauthoryearChandar et al.2016]. However, directly applying the nonlinear maps of DCCAE and CorrNet to limited noisy brain activities is prone to overfitting.
Inspired by recent developments in deep generative models such as variational autoencoders (VAE) [\citeauthoryearKingma and Welling2014], we present a deep generative multiview model (DGMM), which can be viewed as a nonlinear extension of the linear method BCCA. To the best of our knowledge, this paper is the first to study visual image reconstruction via Bayesian deep learning.
3 Perceived image reconstruction with DGMM
In this section, we cast the reconstruction of perceived images from human brain activity as the Bayesian inference of missing view in a multiview latent variable model.
Assume the training set consists of paired observations from two distinct views (), denoted by () (), where is the training set size, and for . Here and denote the visual images and fMRI activity patterns, respectively. The presence of paired twoview data presents an opportunity to learn better representations by analyzing both views simultaneously. Therefore, we introduce the shared latent variables to relate the visual images to the fMRI activity patterns . The shared latent variables are treated as the following Gaussian prior distribution,
(1) 
Since the visual image and associated fMRI activity pattern are assumed to be generated from the same latent variables, we have two likelihood functions. One is for visual images, and the other is for fMRI activity patterns.
3.1 Deep generative model for perceived images
When observation noises for image pixels are assumed to follow a Gaussian distribution with zero mean and diagonal covariance, the likelihood function of visual images is
(2) 
where the mean and covariance are nonlinear functions of the latent variables . To allow for second moment of the data to be captured by the density model, we choose these nonlinear functions to be deep neural networks (DNNs), which is refer to as the generative network, parameterized by . Here the DNNs can be multilayered perceptrons (MLPs) or convolutional neural networks (CNNs). Compared with linear observation model, DNNs can extract nonlinear features from visual images and capture the stages of human visual processing from early visual areas towards the ventral streams [\citeauthoryearGüçlü and van Gerven2015, \citeauthoryearCichy et al.2016]. This nonlinearity and deep structure endow our model with strong representation ability.
3.2 Generative model for fMRI activity patterns
fMRI voxels are generally highly correlated, and the correlation can carry relevant information about stimuli or tasks, even in the absence of information in individual voxels [\citeauthoryearYamashita et al.2008, \citeauthoryearHosseinZadeh and others2016]. However, most existing methods [\citeauthoryearFujiwara et al.2013, \citeauthoryearSchoenmakers et al.2013] simply assume a spherical or diagonal covariance for the Gaussian distribution of voxel activities thus ignoring any correlations among voxels. Unlike them, we assume the observation noises of voxel activities follow a Gaussian distribution with zero mean and full covariance matrix. While this difference might seem minor, it is critical for the model to be able to suppress noise and improve prediction performance. In addition, although nonlinear transformations for fMRI activity patterns are more powerful than linear transformations (in terms of the types of features they can learn), extant multivoxel pattern analysis (MVPA) studies have not found a clear performance benefit for nonlinear versus linear transformations. Therefore, we assume the likelihood function of fMRI activity patterns is
(3) 
The model should be further complemented with priors for the projection matrix and the covariance matrix . Popular choices would be automatic relevance determination (ARD) prior and Wishart distribution for and , respectively,
(4) 
where denotes gamma distribution with shape parameter and rate parameter , and denote the scale matrix and degrees of freedom for Wishart distribution, respectively.
While the above model has the advantage of capturing the correlations among voxels, it results in severe computational issues (the cost is cubic as a function of ). Fortunately, the problem of inferring highdimensional covariance matrix can be solved by introducing auxiliary latent variables [\citeauthoryearArchambeau and Bach2009],
(5) 
and rewriting the likelihood function in Eq.(3) as
(6) 
where ARD prior and simple gamma prior can be set for the extra projection matrix and variance parameter , respectively,
(7) 
The graphical models of DGMM are shown in Fig.1. Note that sparsity of the projection matrices and can be tuned by assigning suitable values to the hyperparameters and , respectively.
By integrating out auxiliary latent variables , Eq.(6) can be shown to be equivalent to imposing a lowrank assumption on the covariance matrix in Eq.(3) (), which allows decreasing the computational complexity. From another perspective, this lowrank assumption produces a full factorization of the variation in fMRI data into shared components and private components . The ability to identify what is shared and what is nonshared makes our model be good at suppressing noise and improving prediction performance.
As shorthand notations, all hyperparameters in the model will be denoted by , while the priors by and the remaining variables by . Dependence on is omitted for clarity throughout the paper. Then we can get the following posterior distribution using Bayes’ rule
(8) 
where is the normalization constant.
4 Variational posterior inference
Given above generative model, exact inference is intractable. Here we formulate a meanfield variational approximate inference method to infer the latent variables and model parameters. Specifically, we assume there are a family of factorable and freeform (except for ) variational distributions
and define as a product of multivariate Gaussian distributions with diagonal covariance^{1}^{1}1We also considered to condition the posterior distribution on both and , but we didn’t observe obvious performance improvement., i.e.,
where the mean and covariance are outputs of the recognition network specified by another DNN with parameters . Then the objective is to get the optimal one which minimizes the KullbackLeibler (KL) divergence between the approximating distribution and the target posterior, i.e.,
where is the space of probability distributions. Equivalently, we can also bound the marginal likelihood:
(9) 
where we used the fact that KL divergence is guaranteed to be nonnegative, and
Intuitively, and can be interpreted as the (negative) expected reconstruction errors of visual images and fMRI activity patterns, respectively. Maximizing this lower bound strikes a balance between minimizing reconstruction errors of two views and minimizing the KL divergence between the approximate posterior and the prior.
4.1 Learning , and
Given the fixedform approximate posterior distribution for factor , can be computed exactly as:
On the other hand, and can be approximated by MonteCarlo sampling[\citeauthoryearKingma and Welling2014, \citeauthoryearKingma et al.2014]. Instead of sampling directly from , is computed as a deterministic function of and some noise term such that has the desired distribution. Assuming we draw samples, () can be expressed as
where and denotes elementwise multiplication. Then the resulting MonteCarlo approximations are
Finally, the parameters of DNNs ( and ) can be obtained by optimizing the objective function (based on minibatches) using the standard stochastic gradient based optimization methods such as SGD, RMSprop or AdaGrad [\citeauthoryearDuchi et al.2011].
4.2 Learning and
For a specific factor (except for ), it can be shown that when keeping all other factors fixed the optimal distribution satisfies
For our model, thanks to the conjugacy, the resulting optimal distribution of each factor follows the same distribution as the corresponding factor.
The optimal distributions of the projection parameters can be found as a product of multivariate Gaussian distributions:
(10) 
where notation denotes the expectation operator, i.e., means the expectation of over its current optimal distribution, and
The optimal distribution of the auxiliary latent variables can also be found as a product of multivariate Gaussian distributions:
(11) 
where
The optimal distributions of the precision variables can be formulated as:
(12) 
where .
4.3 Convergence
The inference mechanism sequentially updates the optimal distributions of the latent variables and the model parameters until convergence, which is guaranteed because the KL divergence is convex with respect to each of the factors.
4.4 Prediction
Using the estimated parameters, we can derive the predictive distribution for a visual image given a new brain activity . The predictive distribution can be formulated as follows,
(13) 
where the posterior distribution of latent variables can be derived by
(14) 
The posterior distribution can be equivalently obtained by solving the following information theoretical optimization problem:
(15) 
Expanding Eq.(15) and ignoring the term unrelated to , we further get
To ensure the latent representations of testing instances are close to that of their neighbours from the training set, we adopt the posterior regularization[\citeauthoryearZhu et al.2014] strategy to incorporate the manifold regularization into the above posterior predictive distribution . Specifically, we define the following expected manifold regularization:
where is some similarity measure of instances and . Here we use a knearest neighbor graph to effectively model local geometry structure in the input space and the affinity graph is defined as:
where denotes the knearest neighbors of .
Then our posterior regularization strategy can be formulated as
(16) 
where the parameter controls the expected scale. As a direct way to impose constraints and incorporate knowledge in Bayesian models, posterior regularization is more natural and general than specially designed priors. However, directly solving Eq.(4.4) with is difficult and inefficient. Let
then Eq.(4.4) can be rewritten as
(17) 
Solving problem Eq.(4.4), we can get the posterior distribution
(18) 
Because the multiple integral over the random variables , , and is intractable, we replace the random variables , and with the mean of estimated optimal distributions and , respectively, to vanish the integral over , and . Then becomes
(19) 
Now the posterior distribution can be found as:
(20) 
where
However, with the likelihood of the visual image formulated by a DNN, the integral over the latent variables (Eq.(13)) can not be computed analytically. Similar as in the training phase, we can approximate this integral by MonteCarlo sampling. Finally, the reconstructed visual image is calculated by taking the mean of all predictions, i.e., , where is the outputs of the generative network, i.e., .
5 Experiments
In this section, we present extensive experimental results on fMRI recording datasets to demonstrate the effectiveness of the proposed framework for perceived image reconstruction from human brain activity. Specifically, we compare our DGMM with the following algorithms, which use either a shallow or a deep architecture:

(Miyawaki et al.): a specially designed method to reconstruct visual images by combining local image bases of multiple scales (, and pixels covering an entire image) [\citeauthoryearMiyawaki et al.2008]. The shapes of these predefined images bases are fixed, thus it may not be optimal for image reconstruction.

(BCCA): a probabilistic extension of CCA model that relates the fMRI activity space to the visual image space via a set of latent variables [\citeauthoryearFujiwara et al.2013]. BCCA assumes a linear observation model for visual images and a spherical covariance for the Gaussian distribution of fMRI voxels.

(DCCAE): a latest deep multiview representation learning model that consists of two autoencoders and optimizes the combination of canonical correlation between the learned bottleneck representations and the reconstruction errors of the autoencoders [\citeauthoryearWang et al.2015]. DCCAE do not consider the crossreconstruction errors between two views.

(DeCNN): a latest neural decoding method based on multivariate linear regression and deconvolutional neural network [\citeauthoryearHaiguang Wen and Liu2016, \citeauthoryearZeiler et al.2011]. It is a twostage cascade model, i.e., it first predicts featuremaps by multivariate linear regression, then reconstruct images by feeding the estimated featuremaps in a pretrained deconvolutional neural network.
5.1 Experimental testbed and setup
Data description. We conducted experiments on three public fMRI datasets obtained from Miyawaki et al. [\citeauthoryearMiyawaki et al.2008] and van Gerven [\citeauthoryearVan Gerven et al.2010b, \citeauthoryearSchoenmakers et al.2013]. Dataset 1, consisting of contrastdefined patches, contains two independent sessions [\citeauthoryearMiyawaki et al.2008]. One is a ‘random image session’, in which spatially random patterns were sequentially presented. The other is a ‘figure image session’, in which alphabetical letters and simple geometric shapes were sequentially presented. We used fMRI data from primary visual area V1 of subject 1 (S1) for the analysis. Note that all comparing algorithms were trained on the data from ‘random image session’ and evaluated on the data from ‘figure image session’. Dataset 2 contains a hundred handwritten grayscale digits (equal number of 6s and 9s) at a pixel resolution taken from the training set of the MNIST database and the fMRI data from V1, V2 and V3 [\citeauthoryearVan Gerven et al.2010b]. Dataset 3 contains 360 grayscale handwritten characters (equal number of Bs, Rs, As, Is, Ns, and Ss) at a pixel resolution taken from [\citeauthoryearVan der Maaten2009] and the fMRI data of V1, V2 taken from three subjects [\citeauthoryearSchoenmakers et al.2013]. The visual images were downsampled from pixels to pixels in our experiments. The details of the 3 data sets used in our experiments had been summarized in Table 1. See [\citeauthoryearMiyawaki et al.2008, \citeauthoryearVan Gerven et al.2010b, \citeauthoryearSchoenmakers et al.2013] for more information, including fMRI data acquisition and preprocessing.
Datasets  #Instances  #Pixels  #Voxels  #ROIs  #Training 

Dataset 1  1400  100  797  V1  1320 
Dataset 2  100  784  3092  V1, V2, V3  90 
Dataset 3  360  784  2420  V1, V2  330 
Voxel selection. Voxel selection is an important component to fMRI brain decoding because many voxels may not respond to the visual stimulus. A common approach is to choose those voxels that are maximally correlated with the visual images during training. We chose voxels for which the model provided better predictability (encoding performance). This codifies our intuition that the voxels better predicted with the visual images are those to be included in the decoding model. The goodnessoffit between model predictions and measured voxel activities was quantified using the coefficient of determination () which indicates the percentage of variance that is explained by the model. In experiments, we first computed the of each voxel using 10fold crossvalidation on training data, then voxels with positive were selected for further analysis.
Parameter setting. The hyperparameters of the proposed DGMM were set to for all data sets, while 5fold cross validation was conducted on training sets to choose better regularization parameters from . For fair comparison, model parameters of other methods had also been tuned carefully. In our experiments, we considered multiple layer perceptrons (MLPs) as the type of recognition models. Inspired by the selectivity of visual areas to feature maps of varying complexity [\citeauthoryearGüçlü and van Gerven2015, \citeauthoryearHaiguang Wen and Liu2016], we set the structures of the recognition network for visual images as ‘100200’, ‘78425612810’ and ‘7842561285’ for three data sets, respectively. Specially, we considered two types of the structures for DCCAE. One has an asymmetric shape (same setup as our model for image view and a single layer setup for fMRI view, DCCAEA), which can mimic our model in structure and function. The other one has a symmetric shape (same setup for both views, DCCAES), which can explore the deep nonlinear maps for fMRI data.
5.2 Performance evaluation
The reconstructed geometric shapes and alphabet letters, handwritten digits and handwritten characters by the proposed DGMM and other algorithms were shown in Fig.2, Fig.3 and Fig.4, respectively, where the first row denote presented images, and below rows are the reconstructed images obtained from all comparing algorithms.
Overall, the images reconstructed by DGMM captured the essential features of the presented images. In particular, they showed fine reconstructions for handwritten digits and characters. Although the reconstructed geometric shapes and alphabet letters had some noise in the peripheral regions, the main shapes can be clearly distinguished. With the obtained reconstructions of handwritten digits and characters shared certain characteristics of their corresponding original images, there are subtle differences in the strokes. We attribute this phenomenon to the fact that manifold regularization imposed on the latent representations may change the details of reconstructed images. On the contrast, images reconstructed by Miyawaki’s method and BCCA were coarse for all image types with noise scattered over the entire reconstructed image. Also, both DCCAES and DCCAEA produced disappointing reconstructions which often lacked shapes of the presented images, especially for geometric shapes and alphabet letters. This might be due to the fact that nonlinear maps will easily overfit the voxel activities.
To evaluate the reconstruction performance quantitatively, we used several standard image similarity metrics, including Pearson’s correlation coefficient (PCC), mean squared error (MSE) and structural similarity index (SSIM) [\citeauthoryearWang et al.2004]. Note that MSE is not highly indicative of perceived similarity, while SSIM can address this shortcoming by taking texture into account. In addition, we also performed image classification analysis to quantify the reconstruction accuracy from another perspective. Specifically, linear support vector machine (SVM) and convolutional neural network (CNN) which had been trained on the presented visual images were used as the classifiers to label the reconstructed images. The classification accuracy of SVM (ACCSVM) and CNN (ACCCNN) on reconstructed images were reported. Performance comparisons were listed in Table 2. Note that we also listed the time consumed in training phase for all comparing algorithms in the last column for reference. Several observations can be drawn as follows.
Datasets  Algorithms  PCC  MSE  SSIM  ACCSVM  ACCCNN  Time(s) 

Dataset 1  Miyawaki et al.  .609.151  .162.025  .237.105  19.41.1  
BCCA  .438.215  .253.051  .181.066  74.93.0  
DCCAEA  .455.113  .234.029  .166.025  211.87.5  
DCCAES  .401.100  .240.027  .175.011  254.99.8  
DeCNN  .469.149  .263.067  .224.129  108.22.2  
DGMM  .611.183  .159.112  .268.106  118.42.5  
Dataset 2  Miyawaki et al.  .767.033  .042.007  .466.030  1.00  1.00  39.91.2 
BCCA  .411.157  .119.017  .192.035  1.00  1.00  20.71.0  
DCCAEA  .548.044  .074.010  .358.097  .900  .967.047  12.70.3  
DCCAES  .511.057  .080.016  .552.088  1.00  1.00  19.40.8  
DeCNN  .799.062  .038.010  .613.043  1.00  1.00  35.81.2  
DGMM  .803.063  .037.014  .645.054  1.00  1.00  18.61.2  
Dataset 3  Miyawaki et al.  .481.096  .067.026  .191.043  .655.193  .655.113  128.14.6 
BCCA  .348.138  .128.049  .058.042  .633.034  .600.098  32.91.0  
DCCAEA  .354.167  .073.036  .186.234  .478.126  .533.072  38.11.1  
DCCAES  .351.153  .086.031  .179.117  .478.051  .478.155  59.51.8  
DeCNN  .470.149  .084.035  .322.118  .589.135  .611.128  96.82.0  
DGMM  .498.193  .058.031  .340.051  .767.115  .778.083  42.44.2 
First, by comparing DGMM against the other algorithms, we can find that DGMM performs considerably better on all three data sets. In particular, the SSIM values of DGMM significantly surpass the baseline algorithms in all cases.
Second, by examining DGMM against BCCA which has a linear observation model for visual images, we can find that DGMM always outperform BCCA. This encouraging result shows that the DGMM with a DNN model for visual images is able to extract nonlinear features from visual images.
Third, DGMM shows obvious better performance than DCCAEA and DCCAES. Except for ignoring crossreconstructions, it is also caused by the fact that a linear map between voxel activities and bottleneck representation is enough to achieve good performance, while the nonlinear maps are easily overfitting under the high dimensionality of limited fMRI data instances.
Fourth, the performance of DeCNN is moderate for all data sets. We attribute this to the fact that it is a twostage method which can’t obtain the global optimal result of model parameters.
Finally, nearly correct classification is possible for each algorithm on Dataset 2. We believe that it is caused by the fact that digit 6 and 9 are easily to distinguish from each other. On Dataset 3, the remarkably higher classification performance on the images reconstructed by our model demonstrates the superiority of the proposed DGMM again.
6 Conclusion and future works
We have proposed a deep generative multiview framework to tackle the perceived image reconstruction problem. In our framework, multiple correspondences between visual image pixels and fMRI voxels can be found via a set of latent variables. We also derived a predictive distribution that succeeded in reconstructing visual images from brain activity patterns. Although we focused on visual image reconstruction problem in this paper, our framework can also deal with brain encoding tasks. Extensive experimental studies have confirmed the superiority of the proposed framework.
Two challenging and promising directions can be considered in the future. First, considering the recurrent neural networks (RNNs) [\citeauthoryearChung et al.2015] in our framework, we can explore the reconstruction of dynamic vision. Second, considering each subject’s fMRI measurements as one view, we can explore multisubject decoding.
Acknowledgment
This work was supported by National Natural Science Foundation of China (No. 91520202, 61602449) and Youth Innovation Promotion Association CAS.
References
 [\citeauthoryearArchambeau and Bach2009] Cédric Archambeau and Francis R Bach. Sparse probabilistic projections. In NIPS, pages 73–80, 2009.
 [\citeauthoryearChandar et al.2016] Sarath Chandar, Mitesh M Khapra, Hugo Larochelle, and Balaraman Ravindran. Correlational neural networks. Neural computation, 2016.
 [\citeauthoryearChung et al.2015] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In NIPS, pages 2980–2988, 2015.
 [\citeauthoryearCichy et al.2016] Radoslaw Martin Cichy, Aditya Khosla, Dimitrios Pantazis, Antonio Torralba, and Aude Oliva. Comparison of deep neural networks to spatiotemporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific reports, 6, 2016.
 [\citeauthoryearDamarla and Just2013] Saudamini Roy Damarla and Marcel Adam Just. Decoding the representation of numerical values from brain activation patterns. Human brain mapping, 34(10):2624–2634, 2013.
 [\citeauthoryearDuchi et al.2011] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 [\citeauthoryearElahe’Yargholi2016] GholamAli HosseinZadeh Elahe’Yargholi. Brain decodingclassification of hand written digits from fmri data employing bayesian networks. Frontiers in Human Neuroscience, 10, 2016.
 [\citeauthoryearFujiwara et al.2013] Yusuke Fujiwara, Yoichi Miyawaki, and Yukiyasu Kamitani. Modular encoding and decoding models derived from bayesian canonical correlation analysis. Neural computation, 25(4):979–1005, 2013.
 [\citeauthoryearGüçlü and van Gerven2015] Umut Güçlü and Marcel A. J. van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience, 35(27):10005–10014, 2015.
 [\citeauthoryearHaiguang Wen and Liu2016] Yizhen Zhang KunHan Lu Haiguang Wen, Junxing Shi and Zhongming Liu. Neural encoding and decoding with deep learning for dynamic natural vision. arXiv:1608.03425v1, 2016.
 [\citeauthoryearHosseinZadeh and others2016] GholamAli HosseinZadeh et al. Reconstruction of digit images from human brain fmri activity through connectivity informed bayesian networks. Journal of neuroscience methods, 257:159–167, 2016.
 [\citeauthoryearKingma and Welling2014] Diederik P Kingma and Max Welling. Autoencoding variational bayes. In ICLR, 2014.
 [\citeauthoryearKingma et al.2014] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In NIPS, pages 3581–3589, 2014.
 [\citeauthoryearLee and Kuhl2016] Hongmi Lee and Brice A Kuhl. Reconstructing perceived and retrieved faces from activity patterns in lateral parietal cortex. The Journal of Neuroscience, 36(22):6069–6082, 2016.
 [\citeauthoryearMiyawaki et al.2008] Yoichi Miyawaki, Hajime Uchida, Okito Yamashita, Masaaki Sato, Yusuke Morito, Hiroki C Tanabe, Norihiro Sadato, and Yukiyasu Kamitani. Visual image reconstruction from human brain activity using a combination of multiscale local image decoders. Neuron, 60(5):915–929, 2008.
 [\citeauthoryearNg and Abugharbieh2011] Bernard Ng and Rafeef Abugharbieh. Generalized group sparse classifiers with application in fmri brain decoding. In CVPR, pages 1065–1071, 2011.
 [\citeauthoryearNishimoto et al.2011] Shinji Nishimoto, An T Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack L Gallant. Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology, 21(19):1641–1646, 2011.
 [\citeauthoryearSchoenmakers et al.2013] Sanne Schoenmakers, Markus Barth, Tom Heskes, and Marcel van Gerven. Linear reconstruction of perceived images from human brain activity. NeuroImage, 83:951–961, 2013.
 [\citeauthoryearSchoenmakers et al.2015] Sanne Schoenmakers, Umut Güçlü, Marcel Van Gerven, and Tom Heskes. Gaussian mixture models and semantic gating improve reconstructions from human brain activity. Frontiers in computational neuroscience, 8, 2015.
 [\citeauthoryearVan der Maaten2009] Laurens Van der Maaten. A new benchmark dataset for handwritten character recognition. Tilburg University, pages 2–5, 2009.
 [\citeauthoryearVan Gerven et al.2010a] Marcel AJ Van Gerven, Botond Cseke, Floris P De Lange, and Tom Heskes. Efficient bayesian multivariate fmri analysis using a sparsifying spatiotemporal prior. NeuroImage, 50(1):150–161, 2010.
 [\citeauthoryearVan Gerven et al.2010b] Marcel AJ Van Gerven, Floris P De Lange, and Tom Heskes. Neural decoding with hierarchical generative models. Neural computation, 22(12):3127–3142, 2010.
 [\citeauthoryearWang et al.2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
 [\citeauthoryearWang et al.2015] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multiview representation learning. In ICML, pages 1083–1092, 2015.
 [\citeauthoryearYamashita et al.2008] Okito Yamashita, Masaaki Sato, Taku Yoshioka, Frank Tong, and Yukiyasu Kamitani. Sparse estimation automatically selects voxels relevant for the decoding of fmri activity patterns. NeuroImage, 42(4):1414–1429, 2008.
 [\citeauthoryearZeiler et al.2011] Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In ICCV, pages 2018–2025, 2011.
 [\citeauthoryearZhu et al.2014] Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularization and applications to infinite latent svms. Journal of Machine Learning Research, 15(1):1799–1847, 2014.