Rate-Distortion Optimization Guided Autoencoder for Generative Approach with quantitatively measurable latent space

Rate-Distortion Optimization Guided Autoencoder for Generative Approach with quantitatively measurable latent space

Keizo Kato, Akira Nakagawa
FUJITSU LABORATORIES LTD.
Kawasaki, Kanagawa, JAPAN
{kato.keizo,anaka}@fujitsu.com &Jing Zhou
Fujitsu R&D Center Co. Ltd.
Shanghai, CHINA
{zhoujing}@cn.fujitsu.com
Abstract

In the generative model approach of machine learning, it is essential to acquire an accurate probabilistic model and compress the dimension of data for easy treatment. However, in the conventional deep-autoencoder based generative model such as VAE, the probability of the real space cannot be obtained correctly from that of in the latent space, because the scaling between both spaces is not controlled. This has also been an obstacle to quantifying the impact of the variation of latent variables on data. \colorblackIn this paper, we propose Rate-Distortion Optimization guided autoencoder, in which the Jacobi matrix from real space to latent space has orthonormality. \colorblackIt is proved theoretically and experimentally that (i) the probability distribution of the latent space obtained by this model is proportional to the probability distribution of the real space because Jacobian between two spaces is constant; (ii) our model behaves as non-linear PCA, where energy of acquired latent space is concentrated on several principal components and the influence of each component can be evaluated quantitatively. Furthermore, to verify the usefulness on the practical application, we evaluate its performance in unsupervised anomaly detection and it outperforms current state-of-the-art methods.



1 Introduction

Capturing the inherent features of a dataset from high-dimensional and complex data is an essential issue in machine learning. Generative model approach learns the probability distribution of data, aiming at data generation by probabilistic sampling, unsupervised/weakly supervised learning, and acquiring meta-prior (general assumptions about how data can be summarized naturally, such as disentangle, clustering, and hierarchical structure (Bengio et al., 2013; Tschannen et al., 2019)). It is generally difficult to directly estimate a probability density function(PDF) of real data . Accordingly, one promising approach is to map to the latent space with reduced dimension and capture PDF . In recent years, deep autoencoder based methods have made it possible to compress dimensions and derive latent variables. While there is remarkable progress in these areas (van den Oord et al., 2017; Kingma et al., 2014; Jiang et al., 2016), the relation between and in the current deep generative models is still not clear.

VAE (P.Kingma and Welling, 2014) is one of the most successful generative models for capturing latent representation. In VAE, lower bound of log-likelihood of is introduced as ELBO. Then latent variable is obtained by maximizing ELBO. In order to maximize ELBO, various methods (Alemi et al., 2018; Zhao et al., 2019; Brekelmans et al., 2019) have been proposed. However, many previous works did not care about the value of Jacobian between two spaces, despite the fact that the ratio between and is equal to the Jacobian. Even in models that provide more flexible estimation (Johnson et al., 2016; Liao et al., 2018; Zong et al., 2018), this point is overlooked.

Here, when we turn our sight toward acquiring meta-prior, it is straightforward to evaluate the quantitative influence of each latent variable on the distance between data and . To do so, the scale of the latent variable should be appropriately controlled so that the changes in latent variables is proportional to the changes of distance in data space. In addition, this scaling should be adjusted according to the definition of the distance metric. For instance, with respect to image quality metrics, different meta-prior should be derived from MSE and SSIM. \colorblackConsidering the mechanism of PCA would be one of the direction to solve it. In PCA, PCA components are derived by optimal orthonormal transformation, then the importance of each component can be identified quantitatively by their variance. It should be noted that orthogonality is not enough for quantitative analysis, but orthonormality is required. This fact implies that, if Jacobi matrix of autoencoder has orthonormality, the characteristics of acquired latent space can be evaluated quantitatively. \colorblack

To deal with this, we propose RaDOGAGA (Rate-Distortion Optimization Guided Autoencoder for Generative Approach), which simultaneously learns parametric probability distribution and auto-encoder, based on the rate-distortion optimization (RDO). In this paper, we show the effect of RaDOGAGA in the following steps.

(1) We prove that RaDOGAGA has the following property theoretically and experimentally.

  • Jacobi matrix between real space and latent space leads to be constant-scaled orthonormal. So the response of the minute change of to the real space data is constant at any .

  • Because of constant Jacobian (or pseudo-Jacobian), and are almost proportional. Therefore, can be estimated, by directly maximizing log-likelihood of parametric PDF in reduced-dimensional space, without considering ELBO.

  • When univariate independent distribution is used to estimate parametrically, it behaves as ”continuous PCA” where energy is concentrated on several principal components.

(2) Thanks to this property, RaDOGAGA achieve the state-of-the-art performance in anomaly detection task with four public datasets, where probability density estimation is important.

(3) We show that our approach can directly evaluate how the impact on the distance metric in real space. This feature is promising to further interpretation of latent variables.

2 Related Work

Flow based model: Flow based generative models generates astonishing quality of image (Kingma and Dhariwal, 2018; Dinh et al., 2014). Flow mechanism explicitly takes Jacobian of and into account. The transformation function is learned, calculating and preserving Jacobian of and . Unlike ordinary autoencoder, which reverse to with function different from , inverse function transforms as . Since the model preserves Jacobian, can be estimated by maximizing log likelihood of without considering ELBO. Although, in this approach, need to be bijection. Because of this limitation, it is difficult to fully utilize the flexibility of neural networks.

Interpretation of latent variables: While it is expected to acquire meta-prior by deep autoencoder, interpreting latent variables is still challenging. In recent years, research aiming to acquire disentangled latent variables, which encourages them to be independent, is flourishing (Lopez et al., 2018; Chen et al., 2018; Kim and Mnih, 2018; Chen et al., 2016). With these methods, qualitative effects for disentanglement can be seen. For example, when a certain latent variable is displaced, image changes corresponding to specific attributes (size, color, etc.). Some works also propose quantitative metrics for meta-prior. In beta-VAE (Higgins et al., 2017), the metric evaluates the independence of latent variables by solving the classification task. \colorblackActually, the Jacobi matrix of VAE is orthogonal (Rolínek et al., 2019), which enable to make latent variables disentangled implicitly. However, orthonormality is not supported and it is difficult to define the metric which evaluate the effect of latent variable to the distance between data directly.\colorblack

Image compression with rate-distortion optimization: Rate-distortion theory is a part of Shannon’s information theory for lossy compression which formulates the optimal condition between rate and distortion. According to the theory, the method to optimally encode data with Gaussian distribution is as follows. At first, the data are transformed to coefficients using the orthonormal PCA basis. Then these coefficients are quantized such that quantization noise for each channel is equal. At last optimal entropy encoding is applied to quantized coefficients where rate can be calculated by logarithm of ratio between the variance of coefficients and quantization noise. Recently, Rate-Distortion optimized VAE is proposed in order to maximize ELBO (Brekelmans et al., 2019). In this work, instead applying equal noise, noise distribution is precisely controlled for each input data so as to match a predefined prior distribution in the sense of Rate-Distortion optimization. As a result, Jacobian is not constant. The most related works to our approach is Rate-Distortion Optimization (RDO) based image compression with deep learning (Ballé et al., 2018; Zhou et al., 2019; Wen et al., 2019). In these works, the information entropy of feature map extracted by CNN is minimized. To calculate entropy, Ballé et al. (2018) propose method to estimate probability distribution of latent variable parametrically, assuming univariate independent (factorized) distribution for each latent variable. Here, we have discovered that, behind the success of these compression methods, RDO has the effect of making the Jacobi matrix constant-multiplied orthonormal. Inspired by this, we propose autoencoder which scale latent variables according to definition of distance of data, without limitation of the transformation function. Thanks to this feature, our scheme can estimate quantitatively, which is suitable for clustering and anomaly detection. Furthermore, in the case factorized distribution is used for , our model behaves as continuous PCA. This property is considered to promote the interpretation of latent variable.



3 Method and Theory

The fundamental mechanism of rate-distortion optimization guided auto-encoder is minimizing cost function which consists of (i) reconstruction error between input data and decoder output with noise to latent variable and (ii) entropy of latent variable. By doing so, the model automatically finds an appropriate scale of latent space. The specific method is described below.

First, let be -dimensional domain data() and the probability of . Then is converted to -dimensional latent space by encoder. Let , , and be parametric encoder, decoder, and probability distribution function of latent space with parameters , , and . We assume that each function’s parameter is rich enough to fit ideally. Then latent variable and decoded data are generated as bellow:

(1)

Let be noise with each dimension being independent (non-correlated among different dimensions) with an average of 0 and a variance of as follows:

(2)

Then, given the sum of latent variable and noise , the decoder output is obtained as follows:

(3)

Here, the cost function is defined by Eq. (4)

(4)

In this equation, the first term is the estimated entropy of the latent distribution. in the second and the third term is a distance function between and . It is assumed that this distance function can be approximated by the following quadratic form in the neighborhood of , where is arbitrary micro variation of , is \colorblackpositive definite matrix \colorblackdepending on , and is Cholesky decomposition of .

(5)

For instance, when is square error as in Eq. (6), and are Identity matrices.

(6)

In the case of SSIM (Wang et al. (2014)) metric which is close to subjective image quality, a cost can be also approximated in a quadratic form \colorblackwith positive definite matrix. \colorblackThis is explained in Appendix C. Let in the second term of Eq. (4) be a scale function such as the identity , the logarithm , etc.. The effect of is discussed in Appendix D. Then, averaging Eq. (4) according to distributions, and . By deriving parameters that minimize this value, the encoder, decoder, and probability distribution of the latent space are trained as Eq. (7).

(7)
Figure 1: Architecture of RaDOGAGA

Here, we introduce as rescaled space of according to the distance function :

(8)

When is square error, the two spaces of and are equivalent. It is turned out that each column of the Jacobian matrix of latent space and rescaled space are constant multiple of orthonormal basis regardless of the value of and after training based on Eqs. (4) and (7). Here, denotes Kronecker delta.

(9)

A more detailed proof is described in Appendix A. Then, by calculating distance function for and , following relationship is derived by Eqs. (5), (8) and (9).

(10)

This means that, the latent space is scaled so that the amount of change in the real space in response to the minute change of is constant independent of value.

(11)

Next, Jacobian between and is examined. First, when , the Jacobian matrix is a square matrix, and each column is the same as times orthonormal basis. For this reason, the Jacobian is a constant regardless of the value of as shown below:

(12)

In this case, the probability distribution of and is proportional because of the constant Jacobian. For the case of , we assume the situation where most energy is efficiently and effectively mapped to N-dimensional latent space. Then the product of the singular values of SVD for a Jacobi matrix can be regarded as a pseudo Jacobian between the real space and the latent space. Since all of the N singular values are , the pseudo Jacobian is also a constant. As a result, the following equation holds where is Jacobian or pseudo Jacobian.

(13)

Let be estimated probability of . Because the Jacobian J is constant, can be approximated by . Accordingly, the average of the first term in Equation (4) by can be transformed as follows.

(14)

Here, minimization of Equation (14) is equivalent to the log-likelihood maximization of . As a result, the PDF of can be estimated without maximizing ELBO.

Regarding a parametric probability distribution, if a model such as GMM is taken, it is considered that a multimodal probability distribution can be obtained flexibly. Besides, when the following factorized probability model is used, the model shows a ”continuous PCA” feature where the energy is concentrated in several principal latent variables.

(15)

The derivation of ”continuous PCA” feature is explained in Appendix B. 

4 Experiment

4.1 Probability Density Estimation With Toy Data

In this section, we describe our experiment using toy data to demonstrate whether the probability density of the input data and that of estimated in the latent space are proportional to each other as in theory. First, we sample data from three-dimensional Gaussian distribution consists of three-mixture-components with mixture weight , mean , and covariance . is the index for mixture component. Then, we scatter with uniform random noise , , where and are index for dimension of sampled data and scattered data. The s are uncorrelated with each other. We produce 16-dimensional input data with a sample number of 10,000 in the end .

(16)

The appearance probability of the input data is equivalent to the generation probability of .

4.1.1 Configuration

In the experiment, we estimate the using GMM with parameter as in DAGMM (Zong et al., 2018). Instead of EM algorithm, GMM parameters are learned using Estimation Network (EN), which consists of multi-layer neural network. When the GMM has mixture-components and is the size of batch samples, EN outputs the mixture-components membership prediction as -dimensional vector as follows:

(17)

K-th mixture weight , mean , covariance , and entropy of are further calculated by Eqs. (18) and (19).

(18)
(19)

Overall network is trained by Eqs. (4) and (7). In this experiment, we set as square error , and test two types of , and . We denote models trained with these as RaDOGAGA(d) and RaDOGAGA(log(d)). As a comparison method, DAGMM is used. DAGMM also consists of encoder, decoder, and EN. In DAGMM, to avoid falling into the trivial solution that entropy is minimized when the diagonal component of the covariance matrix is 0, the inverse of the diagonal component is added to the cost function as Eq. (20):

(20)

For both RaDOGAGA and DAGMM, the autoencoder part is constructed with fully connected (FC) layers with sizes of 64, 32, 16, 3, 16, 32, and 64. For all FC layers except for the last of the encoder and the decoder, we attach as the activation function. The EN part is also constructed with FC layer with a size of 10, 3. For the first layer, we attach the as activation function and dropout (ratio=0.5). For the last one, softmax is attached. (, ) is set as (, ), (, ) and (, ) for DAGMM, RaDOGAGA(d) and RaDOGAGA(log(d)) respectively. Optimization is done by Adam optimizer (Kingma and Ba, 2014) with learning rate for all model. We set as .

(a) DAGMM
(b) RaDOGAGA(d)
(c) RaDOGAGA(log(d))
Figure 2: Plot of latent variable
(a) DAGMM
(b) RaDOGAGA(d)
(c) RaDOGAGA(log(d))
Figure 3: plot of of and

4.1.2 Result

Figure 2 displays the distribution of latent variable , and Figure 3 displays a plot of (x-axis) against (y-axis). Even though both methods can capture that is generated from three mixture-component, there is a difference is PDF. In our method, and are approximately proportional to each other as in theory while we cannot observe such proportionality in DAGMM. To compare RaDOGAGA(d) and RaDOGAGA(log(d)), we normalized with the range from 0 to 1, then calculated residual of linear regression putting as objective variable and as explanatory variable. The residual of RaDOGAGA(log(d)), i.e., 0.0027, is smaller than that of RaDOGAGA(d), i.e., 0.0100. Actually, when is sufficiently fitted, makes and be proportional more rigidly. On the other hand, makes the scale of latent space slightly bent in order to minimize entropy function, allowing relaxed fitting of . More detail is described in Appendix D. In the next section, we see how this trait performs on the real task.

4.2 Anomaly Detection Task Using Real Data

In this section, we examine whether the clear relationship between and is useful in the anomaly detection task using real data. We use four public datasets111Dataset can be dowonload from (https://kdd.ics.uci.edu/) and (http://odds.cs.stonybrook.edu), KDDCUP99, Thyroid, Arrhythmia, and KDDCUP-Rev. The (instance number, dimension, anomaly ratio(%)) of each dataset is (494021, 121, 20), (3772, 6, 2.5), (452, 274, 15), and (121597, 121, 20) respectively. Detail of datasets is described in Appendix E.

4.2.1 Experimental Setup

We follow the setting in Zong et al. (2018). Randomly extracted 50% of the data is assigned to training and the rest to testing. During training, only normal data is used. During the test, the for each sample is calculated as the anomaly score, and if the anomaly score is higher than a threshold, it is detected as an anomaly. The threshold is determined by the ratio of anomaly data in each data set. For example, in KDDCup99, data with in the top 20 % is detected as an anomaly. As metrics, precision, recall, and F1 score are calculated. We run 20 times for each dataset split by 20 different random seeds.

4.2.2 Baseline Model

Same as in the previous section, DAGMM is taken as the baseline method. We also compare with the scores reported in previous works in which same experiments were conducted (Zenati et al., 2018; Song and Ou, 2018; Liao et al., 2018).

4.2.3 Configuration

As in Zong et al. (2018), in addition to the output from the encoder, and are concatenated to . It is sent to EN. Note that is sent to the decoder before concatenation. Other configuration except for hyper parameter is same as in the previous experiment. Hyper parameter for each dataset is described in Appendix E.

4.2.4 Results

Table 1 reports the average scores and standard deviations (in brackets). Comparing DAGMM and RaDOGAGA, RaDOGAGA has a better performance regardless of types of . Simply introducing the RDO mechanism into the autoencoder has a valid efficacy for anomaly detection. Moreover, our approach achieves state-of-the-art performance compared to other previous works in which same datasets is used. Clear relationship between and by our model is considered to be effective in the task of anomaly detection where the estimating probability distribution is important In RaDOGAGA, when we compare result of RaDOGAGA(d) and RaDOGAGA(log(d)), either of one is not always superior. As described in section 4.1 and Appendix D, can be an option depending on fitting flexibility of .

Dataset Methods Precision Recall F1 KDDCup ALAD111Score of ALAD, INRF, GMVAE and DAGMM is cited from Zenati et al. (2018), Song and Ou (2018), Liao et al. (2018) and Zong et al. (2018) respectively. 0.9427(0.0018) 0.9577(0.0018) 0.9501(0.0018) INRF111Score of ALAD, INRF, GMVAE and DAGMM is cited from Zenati et al. (2018), Song and Ou (2018), Liao et al. (2018) and Zong et al. (2018) respectively. 0.9452(0.0105) 0.9600(0.0113) 0.9525(0.0108) GMVAE111Score of ALAD, INRF, GMVAE and DAGMM is cited from Zenati et al. (2018), Song and Ou (2018), Liao et al. (2018) and Zong et al. (2018) respectively. 0.952 0.9141 0.9326 DAGMM111Score of ALAD, INRF, GMVAE and DAGMM is cited from Zenati et al. (2018), Song and Ou (2018), Liao et al. (2018) and Zong et al. (2018) respectively. 0.9297 0.9442 0.9369 DAGMM+222DAGMM+ is our implementation. Note that we also test same configuration as in Zong et al. (2018) and achieve similar score as reported (shown in Appendix E). 0.9427(0.0052) 0.9575(0.0053) 0.9500(0.0052) RaDOGAGA(d) 0.9550(0.0037) 0.9700(0.0038) 0.9624(0.0038) RaDOGAGA(log(d)) 0.9563(0.0042) 0.9714(0.0042) 0.9638(0.0042) Thyroid GMVAE111Score of ALAD, INRF, GMVAE and DAGMM is cited from Zenati et al. (2018), Song and Ou (2018), Liao et al. (2018) and Zong et al. (2018) respectively. 0.7105 0.5745 0.6353 DAGMM111Score of ALAD, INRF, GMVAE and DAGMM is cited from Zenati et al. (2018), Song and Ou (2018), Liao et al. (2018) and Zong et al. (2018) respectively. 0.4766 0.4834 0.4782 DAGMM+222DAGMM+ is our implementation. Note that we also test same configuration as in Zong et al. (2018) and achieve similar score as reported (shown in Appendix E). 0.4656(0.0481) 0.4859(0.0502) 0.4755(0.0491) RaDOGAGA(d) 0.6313(0.0476) 0.6587(0.0496) 0.6447(0.0486) RaDOGAGA(log(d)) 0.6562(0.0572) 0.6848(0.0597) 0.6702(0.0585) Arrythmia ALAD111Score of ALAD, INRF, GMVAE and DAGMM is cited from Zenati et al. (2018), Song and Ou (2018), Liao et al. (2018) and Zong et al. (2018) respectively. 0.5000(0.0208) 0.5313(0.0221) 0.5152(0.0214) GMVAE111Score of ALAD, INRF, GMVAE and DAGMM is cited from Zenati et al. (2018), Song and Ou (2018), Liao et al. (2018) and Zong et al. (2018) respectively. 0.4375 0.4242 0.4308 DAGMM111Score of ALAD, INRF, GMVAE and DAGMM is cited from Zenati et al. (2018), Song and Ou (2018), Liao et al. (2018) and Zong et al. (2018) respectively. 0.4909 0.5078 0.4983 DAGMM+222DAGMM+ is our implementation. Note that we also test same configuration as in Zong et al. (2018) and achieve similar score as reported (shown in Appendix E). 0.4985(0.0389) 0.5136(0.0401) 0.5060(0.0395) RaDOGAGA(d) 0.5353(0.0461) 0.5515(0.0475) 0.5433(0.0468) RaDOGAGA(log(d)) 0.5294(0.0405) 0.5455(0.0418) 0.5373(0.0411) KDDCup-rev DAGMM111Score of ALAD, INRF, GMVAE and DAGMM is cited from Zenati et al. (2018), Song and Ou (2018), Liao et al. (2018) and Zong et al. (2018) respectively. 0.937 0.939 0.938 DAGMM+222DAGMM+ is our implementation. Note that we also test same configuration as in Zong et al. (2018) and achieve similar score as reported (shown in Appendix E). 0.9778(0.0018) 0.9779(0.0017) 0.9779(0.0018) RaDOGAGA(d) 0.9768(0.0033) 0.9827(0.0012) 0.9797(0.0015) RaDOGAGA(log(d)) 0.9864(0.0009) 0.9865(0.0009) 0.9865(0.0009)
Table 1: Average and standard deviations(in brackets) of Precision, Recall and F1

4.3 Quantifying the Impact of Latent Variables on Distance and Behavior as Pca

In this section, we verify that the impact of each latent variable on the distance function can be quantified. As described in section 3, \colorblackwhen is a minute displacement, the ratio between the distance and is constant regardless of . \colorblackWe verify if this characteristic is observable in a model trained with real data. \colorblackLet be a vector where only -th component has a minute value . Then denotes \colorblack for the ratio of -th dimension. Once the model is trained, we encode the image \colorblack and obtain . Then, \colorblack is calculated for each sample. Finally, the average across all samples is measured. This operation is conducted to each dimension of independently. \colorblackWe also observe the distribution of and how the image looks different in response to the variation of component \colorblack. \colorblack
To train model, we use CelebA dataset (Liu et al., 2015), which consists of 202,599 celebrity images. The images are center-cropped so that the image size is 64 x 64.

4.3.1 Configuration

In this experiment, factorized distributions (Ballé et al., 2018) is used to estimate . For comparison, we evaluate beta-VAE. Both models are constructed with same depth of CNN and FC layers, with 256-dimensional . Detail of networks and hyper parameter is written in Appendix F. For RaDOGAGA, we set \colorblack \colorblackas \colorblack \colorblackand as . For beta-VAE, reconstruction loss is also \colorblack. \colorblackis defined by Eq. (60).

4.3.2 Result

Both models are trained so that the SSIM between input and reconstructed image is around 0.93. \colorblackFigure 4 and 4 show the variance of each component in the latent variables \colorblackarranged in descending order. The red line is the cumulative relative ratio of the variance. In Fig. 4, variance is concentrated in a specific dimension. On the other hand, in Fig. 4, beta-VAE is trained so that each latent variable is fitted to a Gaussian distribution with mean and variance , there is no significant difference in the variance of each latent variable. Figures 4 and 4 respectively depicts the average of \colorblack \colorblackof each of the top nine dimensions with the largest variance of \colorblack. \colorblackIn beta-VAE, the \colorblack \colorblackvaries drastically depending on the dimension \colorblack, \colorblackwhile in RaDOGAGA, it is approximately constant regardless of \colorblack. \colorblackFigure 5 \colorblackshows \colorblackdecoder outputs when each component \colorblackis traversed from -2 to 2, fixing rest of as mean. From the top, each row corresponds to , , …, and the center column is mean. In Fig. 5, the image changes visually in any dimension of \colorblack, \colorblackwhile in Fig. 5, depending on \colorblackthe dimension , \colorblackthere are cases where no significant changes can be seen (such as , , and so on). This result means that, in RaDOGAGA, the variance of \colorblack \colorblackdirectly corresponds to the visual impact and the distance \colorblack, \colorblackbehaving as PCA. Besides, since \colorblack \colorblackis constant, the variance can be used as a quantifying metric. Although this is unsupervised image reconstruction task, if the task is the semantical problem and distance function is defined so as to solve it, such as a classification, influence of each \colorblack \colorblackon the semantics is expected to be quantified.

(a) beta-VAE
(b) RaDOGAGA
(c) beta-VAE
(d) RaDOGAGA
Figure 4: Variance of (two on the left) and (two on the right)
(a) beta-VAE
(b) RaDOGAGA
Figure 5: Latent space traversal

5 Conclusion

In this paper, we proposed RaDOGAGA that learns parametric probability distribution and autoencoder simultaneously based on rate-distortion optimization. It was proved that the probability distribution of the latent variables obtained by the proposed method is proportional to the probability distribution of the input data theoretically and experimentally. This property is validated in anomaly detection achieving state-of-the-art performance. Moreover, our model has the trait as PCA which likely promotes interpretation of latent variables. For the future work, we will conduct experiments with different types of distance functions that derived from semantical task, such as in categorical classification. Meanwhile, as mentioned in Tschannen et al. (2019), considering the usefulness of latent variables in downstream task is another research direction to explore.

References

  • A. Alemi, B. Poole, I. Fischer, J. Dillon, R. A. Saurous, and K. Murphy (2018) Fixing a broken elbo. In Proceedings of the 35th International Conference on Machine Learning, pp. 159–168. Cited by: §1.
  • J. Ballé, V. Laparra, and E. P. Simoncelli (2015) Density modeling of images using a generalized normalization transformation. arXiv preprint arXiv:1511.06281. Cited by: Appendix F.
  • J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston (2018) Variational image compression with a scale hyperprior. In In International Conference on Learning Representations, Cited by: §2, §4.3.1.
  • Y. Bengio, C. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, pp. 798–1828. Cited by: §1.
  • R. Brekelmans, D. Moyer, D. Moyer, and G. Ver Steeg (2019) Exact rate-distortion in autoencoders via echo noise. arXiv preprint arXiv:1904.07199v2. Cited by: §1, §2.
  • T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.
  • X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.
  • L. Dinh, D. Krueger, and Y. Bengio (2014) Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §2.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. In ICLR 2017, Cited by: §2.
  • Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou (2016) Variational deep embedding: an unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148. Cited by: §1.
  • M. Johnson, D. Duvenaud, A. B. Wiltschko, S. R. Datta, and A. P. (2016) Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems, Cited by: §1.
  • H. Kim and A. Mnih (2018) Disentangling by factorising. In Proc. of the International Conference on Machine Learning, pp. 2649–2658. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.1.
  • D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling (2014) Semi-supervised learning with deep generative models. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 3581–3589. Cited by: §1.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §2.
  • W. Liao, Y. Guo, X. Chen, and P. Li (2018) A unified unsupervised gaussian mixture variational autoencoder for high dimensional outlier detection. In 2018 IEEE International Conference on Big Data (Big Data), pp. 1208–1217. Cited by: §1, §4.2.2, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1.
  • M. Lichman (2013) UCI machine learning repository. Note: http://archive.ics.uci.edu/ml. Cited by: Appendix E, Appendix E, Appendix E, Appendix E.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.3.
  • R. Lopez, J. Regier, M. I. Jordan, and N. Yosef (2018) Information constraints on auto-encoding variational bayes. In Advances in Neural Information Processing Systems, pp. 6114–6125. Cited by: §2.
  • D. P.Kingma and M. Welling (2014) Auto-encodingvariationalbayes. In ICLR 2014, Banff, Canada. Cited by: §1.
  • K. R. Rao and P. Yip (Eds.) (2000) The transform and data compression handbook. CRC Press, Inc., Boca Raton, FL, USA. External Links: ISBN 0849336929 Cited by: Appendix B.
  • M. Rolínek, D. Zietlow, and G. Martius (2019) Variational autoencoders pursue pca directions (by accident). In Proceedings of Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • Y. Song and Z. Ou (2018) Learning neural random fields with inclusive auxiliary generators. arXiv preprint arXiv:1806.00271. Cited by: §4.2.2, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1.
  • M. Tschannen, O. Bachem, and M. Lucic (2019) Recent advances in autoencoder-based representation learning. In Third workshop on Bayesian Deep Learning (NeurIPS 2018), Montreal, Canada. Cited by: §1, §5.
  • A. van den Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §1.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2014) Image quality assessment: from error visibility to structural similarity. Vol. 13, IEEE Trans. on Image Processing. Cited by: Appendix C, §3.
  • S. Wen, J. Zhou, A. Nakagawa, K. Kazui, and Z. Tan (2019) Variational autoencoder based image compression with pyramidal features and context entropy model. In In WORKSHOP AND CHALLENGE ON LEARNED IMAGE COMPRESSION, Cited by: §2.
  • H. Zenati, M. Romain, C. Foo, B. Lecouat, and V. Chandrasekhar (2018) Adversarially learned anomaly detection. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 727–736. Cited by: §4.2.2, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1.
  • S. Zhao, J. Song, and S. Ermon (2019) InfoVAE: balancing learning and inference in variational autoencoders. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5885–5892. Cited by: §1.
  • J. Zhou, S. Wen, A. Nakagawa, K. Kazui, and Z. Tan (2019) Multi-scale and context-adaptive entropy model for image compression. In In WORKSHOP AND CHALLENGE ON LEARNED IMAGE COMPRESSION, Cited by: §2.
  • B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018) Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In In International Conference on Learning Representations, Cited by: Table 3, Appendix E, Appendix E, §1, §4.1.1, §4.2.1, §4.2.3, footnote 1, footnote 1, footnote 1, footnote 2, footnote 1, footnote 2, footnote 1, footnote 1, footnote 1, footnote 1, footnote 2, footnote 1, footnote 1, footnote 2.



Appendix A How Jacobi Matrix Become a Constantly Scaled Orthonormal Basis

In this appendix section, we prove that all of column vectors in decoder’s Jacobi Matrix have the same norm and are orthogonal to each other. In other words, each column of Jacobi matrix is a constantly scaled orthonormal basis.

Here we assume that data space sample has dimension, that is , then encoded to dimensional latent space sample . We also assume that an fixed encoder function and a fixed decoder are given such that becomes minimal.

(21)
(22)

We further assume that fixed parametric PDF of latent variable is also given.

(23)

Here, it is noted that this function is not needed to be optimal in a sense of where is an actual PDF of .

Under these conditions, we introduce new latent variable , and is transformed from using a following scaling function .

(24)

Here, our goal is to prove Eq. (9) by examining the condition of this scaling function which minimize the average of Eq. (4) with regard to .

Because of the assumption of minimal , we ignore the second term of Eq. (4). Next, the PDF of can be derived using a Jacobian of .

(25)

By applying these conditions and equations to Eq. (4), the cost of scaling function to minimize the average is expressed as follows.

Next, the latent variable is fixed as , and is defined in the next equation.

(27)

Then the average of Eq. (A) with regard to is expressed as follows.

Then the condition of in the neighborhood of is examined when Eq. (A) is minimized. Here, it is noted that the first term of the right side doesn’t depend on .

Before examining Eq. (A), some preparation of equations are needed. At first, Jacobi matrix of at is defined as using notations of partial differentials .

(29)

The vector is also defined as follows.

(30)

It is clear by definition that and have the following relation.

(31)

is defined as a cofactor of matrix with regard to the element , and is also defined by the following equation.

(32)

The following equations holds by the definition of cofactor.

(33)
(34)

In case of , inner product of and becomes zero because this value is a determinant of a singular matrix with two equivalent column vectors .

(35)

is defined as a Jacobi matrix of at as follows.

(36)

Using these equations, we preceed to expand of Eq. (A). The first term of (A) in the right side can be expressed as follows, where the second term Eq. (37) in the right side is constant by definition.

(37)

Next step is an expansion of the second term in Eq. (A). First, the following approximation holds.

(38)

Then, the second term of Eq. (A) can be transformed to the next equation by using Eqs. (38), (2), and (5).

(39)

As a result, Eq. (A) can be rewritten as Eq. (40).

(40)

By examining the minimization process of Eq. (40), the conditions of optimal scaling function in the neighborhood of is clarified. Here, the condition of Jacobi matrix is examined instead of . Eq. (40) is differentiated by vector , and the result is set to be zero. Then the following equation Eq.(41) is derived.

(41)

Afterwards, Eq. (41) is multiplied by from the left, and divided by . As a result, Eq. (42) is derived by using Eqs. (33) and (35).

(42)

Here, we define the following function which is a composite function of and .

(43)

Then the next equation holds by definition.

(44)

It is noted that this equation holds at any value of or . As a result, the following equation, that is Eq. (9), can be derived.

(45)

If encoder and decoder are trained well and holds, we can introduce new rescaled data space determined by distance function such as , and the next equation holds.

(46)

In conclusion, all column vectors of Jacobi matrix between and has the same L2 norm and all pairs of column vectors are orthogonal. In other words,, when column vectors of Jacobi matrix are multiplied by the constant , the resulting vectors are orthonormal.

Appendix B Explanation of ”Continuous Pca” Feature

In this section, we explain RaDOGAGA has a continuous PCA feature when factorized probability density model as below is used.

(47)

Here, our definition of ”continuous PCA” feature means the following. 1) Mutual information between latent variables are minimum and likely to be uncorrelated to each other. 2) Energy of latent space is concentrated to several principal components, and the importance of each component can be determined.

Next we explain the reason why these feature is acquired. As explained in \colorblack appendix A\colorblack, all column vectors of Jacobi matrix of decoder from latent space to data space have the same norm and all combinations of pairwise vectors are orthogonal. In other words, when constant value is multiplied, the resulting vectors are orthonormal. Because encoder is a inverse function of decoder ideally, each row vector of encoder’s Jacobi matrix should be the same as column vector of decoder under the ideal condition. Here, and are defined as encoder and decoder with these feature. Because the latent variables depend on encoder parameter , latent variable is described as , and its PDF is defined as . PDFs of latent space and data space have the following relation where is a Jacobian or pseudo-Jacobian between two spaces with constant value as explained in \colorblack appendix A\colorblack.

(48)

As described before, is a parametric PDF of the latent space to be optimized with parameter .

By applying the result of Eqs. (40) and (42), Eq. (4) can be transformed as Eq. (B) where .

(49)

Here, the third term of the right side is constant, this term can be removed from the cost function as follows.

(50)

Then the parameters of network and probability can be obtained by the next.

(51)

in Eq.(51)can be transformed as the next.

(52)