RateDistortion Optimization Guided Autoencoder for Generative Approach with quantitatively measurable latent space
Abstract
In the generative model approach of machine learning, it is essential to acquire an accurate probabilistic model and compress the dimension of data for easy treatment. However, in the conventional deepautoencoder based generative model such as VAE, the probability of the real space cannot be obtained correctly from that of in the latent space, because the scaling between both spaces is not controlled. This has also been an obstacle to quantifying the impact of the variation of latent variables on data. \colorblackIn this paper, we propose RateDistortion Optimization guided autoencoder, in which the Jacobi matrix from real space to latent space has orthonormality. \colorblackIt is proved theoretically and experimentally that (i) the probability distribution of the latent space obtained by this model is proportional to the probability distribution of the real space because Jacobian between two spaces is constant; (ii) our model behaves as nonlinear PCA, where energy of acquired latent space is concentrated on several principal components and the influence of each component can be evaluated quantitatively. Furthermore, to verify the usefulness on the practical application, we evaluate its performance in unsupervised anomaly detection and it outperforms current stateoftheart methods.
1 Introduction
Capturing the inherent features of a dataset from highdimensional and complex data is an essential issue in machine learning. Generative model approach learns the probability distribution of data, aiming at data generation by probabilistic sampling, unsupervised/weakly supervised learning, and acquiring metaprior (general assumptions about how data can be summarized naturally, such as disentangle, clustering, and hierarchical structure (Bengio et al., 2013; Tschannen et al., 2019)). It is generally difficult to directly estimate a probability density function(PDF) of real data . Accordingly, one promising approach is to map to the latent space with reduced dimension and capture PDF . In recent years, deep autoencoder based methods have made it possible to compress dimensions and derive latent variables. While there is remarkable progress in these areas (van den Oord et al., 2017; Kingma et al., 2014; Jiang et al., 2016), the relation between and in the current deep generative models is still not clear.
VAE (P.Kingma and Welling, 2014) is one of the most successful generative models for capturing latent representation. In VAE, lower bound of loglikelihood of is introduced as ELBO. Then latent variable is obtained by maximizing ELBO. In order to maximize ELBO, various methods (Alemi et al., 2018; Zhao et al., 2019; Brekelmans et al., 2019) have been proposed. However, many previous works did not care about the value of Jacobian between two spaces, despite the fact that the ratio between and is equal to the Jacobian. Even in models that provide more flexible estimation (Johnson et al., 2016; Liao et al., 2018; Zong et al., 2018), this point is overlooked.
Here, when we turn our sight toward acquiring metaprior, it is straightforward to evaluate the quantitative influence of each latent variable on the distance between data and . To do so, the scale of the latent variable should be appropriately controlled so that the changes in latent variables is proportional to the changes of distance in data space. In addition, this scaling should be adjusted according to the definition of the distance metric. For instance, with respect to image quality metrics, different metaprior should be derived from MSE and SSIM. \colorblackConsidering the mechanism of PCA would be one of the direction to solve it. In PCA, PCA components are derived by optimal orthonormal transformation, then the importance of each component can be identified quantitatively by their variance. It should be noted that orthogonality is not enough for quantitative analysis, but orthonormality is required. This fact implies that, if Jacobi matrix of autoencoder has orthonormality, the characteristics of acquired latent space can be evaluated quantitatively. \colorblack
To deal with this, we propose RaDOGAGA (RateDistortion Optimization Guided Autoencoder for Generative Approach), which simultaneously learns parametric probability distribution and autoencoder, based on the ratedistortion optimization (RDO). In this paper, we show the effect of RaDOGAGA in the following steps.
(1) We prove that RaDOGAGA has the following property theoretically and experimentally.

Jacobi matrix between real space and latent space leads to be constantscaled orthonormal. So the response of the minute change of to the real space data is constant at any .

Because of constant Jacobian (or pseudoJacobian), and are almost proportional. Therefore, can be estimated, by directly maximizing loglikelihood of parametric PDF in reduceddimensional space, without considering ELBO.

When univariate independent distribution is used to estimate parametrically, it behaves as ”continuous PCA” where energy is concentrated on several principal components.
(2) Thanks to this property, RaDOGAGA achieve the stateoftheart performance in anomaly detection task with four public datasets, where probability density estimation is important.
(3) We show that our approach can directly evaluate how the impact on the distance metric in real space. This feature is promising to further interpretation of latent variables.
2 Related Work
Flow based model: Flow based generative models generates astonishing quality of image (Kingma and Dhariwal, 2018; Dinh et al., 2014). Flow mechanism explicitly takes Jacobian of and into account. The transformation function is learned, calculating and preserving Jacobian of and . Unlike ordinary autoencoder, which reverse to with function different from , inverse function transforms as . Since the model preserves Jacobian, can be estimated by maximizing log likelihood of without considering ELBO. Although, in this approach, need to be bijection. Because of this limitation, it is difficult to fully utilize the flexibility of neural networks.
Interpretation of latent variables: While it is expected to acquire metaprior by deep autoencoder, interpreting latent variables is still challenging. In recent years, research aiming to acquire disentangled latent variables, which encourages them to be independent, is flourishing (Lopez et al., 2018; Chen et al., 2018; Kim and Mnih, 2018; Chen et al., 2016). With these methods, qualitative effects for disentanglement can be seen. For example, when a certain latent variable is displaced, image changes corresponding to specific attributes (size, color, etc.). Some works also propose quantitative metrics for metaprior. In betaVAE (Higgins et al., 2017), the metric evaluates the independence of latent variables by solving the classification task. \colorblackActually, the Jacobi matrix of VAE is orthogonal (Rolínek et al., 2019), which enable to make latent variables disentangled implicitly. However, orthonormality is not supported and it is difficult to define the metric which evaluate the effect of latent variable to the distance between data directly.\colorblack
Image compression with ratedistortion optimization: Ratedistortion theory is a part of Shannon’s information theory for lossy compression which formulates the optimal condition between rate and distortion. According to the theory, the method to optimally encode data with Gaussian distribution is as follows. At first, the data are transformed to coefficients using the orthonormal PCA basis. Then these coefficients are quantized such that quantization noise for each channel is equal. At last optimal entropy encoding is applied to quantized coefficients where rate can be calculated by logarithm of ratio between the variance of coefficients and quantization noise. Recently, RateDistortion optimized VAE is proposed in order to maximize ELBO (Brekelmans et al., 2019). In this work, instead applying equal noise, noise distribution is precisely controlled for each input data so as to match a predefined prior distribution in the sense of RateDistortion optimization. As a result, Jacobian is not constant. The most related works to our approach is RateDistortion Optimization (RDO) based image compression with deep learning (Ballé et al., 2018; Zhou et al., 2019; Wen et al., 2019). In these works, the information entropy of feature map extracted by CNN is minimized. To calculate entropy, Ballé et al. (2018) propose method to estimate probability distribution of latent variable parametrically, assuming univariate independent (factorized) distribution for each latent variable. Here, we have discovered that, behind the success of these compression methods, RDO has the effect of making the Jacobi matrix constantmultiplied orthonormal. Inspired by this, we propose autoencoder which scale latent variables according to definition of distance of data, without limitation of the transformation function. Thanks to this feature, our scheme can estimate quantitatively, which is suitable for clustering and anomaly detection. Furthermore, in the case factorized distribution is used for , our model behaves as continuous PCA. This property is considered to promote the interpretation of latent variable.
3 Method and Theory
The fundamental mechanism of ratedistortion optimization guided autoencoder is minimizing cost function which consists of (i) reconstruction error between input data and decoder output with noise to latent variable and (ii) entropy of latent variable. By doing so, the model automatically finds an appropriate scale of latent space. The specific method is described below.
First, let be dimensional domain data() and the probability of . Then is converted to dimensional latent space by encoder. Let , , and be parametric encoder, decoder, and probability distribution function of latent space with parameters , , and . We assume that each function’s parameter is rich enough to fit ideally. Then latent variable and decoded data are generated as bellow:
(1) 
Let be noise with each dimension being independent (noncorrelated among different dimensions) with an average of 0 and a variance of as follows:
(2) 
Then, given the sum of latent variable and noise , the decoder output is obtained as follows:
(3) 
Here, the cost function is defined by Eq. (4)
(4) 
In this equation, the first term is the estimated entropy of the latent distribution. in the second and the third term is a distance function between and . It is assumed that this distance function can be approximated by the following quadratic form in the neighborhood of , where is arbitrary micro variation of , is \colorblackpositive definite matrix \colorblackdepending on , and is Cholesky decomposition of .
(5) 
For instance, when is square error as in Eq. (6), and are Identity matrices.
(6) 
In the case of SSIM (Wang et al. (2014)) metric which is close to subjective image quality, a cost can be also approximated in a quadratic form \colorblackwith positive definite matrix. \colorblackThis is explained in Appendix C. Let in the second term of Eq. (4) be a scale function such as the identity , the logarithm , etc.. The effect of is discussed in Appendix D. Then, averaging Eq. (4) according to distributions, and . By deriving parameters that minimize this value, the encoder, decoder, and probability distribution of the latent space are trained as Eq. (7).
(7) 
Here, we introduce as rescaled space of according to the distance function :
(8) 
When is square error, the two spaces of and are equivalent. It is turned out that each column of the Jacobian matrix of latent space and rescaled space are constant multiple of orthonormal basis regardless of the value of and after training based on Eqs. (4) and (7). Here, denotes Kronecker delta.
(9) 
A more detailed proof is described in Appendix A. Then, by calculating distance function for and , following relationship is derived by Eqs. (5), (8) and (9).
(10) 
This means that, the latent space is scaled so that the amount of change in the real space in response to the minute change of is constant independent of value.
(11) 
Next, Jacobian between and is examined. First, when , the Jacobian matrix is a square matrix, and each column is the same as times orthonormal basis. For this reason, the Jacobian is a constant regardless of the value of as shown below:
(12) 
In this case, the probability distribution of and is proportional because of the constant Jacobian. For the case of , we assume the situation where most energy is efficiently and effectively mapped to Ndimensional latent space. Then the product of the singular values of SVD for a Jacobi matrix can be regarded as a pseudo Jacobian between the real space and the latent space. Since all of the N singular values are , the pseudo Jacobian is also a constant. As a result, the following equation holds where is Jacobian or pseudo Jacobian.
(13) 
Let be estimated probability of . Because the Jacobian J is constant, can be approximated by . Accordingly, the average of the first term in Equation (4) by can be transformed as follows.
(14) 
Here, minimization of Equation (14) is equivalent to the loglikelihood maximization of . As a result, the PDF of can be estimated without maximizing ELBO.
Regarding a parametric probability distribution, if a model such as GMM is taken, it is considered that a multimodal probability distribution can be obtained flexibly. Besides, when the following factorized probability model is used, the model shows a ”continuous PCA” feature where the energy is concentrated in several principal latent variables.
(15) 
The derivation of ”continuous PCA” feature is explained in Appendix B.
4 Experiment
4.1 Probability Density Estimation With Toy Data
In this section, we describe our experiment using toy data to demonstrate whether the probability density of the input data and that of estimated in the latent space are proportional to each other as in theory. First, we sample data from threedimensional Gaussian distribution consists of threemixturecomponents with mixture weight , mean , and covariance . is the index for mixture component. Then, we scatter with uniform random noise , , where and are index for dimension of sampled data and scattered data. The s are uncorrelated with each other. We produce 16dimensional input data with a sample number of 10,000 in the end .
(16) 
The appearance probability of the input data is equivalent to the generation probability of .
4.1.1 Configuration
In the experiment, we estimate the using GMM with parameter as in DAGMM (Zong et al., 2018). Instead of EM algorithm, GMM parameters are learned using Estimation Network (EN), which consists of multilayer neural network. When the GMM has mixturecomponents and is the size of batch samples, EN outputs the mixturecomponents membership prediction as dimensional vector as follows:
(17) 
Kth mixture weight , mean , covariance , and entropy of are further calculated by Eqs. (18) and (19).
(18) 
(19) 
Overall network is trained by Eqs. (4) and (7). In this experiment, we set as square error , and test two types of , and . We denote models trained with these as RaDOGAGA(d) and RaDOGAGA(log(d)). As a comparison method, DAGMM is used. DAGMM also consists of encoder, decoder, and EN. In DAGMM, to avoid falling into the trivial solution that entropy is minimized when the diagonal component of the covariance matrix is 0, the inverse of the diagonal component is added to the cost function as Eq. (20):
(20) 
For both RaDOGAGA and DAGMM, the autoencoder part is constructed with fully connected (FC) layers with sizes of 64, 32, 16, 3, 16, 32, and 64. For all FC layers except for the last of the encoder and the decoder, we attach as the activation function. The EN part is also constructed with FC layer with a size of 10, 3. For the first layer, we attach the as activation function and dropout (ratio=0.5). For the last one, softmax is attached. (, ) is set as (, ), (, ) and (, ) for DAGMM, RaDOGAGA(d) and RaDOGAGA(log(d)) respectively. Optimization is done by Adam optimizer (Kingma and Ba, 2014) with learning rate for all model. We set as .
4.1.2 Result
Figure 2 displays the distribution of latent variable , and Figure 3 displays a plot of (xaxis) against (yaxis). Even though both methods can capture that is generated from three mixturecomponent, there is a difference is PDF. In our method, and are approximately proportional to each other as in theory while we cannot observe such proportionality in DAGMM. To compare RaDOGAGA(d) and RaDOGAGA(log(d)), we normalized with the range from 0 to 1, then calculated residual of linear regression putting as objective variable and as explanatory variable. The residual of RaDOGAGA(log(d)), i.e., 0.0027, is smaller than that of RaDOGAGA(d), i.e., 0.0100. Actually, when is sufficiently fitted, makes and be proportional more rigidly. On the other hand, makes the scale of latent space slightly bent in order to minimize entropy function, allowing relaxed fitting of . More detail is described in Appendix D. In the next section, we see how this trait performs on the real task.
4.2 Anomaly Detection Task Using Real Data
In this section, we examine whether the clear relationship between and is useful in the anomaly detection task using real data. We use four public datasets^{1}^{1}1Dataset can be dowonload from (https://kdd.ics.uci.edu/) and (http://odds.cs.stonybrook.edu), KDDCUP99, Thyroid, Arrhythmia, and KDDCUPRev. The (instance number, dimension, anomaly ratio(%)) of each dataset is (494021, 121, 20), (3772, 6, 2.5), (452, 274, 15), and (121597, 121, 20) respectively. Detail of datasets is described in Appendix E.
4.2.1 Experimental Setup
We follow the setting in Zong et al. (2018). Randomly extracted 50% of the data is assigned to training and the rest to testing. During training, only normal data is used. During the test, the for each sample is calculated as the anomaly score, and if the anomaly score is higher than a threshold, it is detected as an anomaly. The threshold is determined by the ratio of anomaly data in each data set. For example, in KDDCup99, data with in the top 20 % is detected as an anomaly. As metrics, precision, recall, and F1 score are calculated. We run 20 times for each dataset split by 20 different random seeds.
4.2.2 Baseline Model
4.2.3 Configuration
As in Zong et al. (2018), in addition to the output from the encoder, and are concatenated to . It is sent to EN. Note that is sent to the decoder before concatenation. Other configuration except for hyper parameter is same as in the previous experiment. Hyper parameter for each dataset is described in Appendix E.
4.2.4 Results
Table 1 reports the average scores and standard deviations (in brackets). Comparing DAGMM and RaDOGAGA, RaDOGAGA has a better performance regardless of types of . Simply introducing the RDO mechanism into the autoencoder has a valid efficacy for anomaly detection. Moreover, our approach achieves stateoftheart performance compared to other previous works in which same datasets is used. Clear relationship between and by our model is considered to be effective in the task of anomaly detection where the estimating probability distribution is important In RaDOGAGA, when we compare result of RaDOGAGA(d) and RaDOGAGA(log(d)), either of one is not always superior. As described in section 4.1 and Appendix D, can be an option depending on fitting flexibility of .
4.3 Quantifying the Impact of Latent Variables on Distance and Behavior as Pca
In this section, we verify that the impact of each latent variable on the distance function can be quantified.
As described in section 3, \colorblackwhen is a minute displacement, the ratio between the distance and is constant regardless of . \colorblackWe verify if this characteristic is observable in a model trained with real data. \colorblackLet be a vector where only th component has a minute value . Then denotes \colorblack for the ratio of th dimension.
Once the model is trained, we encode the image \colorblack and obtain . Then, \colorblack is calculated for each sample. Finally, the average across all samples is measured. This operation is conducted to each dimension of independently. \colorblackWe also observe the distribution of and how the image looks different in response to the variation of component \colorblack. \colorblack
To train model, we use CelebA dataset (Liu et al., 2015), which consists of 202,599 celebrity images. The images are centercropped so that the image size is 64 x 64.
4.3.1 Configuration
In this experiment, factorized distributions (Ballé et al., 2018) is used to estimate . For comparison, we evaluate betaVAE. Both models are constructed with same depth of CNN and FC layers, with 256dimensional . Detail of networks and hyper parameter is written in Appendix F. For RaDOGAGA, we set \colorblack \colorblackas \colorblack \colorblackand as . For betaVAE, reconstruction loss is also \colorblack. \colorblackis defined by Eq. (60).
4.3.2 Result
Both models are trained so that the SSIM between input and reconstructed image is around 0.93. \colorblackFigure 4 and 4 show the variance of each component in the latent variables \colorblackarranged in descending order. The red line is the cumulative relative ratio of the variance. In Fig. 4, variance is concentrated in a specific dimension. On the other hand, in Fig. 4, betaVAE is trained so that each latent variable is fitted to a Gaussian distribution with mean and variance , there is no significant difference in the variance of each latent variable. Figures 4 and 4 respectively depicts the average of \colorblack \colorblackof each of the top nine dimensions with the largest variance of \colorblack. \colorblackIn betaVAE, the \colorblack \colorblackvaries drastically depending on the dimension \colorblack, \colorblackwhile in RaDOGAGA, it is approximately constant regardless of \colorblack. \colorblackFigure 5 \colorblackshows \colorblackdecoder outputs when each component \colorblackis traversed from 2 to 2, fixing rest of as mean. From the top, each row corresponds to , , …, and the center column is mean. In Fig. 5, the image changes visually in any dimension of \colorblack, \colorblackwhile in Fig. 5, depending on \colorblackthe dimension , \colorblackthere are cases where no significant changes can be seen (such as , , and so on). This result means that, in RaDOGAGA, the variance of \colorblack \colorblackdirectly corresponds to the visual impact and the distance \colorblack, \colorblackbehaving as PCA. Besides, since \colorblack \colorblackis constant, the variance can be used as a quantifying metric. Although this is unsupervised image reconstruction task, if the task is the semantical problem and distance function is defined so as to solve it, such as a classification, influence of each \colorblack \colorblackon the semantics is expected to be quantified.
5 Conclusion
In this paper, we proposed RaDOGAGA that learns parametric probability distribution and autoencoder simultaneously based on ratedistortion optimization. It was proved that the probability distribution of the latent variables obtained by the proposed method is proportional to the probability distribution of the input data theoretically and experimentally. This property is validated in anomaly detection achieving stateoftheart performance. Moreover, our model has the trait as PCA which likely promotes interpretation of latent variables. For the future work, we will conduct experiments with different types of distance functions that derived from semantical task, such as in categorical classification. Meanwhile, as mentioned in Tschannen et al. (2019), considering the usefulness of latent variables in downstream task is another research direction to explore.
References
 Fixing a broken elbo. In Proceedings of the 35th International Conference on Machine Learning, pp. 159–168. Cited by: §1.
 Density modeling of images using a generalized normalization transformation. arXiv preprint arXiv:1511.06281. Cited by: Appendix F.
 Variational image compression with a scale hyperprior. In In International Conference on Learning Representations, Cited by: §2, §4.3.1.
 Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, pp. 798–1828. Cited by: §1.
 Exact ratedistortion in autoencoders via echo noise. arXiv preprint arXiv:1904.07199v2. Cited by: §1, §2.
 Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.
 Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.
 Nice: nonlinear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §2.
 Betavae: learning basic visual concepts with a constrained variational framework. In ICLR 2017, Cited by: §2.
 Variational deep embedding: an unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148. Cited by: §1.
 Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems, Cited by: §1.
 Disentangling by factorising. In Proc. of the International Conference on Machine Learning, pp. 2649–2658. Cited by: §2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.1.
 Semisupervised learning with deep generative models. In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2, NIPS’14, Cambridge, MA, USA, pp. 3581–3589. Cited by: §1.
 Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §2.
 A unified unsupervised gaussian mixture variational autoencoder for high dimensional outlier detection. In 2018 IEEE International Conference on Big Data (Big Data), pp. 1208–1217. Cited by: §1, §4.2.2, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1.
 UCI machine learning repository. Note: http://archive.ics.uci.edu/ml. Cited by: Appendix E, Appendix E, Appendix E, Appendix E.
 Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.3.
 Information constraints on autoencoding variational bayes. In Advances in Neural Information Processing Systems, pp. 6114–6125. Cited by: §2.
 Autoencodingvariationalbayes. In ICLR 2014, Banff, Canada. Cited by: §1.
 The transform and data compression handbook. CRC Press, Inc., Boca Raton, FL, USA. External Links: ISBN 0849336929 Cited by: Appendix B.
 Variational autoencoders pursue pca directions (by accident). In Proceedings of Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 Learning neural random fields with inclusive auxiliary generators. arXiv preprint arXiv:1806.00271. Cited by: §4.2.2, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1.
 Recent advances in autoencoderbased representation learning. In Third workshop on Bayesian Deep Learning (NeurIPS 2018), Montreal, Canada. Cited by: §1, §5.
 Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §1.
 Image quality assessment: from error visibility to structural similarity. Vol. 13, IEEE Trans. on Image Processing. Cited by: Appendix C, §3.
 Variational autoencoder based image compression with pyramidal features and context entropy model. In In WORKSHOP AND CHALLENGE ON LEARNED IMAGE COMPRESSION, Cited by: §2.
 Adversarially learned anomaly detection. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 727–736. Cited by: §4.2.2, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1.
 InfoVAE: balancing learning and inference in variational autoencoders. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5885–5892. Cited by: §1.
 Multiscale and contextadaptive entropy model for image compression. In In WORKSHOP AND CHALLENGE ON LEARNED IMAGE COMPRESSION, Cited by: §2.
 Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In In International Conference on Learning Representations, Cited by: Table 3, Appendix E, Appendix E, §1, §4.1.1, §4.2.1, §4.2.3, footnote 1, footnote 1, footnote 1, footnote 2, footnote 1, footnote 2, footnote 1, footnote 1, footnote 1, footnote 1, footnote 2, footnote 1, footnote 1, footnote 2.
Appendix A How Jacobi Matrix Become a Constantly Scaled Orthonormal Basis
In this appendix section, we prove that all of column vectors in decoder’s Jacobi Matrix have the same norm and are orthogonal to each other. In other words, each column of Jacobi matrix is a constantly scaled orthonormal basis.
Here we assume that data space sample has dimension, that is , then encoded to dimensional latent space sample . We also assume that an fixed encoder function and a fixed decoder are given such that becomes minimal.
(21)  
(22)  
We further assume that fixed parametric PDF of latent variable is also given.
(23) 
Here, it is noted that this function is not needed to be optimal in a sense of where is an actual PDF of .
Under these conditions, we introduce new latent variable , and is transformed from using a following scaling function .
(24) 
Here, our goal is to prove Eq. (9) by examining the condition of this scaling function which minimize the average of Eq. (4) with regard to .
Because of the assumption of minimal , we ignore the second term of Eq. (4). Next, the PDF of can be derived using a Jacobian of .
(25) 
By applying these conditions and equations to Eq. (4), the cost of scaling function to minimize the average is expressed as follows.
Next, the latent variable is fixed as , and is defined in the next equation.
(27) 
Then the average of Eq. (A) with regard to is expressed as follows.
Then the condition of in the neighborhood of is examined when Eq. (A) is minimized. Here, it is noted that the first term of the right side doesn’t depend on .
Before examining Eq. (A), some preparation of equations are needed. At first, Jacobi matrix of at is defined as using notations of partial differentials .
(29) 
The vector is also defined as follows.
(30) 
It is clear by definition that and have the following relation.
(31) 
is defined as a cofactor of matrix with regard to the element , and is also defined by the following equation.
(32) 
The following equations holds by the definition of cofactor.
(33)  
(34) 
In case of , inner product of and becomes zero because this value is a determinant of a singular matrix with two equivalent column vectors .
(35) 
is defined as a Jacobi matrix of at as follows.
(36) 
Using these equations, we preceed to expand of Eq. (A). The first term of (A) in the right side can be expressed as follows, where the second term Eq. (37) in the right side is constant by definition.
(37) 
Next step is an expansion of the second term in Eq. (A). First, the following approximation holds.
(38)  
Then, the second term of Eq. (A) can be transformed to the next equation by using Eqs. (38), (2), and (5).
(40)  
By examining the minimization process of Eq. (40), the conditions of optimal scaling function in the neighborhood of is clarified. Here, the condition of Jacobi matrix is examined instead of . Eq. (40) is differentiated by vector , and the result is set to be zero. Then the following equation Eq.(41) is derived.
(41) 
Afterwards, Eq. (41) is multiplied by from the left, and divided by . As a result, Eq. (42) is derived by using Eqs. (33) and (35).
(42) 
Here, we define the following function which is a composite function of and .
(43) 
Then the next equation holds by definition.
(44) 
It is noted that this equation holds at any value of or . As a result, the following equation, that is Eq. (9), can be derived.
(45) 
If encoder and decoder are trained well and holds, we can introduce new rescaled data space determined by distance function such as , and the next equation holds.
(46) 
In conclusion, all column vectors of Jacobi matrix between and has the same L2 norm and all pairs of column vectors are orthogonal. In other words,, when column vectors of Jacobi matrix are multiplied by the constant , the resulting vectors are orthonormal.
Appendix B Explanation of ”Continuous Pca” Feature
In this section, we explain RaDOGAGA has a continuous PCA feature when factorized probability density model as below is used.
(47) 
Here, our definition of ”continuous PCA” feature means the following. 1) Mutual information between latent variables are minimum and likely to be uncorrelated to each other. 2) Energy of latent space is concentrated to several principal components, and the importance of each component can be determined.
Next we explain the reason why these feature is acquired. As explained in \colorblack appendix A\colorblack, all column vectors of Jacobi matrix of decoder from latent space to data space have the same norm and all combinations of pairwise vectors are orthogonal. In other words, when constant value is multiplied, the resulting vectors are orthonormal. Because encoder is a inverse function of decoder ideally, each row vector of encoder’s Jacobi matrix should be the same as column vector of decoder under the ideal condition. Here, and are defined as encoder and decoder with these feature. Because the latent variables depend on encoder parameter , latent variable is described as , and its PDF is defined as . PDFs of latent space and data space have the following relation where is a Jacobian or pseudoJacobian between two spaces with constant value as explained in \colorblack appendix A\colorblack.
(48) 
As described before, is a parametric PDF of the latent space to be optimized with parameter .
Here, the third term of the right side is constant, this term can be removed from the cost function as follows.
(50) 
Then the parameters of network and probability can be obtained by the next.
(51) 
in Eq.（51）can be transformed as the next.
(52)  