# Progressive VAE Training on Highly Sparse and Imbalanced Data

## Abstract

In this paper, we present a novel approach for training a Variational Autoencoder (VAE) on a highly imbalanced data set. The proposed training of a high-resolution VAE model begins with the training of a low-resolution core model, which can be successfully trained on imbalanced data set. In subsequent training steps, new convolutional, upsampling, deconvolutional, and downsampling layers are iteratively attached to the model. In each iteration, the additional layers are trained based on the intermediate pretrained model – a result of previous training iterations. Thus, the resolution of the model is progressively increased up to the required resolution level. In this paper, the progressive VAE training is exploited for learning a latent representation with imbalanced, highly sparse data sets and, consequently, generating routes in a constrained 2D space. Routing problems (e.g., vehicle routing problem, travelling salesman problem, and arc routing) are of special significance in many modern applications (e.g., route planning, network maintenance, developing high-performance nanoelectronic systems, and others) and typically associated with sparse imbalanced data. In this paper, the critical problem of routing billions of components in nanoelectronic devices is considered. The proposed approach exhibits a significant training speedup as compared with state-of-the-art existing VAE training methods, while generating expected image outputs from unseen input data. Furthermore, the final progressive VAE models exhibit much more precise output representation, than the Generative Adversarial Network (GAN) models trained with comparable training time. The proposed method is expected to be applicable to a wide range of applications, including but not limited image impainting, sentence interpolation, and semi-supervised learning.

## 1 Introduction

A Convolutional Neural Network (CNN) [12] is a machine learning (ML) architecture. Owing to local connectivity of convolutional layers, CNNs are commonly used for detecting complex local patterns within high-resolution 2D maps, as illustrated in Figure 1. In particular, image manipulation problems can be efficiently solved with CNNs [10]. Training a CNN model is, however, a non-trivial problem.

A typical CNN comprises multiple convolutional and deconvolutional layers separated with upsampling or downsampling layers. One of the most fundamental and common configurations of CNN is VAE â a CNN topology which enables the encoding of a high-dimensional 2D input into a low-dimensional inner representation. Consequently, the high-dimensional output image is reconstructed solely based on the low-resolution inner representation. The structure of a stacked autoencoder is illustrated in Figure 2.

In its simplest form, the multidimensional VAE input and output images are identical. In this case, VAE acts as an image compression tool. Alternatively, more complex VAE configurations make processing of the input image possible by mapping one image onto another. For advanced image-to-image transformations, a reasonably deep neural network is required, yielding image processing with complex non-linear logic. Increasing the number of VAE layers, however, poses new challenges on the training process. For example, the vanishing gradient problem is a primary concern in deep VAE networks. This problem is a direct consequence of backpropagation (\ie, gradient descent) algorithm, which minimizes the error function, , by iteratively updating the network model weights, , in opposite direction to the gradient of the error function with respect to the network weights, . Here is the learning rate of the algorithm. Intuitively, the gradients become smaller with the increasing model accuracy. These small gradients tend to further decrease through continuous matrix multiplications in those inner network layers, significantly impeding the model training [9].

Furthermore, owing to the high dimensionality of the data and, accordingly, the large number of learning parameters, those early layers of a neural network (as shown by the green shaded substructure in Figure 2) are trained significantly slower than the later layers (as shown by the blue shaded substructure in Figure 2), intensifying the deep training problem. In particular, the surface of the error function becomes flatter with increasing number of weights [18] and vanishing gradients, further decreasing the training speed [9]. Finally, backpropagation with a flat error function often saturates in a local minimum, as it is illustrated in the Figure 3. Training speed and convergence with complex deep VAE architectures is, therefore, a primary concern and the main focus of this paper.

## 2 Related work

Several methods have been proposed to mitigate the issue of training speed and convergence. For example, reasonable training speedup and convergence has recently been demonstrated with ReLU-based activation functions in deep neural networks. The traditional sigmoid activation function saturates quickly at both, very positive and negative argument values, yielding vanishing gradient values within these argument ranges. Alternatively, with the ReLU activation function, , first derivative is unity when function argument is positive. Thus, ReLU [1] activation function exhibits non-linear behavior (as required for complex transformations in neural networks) and no signal degradation through multiple neural network layers. Alternatively, the zero derivative of ReLU function for a negative argument increases the gradient vanishing probability for the negative inputs. Saturation of backpropagation at zero local minimum is, therefore, still a primary concern with ReLU activation function.

Momentum-based training is yet another approach to speed up deep training and mitigate the saturation in a local minimum with sparse input data. The effectiveness of the traditional gradient descent algorithms degrades with increasing sparsity of the training data [18]. Ideally, suboptimal local minima should be avoided. In practice, the error function is, however, often a complex non-convex surface with saddle points surrounded by flat regions (\ie, constant error function). Escaping these regions is a significant challenge for the gradient descent algorithms. Momentum-based functions, such as Adadelta [24] and RMSprop [18] have been demonstrated to accelerate gradient descent convergence and mitigate oscillations of the algorithm caused by ravines slopes around local minima. With these techniques, momentum is accumulated for those parameters with similar gradient direction. As a result, oscillation is reduced, and network model converges faster.

Sophisticated methods with adaptive learning rates have also been considered. With these methods, deep learning training converges faster with default network parameters, eliminating the need for manually adjusting the learning rates. As a result, deep training performance with sparse input data is significantly increased with these approaches.

Yet another approach commonly used with computer vision problems is transfer learning [21]. To mitigate the complexity, transfer learning approaches heavily rely on pretrained reference models. A common practice is to lock majority of the pretrained model layers associated with fundamental low-level features, attach new layers associated with an application/object-specific specialized feature, and retrain, repurposing the learning features of the reference model. To effectively leverage transfer learning with sparse data sets, the number of locked layers should be increased with the increasing sparsity of data. At the limit, all the pretrained model layers are locked and only the additional new layers are trained.

The method proposed in this paper borrows from the transfer learning approach, as described in Section 3. The proposed method is experimentally verified and compared with existing state-of-the-art methods, yielding a significant increase in training speedup and performance, as described in Section 4.

## 3 Method

### 3.1 ML objective

A general image impainting problem is formulated in this section as a supervised ML task. Let and be the set of, respectively, learning objectives and the corresponding impainting solutions. For a single ML objective, , the corresponding output image, , is an bitmap of pixels indexed, . Each pixel within an output image, is associated with a binary score, or if the pixel is, respectively, excluded from or included within the output image.

The primary goal is to train a ML system that provides the conditional probability of each pixel, , to be either included within (\ie, ) or excluded from (\ie, ) the preferred ML solution,

(1) |

where is the trained model of ML weights. The training data set comprises synthetic ML objectives in the bitmap representation and corresponding reference image outputs (\ie, the true labels). Note that the model is trained based on the joint probability distribution of the input features, , and output observations, , yielding a generative network.

### 3.2 Concept overview

In many image-to-image transformations, training sets are available with various resolution levels. While models trained on high-resolution data often suffer from gradient vanishing and other limitations (see Section 1) those lower-resolution models typically exhibit high performance and convergence due to lower number of layers and model parameters [6, 9, 18]. The proposed solution leverages iterative training with gradually increasing dimensionality as well as the principles of repurposing a pretrained model, as demonstrated with transfer learning methods.

To train a stacked deep VAE, an inner low-dimensional substructure of the autoencoder is first identified (referred here as core VAE). The objective in the first iteration is to determine an inner substructure, which (due to lower dimensionality of the inner VAE layers) enables successful model training on sparse data. Once the core autoencoder is identified, the outer, high dimensional layers are stripped, and the core model is successfully trained on a low dimensional data set. In consequent iterations, the stripped layers are progressively added back, and the model is retrained on training sets with increasingly higher resolutions. Finally, a complete VAE is trained on the original, high-resolution data set. This approach is illustrated in Figure 4 with a eight-step training process, yielding the lowest resolution core network, (as shown with grey shade), intermediate networks, with progressively higher resolutions, and the final high-resolution network VAE, comprising layers shared among all the networks, .

#### VAE loss function

Mean square error (MSE) loss function is typically used with autoencoders for evaluating sum of squared distances between the predicted values and true labels [17]. This loss function is, however, less effective with highly imbalance data, where an empty output exhibits a small MSE, thus yielding a legitimate solution. In a typical routing problem, an expected routing output with a single, often quite short path, exhibits high similarity with an empty (\ie, no path) solution, yielding low routability with MSE metric. The problem escalates with increasing input resolution, further increasing the VAE training complexity and degrading the accuracy of the trained models.

To account for specifics of routing problems with an unbalanced data set, a custom loss function is proposed. Note that this function is relevant with other ML problems, such as image impainting, super resolution, and style transfer [13, 25, 14, 4, 22]. The proposed loss function is designed to penalize the model if the number of tiles, , included by the model within a routing path is different from the number of tiles in a reference routing path, . The penalties for exceeding and falling short of differ. A path with redundant tiles is not optimal in terms of its length. It also reduces the routing capacity of the 2D space beyond the expected. Yet, such a path is considered to be legal if it connects all the input/output pins. Alternatively, if and the reference path is optimal, some components in the model solution are disconnected and the path is, therefore, incorrect. In particular, the penalization pertain to the âall-zerosâ local minimum. The proposed loss function accounts for with penalty rate of , and for incomplete routes with additional penalty rate of , yielding the following loss function for a predicted routing output, , and a reference routing solution, ,

(2) |

where

(3) |

(4) |

Here is the Heaviside [23] step function. In this paper, the proposed loss function is used with and .

#### Formal training approach

The progressive training process starts with identifying the core autoencoder . This core VAE is trained with a training set of resolution . After training, the layers of are locked and is trained with training set of resolution . Note, that the inner layers of are already trained and locked and only the outer layers are trained. After all intermediate layers are trained with the individual training sets of resolution , the final VAE is trained with the desired resolution. The pseudocode of the training algorithm is described in Algorithm 1.

Preliminary route-free training is another technique developed in this work for mitigating the probability of convergence to the âall-zerosâ local minimum. For this purpose, a route-free training set is generated. The true label for each data point in this set comprises the tiles indices of the input pins (and no additional routed tiles). Intuitively, this route-free training promotes inclusion of the pin tiles within the routing path. As a result, the solution space is shifted towards the non-flat regions of the loss function hyperplane, reducing the probability of the convergence to âall-zerosâ solution.

Model training with stochastic gradient descent (SGD) and RMSProp is shown in Figure 5 with the MSE function and the customized function (see (2)(3)(4)), exhibiting the convergence trends, as described in this section.

## 4 Experimental data

### 4.1 Experimental results

The proposed approach is demonstrated on a practical routing problem [11]. Routing is a major phase of electronic circuit design process. During this phase, the electronic components that have previously been placed within restricted space, are connected with physical wires with respect to their intended functionality. For example, to implement a Boolean function (inversion of ââ OR ââ), signal pins ââ and ââ are connected with logic nets to the individual inputs of gate ââ, the output of the gate ââ is connected with a net to the input of gate ââ, and the output of the gate ââ is connected to the output pin of the system, as illustrated in Figure 6.

During the routing process, all the logic nets are implemented as physical routes within the technology constraints (\eg, wires can only be routed in vertical and horizontal directions, but not horizontally). The wire route within a constrained area with limited net capacity exhibits a complex path even with this simple two-pin function. Alternatively, modern microprocessors comprise billions of Boolean gates and complex technology and routing constraints [15]. Routing in these systems is a NP-hard problem and a critical challenge for next generation high-performance nanoelectronic systems [5].

In this paper, an input image is represented by array of pixels. Each tile exhibits several characteristics, such as color channels and other special constraints. The input of a image impainting problem is a set of per-pixel nts and a set of all the logic nets, as defined by the physical input and output pin positions. The output of the routing problem is a set of tiles that should be included within the routing paths of the individual input nets. Note, that optimal routing of a net with more than two pins (\eg, a net that connects the output of an inverter to inputs of two other inverters) is a NP-hard problem. Traditionally routing problems are solved with approximation methods, yielding suboptimal solution [3]. We propose to solve this critical problem by formulating it as a ML problem. Consider the following definitions.

A general routing problem is mapped here onto the classical image impainting problem, as defined in Section 3. Let and be the set of, respectively, single-path routing objectives (as defined by all the starting and ending points of a single path) and the corresponding single-net routing solutions. A routing objective is to find the preferred routing path of tiles, , connecting a certain number of placed pins within a given vertical and horizontal per-tile capacity, as defined by . For any routing objective, , the corresponding single-net routed output image, , is an bitmap of tiles indexed with their physical locations, . Each tile within an output image, , is associated with a binary score, or if the routing tile is, respectively, excluded from or included within a routing path.

Note that a primary objective is maximizing the routability (\ie, the number of routed paths), while minimizing the overall wirelength (\ie, the total number of tiles included within all the routed paths) of the routing solution. For each routed wire, the number of tiles included within a path, , is significantly lower than the total number of tiles, , yielding a highly sparse and imbalanced data. These definitions are illustrated in Figure 7.

An example of correct and incorrect routing solution is shown in Figure 8.

Synthetically-generated exhaustively-routed training set was produced with an auxiliary state-of-the-art router [16] and a regular VAE model was trained with RMSprop [18] training method. With this approach, training process has not successfully converged, but saturated at a local minimum. The output of this model is, therefore, of limited use to attain routing of high performance computing devices. Due to sparsity of the data, the most common outcome of the training is the saturation of the model in a local minimum that corresponds to an empty output with all output tiles in the same class, . Routing problem is selected in this paper as a demonstration vehicle due to its high impact on the microprocessor industry and our ability to perform perceptual evaluation of the results in addition to synthetic ML performance metrics. To verify the correctness of the models, individual routing solutions are traversed with BFS algorithm, evaluating the connectivity of all the pins and nets. Number of successfully routed nets is used as a main metric due to its qualitative (\ie, indicates the system routing ability) and quantitative characteristics. If model tends to include less amount of tiles, that is needed, pins may be not connected, and number of the routed net will be lower, than reference. Similarly, if model tends to include more tiles in the routing path, capacity of design saturates faster, that leads to unavailability to route later nets. In this paper, the routability of the individual models is measure as per cent of routability of a state-of-the-art deterministic router, FastRoute 4.1.

### 4.2 Implementation and performance comparison

ISPDâ98 resolution benchmark âibm02.modifiedâ [2] is used for evaluation. With three features per tile (vertical and horizontal available net capacity and binary pin metric), a total of features are considered, yielding a dimensional input. Similarly, a dimensional output space is required to describe the inclusion or exclusion of each tile from the individual routing solutions. Architecture and detailed description of each layer of the proposed VAE is illustrated in the Figure 10. A simple VAE, GAN, and progressive VAE network are implemented and evaluated with this benchmark. All networks described in this section are prototyped in Python 3.7 using Keras neural-network library [8] with Tensorflow backend.

### 4.3 Variational autoencoder

Stacked VAE is designed to solve the critical problem of routing in nanoelectronic devices by utilizing ML imaging methods and parallelization provided by GPU platforms. Input and output spaces are defined as described in Section 3. This model is trained on the training set of 12,000 nets defined within the routing space. Input nets are generated with random pin and obstacles locations. Here synthetic obstacles are used to analogize the already placed wires, fully or partially occupying the available net capacitance at certain tiles. The reference outputs (\ie, truth labels) are obtained with an auxiliary state-of-the-art router FastRoute 4.1. The proposed VAE is trained on this synthetic model. Fastest convergence rate is observed with the RMSprop method with cyclical learning rate. The trained VAE is however unable to predict correct routing solutions neither on new, previously unseen nets, nor on those nets selected from the training set. In all these cases, a trivial output of empty wiring paths is produced by the trained autoencoder. Example of a typical routing output with this method is shown in Figure 8(a). Benchmark routability with simple VAE is, therefore, zero per cent of the routability with the reference router on the same benchmark [2]. Note those trivial cases where all pins are placed in the same tile and no routing is required are excluded from the routability calculations.

### 4.4 Generative Adversarial Network

GAN systems are commonly used to generate previously unseen images. With these methods, training is approached as a minimax game with a generating and discriminating mechanisms. The objective of the generator is to produce output similar to the reference . Alternatively, the objective of the discriminator is to distinguish between generated outputs and reference outputs. The structure of the proposed GAN is shown in Figure 10.

The network is trained on the same input data as the simple autoencoder. The training process has not converged after 10,000 iterations, yielding a total runtime of ten hours on Nvidia GTX1080 GPU. Oscillations between two local minima as observed [7], exhibiting a typical mode collapse behavior of a GAN system [19]. As a result only few modes of a multimodal data are generated, producing perceptually variable routing outputs, as illustrated in Figures 8(b) and 8(c). All the results, however, exhibit low performance based on the routability metric and perceptual evaluation. A total of 2.7% of all non-trivial nets are routed with this network, showing better result, than VAE router (2.7% non-trivial net routed), but still not enough for using it in real applications.

### 4.5 Progressive VAE

The proposed progressive VAE network is designed to utilize all available high, intermediate, and low-resolution training sets. Note that generation of reference net routes for model training exhibits high time complexity with data resolution. For example, synthesizing exhaustively-routed, -resolution training set is significantly more time consuming than synthesizing 4 training sets with resolutions of , , , and . The low-resolution core router comprises same layers, as layers 7-15 in Figure 10 and an additional convolution output layer. This router is trained on an data set. With every progressive VAE iteration, an additional convolutional and pool layer is attached to the input of the pretpre-rained VAE network and an additional upsamping and deconvolutional layer is appended at the end of the network. Finally, an output layer is added to convert the dimensionality of the last convolutional layer, , to the routing output dimensionality of one.

## 5 Conclusion

This research introduces a new approach for iteratively training VAE on highly sparse imbalanced data with progressively increasing training data resolution. The proposed method has been evaluated on routing benchmarks [2], successfully generating routes between placed pins in a constrained 2D space with limited routing capacity. The proposed method exhibits faster convergence and 96% routability, as compared with 0% and 2.7% routability with simple VAE and GAN networks.

### References

- (2014) Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830. Cited by: §2.
- (1998) The ISPD98 circuit benchmark suite. In Proceedings of the 1998 international symposium on Physical design, pp. 80–85. Cited by: §4.2, §4.3, §5.
- (2011) NCTU-GR: Efficient simulated evolution-based rerouting and congestion-relaxed layer assignment on 3-d global routing. IEEE Transactions on very large scale integration (VLSI) systems 20 (3), pp. 459–472. Cited by: §4.1.
- (2017) Learning diverse image colorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6837–6845. Cited by: §3.2.1.
- (2013) Advances in steiner trees. Vol. 6, Springer Science & Business Media. Cited by: §4.1.
- (2017) GAN and VAE from an optimal transport point of view. arXiv preprint arXiv:1706.01807. Cited by: §3.2.
- (2016) NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §4.4.
- (2017) Deep learning with keras. Packt Publishing Ltd. Cited by: §4.2.
- (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6 (02), pp. 107–116. Cited by: §1, §1, §3.2.
- (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §1.
- (2011) VLSI physical design: from graph partitioning to timing closure. Springer Science & Business Media. Cited by: §4.1.
- (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
- (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §3.2.1.
- (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §3.2.1.
- (2008) The coming of age of (academic) global routing. In Proceedings of the 2008 international symposium on Physical design, pp. 148–155. Cited by: §4.1.
- (2012) FastRoute: an efficient and high-quality global router. VLSI Design 2012. Cited by: §4.1.
- (1980) Some comments on the minimum mean square error as a criterion of estimation.. Technical report PITTSBURGH UNIV PA INST FOR STATISTICS AND APPLICATIONS. Cited by: §3.2.1.
- (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §1, §2, §3.2, §4.1.
- (2017) VEEGAN: reducing mode collapse in GANs using implicit variational learning. In Advances in Neural Information Processing Systems, pp. 3308–3318. Cited by: §4.4.
- (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: Figure 10.
- (2010) Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242–264. Cited by: §2.
- (2016) Non-local auto-encoder with collaborative stabilization for image restoration. IEEE Transactions on Image Processing 25 (5), pp. 2117–2129. Cited by: §3.2.1.
- (2002) Heaviside step function. Cited by: §3.2.1.
- (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §2.
- (2016) Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging 3 (1), pp. 47–57. Cited by: §3.2.1.