Semi-Supervised Neural Architecture Search
Neural architecture search (NAS) relies on a good controller to generate better architectures or predict the accuracy of given architectures. However, training the controller requires both abundant and high-quality pairs of architectures and their accuracy, while it is costly to evaluate an architecture and obtain its accuracy. In this paper, we propose SemiNAS, a semi-supervised NAS approach that leverages numerous unlabeled architectures (without evaluation and thus nearly no cost) to improve the controller. Specifically, SemiNAS 1) trains an initial controller with a small set of architecture-accuracy data pairs; 2) uses the trained controller to predict the accuracy of large amount of architectures (without evaluation); and 3) adds the generated data pairs to the original data to further improve the controller. SemiNAS has two advantages: 1) It reduces the computational cost under the same accuracy guarantee. 2) It achieves higher accuracy under the same computational cost. On NASBench-101 benchmark dataset, it discovers a top architecture after evaluating roughly architectures, with only computational cost compared with regularized evolution and gradient-based methods. On ImageNet, it achieves top-1 error rate (under the mobile setting) using 4 GPU-days for search. We further apply it to LJSpeech text to speech task and it achieves intelligibility rate in the low-resource setting and test error rate in the robustness setting, with , improvements over the baseline respectively. Our code is available at https://github.com/renqianluo/SemiNAS.
The work was done when the first author was an intern at Microsoft Research Asia.
Neural architecture search (NAS) for automatic architecture design has been successfully applied in several tasks including image classification and language modeling Zoph et al. (2018); So et al. (2019); Ghiasi et al. (2019). NAS typically contains two components, a controller (also called generator) that controls the generation of new architectures, and an evaluator that trains candidate architectures and evaluates their accuracy
However, collecting such architecture-accuracy pairs is expensive, since it is costly for the evaluator to train each architecture to accurately get its accuracy, which incurs the most computational cost in NAS. Popular methods usually consume hundreds to thousands of GPU days to discover eventually good architectures Zoph & Le (2016); Real et al. (2018); Luo et al. (2018). To address this problem, one-shot NAS Bender et al. (2018); Pham et al. (2018); Liu et al. (2018); Xie et al. (2018) uses a supernet to include all candidate architectures via weight sharing and trains the supernet to reduce the training time. While greatly reducing the computational cost, the quality of the training data (architectures and their corresponding accuracy) for the controller is degraded Sciuto et al. (2019), and thus these approaches suffer from accuracy decline on downstream tasks.
In various scenarios with limited labeled training data, semi-supervised learning Zhu & Goldberg (2009) is a popular approach to leverage unlabeled data to boost the training accuracy. In the scenario of NAS, unlabeled architectures can be obtained through random generation, mutation Real et al. (2018), or simply going through the whole search space Wen et al. (2019), which incur nearly zero additional cost. Inspired by semi-supervised learning, in this paper, we propose SemiNAS, a semi-supervised approach for NAS that leverages a large number of unlabeled architectures to help the training of the controller. Specifically, SemiNAS 1) trains an initial controller with a small set of architecture-accuracy data pairs; 2) uses the trained controller to predict the accuracy of a large number of unlabeled architectures; and 3) adds the generated architecture-accuracy pairs to the original data to further improve the controller.
SemiNAS can be applied to many NAS algorithms. We take the neural architecture optimization (NAO) (Luo et al., 2018) algorithm as an example, since NAO has the following advantages: 1) it takes architecture-accuracy pairs as training data to train a predictor to predict the accuracy of architectures, which can directly reused by SemiNAS; 2) it supports both conventional methods which train each architecture from scratch Zoph et al. (2018); Real et al. (2018); Luo et al. (2018) and one-shot methods which train a supernet with weight sharing Pham et al. (2018); Luo et al. (2018); and 3) it is based on gradient optimization which has shown better effectiveness and efficiency. Although we implement SemiNAS on NAO, it is easy to be applied to other NAS methods, such as reinforcement learning based methods Zoph et al. (2018); Pham et al. (2018) and evolutionary algorithm based methods Real et al. (2018).
SemiNAS shows advantages over both conventional NAS and one-shot NAS. Compared with conventional NAS, it significantly reduces computational cost to achieve similar accuracy, and achieves better accuracy with similar cost. Specifically, on NASBench-101 benchmark, SemiNAS achieves similar accuracy (, ranking top ) as regularized evolution Real et al. (2018) and gradient based methods Luo et al. (2018) using only computational cost of them. Meanwhile it discoverers an architecture with accuracy (ranking top ) surpassing all the baselines when evaluating the same number of architectures (with the same computational cost). Compared with one-shot NAS, SemiNAS achieves higher accuracy using similar computational cost. For image classification, within GPU days for search, we achieve top-1 error rate on ImageNet under the mobile setting, which is the same as the current state-of-the-art. For text to speech (TTS), using GPU days for search, SemiNAS achieves intelligibility rate in the low-resource setting and sentence error rate in the robustness setting, which outperforms human-designed model by and points respectively.
Our contributions can be summarized as follows:
We propose SemiNAS, a semi-supervised approach for NAS, which leverages a large number of unlabeled architectures to help the training of the controller. SemiNAS can reduce computational cost to achieve similar accuracy, and achieve higher accuracy with similar computational cost.
The effectiveness of SemiNAS is verified through experiments on image classification tasks including NASBench-101 (CIFAR) and ImageNet, as well as text to speech tasks including low-resource and robustness settings.
To the best of our knowledge, we are the first to develop NAS algorithms on text to speech (TTS) task. We carefully design the search space and search metric for TTS, and achieve significant improvements compared to human-designed architectures. We believe that our designed search space and metric are helpful for future studies on NAS for TTS.
2 Related Work
From the perspective of the computational cost of training candidate architectures, previous works on NAS can be categorized into conventional NAS and one-shot NAS.
Conventional NAS includes Zoph & Le (2016); Zoph et al. (2018); Real et al. (2018); Luo et al. (2018), which achieve significant improvements on several benchmark datasets. Obtaining the accuracy of the candidate architectures is expensive in conventional NAS, since they train every single architecture from scratch and usually require thousands of architectures to train. The total cost is usually more than hundreds of GPU days Zoph et al. (2018); Real et al. (2018); Luo et al. (2018), which is impracticable for most research institutions and companies.
To reduce the huge cost in NAS, one-shot NAS was proposed with the help of weight sharing mechanism. Bender et al. (2018) proposes to include all candidate operations in the search space within a supernet and share parameters among candidate architectures. Each candidate architecture is a sub-graph in the supernet and only activates the parameters associated with it. The algorithm trains the supernet rather than trains each architecture from scratch and then evaluates the accuracy of candidate architectures by the corresponding sub-graphs in the supernet. ENAS Pham et al. (2018) leverages the idea of weight sharing and searches by reinforcement learning. NAO Luo et al. (2018) also incorporates the idea of weight sharing into its gradient optimization based search method. DARTS Liu et al. (2018) searches via gradient optimization on a supernet. ProxylessNAS Cai et al. (2018) uses gating methods to reduce the memory cost of the supernet and therefore directly searches on target task and device. Stamoulis et al. (2019). Guo et al. (2019) propose to traverse one path in the supernet during the search.
Such weight sharing mechanism successfully cuts down the computational cost to less than GPU days Pham et al. (2018); Liu et al. (2018); Cai et al. (2018); Xu et al. (2019). However, the supernet requires careful design and the training of supernet needs careful tunning. Moreover, it shows inferior performance and reproducibility compared to conventional NAS. One main cause is the short training time and inadequate update of individual architecture Li & Talwalkar (2019); Sciuto et al. (2019), which leads to an inaccurate ranking of the architectures, and provides relatively low-quality architecture-accuracy pairs for the controller. Considering that the key to a NAS algorithm is to discover better architectures in the search space based on the accuracy ranking, such one-shot NAS suffers from a decline in both accuracy and reproducibility Li & Talwalkar (2019); Sciuto et al. (2019).
To sum up, there exists a trade-off between computational cost and accuracy. We formalize the computational cost of the evaluator by , where is the number of architecture-accuracy pairs for the controller to learn, and is the training time of each candidate architecture. In conventional NAS, the evaluator trains each architecture from scratch and the is typically several epochs
In this section, we first describe the semi-supervised training of the controller, and then introduce the implementation of the proposed SemiNAS algorithm.
3.1 The Semi-Supervised Training of the Controller
SemiNAS trains the controller through semi-supervised learning. Specifically, we reduce the number of evaluated architectures () but utilize a large number of unevaluated architectures () to improve the controller. In this paper, we choose the controller as used in NAO Luo et al. (2018), which consists of an encoder , a predictor and a decoder . It is feasible to use such a controller since it directly takes architecture-accuracy pairs as data to learn and predicts the accuracy of an architecture during the inference, which is able to predict accuracy for numerous architectures. More details of NAO will be introduced in Section 3.2.
In SemiNAS, the encoder and the predictor of the controller are leveraged to predict the accuracy of given architectures. The encoder is implemented as an LSTM network to map the discrete architecture to continuous embedding representations , and the predictor uses several fully connected layers to predict the accuracy of the architecture taking the continuous embedding as input. The decoder is to decode the continuous embedding back to discrete architecture, which will be described in detail in Section 3.2. Mathematically, the loss function to train the encoder and the predictor can be described as:
where is the corresponding accuracy obtained from the evaluator.
The semi-supervised learning of the controller can be decomposed into steps:
Generate architectures from the search space. Use the evaluator (conventional or weight sharing) to train and evaluate these architectures, and collect the corresponding accuracy . Train the encoder and predictor of the controller following Eqn. 1 with labeled dataset .
Generate unlabeled architectures and use the trained encoder and predictor to predict their accuracy as delegates of their true accuracy:
and get the generated dataset .
Combine the two datasets and together to train a better controller.
SemiNAS brings advantages over both conventional NAS and one-shot NAS, which can be illustrated under the computational cost formulation . Compared to conventional NAS which is costly, SemiNAS can reduce the computational cost with smaller but using more additional unlabeled architectures to avoid accuracy drop. Compared to one-shot NAS which has inferior accuracy, SemiNAS can improve the accuracy by using more unlabeled architectures under the same computational cost . In this setting, in order to get more accurate evaluation of architectures and improve the quality of architecture-accuracy pairs, we extend the average training time for each individual architecture to obtain better initial training data. Accordingly, we reduce the number of architectures to be trained (i.e., ) to keep the total budget unchanged.
3.2 The Implementation of SemiNAS
We now describe the implementation of our SemiNAS algorithm. We take NAO Luo et al. (2018) as our implementation since it has following advantages: 1) it contains an encoder-predictor-decoder framework, where the encoder and the predictor can predict the accuracy for large number of architectures without evaluation; 2) it performs search by applying gradient ascent which has shown better effectiveness and efficiency; 3) it can incorporate both conventional NAS (whose evaluator trains each architecture from scratch) and one-shot NAS (whose evaluator builds a supernet to train all the architectures via weight sharing).
As briefly described in the last subsection, NAO Luo et al. (2018) uses an encoder-predictor-decoder framework as the controller, where the encoder maps the discrete architecture representation into continuous representation and uses the predictor to predict its accuracy . Then it uses a decoder that is implemented based on a multi-layer LSTM to reconstruct the original discrete architecture from the continuous representation in an auto-regressive manner. The training of the controller aims to minimize the prediction loss and structure reconstruction loss :
where follows Eqn. 1 and is the cross entropy loss between the output of the decoder and the ground-truth architecture. is a trade-off parameter.
After the controller is trained, for any given architecture as the input, NAO moves its representation towards the direction of the gradient ascent of the accuracy prediction to get a new and better continuous representation as follows:
where is a step size. can get higher prediction accuracy after gradient ascent. Finally it uses the decoder to decode into a new architecture , which is supposed to be better than architecture . The process of the architecture optimization is performed for iterations, where newly generated architectures at the end of each iteration are added to the architecture pool for evaluation and further used to train the controller in the next iteration. Finally, the best performing architecture in the architecture pool is selected out as the final result.
The detailed algorithm of SemiNAS based on NAO is shown in Alg. 1. Within each iteration, we train the controller with our proposed semi-supervised approach (line 5 to line 8). We pre-train the encoder and the predictor with a small set of architecture-accuracy pairs (line 5), and then randomly generate architectures and use the pre-trained encoder and predictor to predict the accuracy of these architectures (line 6). Then we use both the architecture-accuracy pairs obtained from the set and the generated set to train the controller (line 7 and line 8). We perform the gradient ascent optimization to generate better architectures based on current architectures (line 9 and line 10). After iterations, we output the best architecture from the architecture-accuracy pool we have obtained as the final discovered architecture.
Although our SemiNAS is implemented based on NAO, the key idea of utilizing the trained encoder and predictor to predict the accuracy of numerous unlabeled architectures can be extended to a variety of NAS methods. For reinforcement learning based algorithms Zoph & Le (2016); Zoph et al. (2018); Pham et al. (2018) where the controller is usually an RNN model, we can predict the accuracy of the architectures generated by the RNN and take the predicted accuracy as the reward to train the controller. For evolution based methods Real et al. (2018), we can predict the accuracy of the architectures generated through mutation and crossover, and then take the predicted accuracy as the fitness of the generated architectures. We leave the implementation of SemiNAS based on these NAS methods as future works.
4 Application to Image Classification
In this section, we demonstrate the effectiveness of SemiNAS on image classification tasks. We first conduct experiments on NASBench-101 Ying et al. (2019), which is a benchmark dataset to evaluate the effectiveness and efficiency of NAS algorithms, and then on the commonly used large-scale ImageNet dataset.
We first describe the experiment settings and results on NASBench-101. Furthermore, we conduct experimental study to analyze the hyper-parameters of SemiNAS.
Datasets. NASBench-101 Ying et al. (2019) designs a cell-based search space for CIFAR-10 following the common practice Zoph et al. (2018); Luo et al. (2018); Liu et al. (2018). It includes architectures. It trains each architecture for times from scratch to full convergence, and reports its validation accuracy and test accuracy for each run. A query of the accuracy of an architecture from the dataset is equivalent to evaluating the architecture and will randomly get the accuracy of one of the runs. We hope to discover comparable architectures with less computational cost or better architectures with comparable computational cost. Specifically, on this dataset, reducing the computational cost can be regarded as decreasing the number of queries.
Training Details. 1) For the controller, both the encoder and the decoder consist of a single layer LSTM with a hidden size of , and the predictor is a three-layer fully connected network with hidden sizes of respectively. In the predictor, ReLU is inserted after the first layer to perform non-linearity. We use a dropout rate of to avoid over-fitting, and set in Eqn. 3 according to the validation performance. We use Adam optimizer with a learning rate of . 2) For the evaluator, we query the accuracy of an architecture from NASBench-101, which can be regarded as training the architecture once in practice. 3) For the final evaluation, we report the mean test accuracy of the selected architecture over the runs.
All the results are listed in Table 1. We also report the performance of random search which is shown to be a strong baseline Li & Talwalkar (2019), regularized evolution Real et al. (2018) which is the best-performing algorithm evaluated in Ying et al. (2019), and NAO Luo et al. (2018) on which our SemiNAS is based.
We report two settings of SemiNAS. For the first setting, we set and up-sample labeled data by x according to the validation performance. We generate new architectures based on top architectures following line 9 in Alg. 1 at each iteration and run for iterations. The algorithm totally queries around architectures from NASBench-101, similar to the baselines, and achieves mean test accuracy, surpassing all the baselines. This shows that with the help of numerous unlabeled architectures, SemiNAS can achieve better accuracy than the baselines under the similar computational cost. For the second setting, we use and up-sample labeled data by x according to the validation performance. We generate new architectures based on top architectures following line 9 in Alg. 1 at each iteration and run for iterations. The algorithm totally evaluates architectures. SemiNAS achieves mean test accuracy, which is on par with regularized evolution and NAO, but with only about computational cost ( architectures in total vs. architectures). This demonstrates that SemiNAS can greatly reduce the computational cost under the similar accuracy guarantee.
|RE Real et al. (2018)||2000||93.97|
|NAO Luo et al. (2018)||2000||93.87|
Study of SemiNAS
In this section, we conduct experiments on NASBench-101 to study SemiNAS, including the number of unlabeled architectures and the up-sampling ratio of labeled architectures.
|Model/Method||Top-1 (%)||Top-5 (%)||Params (Million)||FLOPS (Million)|
|MobileNetV2 Sandler et al. (2018)||25.3||-||6.9||585|
|ShuffleNet 2 (v2) Zhang et al. (2018)||25.1||-||5||591|
|NASNet-A Zoph & Le (2016)||26.0||8.4||5.3||564|
|AmoebaNet-A Real et al. (2018)||25.5||8.0||5.1||555|
|AmoebaNet-C Real et al. (2018)||24.3||7.6||6.4||570|
|MnasNet Tan et al. (2019)||25.2||8.0||4.4||388|
|PNAS Liu et al. (2017)||25.8||8.1||5.1||588|
|DARTS Liu et al. (2018)||26.9||9.0||4.9||595|
|SNAS Xie et al. (2018)||27.3||9.2||4.3||522|
|P-DARTS Chen et al. (2019)||24.4||7.4||4.9||557|
|Single-Path NAS Stamoulis et al. (2019)||25.0||7.8||-||-|
|Single Path One-shot Guo et al. (2019)||25.3||-||-||328|
|ProxylessNAS Cai et al. (2018)||24.9||7.5||7.12||465|
|PC-DARTS Xu et al. (2019)||24.2||7.3||5.3||597|
|NAO Luo et al. (2018)||25.7||8.2||11.35||584|
Number of unlabeled architectures . We study the effect of different on SemiNAS. Given following the second setting in the above experiments, we range within , and plot the results in Figure 1. We can see that the test accuracy increases as increases, indicating that utilizing unlabeled architectures indeed helps the training of the controller and generating better architectures.
Up-sampling ratio. Since is much smaller than , we do up-sampling to balance the data. We study how the up-sampling ratio affects the effectiveness of SemiNAS on NASBench-101. We set following the second setting in our experiments and range the up-sampling ratio in where means no up-sampling. The results are depicted in Figure 1. We can see that the final accuracy would benefit from up-sampling but will not continue to improve when the ratio is high (e.g., larger than ).
Previous experiments on NASBench-101 dataset verify the effectiveness and efficiency of SemiNAS in a well-controlled environment. We further evaluate our approach to the large-scale ImageNet dataset.
Dataset. ImageNet comprises approximately million images for training and images for test, which are categorized into object classes. We randomly sample images from the training data as valid set for architecture search.
Search space. We adopt the architecture search space in ProxylessNAS Cai et al. (2018), which is based on the MobileNet-V2 Sandler et al. (2018) network backbone. It consists of multiple stacked stages, and each stage contains multiple layers. We search the operation of each layer. Candidate operations include mobile inverted bottleneck convolution layers Sandler et al. (2018) with various kernel sizes and expansion ratios , as well as zero-out layer.
Training details. 1) For the controller, we set and and run the search process for iterations. In each iteration, new better architectures are generated based on top architectures following line 9 in Alg. 1. Other details are the same as in NASBench-101 experiments. 2) For the evaluator, since training ImageNet is too expensive, we use a weight sharing based evaluator Pham et al. (2018); Cai et al. (2018) in SemiNAS. We train the supernet on GPUs for steps with a batch size of per card. 3) For the final evaluation, we train the discovered architecture on the full ImageNet training set for epochs following exactly the same setting as in Cai et al. (2018) with a batch size of . We use the SGD optimizer with an initial learning rate of and a cosine learning rate schedule Loshchilov & Hutter (2016). The parameters are initialized with Kaiming initialization He et al. (2015).
We run the algorithm for day with the total cost of GPU days and evaluate the discovered architecture. The final discovered architecture is shown in the supplementary material. The results of SemiNAS and other methods are reported in Table 2. SemiNAS achieves top-1 test error rate on ImageNet under the mobile setting (FLOPS Million), which is the same as the current SOTA PC-DARTS Xu et al. (2019), and outperforms all the other NAS works. Specifically, it outperforms the baseline algorithm NAO on which SemiNAS is based and ProxylessNAS where our search space is based, by and respectively.
5 Application to Text to Speech
Previous experiments on NASench-101 and ImageNet have shown promising results. In this section, we further explore the application of SemiNAS to a new task: text to speech.
Text to speech (TTS) Wang et al. (2017); Shen et al. (2018); Ping et al. (2017); Li et al. (2019); Ren et al. (2019a) is an import task aiming to synthesize intelligible and natural speech from text. The encoder-decoder based neural TTS (Shen et al., 2018) has achieved significant improvements. However, due to the different modalities between the input (text) and the output (speech), popular TTS models are still complicated and require many human experiences when designing the model architecture. Moreover, unlike many other sequence learning tasks (e.g., neural machine translation, language modeling) where the Transformer model Vaswani et al. (2017) is the dominate architecture, RNN based Tacotron Wang et al. (2017); Shen et al. (2018), CNN based Deep Voice Arik et al. (2017); Gibiansky et al. (2017); Ping et al. (2017), and Transformer based models Li et al. (2019) show comparable accuracy in TTS, without one being exclusively better than others.
The complexity of the model architecture in TTS indicates great potential of NAS on this task. However, applying NAS on TTS task also has challenges, mainly in two aspects: 1) Current TTS model architectures are complicated, including many human designed components. It is difficult but important to design the network bone and the corresponding search space for NAS. 2) Unlike other tasks (e.g., image classification) whose evaluation is objective and automatic, the evaluation of a TTS model requires subject judgement and human evaluation in the loop (e.g., intelligibility rate for understandability and mean opinion score for naturalness). It is impractical to use human evaluation for thousands of architectures in NAS. Thus, it is difficult but also important to design a specific and appropriate objective metric as the reward of an architecture during the search process.
Next, we design the search space and evaluation metric for NAS on TTS, and apply SemiNAS on two specific TTS settings: low-resource setting and robustness setting.
5.1 Experiment Settings
Search space. After surveying the previous neural TTS models, we choose a multi-layer encoder-decoder based network as the network backbone for TTS. We search the operation of each layer of the encoder and the decoder. The search space includes candidate operations in total: convolution layer with kernel size , multi-head self-attention layer Vaswani et al. (2017) with number of heads of and LSTM layer. Specifically, we use unidirectional LSTM layer, causal convolution layer, causal self-attention layer in the decoder to avoid seeing the information in future positions. Besides, every decoder layer is inserted with an additional encoder-decoder-attention layer to catch the relationship between the source and target sequence, where the dot-product multi-head attention in Transformer Vaswani et al. (2017) is adopted.
Evaluation metric. It has been shown that the quality of the attention alignment between the encoder and decoder is an important influence factor on the quality of synthesized speech in previous works Ren et al. (2019a); Wang et al. (2017); Shen et al. (2018); Li et al. (2019); Ping et al. (2017), and misalignment can be observed for most mistakes (e.g., skipping and repeating). Accordingly, we consider the diagonal focus rate (DFR) of the attention map between the encoder and decoder as the metric of an architecture. DFR is defined as:
where denotes the attention map, and are the length of the source input sequence and the target output sequence, is the slope factor and is the width of the diagonal area in the attention map. DFR measures how much attention lies in the diagonal area with width in the attention matrix, and ranges in which is the larger the better. In addition, we have also tried valid loss as the search metric, but it is inferior to DFR according to our preliminary experiments.
Task setting. Current TTS systems are capable of achieving near human-parity quality when trained on adequate data and test on regular sentences (Shen et al., 2018; Li et al., 2019). However, current TTS models have poor performance on two specific TTS settings: 1) low-resource setting, where only few paired speech and text data is available. 2) Robustness setting, where the test sentences are not regular (e.g., too short, too long, or contain many word pieces that have the same pronunciations). Under these two settings, the synthesized speech of a human-designed TTS model is usually not accurate and robust (i.e., some words are skipped or repeated). Thus we apply SemiNAS on these two settings to improve the accuracy and robustness.
5.2 Results on Low-Resource Setting
Data. We conduct experiments on the LJSpeech dataset Ito (2017) which contains text and speech data pairs with approximately hours of speech audio. To simulate the low-resource scenario, we randomly split out paired speech and text samples as the training set, where the total audio length is less than hours. We also randomly split out paired samples as the valid/test set.
Training details. 1) For the controller, we follow the same configurations as in the ImageNet experiment. 2) For the evaluator, we adopt the weight sharing mechanism and train the supernet on 4 GPUs. On average, each architecture in the supernet is trained for epochs. Besides, we train vanilla NAO as a baseline where and each architecture is trained for epoch on average within the supernet to keep the total cost the same. 3) For the final evaluation, we train the discovered architecture on the training set for k steps on 4 GPUs, with batch size of K speech frames on each GPU. We use the Adam optimizer with and follow the same learning rate schedule in Li et al. (2019) with warmup steps. In the inference process, the output mel-spectrograms are transformed into audio samples using Griffin-Lim (Griffin & Lim, 1984).
|Model/Method||IR (%)||DFR (%)|
|Transformer TTS Li et al. (2019)||88||86|
|NAO Luo et al. (2018)||94||88|
Results. We test the the performance of SemiNAS, NAO Luo et al. (2018) and Transformer TTS (following Li et al. (2019)) on the test sentences and report the results in Table 3. We measure the performances in terms of word level intelligibility rate (IR), which is a commonly used metric to evaluate the quality of generated audio Ren et al. (2019b). IR is defined as the percentage of test words whose pronunciation is considered to be correct and clear by human. It is shown that SemiNAS achieves IR, with significant improvements of points over human designed Transformer TTS and points over NAO. We also list the DFR metric for each method in Table 3, where SemiNAS outperforms Transformer TTS and NAO in terms of DFR, which is consistent with the results on IR and indicates that our proposed search metric DFR can indeed guide NAS algorithms to achieve better accuracy. We also use MOS (mean opinion score) Streijl et al. (2016) to evaluate the naturalness of the synthesized speech. Using Griffin-Lim as the vocoder to synthesize the speech, the ground-truth mel-spectrograms achieves MOS, Transformer TTS achieves , NAO achieves and SemiNAS achieves . SemiNAS outperforms other methods in terms of MOS, which also demonstrates the advantages of SemiNAS. We also attach the discovered architecture by SemiNAS in the supplementary materials.
5.3 Results on Robustness Setting
Data. We use the whole LJSpeech dataset as the training data. For robustness test, we select the sentences as used in Ping et al. (2017) (attached in the supplementary materials) that are found hard for TTS models.
Training details. 1) For the controller, we follow the same configurations of the controller as in the ImageNet experiment. 2) For the evaluator, we train on the whole LJSpeech dataset to get DFR on the hard sentences. On average, each architecture is trained for epoch within the supernet. Other details of the evaluator follow the same as in the low-resource TTS experiment. Besides, same as the low-resource setting, we also train vanilla NAO as a baseline. 3) For the final evaluation, we train the discovered architecture on the whole LJSpeech dataset and test on the selected sentences. Other details of training the model follow the same as in the low-resource TTS experiment. We also attach the discovered architecture in the supplementary materials.
|Model/Method||DFR (%)||Repeat||Skip||Error (%)|
|Li et al. (2019)|
|Luo et al. (2018)|
Results. We report the results in Table 4, including the DFR, the number of sentences with repeating and skipping words, and the sentence level error rate. A sentence is counted as an error if it contains a repeating or skipping word. SemiNAS is better than Transformer TTS (Li et al., 2019) and NAO Luo et al. (2018) on all the metrics. It reduces the error rate by and compared to Transformer TTS structure designed by human experts and the searched architecture by NAO respectively.
High-quality architecture-accuracy pairs are critical to NAS; however, accurately evaluating the accuracy of an architecture is costly. In this paper, we proposed SemiNAS, a semi-supervised learning method for NAS. It leverages a small set of high-quality architecture-accuracy pairs to train an initial controller, and then utilizes a large number of unlabeled architectures to further improve the controller. Experiments on image classification tasks (NASBench-101 and ImageNet) and text to speech tasks (the low-resource setting and robustness setting) demonstrate 1) the efficiency of SemiNAS on reducing the computation cost over conventional NAS while achieving similar accuracy and 2) its effectiveness on improving the accuracy of both conventional NAS and one-shot NAS under similar computational cost.
In the future, we will apply SemiNAS to more tasks such as automatic speech recognition, text summarization, etc. Furthermore, we will explore advanced semi-supervised learning methods to improve SemiNAS.
Appendix A Discovered Architectures
We show the discovered architectures for the tasks by SemiNAS.
We show the discovered architecture on NASBench-101 by SemiNAS, which has a mean test accuracy of . The connection matrix of the architecture is:
The operations are: input, conv1x1-bn-relu, conv3x3-bn-relu, conv3x3-bn-relu, conv3x3-bn-relu, conv1x1-bn-relu, output.
We adopt the ProxylessNAS Cai et al. (2018) search space which is built on the MobileNet-V2 Sandler et al. (2018) backbone. It contains several different stages and each stage consists of multiple layers. We search the operation of each individual layer. There are candidate operations in the search space:
MBConv (k=3, r=3)
MBConv (k=3, r=6)
MBConv (k=5, r=3)
MBConv (k=5, r=6)
MBConv (k=7, r=3)
MBConv (k=7, r=6)
We adopt encoder-decoder based architecture as the backbone, and search the operation of each layer. Candidate operations include:
Convolution layer with kernel size of 1
Convolution layer with kernel size of 5
Convolution layer with kernel size of 9
Convolution layer with kernel size of 13
Convolution layer with kernel size of 17
Convolution layer with kernel size of 21
Convolution layer with kernel size of 25
Transformer layer with head number of 2
Transformer layer with head number of 4
Transformer layer with head number of 8
The discovered architecture by SemiNAS for low-resource setting is shown in Fig. 3
The discovered architecture by SemiNAS for robustness setting is shown in Fig. 4
Appendix B Robustness Test Sentences
We list the 100 sentences we use for robustness setting:
a b c.
x y z.
is it free?
a debt runs.
christmas is coming.
a pet dilemma thinks.
how was the math test?
good to the last drop.
an m b a agent listens.
a compromise disappears.
an axis of x y or z freezers.
she did her best to help him.
a backbone contests the chaos.
two a greater than two n nine.
don’t step on the broken glass.
a damned flips into the patient.
a trade purges within the b b c.
i’d rather be a bird than a fish.
i hear that nancy is very pretty.
i want more detailed information.
please wait outside of the house.
n a s a exposure tunes the waffle.
a mist dictates within the monster.
a sketch ropes the middle ceremony.
every farewell explodes the career.
she folded here handkerchief neatly.
against the steam chooses the studio.
rock music approaches at high velocity.
nine adam baye study on the two pieces.
an unfriendly decay conveys the outcome.
abstraction is often one floor above you.
a played lady ranks any publicized preview.
he told us a very exciting adventure story.
on august twenty eight mary plays the piano.
into a controller beams a concrete terrorist.
i often see the time eleven eleven on clocks.
it was getting dark and we weren’t there yet.
against every rhyme starves a choral apparatus.
everyone was busy so i went to the movie alone.
i checked to make sure that he was still alive.
a dominant vegetarian shies away from the g o p.
joe made the sugar cookies susan decorated them.
i want to buy a onesie but know it won’t suit me.
a former override of q w e r t y outside the pope.
f b i says that c i a says i’ll stay way from it.
any climbing dish listens to a cumbersome formula.
she wrote him a long letter but he didn’t read it.
dear beauty is in the heat not physical i love you.
an appeal on january fifth duplicates a sharp queen.
a farewell solos on march twenty third shakes north.
he ran out of money so he had to stop playing poker.
for example a newspaper has only regional distribution t.
i currently have four windows open up and i don’t know why.
next to my indirect vocal declines every unbearable academic.
opposite her sounding bag is a m c’s configured thoroughfare.
from april eighth to the present i only smoke four cigarettes.
i will never be this young again every oh damn i just got older.
a generous continuum of amazon dot com is the conflicting worker.
she advised him to come back at once the wife lectures the blast.
a song can make or ruin a person’s day if they let it get to them.
she did not cheat on the test for it was not the right thing to do.
he said he was not there yesterday however many people saw him there.
should we start class now or should we wait for everyone to get here?
if purple people eaters are real where do they find purple people to eat?
on november eighteenth eighteen twenty one a glittering gem is not enough.
a rocket from space x interacts with the individual beneath the soft flaw.
malls are great places to shop i can find everything i need under one roof.
i think i will buy the red car or i will lease the blue one the faith nests.
italy is my favorite country in fact i plan to spend two weeks there next year.
i would have gotten w w w w dot google dot com but my attendance wasn’t good enough.
nineteen twenty is when we are unique together until we realise we are all the same.
my mum tries to be cool by saying h t t p colon slash slash w w w b a i d u dot com.
he turned in the research paper on friday otherwise he emailed a s d f at yahoo dot org.
she works two jobs to make ends meet at least that was her reason for no having time to join us.
a remarkable well promotes the alphabet into the adjusted luck the dress dodges across my assault.
a b c d e f g h i j k l m n o p q r s t u v w x y z one two three four five six seven eight nine ten.
across the waste persists the wrong pacifier the washed passenger parades under the incorrect computer.
if the easter bunny and the tooth fairy had babies would they take your teeth and leave chocolate for you?
sometimes all you need to do is completely make an ass of yourself and laugh it off to realise that life isn’t so bad after all.
she borrowed the book from him many years ago and hasn’t yet returned it why won’t the distinguishing love jump with the juvenile?
last friday in three week’s time i saw a spotted striped blue worm shake hands with a legless lizard the lake is a long way from here.
i was very proud of my nickname throughout high school but today i couldn’t be any different to what my nickname was the metal lusts the ranging captain charters the link.
i am happy to take your donation any amount will be greatly appreciated the waves were crashing on the shore it was a lovely sight the paradox sticks this bowl on top of a spontaneous tea.
a purple pig and a green donkey flew a kite in the middle of the night and ended up sunburn the contained error poses as a logical target the divorce attacks near a missing doom the opera fines the daily examiner into a murderer.
as the most famous singer-songwriter jay chou gave a perfect performance in beijing on may twenty fourth twenty fifth and twenty sixth twenty three all the fans thought highly of him and took pride in him all the tickets were sold out.
if you like tuna and tomato sauce try combining the two it’s really not as bad as it sounds the body may perhaps compensates for the loss of a true metaphysics the clock within this blog and the clock on my laptop are on hour different from each other.
someone i know recently combined maple syrup and buttered popcorn thinking it would taste like caramel popcorn it didn’t and they don’t recommend anyone else do it either the gentleman marches around the principal the divorce attacks near a missing doom the color misprints a circular worry across the controversy.
Appendix C Demo of TTS
We provide demo for both low-resource setting and robustness setting of TTS experiments at this link
Appendix D Implementation Details
We implement all the code in Pytorch Paszke et al. (2019) with version 1.2. We implement the core architecture search algorithm following NAO Luo et al. (2018)
- Although a variety of metrics including accuracy, model size, and inference speed have been used as search criterion, the accuracy of an architecture is the most important and costly one, and other metrics can be easily calculated with almost zero computation cost. Therefore, we focus on accuracy in this work.
- One epoch means training on the whole dataset for once.
- Arik, S. Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., et al. Deep voice: Real-time neural text-to-speech. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 195–204. JMLR. org, 2017.
- Bender, G., Kindermans, P.-J., Zoph, B., Vasudevan, V., and Le, Q. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning, pp. 549–558, 2018.
- Cai, H., Zhu, L., and Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
- Chen, X., Xie, L., Wu, J., and Tian, Q. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. arXiv preprint arXiv:1904.12760, 2019.
- Ghiasi, G., Lin, T.-Y., and Le, Q. V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7036–7045, 2019.
- Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W., Raiman, J., and Zhou, Y. Deep voice 2: Multi-speaker neural text-to-speech. In Advances in neural information processing systems, pp. 2962–2970, 2017.
- Griffin, D. and Lim, J. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
- Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., and Sun, J. Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420, 2019.
- He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
- Ito, K. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
- Li, L. and Talwalkar, A. Random search and reproducibility for neural architecture search. arXiv preprint arXiv:1902.07638, 2019.
- Li, N., Liu, S., Liu, Y., Zhao, S., and Liu, M. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 6706–6713, 2019.
- Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L.-J., Fei-Fei, L., Yuille, A., Huang, J., and Murphy, K. Progressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017.
- Liu, H., Simonyan, K., Yang, Y., and Liu, H. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
- Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Luo, R., Tian, F., Qin, T., and Liu, T.-Y. Neural architecture optimization. arXiv preprint arXiv:1808.07233, 2018.
- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
- Pham, H., Guan, M., Zoph, B., Le, Q., and Dean, J. Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, pp. 4092–4101, 2018.
- Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., and Miller, J. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654, 2017.
- Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
- Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y. Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, pp. 3165–3174, 2019a.
- Ren, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y. Almost unsupervised text to speech and automatic speech recognition. arXiv preprint arXiv:1905.06791, 2019b.
- Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.
- Sciuto, C., Yu, K., Jaggi, M., Musat, C., and Salzmann, M. Evaluating the search phase of neural architecture search. arXiv preprint arXiv:1902.08142, 2019.
- Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE, 2018.
- So, D., Le, Q., and Liang, C. The evolved transformer. In International Conference on Machine Learning, pp. 5877–5886, 2019.
- Stamoulis, D., Ding, R., Wang, D., Lymberopoulos, D., Priyantha, B., Liu, J., and Marculescu, D. Single-path nas: Designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877, 2019.
- Streijl, R. C., Winkler, S., and Hands, D. S. Mean opinion score (mos) revisited: methods and applications, limitations and alternatives. Multimedia Systems, 22(2):213–227, 2016.
- Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828, 2019.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
- Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017.
- Wen, W., Liu, H., Li, H., Chen, Y., Bender, G., and Kindermans, P.-J. Neural predictor for neural architecture search. arXiv preprint arXiv:1912.00848, 2019.
- Xie, S., Zheng, H., Liu, C., and Lin, L. Snas: Stochastic neural architecture search, 2018.
- Xu, Y., Xie, L., Zhang, X., Chen, X., Qi, G.-J., Tian, Q., and Xiong, H. Pc-darts: Partial channel connections for memory-efficient differentiable architecture search, 2019.
- Ying, C., Klein, A., Christiansen, E., Real, E., Murphy, K., and Hutter, F. NAS-bench-101: Towards reproducible neural architecture search. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 7105–7114, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/ying19a.html.
- Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856, 2018.
- Zhou, H., Yang, M., Wang, J., and Pan, W. Bayesnas: A bayesian approach for neural architecture search. arXiv preprint arXiv:1905.04919, 2019.
- Zhu, X. and Goldberg, A. B. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.
- Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
- Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710, 2018.