Robust Optimization over Multiple Domains

Robust Optimization over Multiple Domains

Qi Qian Shenghuo Zhu Jiasheng Tang Rong Jin Baigui Sun Hao Li
Alibaba Group, Bellevue, WA, 98004, USA
{qi.qian, shenghuo.zhu, jiasheng.tjs, jinrong.jr, baigui.sbg, lihao.lh}@alibaba-inc.com
Abstract

Recently, machine learning becomes important for the cloud computing service. Users of cloud computing can benefit from the sophisticated machine learning models provided by the service. Considering that users can come from different domains with the same problem, an ideal model has to be applicable over multiple domains. In this work, we propose to address this challenge by developing a framework of robust optimization. In lieu of minimizing the empirical risk, we aim to learn a model optimized with an adversarial distribution over multiple domains. Besides the convex model, we analyze the convergence rate of learning a robust non-convex model due to its dominating performance on many real-word applications. Furthermore, we demonstrate that both the robustness of the framework and the convergence rate can be enhanced by introducing appropriate regularizers for the adversarial distribution. The empirical study on real-world fine-grained visual categorization and digits recognition tasks verifies the effectiveness and efficiency of the proposed framework.

 

Robust Optimization over Multiple Domains


  Qi Qian Shenghuo Zhu Jiasheng Tang Rong Jin Baigui Sun Hao Li Alibaba Group, Bellevue, WA, 98004, USA {qi.qian, shenghuo.zhu, jiasheng.tjs, jinrong.jr, baigui.sbg, lihao.lh}@alibaba-inc.com

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Cloud computing witnessed the development of machine learning in recent years. By leveraging the sophisticated models provided by the cloud computing service, users can improve the performance of their own applications conveniently and effectively. With the success of cloud computing, the robustness of the deployed models becomes a challenge. To maintain the scalability of the service, only a single model will exist in the cloud for the same problem. In real-world applications, the same problem can appear in multiple domains prevalently, which means users from different domains will adopt the same model. Therefore, the model has to perform consistently well over different domains. For example, given a model for digits recognition, some users may call it to identify the handwritten digits while others may try to recognize the printed digits (e.g., house number). A satisfied model has to deal with both of domains (i.e., handwritten digits, printed digits) well in the modern architecture of cloud computing. Note that the problem is different from the multi-task learning Zhang and Yang (2017) that aims to learn different models (i.e., multiple models) for different tasks by exploiting the shared information between related tasks.

In a conventional learning procedure, an algorithm may mix the data from multiple domains by assigning an ad-hoc weight for each example, and then learn a model accordingly. The weight is pre-defined and can be uniform for each example, which is known as empirical risk minimization. Evidently, the learned model can handle certain domains well but perform arbitrarily poor on the others. Considering the scenario described above, the unsatisfied performance will result in business interruption from those users. Moreover, the strategy that assigns even weights for all examples can suffer from the data imbalance problem when the examples from certain domains dominate.

Recently, distributionally robust optimization has attracted much attention Chen et al. (2017); Namkoong and Duchi (2016); Shalev-Shwartz and Wexler (2016). Unlike the conventional strategy with the uniform distribution, it aims to optimize the performance of the model in the worst case distribution over examples. The learned model is explicitly more robust by focusing on the hard examples. To learn a robust model, many existing work apply the convex loss functions, while the state-of-the-art performance for several important practical problems are reported from the methods with non-convex loss functions, e.g, deep neural networks He et al. (2016); Krizhevsky et al. (2012); Szegedy et al. (2015). Chen et al. (2017) proposes an algorithm to solve the non-convex problem, but their analysis relies on a near-optimal oracle for the non-convex subproblem. This kind of oracle is infeasible for most of non-convex problems. Besides, their algorithm has to go through the whole data set at least once to update the parameters at every iteration, which makes it too expensive for the large-scale data set.

In this work, we propose a framework to learn the robust non-convex model over multiple domains rather than examples. By learning the model and the adversarial distribution simultaneously, the algorithm can trade the performance between different domains adaptively. Compared with the previous work, the empirical data distribution in each domain remains unchanged and only the distribution over multiple domains will be learned in our framework. Including the adversarial distribution over examples from each domain may mislead the learning procedure due to the large freedom and the applied formulation is more appropriate for real-world applications. Considering the efficiency, we adopt stochastic gradient descent (SGD) for optimization and get rid of the dependence on the oracle. We prove that the proposed method converges with a rate of . To further improve the robustness of the framework, we introduce a regularizer for the adversarial distribution. An appropriate regularizer not only prevents the model from a trivial solution but also accelerates the convergence rate to . To the best of our knowledge, we, for the first time, propose a practical algorithm to learn a robust non-convex model over multiple domains with theoretical guarantee. Note that our analysis is applicable for the conventional distributionally robust optimization by degenerating each domain to an example. The empirical study on training deep neural networks for pets categorization and digits recognition demonstrates the effectiveness and efficiency of the proposed method.

The rest of the paper is organized as follows: Section 2 summarizes the related work of distributionally robust optimization. Section 3 describes the details of the proposed method with the theoretical analysis for the convex and non-convex loss functions. Section 4 illustrates the performance of the proposed method on several real-world applications and Section 5 concludes this work.

2 Related Work

Robust optimization has been extensively studied in the past decades Bertsimas et al. (2011). Recently, it has been investigated to improve the performance of the model in the worst case data distribution, which can also be interpreted as regularizing the variance Duchi et al. (2016). For a set of convex loss functions (e.g., a single data set), Namkoong and Duchi (2016) and Shalev-Shwartz and Wexler (2016) propose to optimize the maximal loss, which is equivalent to minimize the loss with the worst case distribution generated from the empirical distribution of data. Namkoong and Duchi (2016) shows that for the -divergence constraint, a standard stochastic mirror descent algorithm can converge at the rate of for the convex loss. In Shalev-Shwartz and Wexler (2016), the analysis indicates that minimizing the maximal loss can improve the generalization performance. In contrast to a single data set, we focus on dealing with multiple data sets and propose to learn the non-convex model in this work.

To address the problem from the non-convex loss, Chen et al. (2017) proposes to apply a near-optimal oracle. At each iteration, it calls the oracle to return a near-optimal model for the given distribution. After that, the adversarial distribution over examples is updated according to the model from the oracle. With an -optimal oracle, authors prove that the algorithm can converge to the -optimal solution at the rate of , where is the number of iterations. The limitation is that even if we assume a near-optimal oracle is accessible for the non-convex problem, the algorithm is too expensive for the real-world applications. It is because the algorithm has to enumerate the whole data set to update the parameters once. To improve the efficiency, we propose to optimize the minimax problem by a stochastic algorithm. Without a near-optimal oracle, we prove that the proposed method can converge with a rate of by setting a regularizer appropriately. Note that the proof in this work is completely different from that in Chen et al. (2017) due to the absence of an oracle.

3 Robust Optimization over Multiple Domains

For the -th domain, we denote the training set as , where is an example and is the corresponding label. Given domains, the complete data set consists of . We aim to learn a model that performs well over all domains. It can be cast as a robust optimization problem as follows.

where is the prediction model. is the empirical risk of the -th domain as

and can be any non-negative loss function.

The problem is equivalent to the following minimax problem

(1)

where . is an adversarial distribution over multiple domains and , where is the simplex as .

The minimax problem can be solved in an alternating manner, which applies gradient descent to learn the model and gradient ascent to update the adversarial distribution. Considering the large number of examples in each data set, we adopt SGD to observe an unbiased estimation for the original gradient, which avoids enumerating the whole data set. Specifically, at the -th iteration, a mini-batch of size is randomly sampled from each domain. The loss of the mini-batch from the -th domain is

It is apparent that and .

After sampling, we first update the model by gradient descent as

(2)

Then, the distribution is updated in an adversarial way. Since is from the simplex, we adopt multiplicative updating criterion Arora et al. (2012) to update it as

(3)
  Input: Data sets , size of mini-batch , step-sizes ,
  Initialize
  for  to  do
     Sample examples from each domain
     Update as in Eqn. 2
     Update as in Eqn. 3
  end for
  return  ,
Algorithm 1 Framework of Robust Optimization over Multiple Domains

Alg. 1 summarizes the main steps of the approach. For the convex loss functions, the convergence rate is well known Nemirovski et al. (2009) and we provide a high probability bound for completeness. All detailed proofs can be found in the appendix.

Lemma 1.

Assume the gradient of and the function value is bounded as , , and . Let denote the results returned by Alg. 1 after iterations. Set the step-sizes as and . Then, with a probability , we have

where and is a constant.

Given the convex loss, Theorem 1 shows that the proposed method can converge to the saddle point at the rate of with high probability, which is a stronger result than the expectation bound of Namkoong and Duchi (2016).

3.1 Non-convex Loss

Despite the extensive studies about the convex loss, there is little research about the minimax problem with non-convex loss. To prove the convergence rate for the non-convex problem, we first have the following lemma.

Lemma 2.

With the same assumptions as in Lemma 1, if is non-convex but -smoothness, we have

Since the loss is non-convex, the convergence is measured by the norm of the gradient (i.e., stationary point), which is a standard criterion for the analysis in the non-convex problem Ghadimi and Lan (2013). Lemma 2 indicates that can converge to a stationary point where is a qualified adversary by setting the step-sizes elaborately. Furthermore, it demonstrates that the convergence rate of will be influenced by the convergence rate of via .

With Lemma 2, we prove the convergence of the non-convex minimax problem as follows.

Theorem 1.

With the same assumptions as in Lemma 2 and setting the step-sizes as and , we have

Remark

Compared with the result in Lemma 1, the convergence rate is reduced from to . Moreover, the convergence rate of general minimization problems with the smooth non-convex loss can be up to  Ghadimi and Lan (2013). It demonstrates that obtaining the solution for the non-convex loss is more difficult in minimax problems.

Different step sizes can lead to different convergence rates for the system. For example, if increasing the step-size for updating as and decreasing the step-size as , the convergence of can be accelerated to while the convergence of will degenerate to . Therefore, if a sufficiently small step-size is applicable for , the convergence of can be significantly improved. We will discuss the strategy to utilize this observation in the next subsection.

3.2 Regularized Non-convex Optimization

A critical problem in minimax optimization is that the formulation is very sensitive to the outlier. For example, if there is a domain with significantly worse performance than others, it will dominate the learning procedure according to Eqn. 1 (i.e., one-hot value in ). Besides the issue of robustness, it is prevalent in real-world applications that the importance of domains is different according to their budgets, popularity, etc. Incorporating the side information into the formulation is essential for the success in practice. Given a prior distribution, the problem can be written as

where is the prior distribution which can be a distribution defined from the side information or a uniform distribution for robustness. defines the distance between two distributions. According to the duality theory Boyd and Vandenberghe (2004), for each , we can have the equivalent problem with a specified

(4)

Compared with the formulation in Eqn. 1, we introduce a regularizer for the adversarial distribution. It can be -divergence, e.g.,

or other distances defined on distributions, e.g., optimal transportation distance

For computational efficiency, we use the version with an entropy regularizer Cuturi (2013) and we have

Proposition 1.

Define the regularizer as

and it is convex in .

If is convex in , the result in Theorem. 1 can be obtained with a similar analysis. Moreover, according to the research about SGD, the strongly convexity is the key to achieve the optimal convergence rate Rakhlin et al. (2012). Hence, we consider to adopt a strongly convex regularizer for the distribution. We will analyze the regularizer in this work. The convergence rate for other strongly convex regularizers can be obtained with a similar analysis by defining the smoothness and the strongly convexity with the corresponding norm.

The new problem can be solved in a similar procedure as in Alg. 1, while we adopt the standard gradient ascent to update the adversarial distribution as

projects the vector onto the simplex. The standard algorithm can be found in Duchi et al. (2008).

Since the regularizer is strongly convex, the convergence of can be accelerated dramatically, which leads to a better convergence rate for the minimax problem. The theoretical result is as follows.

Theorem 2.

With the same assumptions as in Theorem 1 and assume . When setting step-sizes as and , we have

Remark

With the strongly convex regularizer, it is not surprise to obtain the convergence rate for . As we discussed in Lemma 2, a fast convergence rate of can improve that of . In Theorm 2, the convergence rate of is improved from to , which is only worse than the standard result of non-convex optimization Ghadimi and Lan (2013) with a small factor of . It shows that the applied regularizer not only improves the robustness of the proposed framework but also accelerates the learning procedure.

3.3 Trade Efficiency for Convergence

Finally, we study if we can recover the optimal convergence rate for the general non-convex problem as in Ghadimi and Lan (2013). Note that Chen et al. (2017) applies a near-optimal oracle to achieve the convergence rate. Given a distribution, it is hard to observe an oracle for the non-convex model. In contrast, obtaining the near-optimal adversarial distribution with a fixed model is feasible. For the original problem in Eqn. 1, the solution is trivial as returning the index of the domain with the largest empirical loss. For the problem with regularizer in Eqn. 4, the near-optimal can be obtained efficiently by any first order methods Boyd and Vandenberghe (2004). Therefore, we can change the updating criterion for the distribution at the -th iteration to

(5)

With the new updating criterion and letting , we can have a better convergence rate as follows.

Theorem 3.

With the same assumptions as in Theorem 1 and update as in Eqn. 5, where . When setting the step-size as , we have

For the problem in Eqn. 1, can be by a single pass through the whole data set. It shows that with an expensive but feasible operator as in Eqn. 5, the proposed method can recover the optimal convergence rate for the non-convex problem.

4 Experiments

We conduct the experiments on training deep neural networks over multiple domains. The methods in the comparison are summarized as follows.

  • Individual: It learns the model from an individual data set.

  • Mixture: It learns the model from multiple domains with even weights, which is equivalent to fixing as an uniform distribution.

  • Mixture: It implements the approach proposed in Section 3.2 that learns the model and the adversarial distribution over multiple domains simultaneously.

Deep models are trained with SGD and the size of each mini-batch is set to . For the methods learning with multiple domains, the number of examples from different domains are the same in a mini-batch and the size is . Compared with the strategy of sampling examples according to the learned distribution, the applied strategy is deterministic and will not introduce extra noise. The method is evaluated by investigating the worst case performance among multiple domains. For the worst case accuracy, it is defined as . The worst case loss is defined as . All experiments are implemented on an NVIDIA Tesla P100 GPU.

4.1 Pets Categorization

First, we compare the methods on a fine-grained visual categorization task. Given the data sets of VGG cats&dogs Parkhi et al. (2012) and ImageNet Russakovsky et al. (2015), we extract the shared labels between them and then generate the subsets with desired labels from them, respectively. The resulting data sets consist of 24 classes and the task is to assign the image of pets to one of these classes. For ImageNet, each class contains about images for training while that of VGG only has images. Therefore, we apply data augmentation by flipping (horizontal+vertical) and rotating () for VGG to avoid overfitting. After that, the number of images in VGG is similar to that of ImageNet. Some exemplar images from these data sets are illustrated in Fig. 3. We can find that the task in ImageNet is more challenging than that in VGG due to complex backgrounds.

Figure 1: Exemplar images from ImageNet and VGG.
Figure 2: Comparison of discrepancy in losses.
Figure 3: Comparison of running time.

We adopt ResNet18 He et al. (2016) as the base model in this experiment. It is initialized with the parameters learned from ILSVRC2012 Russakovsky et al. (2015) and we set the learning rate as for fine-tuning. Considering the size of data sets, we also include the method of Chen et al. (2017) in comparison and it is denoted as Mixture. Since the near-optimal oracle is infeasible for Mixture, we apply the model with SGD iterations instead as suggested in Chen et al. (2017). The prior distribution in the regularizer is set to the uniform distribution and is set as in Theorem 2 for Mixture.

(a) Pets Categorization (b) Pets Categorization (c) Digits Recognition (d) Digits Recognition
Figure 4: Illustration of worst case training loss.

Fig. 4 summarizes the worst case training loss among multiple domains for the methods in the comparison. Since the performance of models learned from multiple domains is significantly better than those learned from an individual set, we illustrate the results in separate figures. Fig. 4 (a) compares the proposed method to those with the individual data set. It is evident that the proposed method has the superior performance and learning with an individual data set cannot handle the data from other domains well. Fig. 4 (b) shows the results of the methods learning with multiple data sets. First, we find that both Mixture and Mixture can achieve lower worst case loss than Mixture, which confirms the effectiveness of the robust optimization. Second, Mixture performs best among all of these methods and it demonstrates that the proposed method can optimize the performance over the adversarial distribution. To further investigate the discrepancy between the performances on two domains, we illustrate the result in Fig. 3. The discrepancy is measured by . We can find that is smaller than at the beginning but decreases faster than . It is because the model is initialized with the parameters pre-trained on ImageNet. However, the task in VGG is easier than that in ImageNet, and drops faster after a few iterations. Compared with the benchmark methods, the discrepancy from the proposed method is an order of magnitude better throughout the learning procedure. It verifies the robustness of Mixture and also shows that the proposed method can handle the drifting between multiple domains well. Finally, to compare the performance explicitly, we include the detailed results in Table 1. Compared with the Mixture, we observe that Mixture can pay more attention to ImageNet than VGG and trade the performance between them.

Methods ImageNet VGG Acc Acc
Loss Acc Acc Loss Acc Acc
Individual
Individual
Mixture
Mixture
Mixture 97.36 89.35
Table 1: Comparison on pets categorization. We report the loss and accuracy () on each data set.

After the comparison of performance, we illustrate the influence of the parameter in Fig. 5. The parameter can be found in Eqn. 4 and it constrains the distance of the adversarial distribution to the prior distribution. Besides the regularizer applied in Mixture, we also include the results of the regularizer defined in Proposition 1 and the method is denoted as Mixture. Fig. 5 (a) and (c) compare the discrepancy between the losses as in previous experiments. It is obvious that the smaller the , the smaller the gap between two domains. Fig. 5 (b) and (d) summarize the drifting in a distribution, which is defined as . Evidently, the learned adversarial distribution can switch adaptively according to the performance of the current model and the importance of multiple domains can be constrained well by setting appropriately.

Finally, we compare the running time in Fig. 3. Due to the lightweight update for the adversarial distribution, Mixture and Mixture have almost the same running time as Mixture. Mixture has to enumerate the whole data set after each SGD iterations to update the current distribution, hence, its running time with only complete iterations is nearly times slower than the proposed method with iterations on these small data sets.

(a) (b) (c) (d)
Figure 5: Illustration of the influence of the regularizer.

4.2 Digits Recognition

In this experiment, we examine the methods on the task of digits recognition, which is to identify 10 digits (i.e., -) from images. There are two benchmark data sets for the task: MNIST and SVHN. MNIST LeCun et al. (1998) is collected for recognizing handwritten digits. It contains images for training and images for test. SVHN Netzer et al. (2011) is for identifying the house numbers from Google Street View images, which consists of training images and test images. Note that the examples in MNIST are gray images while those in SVHN are color images. To make the format consistent, we resize images in MNIST to be and repeat the gray channel in RGB channels to generate the color images. Considering the task is more straightforward than pets categorization, we apply the AlexNet Krizhevsky et al. (2012) as the base model in this experiment and set the learning rate as . With a different deep model, we also demonstrate that the proposed framework can incorporate with various deep models.

Methods MNIST SVHN Acc Acc
Loss Acc Acc Loss Acc Acc
Individual
Individual
Mixture
Mixture 97.05 92.14
Table 2: Comparison on digits recognition.

Fig. 4 (c) and (d) show the comparison of the worst case training loss and Table 2 summarizes the detailed results. We can observe the similar conclusion as the experiments on pets categorization. Mixture can achieve good performance on these simple domains while the proposed method can further improve the worst case performance and provide a more reliable model for multiple domains.

5 Conclusion

In this work, we propose a framework to learn a robust model over multiple domains, which is essential for the service of cloud computing. The introduced algorithm can learn the non-convex model and the adversarial distribution simultaneously, for which we provide a theoretical guarantee on the convergence rate. The empirical study on real-world applications confirms that the proposed method can obtain a robust non-convex model. In the future, we plan to examine the performance of the method with more applications. Besides, extending the framework to multiple domains with partial overlapped labels is also important for real-world applications.

References

  • Arora et al. [2012] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
  • Bertsimas et al. [2011] D. Bertsimas, D. B. Brown, and C. Caramanis. Theory and applications of robust optimization. SIAM Review, 53(3):464–501, 2011.
  • Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • Chen et al. [2017] R. S. Chen, B. Lucier, Y. Singer, and V. Syrgkanis. Robust optimization for non-convex objectives. In NIPS, pages 4708–4717, 2017.
  • Cuturi [2013] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pages 2292–2300, 2013.
  • Duchi et al. [2008] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l-ball for learning in high dimensions. In ICML, pages 272–279, 2008.
  • Duchi et al. [2016] J. C. Duchi, P. Glynn, and H. Namkoong. Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach. ArXiv e-prints, 2016.
  • Ghadimi and Lan [2013] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
  • LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Namkoong and Duchi [2016] H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In NIPS, pages 2208–2216, 2016.
  • Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  • Netzer et al. [2011] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
  • Parkhi et al. [2012] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In CVPR, 2012.
  • Rakhlin et al. [2012] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.
  • Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
  • Shalev-Shwartz and Wexler [2016] S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why. In ICML, pages 793–801, 2016.
  • Szegedy et al. [2015] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
  • Zhang and Yang [2017] Y. Zhang and Q. Yang. A survey on multi-task learning. CoRR, abs/1707.08114, 2017.

Appendix A Theoretical Analysis

a.1 Proof of Lemma 1

Proof.

According to the updating criterion, we have

(6)

where denotes the KL-divergence between the distribution and . Note that for , we have

Therefore

Since , we have and

Take it back to Eqn. 6 and we have

(7)

Therefore, for the arbitrary distribution , we have

(8)

On the other hand, due to the convexity of the loss function, we have the inequality for the arbitrary model as

(9)

Combine Eqn. A.1 and Eqn. A.1 and add from 1 to T

where we use with the fact that is the uniform distribution.

Note that , we have and . According to the Hoeffding-Azuma inequality for Martingale difference sequence Cesa-Bianchi and Lugosi [2006], with a probability , we have

By taking the similar analysis, with a probability , we have

Therefore, when setting and , with a probability , we have

where and are

Due to the convexity of in and concavity in , with a probability , we have

We finish the proof by taking the desired into the inequality. ∎

a.2 Proof of Lemma 2

Proof.

We first present some necessary definitions.

Definition 1.

A function is called -smoothness w.r.t. a norm if there is a constant such that for any and , it holds that

Definition 2.

A function is called -strongly convex w.r.t. a norm if there is a constant such that for any and , it holds that

According to the -smoothness of the loss function, we have

So we have

(10)

Now we try to bound the difference between and

(11)
(12)

Eqn. 11 is from the Pinsker’s inequality and Eqn. 12 is from the inequality in Eq.7 by letting .

Adding Eqn. A.2 from to with Eqn. 12, we have

On the other hand, with the similar analysis in Eqn. A.1, we have

a.3 Proof of Theorem 2

Proof.

Since is -strongly concave in , we have

Taking and add the equation from to , we have

On the other hand, we have

Take it back to Eqn. A.2 and add from to , then we have

We finish the proof by letting . ∎

a.4 Proof of Theorem 3

Proof.

According to the -smoothness of the loss function, we have

So we have

Adding inequalities from to