Abstract
Sequential changepoint detection when the distribution parameters are unknown is a fundamental problem in statistics and machine learning. When the underlying distributions belong to the exponential family, we show that detection procedures based on sequential likelihood ratios with simple onesample update estimates such as online mirror descent are nearly secondorder asymptotic optimal, under some mild conditions for the expected KullbackLeibler divergence between the estimators and the true parameters. This means that the upper bound for the false alarm rate of the algorithm (measured by the averagerunlength) meets the lower bound asymptotically up to a loglog factor when the threshold tends to infinity. This is a blessing, since although the generalized likelihood ratio (GLR) statistics are asymptotically optimal in theory, they cannot be computed recursively and thus the exact computation can be timeconsuming. We prove the nearly secondorder asymptotic optimality by making a connection between sequential changepoint and online convex optimization and leveraging the logarithmic regret bound property of online mirror descent algorithm. Numerical and real data examples validate our theory.
1 \articlenumberx \doinum10.3390/—— \pubvolumexx 2017 \copyrightyear2017 \externaleditorAcademic Editor: name \historyReceived: date; Accepted: date; Published: date \TitleNearly secondorder asymptotic optimality of sequential changepoint detection with onesample updates \AuthorYang Cao, Liyan Xie, Yao Xie*, Huan Xu \AuthorNamesYang Cao, Liyan Xie, Yao Xie, and Huan Xu \corresCorrespondence: yao.xie@isye.gatech.edu
1 Introduction
Sequential analysis is a classic topic in statistics concerning online inference from a sequence of observations. The goal is to make statistical inference as quickly as possible, while controlling the falsealarm rate. An important sequential analysis problem commonly studied is sequential changepoint detection Siegmund [1985]. It arises from various applications including online anomaly detection, statistical quality control, biosurveillance, financial arbitrage detection and network security monitoring (see, e.g., [Siegmund, 2013, Tartakovsky et al., 2014]).
We are interested in the sequential changepoint detection problem with known prechange parameters but unknown postchange parameters. Specifically, given a sequence of samples , , , we assume that they are independent and identically distributed (i.i.d.). with certain distribution parameterized by , and the values of are different before and after some unknown time called the changepoint. We further assume that the parameters before the changepoint are known. This is reasonable since usually it is relatively easy to obtain the reference data for the normal state, so that the parameters in the normal state can be estimated with good accuracy. After the changepoint, however, the values of the parameters switch to some unknown values, which represents an anomaly or novelty that needs to be discovered.
1.1 Motivation: Dilemma of CUSUM and generalized likelihood ratio (GLR) statistics
Consider changepoint detection with unknown postchange parameters. A commonly used changepoint detection method is the socalled CUSUM procedure [Tartakovsky et al., 2014]. It can be derived from likelihood ratios. Assume that before the change, the samples follow a distribution and after the change the samples follow another distribution . CUSUM procedure has a recursive structure. Initialize with . The likelihoodratio statistic can be computed according to , and a changepoint is detected whenever exceeds a prespecified threshold. Due to the recursive structure, CUSUM is memory and computation efficient since it does not need to store the historical data and only needs to record the value of . The performance of CUSUM depends on the choice of the postchange parameter ; in particular, there must be a welldefined notion of “distance” between and . However, the choice of is somewhat subjective. Even if in practice a reasonable choice of is the “smallest” changeofinterest, in the multidimensional setting, it is hard to define what the “smallest” change would mean. Moreover, when the assumed parameter deviates significantly from the true parameter value, CUSUM may suffer a severe performance degradation Granjon [2013].
An alternative approach is the Generalized Likelihood Ratio (GLR) statistic based procedure Basseville et al. [1993]. The GLR statistic finds the maximum likelihood estimate (MLE) of the postchange parameter and plugs it back to the likelihood ratio to form the detection statistic. To be more precise, for each hypothetical changepoint location , the corresponding postchange samples are . Using these samples, one can form the MLE denoted as . Without knowing whether the change occurs and where it occurs beforehand when forming the GLR statistic, we have to maximize over all possible change locations. The GLR statistic is given by , and a change is announced whenever it exceeds a prespecified threshold. The GLR statistic is more robust than CUSUM [Lai, 1998], and it is particularly useful when the postchange parameter may vary from one situation to another. However, a drawback of the GLR statistic is that it is not memory efficient and it cannot be computed recursively in general. Moreover, when there is a constraint on the maximum likelihood estimator (such as sparsity), MLE cannot have closedform solution; one has to store the historical data and recompute the MLE whenever there is new data. As a remedy, the windowlimited GLR is usually considered, where one only keeps the past samples, and the maximization is restricted to be over . However, even with the windowlimited GLR, one still has to recompute using historical data whenever the new data are added.
Besides CUSUM or GLR, various onesample update schemes have been considered to reduce memory and computation cost of online changepoint detection procedures. The onesample update takes the form of for some function that uses only the most recent data and the previous estimate. The onesample update schemes perform online estimates of the unknown parameters, and plug the estimates into the likelihood ratio statistic to perform detection. The onesample update enjoys efficient computation, as the information from the new data can be incorporated via low computational cost update. It is also memory efficient since the update only needs the most recent sample. For instance, the nonanticipating estimator in Lorden and Pollak [2005] was constructed by moving average for the Gaussian mean and Gamma distribution parameters; the authors of Raginsky et al. [2009, 2012] construct estimators using the general online mirror descent approach (for online outlier detection problem, which is different from the persistent change setting here). For general settings, the online mirror descent provides a good framework for constructing onesample update schemes since it can be computed efficiently via online convex optimization and even has closedform solution in various cases. The one sample update estimators may not correspond to the exact MLE, but they tend to result in good detection performance. The authors of Lorden and Pollak [2005] establish a general result for asymptotic detection delay for asymptotic efficient estimators. However, the online mirror descent based estimators may not satisfy this requirement, and hence, no performance guarantees exist in general for such approach. The comparison of three approaches is summarized in Table 1, and the justification can be found in Appendix A.
Memory Requirement  Computation Requirement  Robust Performance  

CUSUM  No  
GLR with exact MLE  Yes  
Onesample update schemes  Yes 
It is clear that onesample update schemes require less memory and computation than GLR. However, an important question remains to be answered: how much performance do we lose by using onesample update schemes rather than the exact GLR?
1.2 Application scenario: Social network changepoint detection
The widespread use of social networks (such as Twitter) leads to a large amount of usergenerated data generated continuously. One important aspect is to detect change points in streaming social network data. These change points may represent the collective anticipation of or response to external events or system “shocks” Peel and Clauset [2015]. Detecting such changes can provide a better understanding of patterns of social life. In social networks, a common form of the data is discrete events over continuous time. As a simplification, each event contains a time label and a user label in the network. In our prior work Li et al. [2017], we model discrete events data using network point processes, which capture the influence between users through an influence matrix. We then cast the problem as detecting changes in an influence matrix. We assume that the influence matrix in the normal state (before the change) can be estimated from the reference data. After the change, the influence matrix is unknown since it’s due to an anomaly, and it has to be estimated online. Due to computational burden and memory constraint, since the scale of the network can be large, we do not want to store the entire historical data and rather compute the statistic in realtime. In Li et al. [2017], we develop a onesample update scheme to estimate the influence matrix and then form the likelihood ratio detection statistic based on expectationmaximization algorithms. However, theoretical performance of the algorithm has not been wellunderstood.
1.3 Contributions
This paper aims to address the above question by proving the nearly secondorder asymptotic optimality of onesample schemes for the onesided sequential hypothesis test and the sequential changepoint detection for exponential family. While similar question has been previously considered in Robbins and Siegmund [1974], Lorden and Pollak [2005, 2008], we consider likelihood ratios with plugin online mirror descent (OMD) estimators (similar to those in Raginsky et al. [2009, 2012]). The nearly secondorder asymptotic optimality [Tartakovsky et al., 2014] means that the upper bound for performance matches the lower bound up to a loglog factor as the falsealarm rate goes to zero. Here we focus on OMD estimators, but the results can be generalized to other schemes such as the online gradient descent. The proof leverages the logarithmic regret property of online mirror descent and the lower bound established in statistical sequential changepoint literature [Siegmund and Yakir, 2008, Tartakovsky et al., 2014]. Synthetic examples validate the performances of one sample update schemes.
The contributions of the paper are summarized as follows

Inspired by the existing connection between sequential analysis and online convex optimization in CesaBianchi and Lugosi [2006], Hazan [2016], we provide a general upper bound for onesided sequential hypothesis test and changepoint detection procedures with the onesample update schemes. The upper bound explicitly captures the impact of estimation on detection by an estimation algorithm dependent factor. This factor shows up as an additional term in the upper bound for the expected detection delay, and it corresponds to the regret incurred by the onesample update estimators. This establishes an interesting linkage between sequential changepoint detection and online convex optimization^{1}^{1}1Although both fields, sequential changepoint detection and online convex optimization, study sequential data, the precise connection between them is not clear, partly because the performance metrics are different: the former concerns with the tradeoff between average run length and detection delay, whereas the latter focuses on bounding the cumulative loss incurred by the sequence of estimators through a regret bound [Azoury and Warmuth, 2001, Hazan, 2016]..

Using our upper bound and existing lower bound, we show that the onesample update schemes are nearly secondorder optimal for the exponential family. Moreover, numerical examples verify the good performance of onesample update schemes. They can perform better and are more robust than the likelihood ratio methods with prespecified parameters (e.g., CUSUM for changepoint detection). Moreover, they are computationally efficient alternatives of the GLR statistic and cause little performance loss relative to GLR.
1.4 Literature and related work
Sequential changepoint detection is a classic subject with an extensive literature. Much success has been achieved when the prechange and postchange distributions are exactly specified. For example, the CUSUM procedure [Page, 1954] with firstorder asymptotic optimality Lorden [1971] and exact optimality Moustakides [1986] in the minimax sense, and the ShiryayevRoberts (SR) procedure Shiryaev [1963] derived based on a bayesian principle that also enjoys various optimality properties. Both CUSUM and SR procedures rely on likelihood ratios between the specified prechange and postchange distributions.
The GLR [Lai, 1995, 1998] statistic enjoys certain optimality properties, but it can not be computed recursively in most cases Lai [2004]. To address the infinite memory issue, Willsky and Jones [1976], Lai [1998] studied the windowlimited GLR procedure. Another approach aiming to address the issue is called the ShiryayevRobertsRobbinsSiegmund (SRRS) procedure Lorden and Pollak [2005]. The main idea of SRRS dates back to the “onesided” sequential test Robbins and Siegmund [1974]: instead of plugging in the MLE obtained using all samples up to the current moment as done in the GLR procedure, the SRRS procedure uses a sequence of nonanticipating estimators. The nonanticipating estimators are formed by dropping the most recent sample (thus the name “nonanticipating”). The test statistic can then be computed recursively.
The seminal work Lorden and Pollak [2005] laid a theoretical foundation, while constructions of the nonanticipating estimators were given for two specific examples of Gaussian and Gamma distribution based on moving average. Our work considers a more general approach for constructing nonanticipating estimators based on online mirror descent (OMD), which can handle multidimensional parameters and constraints on parameters such as sparsity and smoothness. For onedimensional Gaussian mean shift, our approach reduce to the nonanticipating estimator constructed in Lorden and Pollak [2005]. Later on the authors of Lorden and Pollak [2008] generalize the results to exponential family by introducing a new registering technique and uses it to prove a secondorder asymptotic optimality for Gaussian mean shift. Compared to Lorden and Pollak [2008], our work provides an alternative proof for the nearly secondorder asymptotic optimality by making a connection to online convex optimization and leveraging the regret bound type of results Hazan [2016]. For onedimensional Gaussian mean shift without any constraint, we replicate the secondorder asymptotic optimality, namely, Theorem 3.3 in Lorden and Pollak [2008].
Another related problem is sequential joint estimation and detection, but the goal is different in that one aims to achieve both good detection and good estimation performance, whereas in our setting estimation is only needed to compute the detection statistics. These works include Pollak [1987], which developed a modified SR procedure by introducing a prior distribution to the unknown parameters; Yilmaz et al. [2015] and Yılmaz et al. [2016], which study the joint detection and estimation problem of a specific form that arises from many applications such as spectrum sensing Yilmaz et al. [2014], image observations Vo et al. [2010], and MIMO radar Tajer et al. [2010]: a linear scalar observation model with Gaussian noise, and under the alternative hypothesis there is an unknown multiplicative parameter. The paper of Yilmaz et al. [2015] demonstrates that solving the joint problem by treating detection and estimation separately with the corresponding optimal procedure does not yield an overall optimum performance, and provides an elegant closedform optimal detector. Later on Yılmaz et al. [2016] generalizes the results. There are also other approaches solving the joint detectionestimation problem using multiple hypotheses testing [Baygun and Hero, 1995, Vo et al., 2010] and Bayesian formulations Moustakides et al. [2012].
Related work using online convex optimization for anomaly detection includes Raginsky et al. [2009], which develops an efficient detector for the exponential family using online mirror descent and proves a logarithmic regret bound, and Raginsky et al. [2012], which dynamically adjusts the detection threshold to allow feedbacks about whether decision outcome. However, these works consider a different setting that the change is a transient outlier instead of a persistent change, as assumed by the classic statistical changepoint detection literature. When there is persistent change, it is important to accumulate “evidence” by pooling the postchange samples (our work considers the persistent change).
Extensive work has been done for parameter estimation in the onlinesetting. This includes online density estimation over the exponential family by regret minimization [Azoury and Warmuth, 2001, Raginsky et al., 2009, 2012], sequential prediction of individual sequence with the logarithm loss [CesaBianchi and Lugosi, 2006, Kotlowski and Grünwald, 2011], online prediction for time series O. Anava and Shamir [2013], and sequential NML (SNML) prediction Kotlowski and Grünwald [2011] which achieves the optimal regret bound. Our problem is different from the above, in that estimation is not the end goal; one only performs parameter estimation to plug them back into the likelihood function for detection. Moreover, a subtle but important difference of our work is that the loss function for online detecting estimation is , whereas our loss function is in order to retain the martingale property, which is essential to establish the nearly secondorder asymptotic optimality.
On a high level, our work is also related to the universal source coding problem [Cover and Thomas, 2012, CesaBianchi and Lugosi, 2001] or the minimum description length (MDL) problem [Rissanen, 1985, Barron et al., 1998]. In the universal source coding problem, the goal is to minimize the cumulative KullbackLeibler (KL) loss.
2 Preliminaries
Assume a sequence of i.i.d. random variables with a probability density function of a parametric form . The parameter may be unknown. Consider two related problems: onesided sequential hypothesis test and sequential changepoint detection. The detection statistic relies on a sequence estimators constructed using online mirror descent. The OMD uses simple onesample update: the update from to only uses the current sample . This is the main difference from the traditional generalized likelihood ratio (GLR) statistic Lai [1998], where each is estimated using historical samples. In the following, we present detailed descriptions for two problems. We will consider exponential family distributions and present our nonanticipating estimator based on the onesample estimate.
2.1 Onesided sequential hypothesis test
First, we consider a onesided sequential hypothesis test where the goal is only to reject the null hypothesis. This is a special case of the changedetection problem where the changepoint can be either or (meaning it never occurs). Studying this special case will given us an important intermediate step towards solving the sequential changedetection problem.
Consider the null hypothesis versus the alternative . Hence the parameter under the alternative distribution is unknown. The classic approach to solve this problem is the onesided sequential probablityratio test (SPRT) [Wald and Wolfowitz, 1948]: at each time, given samples , the decision is either to reject or taking more samples if the rejection decision cannot be made confidently. Here, we introduce a modified onesided SPRT with a sequence of nonanticipating plugin estimators:
(1) 
Define the test statistic at time as
(2) 
The test statistic has a simple recursive implementation:
Define a sequence of algebras where . The test statistic has the martingale property due to its nonanticipating nature: , where the expectation is taken when are i.i.d. random variables drawn from . The decision rule is a stopping time
(3) 
where is a prespecified threshold. We reject the null hypothesis whenever the statistic exceeds the threshold. The goal is to reject the null hypothesis using as few samples as possible under the falsealarm rate (or TypeI error) constraint.
2.2 Sequential changepoint detection
Now we consider the sequential changepoint detection problem. A change may occur at an unknown time which changes the underlying distribution of the data. One would like to detect such a change as quickly as possible. Formally, changepoint detection can be cast into the following hypothesis test:
(4) 
Here we assume an unknown to represent the anomaly. The goal is to detect the change as quickly as possible after it occurs under the falsealarm rate constraint. We will consider likelihood ratio based detection procedures adapted from two types of existing ones, which we call the adaptive CUSUM (ACM), and the adaptive SRRS (ASR) procedures.
For changepoint detection, the postchange parameter is estimated using postchange samples. This means that, for each putative changepoint location before the current time , the postchange samples are ; with a slight abuse of notation, the postchange parameter is estimated as
(5) 
Therefore, for , becomes defined in (2) for the onesided SPRT. The likelihood ratio at time for a hypothetical changepoint location is given by (initialize with )
(6) 
where can be computed recursively similar to (2).
Since we do not know the changepoint location , from the maximum likelihood principle, we take the maximum of the statistics over all possible values of . This gives the ACM procedure:
(7) 
where is a prespecified threshold.
Similarly, by replacing the maximization over in (7) with summation, we obtain the following ASR procedure [Lorden and Pollak, 2005], which can be interpreted as a Bayesian statistic similar to the ShiryaevRoberts procedure.
(8) 
where is a prespecified threshold. The computations of and estimator , are discussed later in section 2.4. For a fixed , the comparison between our methods and GLR is illustrated in Figure 1.
To prevent the memory and computation complexity from blowing up as time goes to infinity, we can use windowlimited version of the detection procedures in (7) and (8). The windowlimited versions are obtained by replacing with in (7) and by replacing with in (8). Here is a prescribed window size. Even if we do not provide theoretical analysis to the windowlimited versions, we refer the readers to Lai [1998] for the choice of the windowlimited GLR procedures.
2.3 Exponential family
In this paper, we focus on being the exponential family for the following reasons: (i) exponential family [Raginsky et al., 2012] represents a very rich class of parametric and even many nonparametric statistical models [Barron and Sheu, 1991]; (ii) the negative loglikelihood function for exponential family is convex, and this allows us to perform online convex optimization. Some useful properties of the exponential family are briefly summarized below, and full proofs can be found in Wainwright et al. [2008], Raginsky et al. [2012].
Consider an observation space equipped with a sigma algebra and a sigma finite measure on . Assume the number of parameters is . Let denote the transpose of a vector or matrix. Let be an measurable function . Here corresponds to the sufficient statistic for . Let denote the parameter space in . Let be a set of probability distributions with respect to the measure . Then, is said to be a multivariate exponential family with natural parameter , if the probability density function of each with respect to can be expressed as In the definition, the socalled logpartition function is given by
To make sure a welldefined probability density, we consider the following two sets for parameters:
and
Note that is strongly convex over . Its gradient corresponds to , and the Hessian corresponds to the covariance matrix of the vector . Therefore, is positive semidefinite and is convex. Moreover, is a Legendre function, which means that it is strongly convex, continuous differentiable and essentially smooth Wainwright et al. [2008]. The LegendreFenchel dual is defined as
The mappings is an inverse mapping of [Beck and Teboulle, 2003]. Moreover, if is a strongly convex function, then .
A general measure of proximity used in the OMD is the socalled Bregman divergence , which is a nonnegative function induced by a Legendre function (see, e.g., [Wainwright et al., 2008, Raginsky et al., 2012]) defined as
(9) 
For exponential family, a natural choice of the Bregman divergence is the KullbackLeibler (KL) divergence. Define as the expectation when is a random variable with density and as the KL divergence between two distributions with densities and for any . Then
(10) 
It can be shown that, for exponential family, Using the definition (9), this means that
(11) 
is a Bregman divergence. This property is quite useful to constructing mirror descent estimator for the exponential family [Nemirovskii et al., 1983, Beck and Teboulle, 2003].
2.4 Online mirror descent (OMD) for nonanticipating estimators
Next, we discuss how to construct the nonanticipating estimators in (1), and in (5) using OMD. OMD is a generic procedure for solving the online convex optimization (OCO) problem Hazan [2016], ShalevShwartz et al. [2012]. Our problem of finding an approximate maximum likelihood estimator can be cast into an OCO with the loss function being the negative loglikelihood , which corresponds to the logloss in CesaBianchi and Lugosi [2006].
The main idea of OMD is the following. At each time step, the estimator is updated using the new sample , by balancing the tendency to stay close to the previous estimate against the tendency to move in the direction of the greatest local decrease of the loss function. For the loss function defined above, a sequence of OMD estimator is constructed by
(12) 
where is defined in (11). Here is a closed convex set, which is problemspecific and encourages certain parameter structure such as sparsity (see section 4 for examples). {Remark} Similar to (12), for any fixed , we can compute via OMD for sequential changepoint detection. The only difference is that is computed if we use as our first sample and then apply the recursive update (12) on . For , we use as our first sample.
There is an equivalent form of OMD, presented as the original formulation in Nemirovskii et al. [1983]. The equivalent form is sometimes easier to use for algorithm development, and it consists of four steps: (1) compute the dual variable: ; (2) perform the dual update: ; (3) compute the primal variable: ; (4) perform the projected primal update: . The equivalence between the above form for OMD and the nonlinear projected subgradient approach in (12) is proved in Beck and Teboulle [2003]. We adopt this approach when deriving our algorithm and follow the same strategy as Raginsky et al. [2009]. Our algorithm is presented in Algorithm 1.
A standard performance metric for an OCO algorithm is regret. The regret is the difference between the total cost that an online algorithm has incurred relatively to that of the best fixed decision in hindsight. Given samples , the regret for a sequence of estimators is defined as
(13) 
For strongly convex loss function, the regret of many OCO algorithms, including the OMD, has the property that for some constant (depend on and ) and any positive integer [Agarwal and Duchi, 2011, Raginsky et al., 2012]. Note that for exponential family, the loss function is the negative loglikelihood function, which is strongly convex over . Hence, we have the logarithmic regret property.
3 Nearly secondorder asymptotic optimality of onesample update schemes
Below we prove the nearly secondorder asymptotic optimality of the onesample update schemes. More precisely, the nearly secondorder asymptotic optimality means that the algorithm obtains the lower performance bound asymptotically up to a loglog factor in the falsealarm rate, as the falsealarm rate tends to zero (in many cases the loglog factor is a small number).
We first introduce some necessary notations. Denote and as the probability measure and the expectation when the change occurs at time and the postchange parameter is , i.e., when are i.i.d. random variables with density and are i.i.d. random variables with density . Moreover, let and denote the probability measure when there is no change, i.e., are i.i.d. random variables with density . Finally, let denote the algebra generated by for .
3.1 “onesided” Sequential hypothesis test
The two standard performance metrics are the falsealarm rate, denoted as , and the expected detection delay (i.e., the expected number of samples needed to reject the null), denoted as . A meaningful test should have both small and small . Usually, one adjusts the threshold to control the falsealarm rate to be below a certain level.
Intuitively, a reasonable sequence of estimator should move closer to the true parameter as we collect more data. This is reflected by the following regularity condition (the same assumption has been made in (5.84) of [Tartakovsky et al., 2014])
(14) 
for some constant that characterizes the convergence rate of . This is a mild assumption that can be obtained by many sequences of estimators.
The assumption (14) is mild since it holds whenever for some as goes to infinity. This means that the convergence rate of the estimators to the true parameter is polynomial. Even though here we do not prove that all OMD estimators satisfy (14), we given one example to demonstrate this may be satisfied by many OMD estimators, since the OMD estimators are designed inherently to approach to true parameters fast Hazan [2016]. Consider detection of Gaussian mean without any constraint on the mean parameters, i.e., is the density for with . A quick check gives that , where is the norm. Moreover, the OMD estimators correspond to the MLE . By the asymptotic efficiency of the MLE, we have that , which corresponds to . Therefore, the assumption (14) is satisfied for . In practice, a casebycase validation is recommended.
Our main result is the following. As has been observed by Lai [2004], there is a loss in the statistical efficiency by using onesample update estimators relative to the GLR approach using the entire samples in the past. The theorem below shows that this loss corresponds to the expected regret given in (13).
[Upper bound for OMD based SPRT] Given a sequence of estimators generated by Algorithm 1. When (14) holds, as ,
(15) 
Here is a term upperbounded by an absolute constant as .
The main idea of the proof is to decompose the statistic defining , , into a few terms that form martingales, and then invoke the Wald’s Theorem for the stopped process. {Remark} Even though Theorem 3.1 is stated for OMD, the inequality (15) is valid for any nonanticipating estimators generated by OCO algorithm as long as (14) holds. Moreover, (15) gives an explicit connection between the expected detection delay for the onesided sequential hypothesis testing (lefthand side of (15)) and the regret for the OCO (the second term on the righthand side of (15)). This illustrates clearly the impact of estimation on detection by an estimation algorithm dependent factor.
Note that in the statement of the Theorem 3.1, the stopping time appears on the righthand side of the inequality (15). For OMD, the expected sample size is usually small. By comparing with specific regret bound , we can bound as discussed in Section 4. The most important case is that when the estimation algorithm has a logarithmic expected regret. For the exponential family, as shown in section 3.3, Algorithm 1 can achieve for any positive integer . To obtain a more specific order of the upper bound for when grows, we establish an upper bound for as a function of , to obtain the following Corollary 3.1.
Let be the sequence of estimators generated by Algorithm 1. Assume that for any positive integer and some constant , when (14) holds, we have
(16) 
Here is a vanishing term as . Corollary 3.1 shows that other than the well known firstorder approximation Lorden [1971], Lorden and Pollak [2005], the expected detection delay is bounded by an additional term that is on the order of if the estimation algorithm has a logarithmic regret. This term plays an important role in establishing the optimality properties later. To show the optimality properties for the detection procedures, we first select a set of detection procedures with falsealarm rates lower than a prescribed value, and then prove that among all the procedures in the set, the expected detection delays of our proposed procedures are the smallest. Thus, we can choose a threshold to uniformly control the falsealarm rate of . {Lemma}[falsealarm rate of ] Let be any sequence of nonanticipating estimators (e.g., is generated by Algorithm 1). For any , .
Lemma 3.1 shows that as increases the falsealarm rate of decays exponentially fast. We can set to make the falsealarm rate of be less than some . Next, leveraging an existing lower bound for general SPRT presented in Section 5.5.1.1 in [Tartakovsky et al., 2014], we establish the nearly secondorder asymptotic optimality of OMD based SPRT as follows:
[Nearly secondorder optimality of OMD based SPRT] Let be the sequence of estimators generated by Algorithm 1. Assume that for any positive integer and some constant and (14) holds. Define a set . For , due to Lemma 3.1, . For such a choice, is nearly secondorder asymptotic optimal in the sense that for any , as ,
(17) 
The result means that, compared with any procedure (including the optimal procedure) calibrated to have a falsealarm rate less than , our procedure incurs an at most increase in the expected detection delay, which is usually a small number. For instance, even for a conservative case when we set to control the falsealarm rate, the number is .
3.2 Sequential changepoint detection
Now we proceed the proof by leveraging the close connection Lorden [1971] between the sequential changepoint detection and the onesided hypothesis test. For sequential changepoint detection, the two commonly used performance metrics [Tartakovsky et al., 2014] are the average run length (ARL), denoted by ; and the maximal conditional average delay to detection (CADD), denoted by . ARL is the expected number of samples between two successive false alarms, and CADD is the expected number of samples needed to detect the change after it occurs. A good procedure should have a large ARL and a small CADD. Similar to the onesided hypothesis test, one usually choose the threshold large enough so that ARL is larger than a prespecified level.
Similar to Theorem 3.1, we provide an upper bound for the CADD of our ASR and ACM procedures.
Consider the changepoint detection procedure in (7) and in (8). For any fixed , let be the sequence of estimators generated by OMD. Assume that for any positive integer and some constant and (14) holds. Let , as we have that
(18) 
To prove Theorem 3.2, we relate the ASR and ACM procedures to the onesided hypothesis test and use the fact that when the measure is known, is attained at for both the ASR and the ACM procedures. Above, we may apply a similar argument as in Corollary 3.1 to remove the dependence on on the righthandside of the inequality. We establish the following lower bound for the ARL of the detection procedures, which is needed for proving Theorem 3.2: {Lemma}[ARL] Consider the changepoint detection procedure in (7) and in (8). For any fixed , let be any sequence of nonanticipating estimators (e.g., is generated by OMD). Let , given a prescribed lower bound for the ARL, we have
provided that .
Lemma 3.2 shows that given a required lower bound for ARL, we can choose to make the ARL be greater than . This is consistent with earlier works Pollak [1987], Lorden and Pollak [2005] which show that the smallest threshold such that is approximate . However, the bound in Lamma 3.2 is not tight, since in practice we can set for some to ensure that ARL is greater than .
Combing the upper bound in Theorem 3.2 with an existing lower bound for the CADD of SRRS procedure in [Siegmund and Yakir, 2008], we obtain the following optimality properties. {Corollary}[Nearly secondorder asymptotic optimality of ACM and ASR] Consider the changepoint detection procedure in (7) and in (8). For any fixed , let be the sequence of estimators generated by OMD. Assume that for any positive integer and some constant and (14) holds. Let . Define . For , due to Lemma 3.2, both and belong to . For such , both and are nearly secondorder asymptotic optimal in the sense that for any
(19) 
A similar expression holds for . The result means that, compared with any procedure (including the optimal procedure) calibrated to have a fixed ARL larger than , our procedure incurs an at most increase in the CADD. Comparing (19) with (17), we note that the ARL plays the same role as because is roughly the falsealarm rate for sequential changepoint detection Lorden [1971].
3.3 Example: Regret bound for specific cases
In this subsection, we show that the regret bound can be expressed as a weighted sum of Bregman divergences between two consecutive estimators. This form of is useful to show the logarithmic regret for OMD. The following result comes as a modification of Azoury and Warmuth [2001].
Assume that are i.i.d. random variables with density function . Let in Algorithm 1. Assume that are obtained using Algorithm 1 and (defined in step 7 and 8 of Algorithm 1) for any . Then for any and ,
where , for some .
Next, we use Theorem 3.3 on a concrete example. The multivariate normal distribution, denoted by , is parametrized by an unknown mean parameter and a known covariance matrix ( is a identity matrix). Following the notations in subsection 2.3, we know that , , for any , , and , where denotes the determinant of a matrix, and is a probability measure under which the sample follows ). When the covariance matrix is known to be some , one can “whiten” the vectors by multiplying to obtain the situation here. {Corollary}[Upper bound for the expected regret, Gaussian] Assume are i.i.d. following with some . Assume that are obtained using Algorithm 1 with and . For any , we have that for some constant that depends on ,
The following calculations justify Corollary 3.3, which also serve as an example of how to use regret bound. First, the assumption in Theorem 3.3 is satisfied for the following reasons. Consider is the full space. According to Algorithm 1, using the nonnegativity of the Bregman divergence, we have Then the regret bound can be written as
Since the stepsize , the second term in the above equation can be written as:
Combining above, we have
Finally, since for any , we obtain desired result. Thus, with i.i.d. multivariate normal samples, the expected regret grows logarithmically with the number of samples.
Using the similar calculations, we can also bound the expected regret in the general case. As shown in the proof above for Corollary 3.3, the dominating term for can be rewritten as
where is a convex combination of and . For an arbitrary distribution, the term can be viewed as a local normal distribution with the changing curvature