1 Introduction
###### Abstract

Sequential change-point detection when the distribution parameters are unknown is a fundamental problem in statistics and machine learning. When the underlying distributions belong to the exponential family, we show that detection procedures based on sequential likelihood ratios with simple one-sample update estimates such as online mirror descent are nearly second-order asymptotic optimal, under some mild conditions for the expected Kullback-Leibler divergence between the estimators and the true parameters. This means that the upper bound for the false alarm rate of the algorithm (measured by the average-run-length) meets the lower bound asymptotically up to a log-log factor when the threshold tends to infinity. This is a blessing, since although the generalized likelihood ratio (GLR) statistics are asymptotically optimal in theory, they cannot be computed recursively and thus the exact computation can be time-consuming. We prove the nearly second-order asymptotic optimality by making a connection between sequential change-point and online convex optimization and leveraging the logarithmic regret bound property of online mirror descent algorithm. Numerical and real data examples validate our theory.

Sequential methods, change-point detection, online algorithms
\firstpage

1 \articlenumberx \doinum10.3390/—— \pubvolumexx 2017 \copyrightyear2017 \externaleditorAcademic Editor: name \historyReceived: date; Accepted: date; Published: date \TitleNearly second-order asymptotic optimality of sequential change-point detection with one-sample updates \AuthorYang Cao, Liyan Xie, Yao Xie*, Huan Xu \AuthorNamesYang Cao, Liyan Xie, Yao Xie, and Huan Xu \corresCorrespondence: yao.xie@isye.gatech.edu

## 1 Introduction

Sequential analysis is a classic topic in statistics concerning online inference from a sequence of observations. The goal is to make statistical inference as quickly as possible, while controlling the false-alarm rate. An important sequential analysis problem commonly studied is sequential change-point detection Siegmund [1985]. It arises from various applications including online anomaly detection, statistical quality control, biosurveillance, financial arbitrage detection and network security monitoring (see, e.g., [Siegmund, 2013, Tartakovsky et al., 2014]).

We are interested in the sequential change-point detection problem with known pre-change parameters but unknown post-change parameters. Specifically, given a sequence of samples , , , we assume that they are independent and identically distributed (i.i.d.). with certain distribution parameterized by , and the values of are different before and after some unknown time called the change-point. We further assume that the parameters before the change-point are known. This is reasonable since usually it is relatively easy to obtain the reference data for the normal state, so that the parameters in the normal state can be estimated with good accuracy. After the change-point, however, the values of the parameters switch to some unknown values, which represents an anomaly or novelty that needs to be discovered.

### 1.1 Motivation: Dilemma of CUSUM and generalized likelihood ratio (GLR) statistics

Consider change-point detection with unknown post-change parameters. A commonly used change-point detection method is the so-called CUSUM procedure [Tartakovsky et al., 2014]. It can be derived from likelihood ratios. Assume that before the change, the samples follow a distribution and after the change the samples follow another distribution . CUSUM procedure has a recursive structure. Initialize with . The likelihood-ratio statistic can be computed according to , and a change-point is detected whenever exceeds a pre-specified threshold. Due to the recursive structure, CUSUM is memory and computation efficient since it does not need to store the historical data and only needs to record the value of . The performance of CUSUM depends on the choice of the post-change parameter ; in particular, there must be a well-defined notion of “distance” between and . However, the choice of is somewhat subjective. Even if in practice a reasonable choice of is the “smallest” change-of-interest, in the multi-dimensional setting, it is hard to define what the “smallest” change would mean. Moreover, when the assumed parameter deviates significantly from the true parameter value, CUSUM may suffer a severe performance degradation Granjon [2013].

An alternative approach is the Generalized Likelihood Ratio (GLR) statistic based procedure Basseville et al. [1993]. The GLR statistic finds the maximum likelihood estimate (MLE) of the post-change parameter and plugs it back to the likelihood ratio to form the detection statistic. To be more precise, for each hypothetical change-point location , the corresponding post-change samples are . Using these samples, one can form the MLE denoted as . Without knowing whether the change occurs and where it occurs beforehand when forming the GLR statistic, we have to maximize over all possible change locations. The GLR statistic is given by , and a change is announced whenever it exceeds a pre-specified threshold. The GLR statistic is more robust than CUSUM [Lai, 1998], and it is particularly useful when the post-change parameter may vary from one situation to another. However, a drawback of the GLR statistic is that it is not memory efficient and it cannot be computed recursively in general. Moreover, when there is a constraint on the maximum likelihood estimator (such as sparsity), MLE cannot have closed-form solution; one has to store the historical data and recompute the MLE whenever there is new data. As a remedy, the window-limited GLR is usually considered, where one only keeps the past samples, and the maximization is restricted to be over . However, even with the window-limited GLR, one still has to recompute using historical data whenever the new data are added.

Besides CUSUM or GLR, various one-sample update schemes have been considered to reduce memory and computation cost of online change-point detection procedures. The one-sample update takes the form of for some function that uses only the most recent data and the previous estimate. The one-sample update schemes perform online estimates of the unknown parameters, and plug the estimates into the likelihood ratio statistic to perform detection. The one-sample update enjoys efficient computation, as the information from the new data can be incorporated via low computational cost update. It is also memory efficient since the update only needs the most recent sample. For instance, the non-anticipating estimator in Lorden and Pollak [2005] was constructed by moving average for the Gaussian mean and Gamma distribution parameters; the authors of Raginsky et al. [2009, 2012] construct estimators using the general online mirror descent approach (for online outlier detection problem, which is different from the persistent change setting here). For general settings, the online mirror descent provides a good framework for constructing one-sample update schemes since it can be computed efficiently via online convex optimization and even has closed-form solution in various cases. The one sample update estimators may not correspond to the exact MLE, but they tend to result in good detection performance. The authors of Lorden and Pollak [2005] establish a general result for asymptotic detection delay for asymptotic efficient estimators. However, the online mirror descent based estimators may not satisfy this requirement, and hence, no performance guarantees exist in general for such approach. The comparison of three approaches is summarized in Table 1, and the justification can be found in Appendix A.

It is clear that one-sample update schemes require less memory and computation than GLR. However, an important question remains to be answered: how much performance do we lose by using one-sample update schemes rather than the exact GLR?

### 1.2 Application scenario: Social network change-point detection

The widespread use of social networks (such as Twitter) leads to a large amount of user-generated data generated continuously. One important aspect is to detect change points in streaming social network data. These change points may represent the collective anticipation of or response to external events or system “shocks” Peel and Clauset [2015]. Detecting such changes can provide a better understanding of patterns of social life. In social networks, a common form of the data is discrete events over continuous time. As a simplification, each event contains a time label and a user label in the network. In our prior work Li et al. [2017], we model discrete events data using network point processes, which capture the influence between users through an influence matrix. We then cast the problem as detecting changes in an influence matrix. We assume that the influence matrix in the normal state (before the change) can be estimated from the reference data. After the change, the influence matrix is unknown since it’s due to an anomaly, and it has to be estimated online. Due to computational burden and memory constraint, since the scale of the network can be large, we do not want to store the entire historical data and rather compute the statistic in real-time. In Li et al. [2017], we develop a one-sample update scheme to estimate the influence matrix and then form the likelihood ratio detection statistic based on expectation-maximization algorithms. However, theoretical performance of the algorithm has not been well-understood.

### 1.3 Contributions

This paper aims to address the above question by proving the nearly second-order asymptotic optimality of one-sample schemes for the one-sided sequential hypothesis test and the sequential change-point detection for exponential family. While similar question has been previously considered in Robbins and Siegmund [1974], Lorden and Pollak [2005, 2008], we consider likelihood ratios with plug-in online mirror descent (OMD) estimators (similar to those in Raginsky et al. [2009, 2012]). The nearly second-order asymptotic optimality [Tartakovsky et al., 2014] means that the upper bound for performance matches the lower bound up to a log-log factor as the false-alarm rate goes to zero. Here we focus on OMD estimators, but the results can be generalized to other schemes such as the online gradient descent. The proof leverages the logarithmic regret property of online mirror descent and the lower bound established in statistical sequential change-point literature [Siegmund and Yakir, 2008, Tartakovsky et al., 2014]. Synthetic examples validate the performances of one sample update schemes.

The contributions of the paper are summarized as follows

• Inspired by the existing connection between sequential analysis and online convex optimization in Cesa-Bianchi and Lugosi [2006], Hazan [2016], we provide a general upper bound for one-sided sequential hypothesis test and change-point detection procedures with the one-sample update schemes. The upper bound explicitly captures the impact of estimation on detection by an estimation algorithm dependent factor. This factor shows up as an additional term in the upper bound for the expected detection delay, and it corresponds to the regret incurred by the one-sample update estimators. This establishes an interesting linkage between sequential change-point detection and online convex optimization111Although both fields, sequential change-point detection and online convex optimization, study sequential data, the precise connection between them is not clear, partly because the performance metrics are different: the former concerns with the tradeoff between average run length and detection delay, whereas the latter focuses on bounding the cumulative loss incurred by the sequence of estimators through a regret bound [Azoury and Warmuth, 2001, Hazan, 2016]..

• Using our upper bound and existing lower bound, we show that the one-sample update schemes are nearly second-order optimal for the exponential family. Moreover, numerical examples verify the good performance of one-sample update schemes. They can perform better and are more robust than the likelihood ratio methods with pre-specified parameters (e.g., CUSUM for change-point detection). Moreover, they are computationally efficient alternatives of the GLR statistic and cause little performance loss relative to GLR.

### 1.4 Literature and related work

Sequential change-point detection is a classic subject with an extensive literature. Much success has been achieved when the pre-change and post-change distributions are exactly specified. For example, the CUSUM procedure [Page, 1954] with first-order asymptotic optimality Lorden [1971] and exact optimality Moustakides [1986] in the minimax sense, and the Shiryayev-Roberts (SR) procedure Shiryaev [1963] derived based on a bayesian principle that also enjoys various optimality properties. Both CUSUM and SR procedures rely on likelihood ratios between the specified pre-change and post-change distributions.

The GLR [Lai, 1995, 1998] statistic enjoys certain optimality properties, but it can not be computed recursively in most cases Lai [2004]. To address the infinite memory issue, Willsky and Jones [1976], Lai [1998] studied the window-limited GLR procedure. Another approach aiming to address the issue is called the Shiryayev-Roberts-Robbins-Siegmund (SRRS) procedure Lorden and Pollak [2005]. The main idea of SRRS dates back to the “one-sided” sequential test Robbins and Siegmund [1974]: instead of plugging in the MLE obtained using all samples up to the current moment as done in the GLR procedure, the SRRS procedure uses a sequence of non-anticipating estimators. The non-anticipating estimators are formed by dropping the most recent sample (thus the name “non-anticipating”). The test statistic can then be computed recursively.

The seminal work Lorden and Pollak [2005] laid a theoretical foundation, while constructions of the non-anticipating estimators were given for two specific examples of Gaussian and Gamma distribution based on moving average. Our work considers a more general approach for constructing non-anticipating estimators based on online mirror descent (OMD), which can handle multi-dimensional parameters and constraints on parameters such as sparsity and smoothness. For one-dimensional Gaussian mean shift, our approach reduce to the non-anticipating estimator constructed in Lorden and Pollak [2005]. Later on the authors of Lorden and Pollak [2008] generalize the results to exponential family by introducing a new registering technique and uses it to prove a second-order asymptotic optimality for Gaussian mean shift. Compared to Lorden and Pollak [2008], our work provides an alternative proof for the nearly second-order asymptotic optimality by making a connection to online convex optimization and leveraging the regret bound type of results Hazan [2016]. For one-dimensional Gaussian mean shift without any constraint, we replicate the second-order asymptotic optimality, namely, Theorem 3.3 in Lorden and Pollak [2008].

Another related problem is sequential joint estimation and detection, but the goal is different in that one aims to achieve both good detection and good estimation performance, whereas in our setting estimation is only needed to compute the detection statistics. These works include Pollak [1987], which developed a modified SR procedure by introducing a prior distribution to the unknown parameters; Yilmaz et al. [2015] and Yılmaz et al. [2016], which study the joint detection and estimation problem of a specific form that arises from many applications such as spectrum sensing Yilmaz et al. [2014], image observations Vo et al. [2010], and MIMO radar Tajer et al. [2010]: a linear scalar observation model with Gaussian noise, and under the alternative hypothesis there is an unknown multiplicative parameter. The paper of Yilmaz et al. [2015] demonstrates that solving the joint problem by treating detection and estimation separately with the corresponding optimal procedure does not yield an overall optimum performance, and provides an elegant closed-form optimal detector. Later on Yılmaz et al. [2016] generalizes the results. There are also other approaches solving the joint detection-estimation problem using multiple hypotheses testing [Baygun and Hero, 1995, Vo et al., 2010] and Bayesian formulations Moustakides et al. [2012].

Related work using online convex optimization for anomaly detection includes Raginsky et al. [2009], which develops an efficient detector for the exponential family using online mirror descent and proves a logarithmic regret bound, and Raginsky et al. [2012], which dynamically adjusts the detection threshold to allow feedbacks about whether decision outcome. However, these works consider a different setting that the change is a transient outlier instead of a persistent change, as assumed by the classic statistical change-point detection literature. When there is persistent change, it is important to accumulate “evidence” by pooling the post-change samples (our work considers the persistent change).

Extensive work has been done for parameter estimation in the online-setting. This includes online density estimation over the exponential family by regret minimization [Azoury and Warmuth, 2001, Raginsky et al., 2009, 2012], sequential prediction of individual sequence with the logarithm loss [Cesa-Bianchi and Lugosi, 2006, Kotlowski and Grünwald, 2011], online prediction for time series O. Anava and Shamir [2013], and sequential NML (SNML) prediction Kotlowski and Grünwald [2011] which achieves the optimal regret bound. Our problem is different from the above, in that estimation is not the end goal; one only performs parameter estimation to plug them back into the likelihood function for detection. Moreover, a subtle but important difference of our work is that the loss function for online detecting estimation is , whereas our loss function is in order to retain the martingale property, which is essential to establish the nearly second-order asymptotic optimality.

On a high level, our work is also related to the universal source coding problem [Cover and Thomas, 2012, Cesa-Bianchi and Lugosi, 2001] or the minimum description length (MDL) problem [Rissanen, 1985, Barron et al., 1998]. In the universal source coding problem, the goal is to minimize the cumulative Kullback-Leibler (KL) loss.

## 2 Preliminaries

Assume a sequence of i.i.d. random variables with a probability density function of a parametric form . The parameter may be unknown. Consider two related problems: one-sided sequential hypothesis test and sequential change-point detection. The detection statistic relies on a sequence estimators constructed using online mirror descent. The OMD uses simple one-sample update: the update from to only uses the current sample . This is the main difference from the traditional generalized likelihood ratio (GLR) statistic Lai [1998], where each is estimated using historical samples. In the following, we present detailed descriptions for two problems. We will consider exponential family distributions and present our non-anticipating estimator based on the one-sample estimate.

### 2.1 One-sided sequential hypothesis test

First, we consider a one-sided sequential hypothesis test where the goal is only to reject the null hypothesis. This is a special case of the change-detection problem where the change-point can be either or (meaning it never occurs). Studying this special case will given us an important intermediate step towards solving the sequential change-detection problem.

Consider the null hypothesis versus the alternative . Hence the parameter under the alternative distribution is unknown. The classic approach to solve this problem is the one-sided sequential probablity-ratio test (SPRT) [Wald and Wolfowitz, 1948]: at each time, given samples , the decision is either to reject or taking more samples if the rejection decision cannot be made confidently. Here, we introduce a modified one-sided SPRT with a sequence of non-anticipating plug-in estimators:

 ^θt:=^θt(X1,…,Xt),t=1,2,…. (1)

Define the test statistic at time as

 Λt=t∏i=1f^θi−1(Xi)fθ0(Xi),i≥1. (2)

The test statistic has a simple recursive implementation:

 Λt=Λt−1⋅f^θt−1(\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0Xt)/fθ0(\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0Xt).

Define a sequence of -algebras where . The test statistic has the martingale property due to its non-anticipating nature: , where the expectation is taken when are i.i.d. random variables drawn from . The decision rule is a stopping time

 τ(b)=min{t≥1:logΛt≥b}, (3)

where is a pre-specified threshold. We reject the null hypothesis whenever the statistic exceeds the threshold. The goal is to reject the null hypothesis using as few samples as possible under the false-alarm rate (or Type-I error) constraint.

### 2.2 Sequential change-point detection

Now we consider the sequential change-point detection problem. A change may occur at an unknown time which changes the underlying distribution of the data. One would like to detect such a change as quickly as possible. Formally, change-point detection can be cast into the following hypothesis test:

 H0:  X1,X2,…i.i.d.∼fθ0,H1:  X1,…,Xνi.i.d.∼fθ0,Xν+1,Xν+2,…i.i.d.∼fθ, (4)

Here we assume an unknown to represent the anomaly. The goal is to detect the change as quickly as possible after it occurs under the false-alarm rate constraint. We will consider likelihood ratio based detection procedures adapted from two types of existing ones, which we call the adaptive CUSUM (ACM), and the adaptive SRRS (ASR) procedures.

For change-point detection, the post-change parameter is estimated using post-change samples. This means that, for each putative change-point location before the current time , the post-change samples are ; with a slight abuse of notation, the post-change parameter is estimated as

 ^θk,i=^θk,i(Xk,…,Xi),i≥k. (5)

Therefore, for , becomes defined in (2) for the one-sided SPRT. The likelihood ratio at time for a hypothetical change-point location is given by (initialize with )

 Λk,t=t∏i=kf^θk,i−1(Xi)fθ0(Xi), (6)

where can be computed recursively similar to (2).

Since we do not know the change-point location , from the maximum likelihood principle, we take the maximum of the statistics over all possible values of . This gives the ACM procedure:

 TACM(\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0b1)=inf{t≥1:max1≤k≤tlogΛk,t>\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0b1}, (7)

where is a pre-specified threshold.

Similarly, by replacing the maximization over in (7) with summation, we obtain the following ASR procedure [Lorden and Pollak, 2005], which can be interpreted as a Bayesian statistic similar to the Shiryaev-Roberts procedure.

 TASR(\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0b2)=inf{t≥1:log(t∑k=1Λk,t)>\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0b2}, (8)

where is a pre-specified threshold. The computations of and estimator , are discussed later in section 2.4. For a fixed , the comparison between our methods and GLR is illustrated in Figure 1.

{Remark}

To prevent the memory and computation complexity from blowing up as time goes to infinity, we can use window-limited version of the detection procedures in (7) and (8). The window-limited versions are obtained by replacing with in (7) and by replacing with in (8). Here is a prescribed window size. Even if we do not provide theoretical analysis to the window-limited versions, we refer the readers to Lai [1998] for the choice of the window-limited GLR procedures.

### 2.3 Exponential family

In this paper, we focus on being the exponential family for the following reasons: (i) exponential family [Raginsky et al., 2012] represents a very rich class of parametric and even many nonparametric statistical models [Barron and Sheu, 1991]; (ii) the negative log-likelihood function for exponential family is convex, and this allows us to perform online convex optimization. Some useful properties of the exponential family are briefly summarized below, and full proofs can be found in Wainwright et al. [2008], Raginsky et al. [2012].

Consider an observation space equipped with a sigma algebra and a sigma finite measure on . Assume the number of parameters is . Let denote the transpose of a vector or matrix. Let be an -measurable function . Here corresponds to the sufficient statistic for . Let denote the parameter space in . Let be a set of probability distributions with respect to the measure . Then, is said to be a multivariate exponential family with natural parameter , if the probability density function of each with respect to can be expressed as In the definition, the so-called log-partition function is given by

 Φ(θ):=log∫Xexp(θ⊺ϕ(x))dH(x).

To make sure a well-defined probability density, we consider the following two sets for parameters:

 Θ={θ∈Rd:log∫Xexp(θ⊺ϕ(x))dH(x)<+∞},

and

 Θσ={θ∈Θ:∇2Φ(θ)⪰σId×d}.

Note that is -strongly convex over . Its gradient corresponds to , and the Hessian corresponds to the covariance matrix of the vector . Therefore, is positive semidefinite and is convex. Moreover, is a Legendre function, which means that it is strongly convex, continuous differentiable and essentially smooth Wainwright et al. [2008]. The Legendre-Fenchel dual is defined as

 Φ∗(z)=supu∈Θ{u⊺z−Φ(u)}.

The mappings is an inverse mapping of [Beck and Teboulle, 2003]. Moreover, if is a strongly convex function, then .

A general measure of proximity used in the OMD is the so-called Bregman divergence , which is a nonnegative function induced by a Legendre function (see, e.g., [Wainwright et al., 2008, Raginsky et al., 2012]) defined as

 BF(u,v):=F(u)−F(v)−⟨∇F(v),u−v⟩. (9)

For exponential family, a natural choice of the Bregman divergence is the Kullback-Leibler (KL) divergence. Define as the expectation when is a random variable with density and as the KL divergence between two distributions with densities and for any . Then

 I(θ1,θ2)=Eθ1[log(fθ1(X)/fθ2(X))]. (10)

It can be shown that, for exponential family, Using the definition (9), this means that

 BΦ(θ1,θ2):=I(θ2,θ1) (11)

is a Bregman divergence. This property is quite useful to constructing mirror descent estimator for the exponential family [Nemirovskii et al., 1983, Beck and Teboulle, 2003].

### 2.4 Online mirror descent (OMD) for non-anticipating estimators

Next, we discuss how to construct the non-anticipating estimators in (1), and in (5) using OMD. OMD is a generic procedure for solving the online convex optimization (OCO) problem Hazan [2016], Shalev-Shwartz et al. [2012]. Our problem of finding an approximate maximum likelihood estimator can be cast into an OCO with the loss function being the negative log-likelihood , which corresponds to the log-loss in Cesa-Bianchi and Lugosi [2006].

The main idea of OMD is the following. At each time step, the estimator is updated using the new sample , by balancing the tendency to stay close to the previous estimate against the tendency to move in the direction of the greatest local decrease of the loss function. For the loss function defined above, a sequence of OMD estimator is constructed by

 ^θt=argminu∈Γ[u⊺∇ℓt(^θt−1)+1ηiBΦ(u,^θt−1)], (12)

where is defined in (11). Here is a closed convex set, which is problem-specific and encourages certain parameter structure such as sparsity (see section 4 for examples). {Remark} Similar to (12), for any fixed , we can compute via OMD for sequential change-point detection. The only difference is that is computed if we use as our first sample and then apply the recursive update (12) on . For , we use as our first sample.

There is an equivalent form of OMD, presented as the original formulation in Nemirovskii et al. [1983]. The equivalent form is sometimes easier to use for algorithm development, and it consists of four steps: (1) compute the dual variable: ; (2) perform the dual update: ; (3) compute the primal variable: ; (4) perform the projected primal update: . The equivalence between the above form for OMD and the nonlinear projected subgradient approach in (12) is proved in Beck and Teboulle [2003]. We adopt this approach when deriving our algorithm and follow the same strategy as Raginsky et al. [2009]. Our algorithm is presented in Algorithm 1.

A standard performance metric for an OCO algorithm is regret. The regret is the difference between the total cost that an online algorithm has incurred relatively to that of the best fixed decision in hindsight. Given samples , the regret for a sequence of estimators is defined as

 Rt=t∑i=1{−logf^θi−1(Xi)}−inf~θ∈Θt∑i=1{−logf~θ(Xi)}. (13)

For strongly convex loss function, the regret of many OCO algorithms, including the OMD, has the property that for some constant (depend on and ) and any positive integer [Agarwal and Duchi, 2011, Raginsky et al., 2012]. Note that for exponential family, the loss function is the negative log-likelihood function, which is strongly convex over . Hence, we have the logarithmic regret property.

## 3 Nearly second-order asymptotic optimality of one-sample update schemes

Below we prove the nearly second-order asymptotic optimality of the one-sample update schemes. More precisely, the nearly second-order asymptotic optimality means that the algorithm obtains the lower performance bound asymptotically up to a log-log factor in the false-alarm rate, as the false-alarm rate tends to zero (in many cases the log-log factor is a small number).

We first introduce some necessary notations. Denote and as the probability measure and the expectation when the change occurs at time and the post-change parameter is , i.e., when are i.i.d. random variables with density and are i.i.d. random variables with density . Moreover, let and denote the probability measure when there is no change, i.e., are i.i.d. random variables with density . Finally, let denote the -algebra generated by for .

### 3.1 “one-sided” Sequential hypothesis test

The two standard performance metrics are the false-alarm rate, denoted as , and the expected detection delay (i.e., the expected number of samples needed to reject the null), denoted as . A meaningful test should have both small and small . Usually, one adjusts the threshold to control the false-alarm rate to be below a certain level.

Intuitively, a reasonable sequence of estimator should move closer to the true parameter as we collect more data. This is reflected by the following regularity condition (the same assumption has been made in (5.84) of [Tartakovsky et al., 2014])

 ∞∑t=1(Eθ,0[I(θ,^θt)])r<∞, (14)

for some constant that characterizes the convergence rate of . This is a mild assumption that can be obtained by many sequences of estimators.

{Remark}

The assumption (14) is mild since it holds whenever for some as goes to infinity. This means that the convergence rate of the estimators to the true parameter is polynomial. Even though here we do not prove that all OMD estimators satisfy (14), we given one example to demonstrate this may be satisfied by many OMD estimators, since the OMD estimators are designed inherently to approach to true parameters fast Hazan [2016]. Consider detection of Gaussian mean without any constraint on the mean parameters, i.e., is the density for with . A quick check gives that , where is the norm. Moreover, the OMD estimators correspond to the MLE . By the asymptotic efficiency of the MLE, we have that , which corresponds to . Therefore, the assumption (14) is satisfied for . In practice, a case-by-case validation is recommended.

Our main result is the following. As has been observed by Lai [2004], there is a loss in the statistical efficiency by using one-sample update estimators relative to the GLR approach using the entire samples in the past. The theorem below shows that this loss corresponds to the expected regret given in (13).

{Theorem}

[Upper bound for OMD based SPRT] Given a sequence of estimators generated by Algorithm 1. When (14) holds, as ,

 (15)

Here is a term upper-bounded by an absolute constant as .

The main idea of the proof is to decompose the statistic defining , , into a few terms that form martingales, and then invoke the Wald’s Theorem for the stopped process. {Remark} Even though Theorem 3.1 is stated for OMD, the inequality (15) is valid for any non-anticipating estimators generated by OCO algorithm as long as (14) holds. Moreover, (15) gives an explicit connection between the expected detection delay for the one-sided sequential hypothesis testing (left-hand side of (15)) and the regret for the OCO (the second term on the right-hand side of (15)). This illustrates clearly the impact of estimation on detection by an estimation algorithm dependent factor.

Note that in the statement of the Theorem 3.1, the stopping time appears on the right-hand side of the inequality (15). For OMD, the expected sample size is usually small. By comparing with specific regret bound , we can bound as discussed in Section 4. The most important case is that when the estimation algorithm has a logarithmic expected regret. For the exponential family, as shown in section 3.3, Algorithm 1 can achieve for any positive integer . To obtain a more specific order of the upper bound for when grows, we establish an upper bound for as a function of , to obtain the following Corollary 3.1.

{Corollary}

Let be the sequence of estimators generated by Algorithm 1. Assume that for any positive integer and some constant , when (14) holds, we have

 Eθ,0[τ(b)]≤bI(θ,θ0)+ClogbI(θ,θ0)(1+o(1)). (16)

Here is a vanishing term as . Corollary 3.1 shows that other than the well known first-order approximation Lorden [1971], Lorden and Pollak [2005], the expected detection delay is bounded by an additional term that is on the order of if the estimation algorithm has a logarithmic regret. This term plays an important role in establishing the optimality properties later. To show the optimality properties for the detection procedures, we first select a set of detection procedures with false-alarm rates lower than a prescribed value, and then prove that among all the procedures in the set, the expected detection delays of our proposed procedures are the smallest. Thus, we can choose a threshold to uniformly control the false-alarm rate of . {Lemma}[false-alarm rate of ] Let be any sequence of non-anticipating estimators (e.g., is generated by Algorithm 1). For any , .

Lemma 3.1 shows that as increases the false-alarm rate of decays exponentially fast. We can set to make the false-alarm rate of be less than some . Next, leveraging an existing lower bound for general SPRT presented in Section 5.5.1.1 in [Tartakovsky et al., 2014], we establish the nearly second-order asymptotic optimality of OMD based SPRT as follows:

{Corollary}

[Nearly second-order optimality of OMD based SPRT] Let be the sequence of estimators generated by Algorithm 1. Assume that for any positive integer and some constant and (14) holds. Define a set . For , due to Lemma 3.1, . For such a choice, is nearly second-order asymptotic optimal in the sense that for any , as ,

 Eθ,0[τ(b)]−infT∈C(α)Eθ,0[T]=O(log(log(1/α))). (17)

The result means that, compared with any procedure (including the optimal procedure) calibrated to have a false-alarm rate less than , our procedure incurs an at most increase in the expected detection delay, which is usually a small number. For instance, even for a conservative case when we set to control the false-alarm rate, the number is .

### 3.2 Sequential change-point detection

Now we proceed the proof by leveraging the close connection Lorden [1971] between the sequential change-point detection and the one-sided hypothesis test. For sequential change-point detection, the two commonly used performance metrics [Tartakovsky et al., 2014] are the average run length (ARL), denoted by ; and the maximal conditional average delay to detection (CADD), denoted by . ARL is the expected number of samples between two successive false alarms, and CADD is the expected number of samples needed to detect the change after it occurs. A good procedure should have a large ARL and a small CADD. Similar to the one-sided hypothesis test, one usually choose the threshold large enough so that ARL is larger than a pre-specified level.

Similar to Theorem 3.1, we provide an upper bound for the CADD of our ASR and ACM procedures.

{Theorem}

Consider the change-point detection procedure in (7) and in (8). For any fixed , let be the sequence of estimators generated by OMD. Assume that for any positive integer and some constant and (14) holds. Let , as we have that

 supν≥0Eθ,ν[TASR(b)−ν∣TASR(b)>ν]≤supν≥0Eθ,ν[TACM(b)−ν∣TACM(b)>ν]≤  (I(θ,θ0))−1(b+Eθ,0[Rτ(b)]+O(1)). (18)

To prove Theorem 3.2, we relate the ASR and ACM procedures to the one-sided hypothesis test and use the fact that when the measure is known, is attained at for both the ASR and the ACM procedures. Above, we may apply a similar argument as in Corollary 3.1 to remove the dependence on on the right-hand-side of the inequality. We establish the following lower bound for the ARL of the detection procedures, which is needed for proving Theorem 3.2: {Lemma}[ARL] Consider the change-point detection procedure in (7) and in (8). For any fixed , let be any sequence of non-anticipating estimators (e.g., is generated by OMD). Let , given a prescribed lower bound for the ARL, we have

 E∞[TACM(b)]≥E∞[TASR(b)]≥γ,

provided that .

Lemma 3.2 shows that given a required lower bound for ARL, we can choose to make the ARL be greater than . This is consistent with earlier works Pollak [1987], Lorden and Pollak [2005] which show that the smallest threshold such that is approximate . However, the bound in Lamma 3.2 is not tight, since in practice we can set for some to ensure that ARL is greater than .

Combing the upper bound in Theorem 3.2 with an existing lower bound for the CADD of SRRS procedure in [Siegmund and Yakir, 2008], we obtain the following optimality properties. {Corollary}[Nearly second-order asymptotic optimality of ACM and ASR] Consider the change-point detection procedure in (7) and in (8). For any fixed , let be the sequence of estimators generated by OMD. Assume that for any positive integer and some constant and (14) holds. Let . Define . For , due to Lemma 3.2, both and belong to . For such , both and are nearly second-order asymptotic optimal in the sense that for any

 supν≥1Eθ,ν[TASR(b)−ν+1∣TASR(b)≥ν]  −infT(b)∈S(γ)supν≥1Eθ,ν[T(b)−ν+1∣T(b)≥ν]=O(loglogγ). (19)

A similar expression holds for . The result means that, compared with any procedure (including the optimal procedure) calibrated to have a fixed ARL larger than , our procedure incurs an at most increase in the CADD. Comparing (19) with (17), we note that the ARL plays the same role as because is roughly the false-alarm rate for sequential change-point detection Lorden [1971].

### 3.3 Example: Regret bound for specific cases

In this subsection, we show that the regret bound can be expressed as a weighted sum of Bregman divergences between two consecutive estimators. This form of is useful to show the logarithmic regret for OMD. The following result comes as a modification of Azoury and Warmuth [2001].

{Theorem}

Assume that are i.i.d. random variables with density function . Let in Algorithm 1. Assume that are obtained using Algorithm 1 and (defined in step 7 and 8 of Algorithm 1) for any . Then for any and ,

 Rt=t∑i=1i⋅BΦ∗(^μi,^μi−1)=12t∑i=1i⋅(^μi−^μi−1)⊺[∇2Φ∗(~μi)](^μi−^μi−1),

where , for some .

Next, we use Theorem 3.3 on a concrete example. The multivariate normal distribution, denoted by , is parametrized by an unknown mean parameter and a known covariance matrix ( is a identity matrix). Following the notations in subsection 2.3, we know that , , for any , , and , where denotes the determinant of a matrix, and is a probability measure under which the sample follows ). When the covariance matrix is known to be some , one can “whiten” the vectors by multiplying to obtain the situation here. {Corollary}[Upper bound for the expected regret, Gaussian] Assume are i.i.d. following with some . Assume that are obtained using Algorithm 1 with and . For any , we have that for some constant that depends on ,

 Eθ,0[Rt]≤C1dlogt/2.

The following calculations justify Corollary 3.3, which also serve as an example of how to use regret bound. First, the assumption in Theorem 3.3 is satisfied for the following reasons. Consider is the full space. According to Algorithm 1, using the non-negativity of the Bregman divergence, we have Then the regret bound can be written as

 Rt=12(^μ1−^μ0)⊺(^μ1−^μ0)+12t∑i=2[i⋅(^μi−^μi−1)⊺(^μi−^μi−1)]=12(X1−θ0)⊺(X1−θ0)+12t∑i=2(^μi−^μi−1)⊺(ϕ(Xi)−^μi−1).

Since the step-size , the second term in the above equation can be written as:

 12t∑i=2(^μi−^μi−1)⊺(ϕ(Xi)−^μi−1)=12t∑i=2(^μi−^μi−1)⊺(ϕ(Xi)+^μi)−t∑i=212(^μi−^μi−1)⊺(^μi−1+^μi)=t∑i=212(i−1)(ϕ(Xi)−^μi)⊺(ϕ(Xi)+^μi)+t∑i=212(∥∥^μi−1∥∥2−∥^μi∥2)=t∑i=212(i−1)∥Xi∥2−t∑i=212(i−1)∥^μi∥2+12∥^μ1∥2−12∥^μt∥2.

Combining above, we have

 Eθ,0[Rt]≤12Eθ,0[(X1−θ0)⊺(X1−θ0)]+12t∑i=21i−1Eθ,0[∥Xi∥2]+12Eθ,0[∥X1∥2].

Finally, since for any , we obtain desired result. Thus, with i.i.d. multivariate normal samples, the expected regret grows logarithmically with the number of samples.

Using the similar calculations, we can also bound the expected regret in the general case. As shown in the proof above for Corollary 3.3, the dominating term for can be rewritten as

 t∑i=212(i−1)(ϕ(Xi)−^μi)⊺[∇2Φ∗(~μi)](ϕ(Xi)+^μi),

where is a convex combination of and . For an arbitrary distribution, the term can be viewed as a local normal distribution with the changing curvature