BlockWise MAP Inference for Determinantal Point Processes
with Application to ChangePoint Detection
Abstract
Existing MAP inference algorithms for determinantal point processes (DPPs) need to calculate determinants or conduct eigenvalue decomposition generally at the scale of the full kernel, which presents a great challenge for realworld applications. In this paper, we introduce a class of DPPs, called BwDPPs, that are characterized by an almost block diagonal kernel matrix and thus can allow efficient blockwise MAP inference. Furthermore, BwDPPs are successfully applied to address the difficulty of selecting changepoints in the problem of changepoint detection (CPD), which results in a new BwDPPbased CPD method, named BwDppCpd. In BwDppCpd, a preliminary set of changepoint candidates is first created based on existing wellstudied metrics. Then, these changepoint candidates are treated as DPP items, and DPPbased subset selection is conducted to give the final estimate of the changepoints that favours both quality and diversity. The effectiveness of BwDppCpd is demonstrated through extensive experiments on five realworld datasets.
BlockWise MAP Inference for Determinantal Point Processes
with Application to ChangePoint Detection
Jinye Zhang and Zhijian Ou Speech Processing and Machine Intelligence Laboratory Tsinghua University, Beijing, China, 100085
Introduction
The determinantal point processes (DPPs) are elegant probabilistic models for subset selection problems where both quality and diversity are considered. Formally, given a set of items , a DPP defines a probability measure on , the set of all subsets of . For every subset we have
(1) 
where the Lensemble kernel is an by positive semidefinite matrix. By writing as a Gram matrix, could be viewed as the squared volume spanned by the column vectors for . By defining , a popular decomposition of the kernel is given as
(2) 
where measures the quality (magnitude) of item in , and , can be viewed as the angle vector of diversity features so that measures the similarity between items and . It can be shown that the probability of including and increases with the quality of and and diversity between and . As a result, a DPP assigns high probability to subsets that are both of good quality and diverse (?).
For DPPs, the maximum a posteriori (MAP) problem , aiming at finding the subset with highest probability, has attracted much attention due to its broad range for potential applications. Noting that this is an NPhard problem (?), a number of approximate inference methods have been purposed, including the greedy methods for optimizing the submodular function (?; ?), optimization via continuous relaxation (?), and minimum Bayes risk decoding that minimizes the applicationspecific loss function (?).
These existing methods need to calculate determinants or conduct eigenvalue decomposition. Both computations are taken at the scale of the kernel size and with the cost of around time that become intolerably high when become large, e.g. thousands. Nevertheless, we find that for a class of DPPs where the kernel is almost block diagonal (Fig. 1 (b)), the MAP inference with the whole kernel could be replaced by a series of subinferences with its subkernels. Since the sizes of the subkernels become smaller, the overall computational cost can be significantly reduced. Such DPPs are often defined over a line where items are only similar to their neighbourhoods on the line and significantly different from those far away. Since the MAP inference for such DPPs is conducted in a blockwise manner, we refer to them as BwDPPs (blockwise DPPs) in the rest of the paper.
The above observation is mainly motivated by the problem of changepoint detection (CPD) that aims at detecting abrupt changes in timeseries data (?). In CPD, the period of time between two consecutive changepoints, often referred to as a segment or a state, is with homogeneous properties of interest (e.g. the same speaker in a speech (?) or the same behaviour in human activity data (?)). After choosing a number of changepoint candidates without much difficulty, we can treat these changepoint candidates as DPP items, and select a subset from them to be our final estimate of the changepoints. Each changepoint candidate has its own quality of being a changepoint. Moreover, the true locations of changepoints along the timeline tend to be diverse, since states (e.g. speakers in Fig. 1 (a)) would not change rapidly. Therefore, it is preferred to conduct changepoint selection that incorporates both quality and diversity. DPPbased subset selection clearly suits this purpose well. Meanwhile, the corresponding kernel will then become almost block diagonal (e.g. Fig. 1 (b)), as neighbouring items are less diversified, and items far apart more diversified, In this case, the DPP becomes BwDPP.
The problem of CPD have been actively studied for decades, where various CPD methods could be broadly classified into Bayesian or frequentist approach. In Bayesian approach, the CPD problem is reduced to estimating the posterior distribution of the changepoint locations given the timeseries data (?). Other posteriors to be estimated include the 0/1 indicator sequence (?), and the “run length” (?). Although many improvements were made, e.g. using advanced Monte Carlo method, the efficiency for estimating these posteriors is still a big challenge for realworld tasks.
In frequentist approach, the core idea is hypothesis testing and the general strategy is to first define a metric (test statistic) by considering the observations over past and present windows. As both windows move forward, changepoints are selected when the metric value exceeds a threshold. Some widelyused metrics include the cumulative sum (?), the generalized likelihoodratio (?), the Bayesian information criterion (BIC) (?), the Kullback Leibler divergence (?), and more recently, subspacebased metrics (?; ?), kernelbased metrics (?), and densityratio (?; ?). While various metrics have been explored, how to choose thresholds and perform changepoint selection, which is also a determining factor for detection performance, is relatively less studied. Heuristicbased rules or procedures are dominant and not wellperformed, e.g. selecting local peaks above a threshold (?), discarding the lower one if two peaks are close (?), or requiring the metric differences between changepoints and their neighbouring valleys above a threshold (?).
In this paper, we propose to apply DPP to address the difficulty of selecting changepoints. Based on existing wellstudied metrics, we can create a preliminary set of changepoint candidates without much difficulty. Then, we treat these changepoint candidates as DPP items, and conduct DPPbased subset selection to obtain the final estimate of the changepoints that favours both quality and diversity.
The contribution of this paper is twofold. First, we introduce a class of DPP, called BwDPPs, that are characterized by an almost block diagonal kernel matrix and thus can allow efficient blockwise MAP inference. Second, BwDPPs are successfully applied to address the difficult problem of selecting changepoints, which results in a new BwDPPbased CPD method, named BwDppCpd.
The rest of the paper is organized as follows. After describing brief preliminaries, we introduce BwDPPs and give our theoretical result on the BwDPPMAP method. Next, we introduce BwDppCpd and present evaluation experiment results on a number of realworld datasets. Finally, we conclude the paper with a discussion on potential future directions.
Preliminaries
Throughout the paper, we are interested in MAP inference for BwDPPs, a particular class of DPP where the Lensemble kernel is almost block diagonal^{1}^{1}1Such matrices could also be defined as a particular class of block tridiagonal matrices, where the offdiagonal submatrices only have a few nonzeros entries at the bottom left., namely
(3) 
where the diagonal submatrices are subkernels containing DPP items that are mutually similar, and the offdiagonal submatrices are sparse submatrices with nonzero entries only at the bottom left, representing the connections between adjacent subkernels. Fig. 2 (a) gives a good example of such matrices.
Let be the set of all indices of and let be that of correspondingly. For any set of indices , we use to denote the square submatrix indexed by and the submatrix with rows indexed by and columns by . Following general notations, by we mean the block diagonal matrix consisting of submatrices and means that is positive semidefinite.
MAP Inference for BwDPPs
Strictly Block Diagonal Kernel
We first consider the motivating case where the kernel is strictly block diagonal, i.e. all elements in the offdiagonal submatrices are zero. It can be easily seen that the following divideandconquer theorem holds.
Theorem
For the DPP with a block diagonal kernel over ground set which is partitioned correspondingly, the MAP solution can be obtained as:
(4) 
where , and .
Theorem Theorem tells us that the MAP inference with a strictly block diagonal kernel can be decomposed into a series of subinferences with its subkernels. In this way, the overall computation cost can be largely reduced. Noting that no exact DPPMAP algorithms are available so far, any approximate DPPMAP algorithms could be used in a plugandplay way for the subinferences.
Almost Block Diagonal Kernel
Now we analyze the MAP inference for BwDPP with an almost block diagonal kernel as defined in (3). Let be the hypothesized subset to be selected from and let be that from correspondingly, where . Without loss of generality, we assume is invertible^{2}^{2}2That simply assumes that we only consider the nontrivial subsets selected with a DPP kernel , i.e. . for . By defining recursively as
(5) 
one could rewrite the MAP objective function:
(6) 
where represents zero matrix of appropriate size that fill the corresponding area with zeros. The key to the second equation above is for , since is an almost block diagonal kernel. Continuing this recursion,
(7) 
Hence, the MAP objective function is reduced to:
(8) 
As depends on , we cannot optimize separately. Alternatively, we provide an approximate method that optimize over sequentially, named the BwDPPMAP method, which is a depthfirst greedy search method in essence. The BwDPPMAP is described in Table 1, where denotes optimizing over with the value of fixed as for , and the subkernel^{3}^{3}3Both and are called subkernels. is given similarly as , namely
(9) 
One may notice that is equivalent to .
Input: as defined in (3); 
Output: Subset of items . 
For: 
Compute via (9); 
Perform subinference over via 
; 
Return: . 
In conclusion, similar to the MAP inference with a strictly block diagonal kernel, by using BwDPPMAP, the MAP inference for an almost block diagonal kernel can be decomposed into a series of subinferences for the subkernels as well. There are four comments for this conclusion.
First, it should be noted that the above BwDPPMAP method is an approximate optimization method, even if each subinference step is conducted exactly. This is because depends on . We provide an empirical evaluation later, showing that through blockwise operation, the greedy search in BwDPPMAP can achieve computation speedup with marginal sacrifice of the accuracy.
Second, by the following Lemma Lemma, we show that each subkernel is positive semidefinite, so that it is theoretically guaranteed that we can conduct each subinference via existing DPPMAP algorithms, e.g. the greedy DPPMAP algorithm (Table 2) (?). One may find the proof of Lemma Lemma in the appendix.
Lemma
, for .
Third, in order to apply BwDPPMAP, we need to first partition a given DPP kernel into the form of an almost block diagonal matrix as defined in (3). The partition is not unique. A trivial partition for an arbitrary DPP kernel is no partition, i.e., regarding the whole matrix as a single block. We leave the study of finding the optimal partition for further work. Here we provide a heuristic rule for partition, which is called partition and performs well in our experiments.
Definition
(partition) A partition is defined by partitioning a DPP kernel into the almost block diagonal form as defined in (3) with the maximum number of blocks (i.e. the largest possible m)^{4}^{4}4Generally speaking, a partition of a kernel of size into subkernels will approximately reduce the computational complexity times. A larger implies larger computation reduction., where for every offdiagonal matrix , the size of its nonzero area is only at the bottom left and does not exceed .
A heuristic way to obtain partition for a kernel L is to first identify a series of nonoverlapping dense square submatrices along the main diagonal as many as possible. Next, two adjacent square submatrices in the main diagonal are merged if the size of the nonzero area in their corresponding offdiagonal submatrix exceeds .
It should be noted that a kernel could be subject to partition in one or more ways with different values of . By taking partition for a kernel with different values of , we can obtain a balance between computation cost and optimization accuracy. A smaller implies smaller achievable in partition, and thus smaller computation reduction. On the other hand, a smaller means smaller degree of interaction between adjacent subinferences, and thus better optimization accuracy.
Fourth, an empirical illustration of BwDPPMAP is given in Fig. 2, where the greedy MAP algorithm (Table 2) (?) is used for the subinferences in BwDPPMAP. The synthetic kernel size is fixed as . For each realization, the area of nonzero entries in the kernel is first specified by uniformly randomly choosing the size of subkernels from and the size of the nonzero areas in offdiagonal submatrices from . Next, a vector is generated for each item separately, following standard normal distribution. Finally, for all nonzero entries () specified in the previous step, the entry value is given by . Fig. 2 (a) provides an example for such synthetic kernel.
We generate 1000 synthetic kernels as described above. For each synthetic kernel, we take partition with , and then run BwDPPMAP. The performance of directly applying the greedy MAP algorithm on the original unpartitioned kernel is used as baseline. The results in Fig. 2 (b) show that BwDPPMAP runs much faster than the baseline. With the increase of , the runtime drops while the inference accuracy degrades within a tolerable range.
Connection between BwDPPMAP and its Subinference Algorithm
Any DPPMAP inference algorithm can be used in a plugandplay fashion for the subinference procedure of BwDPP. It is natural to ask the connection between BwDPPMAP and its corresponding DPPMAP algorithm. The relation is given by the following result.
Theorem
Let be any DPPMAP algorithm for BwDPPMAP subinference, where maps a positive semidefinite matrix to a subset of its indices, i.e. . BwDPPMAP (table 1) is equivalent to applying the following steps successively to the almost block diagonal kernel as defined in (3):
(10) 
and for ,
(11) 
where , , and the input of is the conditional kernel^{5}^{5}5The conditional distribution (over set ) of the DPP defined by , (12) is also a DPP (?), and the corresponding kernel, , is called the conditional kernel..
The proof of Theorem Theorem is in the appendix. Theorem Theorem states that BwDPPMAP is essentially a series of Bayesian belief updates, where in each update a conditional kernel is fed into that contains the information of previous selection result. The equivalent form allows us to compare BwDPPMAP directly with the method of applying on the entire kernel. The latter does inference on the entire set for one time, while the former does the inference on a sequence of smaller subsets . Concretely, in the th update, a subset is added to have the kernel . Then the information of previous selection result is incorporated into the kernel to generate the conditional kernel. Finally, the DPPMAP inference is performed on the conditional kernel to select from .
Input: ; Output: . 

Initialization: Set , ; 
While is not empty; 
; ; 
Compute ; 
; ; 
Return: . 
BwDPPbased ChangePoint Detection
Let be the timeseries observations, where represents the dimensional observation at time , and let denote the segment of observations in the time interval . We further use , to represent different segments of observations at different intervals, when explicitly denoting the beginning and ending times of the intervals are not necessary. The new CPD method will build on existing metrics. A dissimilarity metric is denoted as , which measures the dissimilarity between two arbitrary segments and .
QualityDiversity Decomposition of Kernel
Given a set of items , the DPP kernel can be written as a Gram matrix , where , the columns of , are vectors representing items in .
A popular decomposition of the kernel is to define , where measures the quality (magnitude) of item in , and , can be viewed as the angle vector of diversity features so that measures the similarity between items and . Therefore, is defined as
(13) 
where is the quality vector consisting of , and is the similarity matrix consisting of . The qualitydiversity decomposition allows us to construct and separately to address different concerns, which is utilized below to construct the kernel for CPD.
BwDppCpd
BwDppCpd is a twostep CPD method, described as follows.
Step 1: Based on a dissimilarity metric , a preliminary set of changepoint candidates is created. Consider moving a pair of adjacent windows, and , along , where is the size of local windows. Then, a large value for the adjacent windows, i.e. , suggests that a changepoint is likely to occur at time t. After we obtain the series of values, local peaks above the mean of the values are marked and the corresponding locations, say , are selected to form the preliminary set of changepoint candidates .
Step 2: Treat the changepoint candidates as BwDPP items, and select a subset from them to be our final estimate of the changepoints.
The BwDPP kernel is built via qualitydiversity decomposition. We use the similarity metric once more to measure the quality of a candidate changepoint to be a true one. Specifically, we define
(14) 
The higher the value is, the sharper contrast around the changepoint candidate , and the better quality of .
Next, the BwDPP similarity matrix is defined to address the fact that the true locations of changepoints along the timeline tend to be diverse, since states would not change rapidly. This is done by assigning high similarity score to items being close to each other. Specifically, we define
(15) 
where is a parameter representing the position diversity level. Finally, after taking partition of the kernel into the almost block diagonal form, BwDPPMAP is used to select a set of changepoints that favours both quality and diversity (Fig. 3 (b)).
Discussion
There is a rich studies of metrics for CPD problem. The choice of the dissimilarity metric is flexible and could be welltailored to the characteristics of the data. We present two examples that are used in our experiments.

Symmetric KullbackLeibler Divergence (SymKL):
If the two segments , to be compared are assumed to follow Gaussian processes, the SymKL metric is given:(16) where and are corresponding sample mean and covariance.

Generalized Likelihood Ratio (GLR):
Generally, the GLR metric is given by the likelihood ratio:(17) The numerator is the likelihood that the two segments follows two different models and respectively, while the denominator is that two segments together (denoted as ) follows a single model . In practice, we plug the maximium likelihood estimates (MLE) for the parameters , , and . E.g. if we assume that the timeseries segment follows a homogeneous Poisson process, where is the occurring time of the th event, . The loglikehood of is
(18) where the MLE of is used, .
Experiments
The BwDppCpd method are evaluated on five realworld timeseries data. Firstly, three classic datasets are examined for CPD, namely WellLog data, Coal Mine Disaster data, and Dow Jones Industrial Average Return (DJIA) data, where we set due to the small data size.
Next, we experiment with human activity detection and speech segmentation, where the data size becomes larger and there is no accurate model to characterize the data, making the CPD task harder. In both experiments, the numbers of DPP items varies from hundreds to thousands, where, except BwDPPMAP, no other algorithms can perform MAP inference within a reasonable cost of time due to the large kernel scale. We set for human activity detection and for speech segmentation to provide a comparison.
As for the dissimilarity metric , Poisson processes and GLR are used in Coal Mine Disaster and for other experiments, Gaussian models and SymKL are used.
WellLog Data
WellLog contains 4050 measurements of nuclear magnetic response taken during the drilling of a well. It is an example of varying Gaussian mean and the changes reflect the stratification of the earth’s crust (?). Outliers are removed prior to the experiment. As shown in Fig. 4 (a), all changes are detected by BwDppCpd.
Coal Mine Disaster Data
Coal Mine Disaster (?), a standard dataset for testing CPD method, consists of 191 accidents from 1851 to 1962. The occurring rates of accidents are believed to have changed a few times and the task is to detect them. The BwDppCpd detection result, as shown in Fig. 4 (b), agrees with that in (?).
197275 Dow Jones Industrial Average Return
DJIA contains daily return rates of Dow Jones Industrial Average from 1972 to 1975. It is an example of varying Gaussian variance, where the changes are caused by big events that have potential macroeconomic effects. Four changes in the data are detected by BwDppCpd, which are matched well with important events (Fig. 4 (c)). Compared to (?), one more change is detected (the rightmost), which corresponds to the date that 7374 stock market crash ended^{6}^{6}6http://en.wikipedia.org/wiki/197374_stock_market_crash. This shows that the BwDppCpd discovers more information from the data.
PRC  RCL  

BwDppCpd  93.05  87.88  0.9039 
RuLSIF  86.36  83.84  0.8508 
Human Activity Detection
HASC^{7}^{7}7http://hasc.jp/hc2011/ contains human activity data collected by portable threeaxis accelerometers and the task is to segment the data according to human behaviour changes. Fig. 3 (b) shows an example of Hasc. The performance of the best algorithm in (?), RuLSIF, is used for comparison and the precision (PRC), recall (RCL), and measure (?) are used for evaluation:
(19)  
(20) 
where is the number of correctly found changes, is the number of detected changes, and is the number of groundtruth changes. score could be viewed as a overall score that balances PRL and RCL. The CPD result is shown in Table 3, where the parameters are set to attain the best results for both algorithms.
The receiver operating characteristic (ROC) curve is often used to evaluate performance under different precision and recall, where true positive rate (TPR) and false positive rate (FPR) are given by and . For BwDppCpd, different levels of TPR and FPR are obtained by tuning the position diversity parameter and for RuLSIF by tuning the threshold (?).
Speech Segmentation
We tested two datasets for speech segmentation. The first dataset, called Hub4m97, is a subset (around 5 hours) from 1997 Mandarin Broadcast News Speech (HUB4NE) released by LDC^{8}^{8}8http://catalog.ldc.upenn.edu/LDC98S73. The second dataset, called TelRecord, consists of 216 telephone conversations, each around 2min long, collected from realworld call centres. Acoustic features of 12order MFCCs (melfrequency cepstral coefficients) are extracted as the timeseries data.
Speech segmentation is to segment the audio data into acoustically homogeneous segments, e.g. utterances from a single speaker or nonspeech portions. The two datasets contain utterances with hesitations and a variety of changing background noises, presenting a great challenge for CPD.
The BwDppCpd method with different for kernel partition (denoted as Bw in Table 4) is tested and two classic segmentation methods BIC (?) and DISTBIC (?) are used for comparison. As the same as in (Delacourt and Wellekens 2000), a postprocessing step based on BIC values is also taken to reduce the false alarms for BwDppCpd.
The experiment results in Table 4 shows that BwDppCpd outperforms BIC and DISTBIC for both datasets. In addition, comparing the results obtained with and , using is found to be faster but has a slightly worse performance. This agrees with our analysis of BwDPPMAP for using different partition to tradeoff speed and accuracy.
BIC  DistBIC  Bw  Bw  
Hub4m97  
PRC  59.40  64.29  65.29  65.12 
RCL  78.24  74.98  78.49  78.39 
0.6753  0.6922  0.7128  0.7114  
TelRecord  
PRC  54.05  61.39  66.54  66.47 
RCL  79.97  81.72  85.47  84.83 
0.6451  0.7011  0.7483  0.7454 
Conclusion
In this paper, we introduced BwDPPs, a class of DPPs where the kernel is almost block diagonal and thus can allow efficient blockwise MAP inference. Moreover, BwDPPs are demonstrated to be useful in changepoint detection problem. The BwDPPbased changepoint detection method, BwDppCpd, shows superior performance in experiments with several realworld datasets.
The almost block diagonal kernels suit the changepoint detection problem well, but BwDPPs may achieve more than that. Theoretically, BwDPPMAP could be applied to any block tridiagonal matrices without modification. It remains to be studied the theoretical issues regarding exact or approximate partition of a DPP kernel into the form of an almost block diagonal matrix (?). Other potential BwDPP applications are also worth further exploration.
Appendix: Proof of Lemma Lemma
Proof
Define
(21) 
For , is the Schur complement of in , the submatrix of . We next prove the lemma using the first principle of mathematical induction. State the predicate as:

: and are positive semidefinite (PSD).
trivially holds as and are PSD.
Assuming holds. is PSD because is PSD. Since (footnote 2) and is the Schur complement of in , is PSD. Being submatrix of , is also PSD. Hence, holds.
Therefore, for , is PSD.
Appendix: Proof of Theorem Theorem
For preparation, first I need to quote a result from (?): the conditional kernel
(22) 
Next I need to use the following lemma:
Lemma
, for , where is defined by (5).
Proof
The proof is given by mathematical induction. When , the result trivially holds:
(23) 
Assume the result holds for , i.e.,
(24) 
Consider the case when . One has
(25) 
Therefore the result holds for .
References
 [Acer, Kayaaslan, and Aykanat 2013] Acer, S.; Kayaaslan, E.; and Aykanat, C. 2013. A recursive bipartitioning algorithm for permuting sparse square matrices into block diagonal form with overlap. SIAM Journal on Scientific Computing 35(1):C99–C121.
 [Adams and MacKay 2007] Adams, R. P., and MacKay, D. J. 2007. Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742.
 [Basseville, Nikiforov, and others 1993] Basseville, M.; Nikiforov, I. V.; et al. 1993. Detection of abrupt changes: theory and application, volume 104. Prentice Hall Englewood Cliffs.
 [Buchbinder et al. 2012] Buchbinder, N.; Feldman, M.; Naor, J.; and Schwartz, R. 2012. A tight linear time (1/2)approximation for unconstrained submodular maximization. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, 649–658. IEEE.
 [Chen and Gopalakrishnan 1998] Chen, S., and Gopalakrishnan, P. 1998. Speaker, environment and channel change detection and clustering via the bayesian information criterion. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, 8. Virginia, USA.
 [Delacourt and Wellekens 2000] Delacourt, P., and Wellekens, C. J. 2000. Distbic: A speakerbased segmentation for audio data indexing. Speech communication 32(1):111–126.
 [Desobry, Davy, and Doncarli 2005] Desobry, F.; Davy, M.; and Doncarli, C. 2005. An online kernel change detection algorithm. Signal Processing, IEEE Transactions on 53(8):2961–2974.
 [Gillenwater, Kulesza, and Taskar 2012] Gillenwater, J.; Kulesza, A.; and Taskar, B. 2012. Nearoptimal map inference for determinantal point processes. In Advances in Neural Information Processing Systems, 2735–2743.
 [Green 1995] Green, P. J. 1995. Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika 82(4):711–732.
 [Gustafsson and Gustafsson 2000] Gustafsson, F., and Gustafsson, F. 2000. Adaptive filtering and change detection, volume 1. Wiley New York.
 [Gustafsson 1996] Gustafsson, F. 1996. The marginalized likelihood ratio test for detecting abrupt changes. Automatic Control, IEEE Transactions on 41(1):66–78.
 [Idé and Tsuda 2007] Idé, T., and Tsuda, K. 2007. Changepoint detection using krylov subspace learning. In SDM, 515–520. SIAM.
 [Jarrett 1979] Jarrett, R. 1979. A note on the intervals between coalmining disasters. Biometrika 66(1):191–193.
 [Kanamori, Suzuki, and Sugiyama 2010] Kanamori, T.; Suzuki, T.; and Sugiyama, M. 2010. Theoretical analysis of density ratio estimation. IEICE transactions on fundamentals of electronics, communications and computer sciences 93(4):787–798.
 [Kawahara and Sugiyama 2012] Kawahara, Y., and Sugiyama, M. 2012. Sequential changepoint detection based on direct densityratio estimation. Statistical Analysis and Data Mining 5(2):114–127.
 [Kawahara, Yairi, and Machida 2007] Kawahara, Y.; Yairi, T.; and Machida, K. 2007. Changepoint detection in timeseries data based on subspace identification. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, 559–564. IEEE.
 [Ko, Lee, and Queyranne 1995] Ko, C.W.; Lee, J.; and Queyranne, M. 1995. An exact algorithm for maximum entropy sampling. Operations Research 43(4):684–691.
 [Kotti, Moschou, and Kotropoulos 2008] Kotti, M.; Moschou, V.; and Kotropoulos, C. 2008. Speaker segmentation and clustering. Signal processing 88(5):1091–1124.
 [Kulesza and Taskar 2012] Kulesza, A., and Taskar, B. 2012. Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083.
 [Lavielle and Lebarbier 2001] Lavielle, M., and Lebarbier, E. 2001. An application of mcmc methods for the multiple changepoints problem. Signal Processing 81(1):39–53.
 [Liu et al. 2013] Liu, S.; Yamada, M.; Collier, N.; and Sugiyama, M. 2013. Changepoint detection in timeseries data by relative densityratio estimation. Neural Networks 43:72–83.
 [Nemhauser, Wolsey, and Fisher 1978] Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An analysis of approximations for maximizing submodular set functionsâi. Mathematical Programming 14(1):265–294.