Identification of cromosomal translocation hotspots via scan statistics
The detection of genomic regions unusually rich in a given pattern
is an important undertaking in the analysis of next generation
sequencing data. Recent studies of chromosomal translocations in
activated B lymphocytes have identified regions that are frequently
translocated to c-myc oncogene. A quantitative method for the
identification of translocation hotspots was crucial to this
study. Here we improve this analysis by using a simple probabilistic
model and the framework provided by scan statistics to define the
number and location of translocation breakpoint hotspots. A key
feature of our method is that it provides a global chromosome-wide
significance level to clustering, as opposed to previous methods
based on local criteria. Whilst being motivated by a specific
application, the detection of unusual clusters is a widespread
problem in bioinformatics. We expect our method to be useful in the
analysis of data from other experimental approaches such as of
ChIP-seq and 4C-seq.
Results: The analysis of translocations from B lymphocytes with the method described here reveals the presence of longer hotspots when compared to those defined previously. Further, we show that the hotspot size changes quite substantially in the absence of DNA repair protein 53BP1. When 53BP1 deficiency is combined with overexpression of activation induced cytidine deaminase (AID) the hotspot length increases even further. These changes are not detected by previous methods that use local significance criteria for clustering. Our method is also able to identify several exclusive translocation hotspots located in genes of known tumor supressors.
Availability: The detection of translocation hotspots is done with hot_scan, a program implemented in R and Perl. Source code and documentation are freely available for download at https://github.com/itojal/hot_scan.
Key words and phrases:genome translocations, deep sequencing, scan statistics
2010 Mathematics Subject Classification:Primary: 92D20, Secondary: 62P10, 62M30
Israel T. Silva111To whom correspondence should be addressed222This is an un-refereed author version of an article submitted for publication in Bioinformatics, Rafael A. Rosales, Adriano J. Holanda,
Michel C. Nussenzweig and Mila Jankovic
Laboratory of Molecular Immunology, The Rockefeller University
1230 York Avenue, New York, NY 10065
Departamento de Computação e Matemática, Universidade de São Paulo
Av. Bandeirantes, 3900, Ribeirão Preto, CEP 14049-901, SP, Brazil
National Institute of Science and Technology in Stem Cell and Cell Therapy
and Center for Cell Based Therapy
Rua Catão Roxo, 2501, Ribeirão Preto, CEP 14051-140, SP, Brazil
The identification of genomic regions that are unusually rich in a given pattern is a recurring problem in bioinformatics, quite widespread in the analysis of data generated by deep-sequencing. An example of this is the detection of regions with an unlikely high clustering of chromosomal translocation breakpoints ([18, 19, 20]). Recurrent chromosomal translocations are associated with hematopoietic malignancies such as leukemia and lymphoma and with some sarcomas and carcinomas ([23, 24, 30, 33, 22]). There is growing evidence that translocations are not random. Among basic determinants of these events are the existence of chromosome territories, active transcription and most prominently targeted DNA damage ([20, 9, 17]). DNA double strand breaks (DSBs) are necessary intermediates in chromosome rearrangements and they occur in the cell during normal metabolic processes, and can be induced by genotoxic agents or during physiological DNA recombination in lymphocytes. The majority of human lymphomas are of mature B cell origin and many of them carry balanced chromosomal translocations that involve immunoglobulin genes (). This susceptibility is most likely dependent on Activation-induced cytidine deaminase (AID), the B lymphocyte specific enzyme that initiates class switch recombination (CSR) and somatic hypermutation (SHM), two processes nevessary for antibody diversification ([36, 34]). AID initiates SHM and CSR by deaminating cytosines in immunoglobulin genes during stalled transcription ([8, 41, 31]). Several DNA repair pathways process the resulting U:G mismatch to introduce mutations or produce targeted DSB ([11, 40]). Besides being targeted to immunoglobulin genes, AID targets a large number of non-immunoglobulin genes ([25, 31, 46]). AID induced DSBs are recognised by DNA damage response (DDR) proteins and repaired by non-homologous end joining (NHEJ), a process that can fail and result in chromosomal translocations ([30, 47, 16]). Libraries of AID dependent translocations from primary B cells revealed many discrete sites throughout the genome that are targeted by AID. Some of these targets are known translocation partners identified in mature B cell lymphomas [20, 9]. Mutations in the components of DNA repair pathways that process AID induced breaks can lead to defective CSR and the most severe defect is documented in 53BP1 deficient B cells. 53BP1 is a DNA repair protein that regulates DSB processing and is required for genomic stability. It does so by facilitation distal DSB joining and by protecting DNA ends from resection ([13, 6, 5]). The landscape of AID induced translocations in 53BP1 deficient B cells is different from the one found in wild type cells. Deep sequencing of translocation capture libraries from primary 53BP1 deficient B lymphocytes has shown that the profile of translocation hotspots changes most likely due to increased DNA end resection ().
A quantitative method to determine the clustering of translocations was essential for the analysis of chromosomal rearrangements in  and . Translocation hotspots were determined by using a technique similar to that used to define the coordinates of enriched protein binding regions in ChIP-seq experiments. A translocation cluster is defined by concatenating closely spaced adjacent breakpoints and its significance is then determined by using a test based on the negative binomial distribution. This method assumes that the observed breakpoints are a realisation of a Bernoulli process. By taking advantage of this model here we consider a different approach for the detection of hotspots based on the use of scan statistics, [15, 1]. The scan statistic is particularly suited in the current setting because if provides a genome-wide significance level to breakpoint clustering. Using our method we are able to show that translocation hotspots induced by AID in activated B lymphocytes are longer than those previously identified by the local method. Furthermore, our method shows that long hotspots are more frequent in the absence of 53BP1. The frequency of the long hotspots is further increased if AID is overexpressed in 53BP1 deficient B cells. We also discover a set of hotspots exclusively found by the scan statistic and discuss the potential biological relevance of our findings.
Several methods have been developed to detect clustering of events when the observations arise from a spatial or temporal point process. This section describes the application of the scan statistic for the detection of genomic regions with a particularly high density of translocations. We also describe explicitly a technique previously used by , and , which attributes a local significance level to a hotspot. A mayor difference between these methods is that the scan statistic provides a global chromosome-wide significance level to clustering.
2.1. Scan statistic approximations
We model the occurrence of translocation breakpoints in a chromosome of length as realisation of an independent and identically distributed sequence of 0-1 Bernoulli random variables, with and , for . The event occurs if there is a translocation breakpoint at the -th base. We refer hereafter to this model as the global chromosome null hypothesis, . Let be a positive integer and then let
be the running number of successes in a window of width . The scan statistic is defined as the maximum number of successes within any of the consecutive windows,
The significance of a cluster of translocation events in a window of width can be assessed by the probability of the tail event . Small probabilities for this event indicate departures for the Bernoulli model consistent with and could therefore be used to detect hotspots. A considerable effort has been made in order to derive the distribution of under . Still in this simple case its form has remained elusive. Several approximations and asymptotics for the distribution of the scan statistic have been derived under the Bernoulli model, particularly when the number of observed events in trials, , are known. Following [28, 15], the conditional probability may be approximated by the function
with as the hypergeometric distribution,
Although the expression in (1) already provides a method to quantify the significance of a cluster, the following asymptotic version avoids the use of the hypergeometric distribution and allows for an efficient implementation. For sufficiently large and , the function in (1) may be approximated by
where denotes the Binomial distribution for trials and success probability . This approximation is ensured by weak convergence of the hypergeometric distribution towards the binomial law for large populations and becomes very accurate in the current application where and . Furthermore, for large values of , namely for as is the case for most chromosomes in the data sets considered here, the summation in (2) may be evaluated as
with as Gauss hypergeometric function, that is, for and , ,
and as the th Pochhammer symbol of , i. e. . Note that the second argument of is always negative or zero because . The series defining is thus finite. With this simplification the desired -value
We observe that (2) is the approximation for the probability of described by  in the well known continuous case, namely when points are drawn uniformly from and is the largest number of points to be found in any subinterval of of length . Despite of the existence of several other approximations for the probability of ,  observes that this approximation is quite precise when the right side of (2) is less or equal to 0.01 and recommends its use in this regime.
The detection of chromosomal translocation breakpoints via scan statistics has also been considered by several authors in the analysis of leukemias, see [4, 7, 18, 35, 45]. These analysis are based on the method described by , by following a large deviation approximation for the probability of described in . Although being derived by using rather different arguments,  observe that this approximation and the one in (1) produce similar results.
The method outlined in Section 2.1 provides a significance test for the existence of hotspots, still their actual number and location have to be determined. Here we describe a method to infer the coordinates of these events.
A chromosome-wide scan with a window of width leads to the consideration of the following sequence of local null hypotheses. For ,
Let be the observed number of translocation events in the -th window, . The hypothesis is rejected at a prescribed level if , with computed according to (3). This procedure partitions the chromosome into two regions
The connected components of are prospective hotspot candidates formed by one or more scanning windows of width . To actually account for the bias involved while considering the simultaneous rejection of the multiple hypotheses involved in a given component of , we adjusted the corresponding -values by using the Benjamini-Yekutieli correction, . This correction accounts for the possibility of having a positive dependence structure among the considered set of hypotheses from overlapping scan windows. A similar control of the false discovery rate associated with the large number tests produced by the scan statistic has previously been considered while scanning for clusters in random fields by . Denote by the adjusted -value for the th window, so that for the level , the corrected -values define set
The inclusion follows because . Depending on the value of , each element of may end on an translocation event or not. In the latter case, the extra bases starting after the last translocation event are deleted. Let be the remaining connected regions in after trimming. The set is finally the group of hotspots. The significance of each element in is computed by observing its length, say , via (3) by taking .
The hotspots here were defined by using . The method based on the scan statistic with window width (in base pairs) is denoted SS hereafter. Although we considered several widths, we only present results for the cases SS and SS.
2.3. A local approach to hotspot detection
The probabilistic model for the occurrence of translocations described in section 2.1 is implicit in previous work made by  and  while analysing hostspots. The data consisting of the genome translocation breakpoints is represented as a Bernoulli process with success probability , estimated as with as the genome length and as the total number of observed translocation events. Suppose , , are the coordinates of the translocations and let for be the number of bases between and inclusive. The random variable records therefore the length until the next translocation starting at , namely . The independence of the underlying Bernoulli process implies that is a geometric random variable with parameter , that is , . Small values for
may thus be used to detect unusual short distances between successive translocation events. In this sense, a hotspot can be defined by concatenating adjacent segments for which , where is a given significance level specified in advance. Suppose that a given sequence of adjacent segments of widths is identified as a hotspot. Let
so that the significance of this hotspot may be quantified by the -value
This probability is directly available because is a negative binomial random variable with parameters and , that is
This method was used by  and 
to define a set of potential hotspots by taking . Any
candidate of this set is then identified as a hotspot if:
(i) it has more than 3 translocation breakpoints,
(ii) it has at least one read from each of the two sides of the bait,
(iii) at least 10% of the translocations come from each side of the bait,
Hereafter, we refer to this procedure as the local method and denote it by NB. We describe results obtained with NB and NB.
2.4. Software Implementation
The genome-wide scan for hotspots according to the method described in Section 2.2 is implemented by a program we call hot_scan. hot_scan is written in Perl and R, and depends on the Perl modules Parallel::ForkManager and Math::GSL::SF, available via CPAN search (http://search.cpan.org). The former is required for simple parallelisation and the latter to evaluate the hipergeometric function in (3).
2.5. TC-Seq and ChIP libraries
The TC-Seq data sets analysed here are those described by  and . These are deposited at SRA (http://www.ncbi.nlm.nih.gov/sra) under accession numbers SRA061477 and SRA039959. These data sets are from four different translocation libraries: 1. a library from activated B cells infected with AID expressing retrovirus (denoted hereafter as ), 2. a library from AID deficient B cells (denoted as ), 3. a library from 53BP1 deficient B cells infected with AID retrovirus (denoted as ) and 4. a library from 53BP1 deficient AID deficient B cells (denoted as ). The list with the translocation breakpoints passed on to hot_scan in BED format was generated by mapping onto the reference genome as described in .
The association between translocations hotspots and RNA polymerase II (PolII) accumulation was examined by using ChIP-seq experiments deposited at the Gene Expression Omnibus database http://www.ncbi.nlm.nih.gov/geo under the accession number GSE24178.
2.6. Enrichment analysis
The analysis of genes targeted by AID that are discovered by hot_scan and NB methods was made by using WebGestalt (). The set of genes targeted by AID was compared with the mouse genome using the hypergeometric test followed by correction for multiple testing using the Benjamini & Hochberg method at a significance level of Top10. The high level functional classification was based upon GO Slim for all three major GO term categories, namely biological process, cellular component and molecular function.
3.1. Scan statistic & local method
The methods described in sections 2.2 and 2.3 are compared by plotting the distribution of the hotspot lengths defined by each. Because the observed hotspot lengths vary across several orders, we considered the logarithm of their actual length. The results obtained by analysing the four data sets described in section 2.5 are presented in Figure 1. The analysis done with SS shows that the hotspot length distribution is roughly characterised by two components, one with a mean length of base pairs and the other with mean length equal to . Table S1 presents the means, variances and the weights of these components. Let and be the weight of the short and long hotspots components respectively. While the mean position of these components remains almost the same, the relative weight of long to short hotspots, , does shows significant changes. A comparison of the data set (Figure 1.D) and the data set (Figure 1.B) reveals that the relative frequency of long hotspots is higher in the absence of 53BP1. Indeed, the value for in the sample is 1.95, while for the sample it is four times smaller, namely . We conclude that 53BP1 decreases the proportion of long hotspots. A similar effect of 53BP1 deficiency is observed in the absence of AID. Indeed, the sample (Figure 1.A) is characterised by while the sample (Figure 1.C) by . We conclude that in the absence of 53BP1 longer hotspots are more frequent regardless of AID expression. This effect can be attributed to the role of 53BP1 in DNA end protection. In the absence of this protein DNA end resection is increased resulting in longer hotspots as suggested previously . A comparison of the plots that present the and samples (Figures 1.C, 1.D respectively) shows no substantial changes in the proportion of short to long hotspots, with and . However, in the absence of 53BP1 the frequency of longer hotspots increases significantly when AID is overexpressed (Figures 1.A, 1.B). Thus, proper DNA repair that is dependent on 53BP1 ensures the predominance of short hotspots, evend when AID is overexpressed.
Most of these results are not observed when analysing the same data by the local method described in Section 2.3. This becomes clear by inspection of the dashed lines in Figure 1, which correspond to the length distributions for hotspots detected by NB and NB. Even at , for which one would expect longer hotspots, the local method is unable to detect the changes in the frequency of the long hotspots to the extent brought by hot_scan. It is important to note that by following (4), the local method would classify two consecutive breakpoints as being part of a hotspot if their distance is smaller than . Using larger values for allows thus for larger gaps. Values above 0.05 would however correspond to tests with Type I Errors higher to what is commonly acceptable.
The results in Figure 1 present the differences in the hotspot lengths defined by the scan statistic and the local method. However, they do not provide any information about the relative positions of the hotspots detected by either technique. To address this aspect, we analysed the relative hotspot positions for all four data sets described in Section 2.5. As an example Figure 2 presents the hotspots for chromosome 9 estimated via , SS and SS. A comparison of the hotspots found by NB (Figure 2.A) and those by the scan statistic with a relatively small window, namely by SS (Figure 2.C), shows that the hotspots defined by either method share the same location. However, the analysis with SS (Figure 2.C) reveals the existence of longer hotspots which include one or few smaller hotspots found by NB. The merging of several smaller hotspots into a larger one is justified by the sparsity of the data which only becomes apparent at larger scales. These features are clearly overseen by NB method because of its local nature. Few examples of the scaling effect are shown by the examples in Figures 2.D, 2.F and 2.G. These results are consistent with those observed for other chromosomes (see Figures S1-S2).
Over expression of AID in the absence of 53BP1 results not only in the increase of the number of translocation but also defines larger regions where these events cluster. In addition AID overexpression in cells results in a elongation of pre-existing hotspot regions. This is apparent when comparing the outermost track and the neighboring one on the circular graph that corresponds to the analysis with SS (Figure 2.C) for the and the data. The analysis of the same data with a larger scanning window, namely with SS, gives a similar result but the afected regions are much larger (Figure 2.B). The length of most hotspots in this situation is greatly reduced in 53BP1 suficient samples. This is apparent for the hotspots from the and the data. Interestingly, the effect of 53BP1 correlates with the significance of the hotspots.
We conclude that hotspot length is dependent on 53BP1 and that AID overexpression in the absence of 53BP1 results in translocations that cluster over large regions.
3.2. Exclusive hotspots
Most of the more prominent hotspots are defined by both the scan statistic and the local method (Figures 2.A, C and Figures S1.A, C and S2.A, C). However, both methods reveal exclusive clustering regions (Tables S2, S3). In order to identify AID-dependent hotspots that are exclusive to each method, we compared the and the data. We found 36 exclusive hotspots with hot_scan (see Table S2) and 27 exclusive hotspots with NB and NB (see Table S3). The exclusive hotspots obtained by the scan statistic were defined using different window widths (50, 100, 150, 250, 500, 1000, 2500, and 5000 bp). Regions that are identified as exclusive AID hotspots were also analysed for several biologically relevant markers. First, we analysed whether our exclusive hotspots correlate with Replication Protein A (RPA) binding sites in activated B cells [46, 17]. The sites of RPA accumulation have been shown to overlap well with AID targets genome-wide and it was proposed that RPA marks AID induced DNA double strand breaks. Further, we analysed the overlap with sites where RNA Polymerase II (PolII) accumulates as it was shown that transcription is necessary for AID targeting (). We also analysed the overlap with known fragile regions, namely by the Early Replicating Fragile Sites (ERFS) and Common Fragile Sites (CFS) . The results of all these comparisons are summarised in Tables S2 and S3. A total of 20 of the 36 (55.5%) exclusive hotspots found by the scan statistic were common to all sites. Notably, all of these sites are associated with the PolII signal (Table S2). On the other hand, only 8 of the 27 (29.6%) exclusive hotspots of the local method fall within these sites and 6 are associated with the PolII signal (Table S3). Thus, the hotspots defined by the scan statistic show higher correlation with active transcription, RPA accumulation and common fragile sites than those defined by the local method.
AID leads to the accumulation of somatic mutations in a large number of non immunoglobulin genes . A analysis for the presence of somatic hypermutations in 1,496,058 bp from activated B cells  revealed a number of non-immunoglobulin genes with AID dependent mutations: Il4ra, Grap, Hist1h1c, Ly6e, Gadd45g and Il4i1. Three of these, namely Il4ra, Grap and Ly6e, were detected as genes with AID dependent hotspots by both methods, but a hotspot in Hist1h1c (mutation rate in Ig-AID Ung: 79.7 ), was only found by hot_scan (Table S2 and Figure S3). Three other genes associated with chromosomal translocations where detected exclusively by hot_scan, namely Fli1, Dlx5 and Birc3. The Fli1(Friend leukemia integration 1) gene (Figure 2.E) is translocated in 90% of Ewing sarcomas and is important in tumorigenesis . Dlx5 (distal-less homeobox 5) is implicated in T-Cell Lymphomas (). Finally, the Birc3 (baculoviral IAP repeat-containing 3) gene encodes an apoptosis inhibitor that is associated with MALT lymphomas . A complementary enrichment analysis  for the genes identified by hot_scan is included in Figures S5, S8 and Tables S4, S5. The functional categories associated with the scan statistic hotspots indicate that the top ranked genes are important in B lymphocytes.
Here we describe a method for the identification of chromosomal translocation hotspots. In contrast to the previous methods, the significance level for a cluster is defined on a chomosome-wide basis by using scan statistics. We show that this has important consequences in the analysis of translocation hotspots in primary B cells in the prescence or absence of 53BP1 repair protein.
The previous study by  showed that 53BP1 deficiency results in an increase of rearrangements to intergenic regions and changes the frequency and distribution of translocations in , immunoglobulin switch regions and other 16 prominent hotspots. Our analysis adds to these findings by showing that the 53BP1 deficiency results in the overal enrichment of longer hotspots. These results support the previous conclusion that 53BP1 prevents the resection of DNA thus resulting in shorter hotspots . Our analysis here also shows that an increased amount of AID results in quite a substantial enlargement of pre-existing hotspot regions. These changes can only be observed with wider scanning windows, here , and are not detected by previous methods because of their local characterization of clustering. The success of the scan statistic here is brought by its ability to detect events spread across several scales as is shown by the analyses made with several scanning window widths. Our analysis with the scan statistic is able to identify several exclusive hotspots whose authenticity is supported by independent experimental approaches. Some of these exclusive events are localised in genes that are known to be relevant in tumorigenesis.
The approach presented here may be applied to a variety of questions related to the detection of unusual clustering of a given pattern throughout the genome. Few recent examples of particular interest are the detection of enriched genomic interaction regions such as those defined via ChIP-seq experiments , 4C-seq experiments  and DNA-DNA contact sites . We expect our method to be especially useful for the analysis of data where a global significance to clustering can be considered.
I. T. S. wishes to thank T. Oliveira for kindly providing the script for the local method described in section 2.3, R. A. R. thanks K. J. Abraham for useful discussions. M. C. N. is a Howard Huges Medical Institute Inverstigator.
-  N. Balakrishnan and M. V. Koutras. Runs and Scans with Applications, volume 415 of Wiley series in statistics. Wiley, New York, 2001.
-  J. H. Barlow, R. B. Faryabi, E. Callen, N. Wong, A. Malhowski, H. T. Chen, G. Gutierrez-Cruz, H. W. Sun, P. McKinnon, G. Wright, R. Casellas, D. F. Robbiani, L. Staudt, O. Fernandez-Capetillo, and A. Nussenzweig. Identification of early replicating fragile sites that contribute to genome instability. Cell, 152(3):620--632, Jan 2013.
-  Y. Benjamini and D. Yakutieli. The control of the false discovery rate in multiple testing under dependency. Ann. Stat., 29(4):1165--1188, 2001.
-  M. Berger, U. Dirksen, A. Braeuninger, G. Koehler, H. Juergens, M. Krumbholz, and M. Metzler. Genomic ews-fli1 fusion sequences in Ewing sarcoma resemble breakpoint characteristics of immature lymphoid malignancies. PLoS ONE, 8(2), 2013.
-  A. Bothmer, D. F. Robbiani, M. Di Virgilio, S. F. Bunting, I. A. Klein, N. Feldhahn, J. Barlow, H.T. Chen, D. Bosque, E. Callen, A. Nussenzweig, and M. C. Nussenzweig. Regulation of DNA end joining, resection, and immunoglobulin class switch recombination by 53BP1. Mol. Cell, 42(3):319--329, May 2011.
-  S. F. Bunting, E. Callen, N. Wong, H. T. Chen, F. Polato, A. Gunn, A. Bothmer, N. Feldhahn, O. Fernandez-Capetillo, L. Cao, X. Xu, C. X. Deng, T. Finkel, M. Nussenzweig, J. M. Stark, and A. Nussenzweig. 53BP1 inhibits homologous recombination in Brca1-deficient cells by blocking resection of DNA breaks. Cell, 141(2):243--254, Apr 2010.
-  K. Busch, T. Keller, U. Fuchs, R-F Yeh, J. Harbott, I. Klose, J. Wiemels, A. Novosel, A. Reiter, and A. Borkhardt. Identification of two distinct MYC breakpoint clusters and their association with IGH breakpoint regions in the t(8; 14) translocations in sporadic Burkitt-lymphoma. Leukemia, 21:1739--1751, 2007.
-  J. Chaudhuri, C. Khuong, and F. W. Alt. Replication protein A interacts with AID to promote deamination of somatic hypermutation targets. Nature, 430(7003):992--998, Aug 2004.
-  R. Chiarle, Y. Zhang, R. L. Frock, S. M. Lewis, B. Molinie, Y. J. Ho, D. R. Myers, V. W. Choi, M. Compagno, D. J. Malkin, D. Neuberg, S. Monti, C. C. Giallourakis, M. Gostissa, and F. W. Alt. Genome-wide translocation sequencing reveals mechanisms of chromosome breaks and rearrangements in B cells. Cell, 147(1):107--119, Sep 2011.
-  E. de Wit and W. de Laat. A decade of 3C technologies: insights into nuclear organization. Genes Dev., 26(1):11--24, Jan 2012.
-  J. M. Di Noia and M. S. Neuberger. Molecular mechanisms of antibody somatic hypermutation. Annu. Rev. Biochem., 76:1--22, 2007.
-  J. Dierlamm, M. Baens, I. Wlodarska, M. Stefanova-Ouzounova, J. M. Hernandez, D. K. Hossfeld, C. De Wolf-Peeters, A. Hagemeijer, H. Van den Berghe, and P. Marynen. The apoptosis inhibitor gene API2 and a novel 18q gene, MLT, are recurrently rearranged in the t(11;18)(q21;q21) associated with mucosa-associated lymphoid tissue lymphomas. Blood, 93(11):3601--3609, Jun 1999.
-  S. Difilippantonio, E. Gapud, N. Wong, C. Y. Huang, G. Mahowald, H. T. Chen, M. J. Kruhlak, E. Callen, F. Livak, M. C. Nussenzweig, B. P. Sleckman, and A. Nussenzweig. 53BP1 facilitates long-range DNA end-joining during V(D)J recombination. Nature, 456(7221):529--533, Nov 2008.
-  J. Glaz. Approximations and bounds for the distribution of the scan statistic. J. Amer. Stat. Assoc., 84(406):560--566, 1989.
-  J. Glaz, J. Naus, and S. Wallenstein. Scan Statistics. Springer series in Statistics. Springer Verlag, New York, 2001.
-  M. Gostissa, F. W. Alt, and R. Chiarle. Mechanisms that promote and suppress chromosomal translocations in lymphocytes. Annu. Rev. Immunol., 29:319--350, 2011.
-  O. Hakim, W. Resch, A. Yamane, I. Klein, K. R. Kieffer-Kwon, M. Jankovic, T. Oliveira, A. Bothmer, T. C. Voss, C. Ansarah-Sobrinho, E. Mathe, G. Liang, J. Cobell, H. Nakahashi, D. F. Robbiani, A. Nussenzweig, G. L. Hager, M. C. Nussenzweig, and R. Casellas. DNA damage defines sites of recurrent chromosomal translocations in B lymphocytes. Nature, 484(7392):69--74, Apr 2012.
-  S. K. Hasan, T. Ottone, R. F. Schlenk, Y. Xiao, J. L. Wiemels, M. E. Mitra, P. Bernasconi, F. Di Raimondo, M. T. L. Stanghellini, P. Marco, A. N. Mays, H. Döhner, M. A. Sanz, S. Amadori, D. Grimwade, and F. Lo-Coco. Analysis of t(15;17) chromosomal breakpoint sequences in therapy-related versus de novo acute promyelocytic leukemia: association of DNA breaks with specific DNA motifs at PML and RARA loci. Genes Chromosomes Cancer, 49(8):726--32, 2010.
-  M. Jankovic, N. Feldhahn, T. Y. Oliveira, I. T. Silva, Kyong-Rim Kieffer-Kwon, A. Yamane, W. Resch, I. Klein, D. F. Robbiani, R. Casellas, and M. C. Nussenzweig. 53BP1 alters the landscape of DNA rearrangements and suppresses AID-induced B cell lymphoma. Mol. Cell, 49(4):623--631, 2013.
-  I. A. Klein, W. Resch, M. Jankovic, T. Oliveira, A. Yamane, H. Nakahashi, M. Di Virgilio, A. Bothmer, A. Nussenzweig, D. F. Robbiani, R. Casellas, and M. C. Nussenzweig. Translocation-capture sequencing reveals the extent and nature of chromosomal rearrangements in B lymphocytes. Cell, 147(1):95--106, Sep 2011.
-  M. I. Krzywinski, J. E. Schein, I. Birol, J. Connors, R. Gascoyne, D. Horsman, S. J. Jones, and Ma. A. Marra. Circos: An information aesthetic for comparative genomics. Genome Research, 19(9), 2009.
-  C. Kumar-Sinha, S. A. Tomlins, and A. M. Chinnaiyan. Recurrent gene fusions in prostate cancer. Nat. Rev. Cancer, 8(7):497--511, Jul 2008.
-  R. Kuppers. Mechanisms of B-cell lymphoma pathogenesis. Nat. Rev. Cancer, 5(4):251--262, Apr 2005.
-  R. Kuppers and R. Dalla-Favera. Mechanisms of chromosomal translocations in B cell lymphomas. Oncogene, 20(40):5580--5594, Sep 2001.
-  M. Liu, J. L. Duke, D. J. Richter, C. G. Vinuesa, C. C. Goodnow, S. H. Kleinstein, and D. G. Schatz. Two levels of protection for the B cell genome during somatic hypermutation. Nature, 451(7180):841--845, Feb 2008.
-  C. R. Loader. Large deviation approximations to the distribution of scan statistics. Adv. Appl. Prob., 23(4):751--771, 1991.
-  W. Ma and W. H. Wong. The analysis of ChIP-Seq data. Meth. Enzymol., 497:51--73, 2011.
-  J. I. Naus. Probabilities for a generalized birthday problem. J. Amer. Stat. Assoc., 69:810--815, 1974.
-  Joseph I. Naus and Sylvan Wallenstein. Multiple window and cluster size scan procedures. Methodology and Computing in Applied Probability, 6:389--400, 2004.
-  A. Nussenzweig and M. C. Nussenzweig. Origin of chromosomal translocations in lymphoid cancer. Cell, 141:27--38, 2010.
-  R. Pavri, A. Gazumyan, M. Jankovic, M. Di Virgilio, I. Klein, C. Ansarah-Sobrinho, W. Resch, A. Yamane, B. Reina San-Martin, V. Barreto, T. J. Nieland, D.Ẽ. Root, R. Casellas, and M.C̃. Nussenzweig. Activation-induced cytidine deaminase targets DNA at sites of RNA polymerase II stalling by interaction with Spt5. Cell, 143(1):122--133, Oct 2010.
-  M. Perone Pacifico, C. Genovese, I. Verinelli, and L. Wasserman. Scan clustering: a false discovery approach. J. Multivariate Anal., 98(7):1141--1469, 2007.
-  T. H. Rabbitts. Commonality but diversity in cancer gene fusions. Cell, 137(3):391--395, May 2009.
-  A. R. Ramiro, M. Jankovic, E. Callen, S. Difilippantonio, H. T. Chen, K. M. McBride, T. R. Eisenreich, J. Chen, R. A. Dickins, S. W. Lowe, A. Nussenzweig, and M. C. Nussenzweig. Role of genomic instability and p53 in AID-induced c-myc-Igh translocations. Nature, 440(7080):105--109, Mar 2006.
-  A. Reiter, S. Saußele, D. Grimwade, J. L. Wiemels, M. R. Segal, M. Lafage-Pochitaloff, C. Walz, A. Weisser, A. Hochhaus, A. Willer, A. Reichert, T. Büchner, E. Lengfelder, R. Hehlmann, and N. C.P. Cross. Genomic anatomy of the specific reciprocal translocation t(15;17) in acute promyelocytic leukemia. Genes Chromosom. Cancer, 36(2):175--188, 2003.
-  P. Revy, T. Muto, Y. Levy, F. Geissmann, A. Plebani, O. Sanal, N. Catalan, M. Forveille, R. Dufourcq-Labelouse, A. Gennery, I. Tezcan, F. Ersoy, H. Kayserili, A. G. Ugazio, N. Brousse, M. Muramatsu, L. D. Notarangelo, K. Kinoshita, T. Honjo, A. Fischer, and A. Durandy. Activation-induced cytidine deaminase (AID) deficiency causes the autosomal recessive form of the Hyper-IgM syndrome (HIGM2). Cell, 102(5):565--575, Sep 2000.
-  N. Riggi and I. Stamenkovic. The Biology of Ewing sarcoma. Cancer Lett., 254(1):1--10, Aug 2007.
-  M. R. Segal and J. L. Wiemels. Clustering of translocation breakpoints. J. Amer. Stat. Assoc., 97(457):66--76, 2002.
-  M. Simonis, P. Klous, E. Splinter, Y. Moshkin, R. Willemsen, E. de Wit, B. van Steensel, and W. de Laat. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nat. Genet., 38(11):1348--1354, Nov 2006.
-  J. Stavnezer, J. E. Guikema, and C. E. Schrader. Mechanism and regulation of class switch recombination. Annu. Rev. Immunol., 26:261--292, 2008.
-  U. Storb, H. M. Shen, S. Longerich, S. Ratnam, A. Tanaka, G. Bozek, and S. Pylawka. Targeting of AID to immunoglobulin genes. Adv. Exp. Med. Biol., 596:83--91, 2007.
-  Y. Tan, R. A. Timakhov, M. Rao, D. A. Altomare, J. Xu, Z. Liu, Q. Gao, S. C. Jhanwar, A. Di Cristofano, D. L. Wiest, J. E. Knepper, and J. R. Testa. A novel recurrent chromosomal inversion implicates the homeobox gene Dlx5 in T-cell lymphomas from Lck-Akt2 transgenic mice. Cancer Res., 68(5):1296--1302, Mar 2008.
-  S. Wallenstein and N. Neff. An approximation for the distribution of the scan statistic. Stat. Med., 6:197--207, 1987.
-  J. Wang, D. Duncan, Z. Shi, and B. Zhang. WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. Nucleic Acids Res., 41(Web Server issue):77--83, Jul 2013.
-  J. L. Wielmels, Y. Leonard, B. C. Wang, M. R. Segal, S. P. Hunger, M. T. Smith, V. Crouse, X. Ma, P. A. Buffler, and S. R. Pine. Site-specific translocation and evidence of postnatal origin of the t(1;19) e2a-pbx1 fusion in childhood acute lymphoblastic leukemia. PNAS, 99(23):15101--15106, 2002.
-  A. Yamane, W. Resch, N. Kuo, S. Kuchen, Z. Li, H. W. Sun, D. F. Robbiani, K. McBride, M. C. Nussenzweig, and R. Casellas. Deep-sequencing identification of the genomic targets of the cytidine deaminase AID and its cofactor RPA in B lymphocytes. Nat. Immunol., 12(1):62--69, Jan 2011.
-  Y. Zhang, M. Gostissa, D. G. Hildebrand, M. S. Becker, C. Boboila, R. Chiarle, S. Lewis, and F. W. Alt. The role of mechanistic factors in promoting chromosomal translocations found in lymphoid and other cancers. Adv. Immunol., 106:93--133, 2010.