Faster mass decomposition
Metabolomics complements investigation of the genome, transcriptome, and proteome of an organism. Today, the vast majority of metabolites remain unknown, in particular for non-model organisms. Mass spectrometry is one of the predominant techniques for analyzing small molecules such as metabolites. A fundamental step for identifying a small molecule is to determine its molecular formula.
Here, we present and evaluate three algorithm engineering techniques that speed up the molecular formula determination. For that, we modify an existing algorithm for decomposing the monoisotopic mass of a molecule. These techniques lead to a four-fold reduction of running times, and reduce memory consumption by up to . In comparison to the classical search tree algorithm, our algorithm reaches a 1000-fold speedup.
Metabolomics complements investigation of the genome, transcriptome, and proteome of an organism . Today, the vast majority of metabolites remain unknown, and this is particularly the case for non-model organisms and secondary metabolites: for many organisms, there is a striking discrepancy between the number of identified metabolites and the prediction of secondary metabolite-related biosynthetic pathways through recent genome-sequencing results, see for example . The structural diversity of metabolites is extraordinarily large, and much larger than for biopolymers such as proteins. In almost all cases, we cannot deduce the structure of metabolites from genome sequences, as it is done with proteins. Mass spectrometry (MS) is one of the two predominant experimental analysis techniques for detecting and identifying metabolites and other small molecules, the other being nuclear magnetic resonance (NMR). The most important advantage of MS over NMR is that it is orders of magnitude more sensitive, making it the method of choice for medium- to high-throughput screening applications . Newly identified metabolites often serve as leads in drug design , in particular for novel antibiotics .
In a mass spectrometry experiment, we measure the mass-to-charge ratios () of a peak, corresponding to an ion of the intact molecule or its fragments. Here, we omit the analysis of the charge state and assume that the mass of the ion is known: In most cases, small molecules receive only a single charge, and other cases can be either detected by analyzing the isotope pattern on the metabolite, or simply by iterating over the few possible charge states.
One of the most basic — but nevertheless highly important — steps when analyzing a molecule, is to determine its molecular formula. We note in passing that mass spectrometry does not record the mass of the uncharged molecule but rather the mass of the corresponding ion; for the sake of clarity, we ignore this difference in the following. Common approaches first compute all candidate molecules with mass sufficiently close to a peak mass in the measured spectrum, using an alphabet of potential elements. In a second step, additional information is used to score the different candidate molecules, for example using isotope pattern or fragmentation pattern information. The identified molecular formulas may then serve as a basis for subsequent identification steps. The problem of decomposing peak masses lies at the core of practically every approach for the interpretation of small molecule MS data that does not directly depend on a spectra library: See for example [6, 3, 21, 25, 18, 23, 13, 19].
First approaches for decomposing masses date back to at least the 1970’s [22, 9], where the naïve search tree algorithm described below is mentioned for the first time. Running times of this algorithm are often prohibitive, particularly for large alphabets of elements. Fürst et al.  proposed a faster decomposition algorithm which, unfortunately, is limited to the four elements \ceCHNO. For integer-valued masses, the problem is closely related to unbounded integer knapsacks . Here, an algorithm that works for arbitrary alphabets of elements is “folklore” in computer science, and can solve the problem in pseudo-polynomial running time . Böcker and Lipták [4, 5] presented an algorithm that requires only little memory and is swift in practice. Decomposing real-valued masses using the integer-mass approaches was introduced in . See also the review .
In this paper, we present three algorithm engineering techniques to speed up the decomposition of peak masses in practice. First, we replace the recursive decomposition algorithm by an iterative version that mimics the recursive enumeration, but is faster in practice. Second, we show how to minimize rounding error accumulation when transforming real-valued masses to their integer-valued counterparts. Finally, we modify the algorithm from  to decompose intervals instead of single masses, based on ideas from . Together, these improvements result in -fold decreased running times, compared to the previously fastest approach . We evaluate this on four experimental datasets.
In the following, let denote the masses of our alphabet , see Table 1 for masses of elements.111For readability, we will denote the real-valued masses by and the integer-valued masses by . We usually assume that these masses are ordered and, in particular, that is minimum. We want to decompose the mass of a peak in a measured spectrum, but we have to take into account measurement inaccuracies. To this end, we assume that we are given an interval , and want to find all decompositions such that .
Analogously, we can decompose integer masses over an alphabet of integer masses . Again, we assume that masses are ordered and that is minimum. We want to find all decompositions such that where are integer.
Böcker et al.  describe how to transform an instance of the real-valued mass decomposition problem into an integer-valued instance, see there for details. We briefly recapitulate the method: For a given blowup factor we transform real-valued masses into integer masses . (Different from  we will round down here, as this presentation appears to be somewhat easier to follow.) We want to find all real-valued decompositions in the interval . Regarding the upper bound we have and, as the left side is integer, . For the lower bound, we have to take into account rounding error accumulation: We define relative rounding errors
and note that . Let . Then, implies , see  for details. To this end, we can decompose integer masses in the interval to . Doing so, we guarantee that no real-valued decomposition will be missed. The list of integer decompositions will contain false positive decompositions, but these can be easily filtered out by checking for each integer decomposition.
Mass accuracy of an MS instrument depends linearly on the mass that we measure, and is usually given in “parts per million” (ppm). Formally, for a given mass and some , we want to find all masses in the interval with and . To this end, the width of the interval that we want to decompose, is linear in the mass .
For integer masses, the number of decompositions of some mass asymptotically equals . This leads to a similar estimate for real-valued masses . In general, this asymptotic estimate is accurate only for very large masses; for molecular formulas, it is a relatively good estimate even for small masses .
3 Algorithms for decomposing masses
The conceptually simplest algorithm for decomposing masses is a search tree that recursively builds up the decompositions (molecular formulas), taking into account the mass accuracy. The algorithm is very similar to FindAllRecursive in Fig. 1, we leave out the straightforward details. This algorithm has been suggested several times in the literature [22, 9, 1]. The major disadvantage of this algorithm is that its running time is not output-sensitive: For a constant alphabet of size , the algorithm requires time, even if there is not a single decomposition.
To decompose an integer over an alphabet of integer masses, we can use algorithm FindAllRecursive in Fig. 1. This algorithm requires an oracle such that if and only if is decomposable over the sub-alphabet . We can build this oracle using a dynamic programming table : We initialize , for , and use the recurrence for and otherwise. This approach requires memory to store and time to compute it, where is the largest mass that we want to decompose. The algorithm has polynomial time and space with regards to the mass we want to decompose.222Precisely speaking, running time is pseudo-polynomial in the input , as polynomial running time would require polynomial dependency on . Since the decision version of the problem (“is there a decomposition of mass ?”) is weakly NP-hard , there is little hope for an algorithm with running time polynomial in . We will ignore this detail in the following. Time for computing each decomposition is . The decomposition algorithm has polynomial delay and, hence, is output-dependent.
A more memory-efficient way to build the required oracle, is to use the extended residue table (ERT) from : For an integer , let denote the residue of modulo , where . We define the extended residue table by
where we define if no such number exists, that is, if the minimum is taken over the empty set. Now, we can define the oracle by
Storing the ERT requires space, and the table can be computed in time. The time for computing each decomposition using algorithm FindAllRecursive is again but can be reduced to [4, 5]. The conceptual advantage of this approach is that we do not have to decide upon some “largest mass” during preprocessing, and that both time and space do no longer depend on the mass that we want to decompose.
4 Iterative version of the decomposition algorithm
The naïve search tree algorithm for decomposing masses can easily be made iterative using nested For-loops. Replacing algorithm FindAllRecursive by an iterative version is slightly more complicated, as we have to avoid “empty branches” of the search tree, where no decomposition can be found. We can use an auxiliary Boolean vector that stores which of the two alternative recursive calls from FindAllRecursive has been executed last. We present a more involved version of the iterative algorithm in Fig. 2. We avoid the auxiliary vector by deciding on the alternative calls directly from the decomposition .
The algorithm is independent of the actual implementation of the oracle . Asymptotically, worst-case running time is identical to that of the recursive version; in practice, the iterative version is nevertheless considerably faster than its recursive counterpart, as we avoid the stack handling.
5 Selecting optimal blowup factors
Transforming the real-valued decomposition problem into its integer-valued counterpart requires that we choose some blowup factor . Due to the rounding error correction, we have to decompose roughly “auxiliary” integers in addition to the “regular” integers, where . It is reasonable to ask for a blowup factor such that the ratio of additional integers is minimum. For “sufficiently small” we have and, hence, .
Since is bounded, we can make arbitrarily small by choosing an arbitrarily large blowup factor . But this is not realistic in applications, as memory requirements increase linearly with . To this end, we suppose that memory considerations imply an upper bound of . We want to find such that is minimized. We can explicitly find an optimal as follows: First, we consider the functions
for all . Each is a piecewise linear function with discontinuities . In every interval, this function has slope . Next, we set and for , we define as the maximum of and . Then, is a piecewise linear function with discontinuities. Finally, is a piecewise linear function with discontinuities. We sweep over the discontinuities from left to right, and for each discontinuity , we calculate all and . This can be easily achieved in time , where is the largest mass in the alphabet. For every piecewise linear part of the minimum of must be located at one of the terminal points, so it suffices to test the discontinuities to find the minimum of .
We compute optimal blowup factors for the default alphabet \ceCHNOPS, and for the extended alphabet \ceCHNOPSClBrI suggested in . Since we can find arbitrarily small blowup factors by increasing , any blowup factor can only be locally optimal: that is, for an upper bound and all we then have . See Table 2 for all locally optimal blowup factors in the range . We do not report blowup factors below as, for the mass accuracies considered here, such blowup factors result in a dramatic increase of false positive decompositions and, hence, are not useful in practice.
6 Range decompositions
Agarwal et al.  suggested to decompose a range of masses for integers , instead of decomposing each mass individually. In theory, this does not noticeably improve running times: Using the approaches described above, we can iterate over all masses . Let denote the number of decompositions in this range, then this results in a total running time of for the approach of [4, 5]. Clearly, the additive term can be ignored in practice.
But from an algorithm engineering perspective, decomposing a range instead of an integer may result in considerable time savings: For the algorithm FindAllRecursive, this can significantly reduce the number of recursive function calls. To this end, given a range and an alphabet we assume an oracle with if and only if there is at least one that is decomposable over . Solely for the sake of clarity, we will assume to be fixed, although it obviously depends on the mass that we want to decompose. With this new oracle, we can reuse the algorithms from Fig. 1 and 2 without further changes.
In the following, let denote the original oracle for a single mass . Then, a straightforward oracle for the range decomposition is
But this requires calls of the oracle and results in a multiplicative factor of in the running time. Agarwal et al.  suggested modifying the integer knapsack recurrence mentioned above, to capture the mass range: To this end, we initialize for and for . We use the same recurrence as above, namely for and otherwise. Unfortunately, for each that we want to consider for decomposing, this requires preprocessing and storing a dynamic programming table. Again, this is not desirable in application, as the mass error of the measurement increases with mass. So, we have to compute and store a table for every .
But with a small trick, we can reduce the multiplicative factor for space from to : Let be an oracle with if and only if there is at least one that is decomposable over . Then, we can “recover” the oracle for the range as
We will now show how to use the extended residue table from  for range decompositions: We define a family of extended residue tables for , where is the minimum of all with , such that some is decomposable over . Again, if no such number exists. Now, we can re-use the oracle from (2): We have if and only if . Storing all extended residue tables requires space, and the tables can be computed in time using the following simple recurrence: We initialize and use
for , , and . Here, refers to the extended residue table for the single integer decomposition problem.
The iterative algorithm FindAllIterative does not consider the degenerate case where the width of the interval we want to decompose, is large compared to the masses of the alphabet. In particular, for every decomposition with mass at most can be “completed” using element to find a decomposition with mass in . For this case, we have to adapt the algorithm by replacing the line marked in Fig. 2 by a loop over appropriate numbers of elements . But for this degenerate case, we find a decomposition for every leaf of the naïve search tree algorithm; so, the iterative version of this algorithm outperforms all other, more involved algorithms.
We implement the algorithms mentioned above in Java 1.6. SearchTree denotes the naïve search tree algorithm. We distinguish two algorithms based on the recursive decomposition (Fig. 1), namely Recursive+Knapsack using the knapsack DP, and Recursive+ERT using the extended residue table. We also implement two versions of the iterative decomposition (Fig. 2), namely Iterative+ERT and Iterative+Range which uses range decompositions from Sec. 6. For brevity, we exclude other combinations, as these will in all likelihood not result in better running times.
We evaluate the algorithms on four datasets: The Orbitrap dataset  contains 97 compounds measured on a Thermo Scientific Orbitrap XL instrument. The MassBank dataset  consists of 370 compounds measured on a Waters Q-Tof Premier spectrometer. The Eawag dataset  contains 60 compounds measured on a LTQ Orbitrap XL Thermo Scientific and is also accessible from MassBank. The Hill dataset  consists of 102 compounds with 502 spectra measured on a Waters Micromass QTOF II instrument. We omit experimental details. All of these datasets are used in computations that require the decomposition of peak masses. See Table 3 for details.
We discard peaks with mass below Da because for such masses, the problem becomes easy regardless of the used algorithm. We report running times for decomposing a single peak mass as well as all peaks in a dataset. For algorithms working on integer masses, we generate integer-valued instances as described in Sec. 2. We use a mass accuracy of 20 ppm, so . For Recursive+Knapsack, Recursive+ERT, and Iterative+ERT, we decompose intervals by decomposing all integer values separately.
Running time measurements are done on a Intel Xeon E5645 with 48 GB RAM. For each algorithm we repeat computations five times and report minimum running times. For the complete datasets, total running times can be found in Table 4 and Figure 3 for alphabets \ceCHNOPSClBrI and \ceCHNOPS. We find that the fastest ERT-based algorithm Iterative+Range was -fold faster than the SearchTree algorithm for alphabet \ceCHNOPS; this increases to -fold speedup for alphabet \ceCHNOPSClBrI.
Replacing the knapsack DP by an ERT table results in a -fold speedup. In addition, memory requirements decrease considerably: For example, to decompose the maximal mass of with a blowup of and the extended alphabet, the integer knapsack DP table requires megabyte, whereas the ERT requires only megabyte.
Replacing the recursive search by its iterative counterpart has only limited impact: We find that the iterative algorithm Iterative+ERT is merely -fold faster than its recursive counterpart, Recursive+ERT.
We observe that the blowup factor has a major impact on running times. Choosing locally optimal blowup factors does in fact significantly reduce running times: For example, blowup factor results in -fold faster running times than . We test all locally optimal blowup factors from Table 2. We observe best running time for blowup , whereas larger and smaller blowup factors result in increased running times. We refrain from reporting all running times, see Fig. 3 and Table 4 for some examples. The best blowup factor results in a two-fold speedup when compared to the “default” blowup factor from . As a pleasant side effect, this decreases the memory requirements of the algorithm by .
The range decomposition improves running times by about -fold for the best blowup factor. A stronger improvement is achieved when more integers are to be decomposed using, say, a larger blowup factor. Using blowup factor , memory increases to MB for storing ten ERT tables.
Finally, we repeat our experiments for an improved mass accuracy ppm. For all algorithms except SearchTree this results in roughly a - to -fold decrease of running time, whereas SearchTree running times does not change. For this mass accuracy and alphabet \ceCHNOPSClBrI, the best algorithm Iterative+range is -fold faster than the naïve SearchTree algorithm.
We suggest three techniques to improve the running time for decomposing real masses. We measure the improvements on four different datasets. All techniques together result in a -fold improvement in running time, compared to the Recursive+ERT algorithm from [4, 5]. We note in passing that the implementation of the Recursive+ERT algorithm used in this evaluation, was two-fold faster than the one provided as part of SIRIUS . The competitive edge of the new method is even larger for “hard” problem instances, e.g. high masses, large mass deviations, and bigger alphabets. Compared to the naïve search tree algorithm, we reach improvements between 56-fold and 1000-fold, reducing the total running times from hours to minutes or even seconds.
Regarding the degenerate case mentioned in Sec. 6, we argue that this case is of no interest, from either the practical or the theoretical side: Modern MS instruments easily reach mass accuracies of ppm and below, whereas metabolite and even peptide masses rarely exceed Da. Even a peptide mass of Da can be measured with an accuracy of at least Da, well below the mass of a single \ce^1H atom. From the theoretical side, we would have to deal with a humongous number of decompositions, rendering time to compute the decompositions irrelevant in comparison to subsequent analysis steps.
We thank Tim White for proofreading earlier versions of this work.
-  D. Agarwal, F. Cazals, and N. Malod-Dognin. Stoichiometry determination for mass-spectrometry data: the interval cases. Research report 8101, Inria, Research Centre Sophia Antipolis – Méditerranée, Oct. 2012.
-  G. Audi, A. Wapstra, and C. Thibault. The AME2003 atomic mass evaluation (ii): Tables, graphs, and references. Nucl Phys A, 729:129–336, 2003.
-  S. Böcker, M. Letzel, Zs. Lipták, and A. Pervukhin. SIRIUS: Decomposing isotope patterns for metabolite identification. Bioinformatics, 25(2):218–224, 2009.
-  S. Böcker and Zs. Lipták. Efficient mass decomposition. In Proc. of ACM Symposium on Applied Computing (ACM SAC 2005), pages 151–157. ACM press, New York, 2005.
-  S. Böcker and Zs. Lipták. A fast and simple algorithm for the Money Changing Problem. Algorithmica, 48(4):413–432, 2007.
-  S. Böcker and F. Rasche. Towards de novo identification of metabolites by analyzing tandem mass spectra. Bioinformatics, 24:I49–I55, 2008. Proc. of European Conference on Computational Biology (ECCB 2008).
-  M. A. Cooper and D. Shlaes. Fix the antibiotics pipeline. Nature, 472(7341):32, 2011.
-  N. S. Cortina, D. Krug, A. Plaza, O. Revermann, and R. Müller. Myxoprincomide: a natural product from Myxococcus xanthus discovered by comprehensive analysis of the secondary metabolome. Angew Chem Int Ed Engl, 51(3):811–816, Jan 2012.
-  R. G. Dromey and G. T. Foyster. Calculation of elemental compositions from high resolution mass spectral data. Anal Chem, 52(3):394–398, 1980.
-  A. Fürst, J.-T. Clerc, and E. Pretsch. A computer program for the computation of the molecular formula. Chemom Intell Lab Syst, 5:329–334, 1989.
-  D. W. Hill, T. M. Kertesz, D. Fontaine, R. Friedman, and D. F. Grant. Mass spectral metabonomics beyond elemental formula: Chemical database querying by matching experimental with computational fragmentation spectra. Anal Chem, 80(14):5574–5582, 2008.
-  H. Horai, M. Arita, S. Kanaya, Y. Nihei, T. Ikeda, K. Suwa, Y. Ojima, K. Tanaka, S. Tanaka, K. Aoshima, Y. Oda, Y. Kakazu, M. Kusano, T. Tohge, F. Matsuda, Y. Sawada, M. Y. Hirai, H. Nakanishi, K. Ikeda, N. Akimoto, T. Maoka, H. Takahashi, T. Ara, N. Sakurai, H. Suzuki, D. Shibata, S. Neumann, T. Iida, K. Tanaka, K. Funatsu, F. Matsuura, T. Soga, R. Taguchi, K. Saito, and T. Nishioka. MassBank: A public repository for sharing mass spectral data for life sciences. J Mass Spectrom, 45(7):703–714, 2010.
-  S. Jarussophon, S. Acoca, J.-M. Gao, C. Deprez, T. Kiyota, C. Draghici, E. Purisima, and Y. Konishi. Automated molecular formula determination by tandem mass spectrometry (MS/MS). Analyst, 134(4):690–700, 2009.
-  R. L. Last, A. D. Jones, and Y. Shachar-Hill. Towards the plant metabolome and beyond. Nat Rev Mol Cell Biol, 8:167–174, 2007.
-  J. W.-H. Li and J. C. Vederas. Drug discovery and natural products: End of an era or an endless frontier? Science, 325(5937):161–165, 2009.
-  G. S. Lueker. Two NP-complete problems in nonnegative integer programming. Technical Report TR-178, Department of Electrical Engineering, Princeton University, Mar. 1975.
-  S. Martello and P. Toth. Knapsack Problems: Algorithms and Computer Implementations. John Wiley & Sons, Chichester, 1990.
-  M. Meringer, S. Reinker, J. Zhang, and A. Muller. MS/MS data improves automated determination of molecular formulas by mass spectrometry. MATCH-Commun Math Co, 65:259–290, 2011.
-  T. Pluskal, T. Uehara, and M. Yanagida. Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal Chem, 84(10):4396–4403, 2012.
-  F. Rasche, K. Scheubert, F. Hufsky, T. Zichner, M. Kai, A. Svatoš, and S. Böcker. Identifying the unknowns by aligning fragmentation trees. Anal Chem, 84(7):3417–3426, 2012.
-  F. Rasche, A. Svatoš, R. K. Maddula, C. Böttcher, and S. Böcker. Computing fragmentation trees from tandem mass spectrometry data. Anal Chem, 83(4):1243–1251, 2011.
-  A. L. Robertson and M. C. Hamming. MASSFORM: a computer program for the assignment of elemental compositions to high resolution mass spectral data. Biomed Mass Spectrom, 4(4):203–208, 1977.
-  M. Rojas-Chertó, P. T. Kasper, E. L. Willighagen, R. J. Vreeken, T. Hankemeier, and T. H. Reijmers. Elemental composition determination based on MS. Bioinformatics, 27:2376–2383, 2011.
-  K. Scheubert, F. Hufsky, and S. Böcker. Computational mass spectrometry for small molecules. J Cheminform, 5:12, 2013.
-  M. A. Stravs, E. L. Schymanski, H. P. Singer, and J. Hollender. Automatic recalibration and processing of tandem mass spectra using formula annotation. J Mass Spectrom, 48(1):89–99, 2013.
-  H. Wilf. generatingfunctionology. Academic Press, second edition, 1994. Freely available from http://www.math.upenn.edu/~wilf/DownldGF.html.