Estimating Cellular Goals from
HighDimensional Biological Data
Abstract.
Optimizationbased models have been used to predict cellular behavior for over 25 years. The constraints in these models are derived from genome annotations, measured macromolecular composition of cells, and by measuring the cell’s growth rate and metabolism in different conditions. The cellular goal (the optimization problem that the cell is trying to solve) can be challenging to derive experimentally for many organisms, including human or mammalian cells, which have complex metabolic capabilities and are not well understood. Existing approaches to learning goals from data include (a) estimating a linear objective function, or (b) estimating linear constraints that model complex biochemical reactions and constrain the cell’s operation. The latter approach is important because often the known/observed biochemical reactions are not enough to explain observations, and hence there is a need to extend automatically the model complexity by learning new chemical reactions. However, this leads to nonconvex optimization problems, and existing tools cannot scale to realistically large metabolic models. Hence, constraint estimation is still used sparingly despite its benefits for modeling cell metabolism, which is important for developing novel antimicrobials against pathogens, discovering cancer drug targets, and producing valueadded chemicals. Here, we develop the first approach to estimating constraint reactions from data that can scale to realistically large metabolic models. Previous tools have been used on problems having less than 75 biochemical reactions and 60 metabolites, which limits reallifesize applications. We perform extensive experiments using 75 largescale metabolic network models for different organisms (including bacteria, yeasts, and mammals) and show that our algorithm can recover cellular constraint reactions. The recovered constraints enable accurate prediction of metabolic states in hundreds of growth environments not seen in training data, and we recover useful cellular goals even when some measurements are missing.
1. Introduction and Related Work
Engineered microbial and mammalian cells are used as production platforms to synthesize commodity or specialty chemicals and pharmaceuticals. Accurate computational models of cell metabolism and protein expression are important to design cell factories and to maximize product yield and quality (Yim et al., 2011; Kuo et al., 2018). Similarly, mathematical models of cell metabolism have been used to identify strategies to improve the efficacy of existing antibiotics (Brynildsen et al., 2013).
The ability of engineers to predict microbial behavior was facilitated in 1990 by the observation that overflow metabolism—an industriallyrelevant metabolic behavior—in Escherichia coli could be predicted by a relatively simple network of reactions with capacityconstrained flows (Majewski and Domach, 1990). Since then, this constrained optimization model of cellular metabolism (often called COBRA) has been applied to over species across the tree of life (Monk et al., 2014). Metabolic reconstructions today are “genomescale”—i.e., they account for the majority of metabolic genes annotated in the organism’s genome—and consist of hundreds to thousands of biochemical reactions and metabolites. For example, the most recent reconstruction of human metabolism includes 13,543 metabolic reactions involving 4,140 metabolites (Brunk et al., 2018), while the latest multiscale model of metabolism and protein expression for E. coli (Lloyd et al., 2018) consists of 12,655 reactions and 7,031 components including macromolecules like protein, RNA (ribonucleic acid), and ribosome.
As of mid2013, over 640 published studies used COBRA for experimental investigation in various domains of engineering and health sciences (Bordbar et al., 2014). Successful applications of COBRA include engineering microbes to produce commodity or valuable chemicals (Yim et al., 2011), developing novel antimicrobials against infectious disease (Brynildsen et al., 2013), and discovering new drug targets against cancer (Folger et al., 2011).
In its basic form, COBRA predicts reaction fluxes (reaction rate normalized by dry weight of biomass) in a metabolic network, consisting of metabolites and reactions, by solving a linear program that models cellular goal:
(1) 
where is a vector of objective coefficients, denotes the transpose, is a matrix of stoichiometric coefficients (one column per biochemical reaction), is a vector of metabolite accumulation or depletion rates, and are lower and upper flux bounds.
In order to make accurate predictions, these models require an accurate matrix, which is nowadays reliably reconstructed from extensive knowledgebases of enzyme biochemistry, and genome annotations for thousands of species and strains organisms. The other key ingredient is the cellular objective function, . Other objectives, including nonlinear functions, have been tested (Schuetz et al., 2007). For many microbes cultured in nutrientlimiting conditions, maximization of cell growth rate (per unit of limiting nutrient) is an accurate objective function (Schuetz et al., 2007). This particular function corresponds to a that is an indicator vector with a in the component associated with the reaction that corresponds to the cell growth, and zero everywhere else.
Currently, systemlevel understanding of living organisms requires the analysis of largescale measurements originating from disparate biological processes operating at multiple length and time scales. Such measurements are collectively referred to as “omics”, as they involve measuring the complete makeup of a given biological variable (e.g., proteomics attempts to measure the complete protein composition of a cell). Analysis of such omics measurements has shown that, despite the complexity inherent in living systems, relatively simple models are accurate enough for several biological studies. E.g., the gene expression profiles of various cell types can be described in terms of relatively few biological functions (Hart et al., 2015).
Similarly, the metabolic fluxes in a cell can be predicted by assuming that the cell is solving an optimization problem shaped by its evolutionary history (Feist and Palsson, 2016). The problem includes constraints that model biochemical “reactions” consuming and producing metabolites at specific stoichiometric ratios, and an objective that depends on the fluxes through the reactions. For example, the metabolism of fastgrowing microbes (midlog phase) is predicted accurately by a linear problem such as (1). As another example, the metabolism of mammalian cells (e.g., hybridoma), is explained well by minimization of the flux through the reaction that makes reactive oxygen species (Feist and Palsson, 2016). Studies have identified alternative objective functions that best predict microbial metabolism under different growth conditions (Schuetz et al., 2007). These objectives include maximizing ATP yield per flux unit and maximizing ATP or biomass yield.
While the aforementioned studies have focused on using predefined optimization problems, a number of studies have investigated datadriven estimation of cellular goals. Two types of method exist. Methods of the first type estimate how important the different chemical reactions are for the cell’s operation, i.e., estimate in (1). For example, ObjFind (Burgard and Maranas, 2003) does this through a nonconvex optimization formulation, and the more recent invFBA (Zhao et al., 2016) solves a linear program. The second type, and the focus of our work, estimate the stoichiometric coefficients of a new chemical reaction, i.e., estimate a new column for matrix in (1). This approach is important because often the known biochemical reactions are not enough to explain observed data. We want to extend the model complexity automatically by learning new chemical reactions, i.e., new columns for .
The main drawback of estimating new reactions is that it requires solving nonconvex optimization problems. Currently, only the BOSS tool (Gianchandani et al., 2008) does this. BOSS was shown to recover known biomass reactions successfully in synthetic experiments involving less than reactions and metabolites. This is not large enough for reallife applications, which can involve thousands of reactions/metabolites. Note that BOSS (a) uses an offtheshelf solver (e.g. MINOS) that cannot exploit parallelism, (b) cannot induce sparsity on the learnt reaction, and (c) has not been tested on large problems.
Here we address these important limitations. Our main contributions are the following.

We develop BIGBOSS (https://github.com/laurenceyang33/cellgoal), the first tool that can learn biochemical reactions in large models (with up to 4,456 metabolites and 6,663 reactions).

BIGBOSS is based on the alternating direction method of multipliers (ADMM). It employs an adaptive penalty and a preconditioning scheme, and hence requires no tuning. Further, it can exploit multiple CPUs to speed up computations on large problems.

BIGBOSS uses a sparsityinducing regularization to allow users to control the complexity of inferred reactions.

We test and report the accuracy of BIGBOSS on 75 genomescale models of cell metabolism including prokaryotes, eukaryotes, and mammalian cells, using both clean and corrupted input datasets.
2. Reaction estimation as an optimization problem
We estimate a cellular goal using measured fluxes (cf. in (1)) as input. These fluxes are measured for different cell growth environments, e.g., growth on glucose, or with/without oxygen. The part of the cellular goal that we estimate here is one new metabolic reaction, in particular its stoichiometric coefficients. In other words, we want to learn a new column for in (1) in order to explain observations better. We focus on one reaction for simplicity, but BIGBOSS can be used to estimate simultaneously the stoichiometric coefficients of multiple new biochemical reactions.
2.1. Bilevel optimization problem
We formulate our problem as a bilevel optimization problem (Bard, 2013). We start from fluxes measured across growth conditions, which is accomplished by feeding an organism carbon13 labeled nutrients and tracing their distribution using spectroscopy/spectrometry (Antoniewicz, 2015). Each vector contains the fluxes through a subset of all reactions taking place in the cell. We also fix the flux , through the new undefined reaction whose stoichiometric coefficients we want to estimate. Given a set of coefficients , and for each growth condition , we assume that the cell fluxes optimize a standard COBRA model:
(2)  
(3)  subject to  
(4) 
where and (known) are the lower and upper flux bounds, (known) is the rate of accumulation or depletion of different metabolites, (known) is the relative importance of the different known chemical reactions in the model, (known) is the relative importance of the new chemical reaction we want to learn, and (known) are the stoichiometric coefficients of the known chemical reactions in the model.
Our goal is to find sparse reaction coefficients (unknown) for the new reaction, such that the resulting fluxes explain the measured fluxes as well as possible. For each growth condition , we keep track of which of the fluxes we measure in a binary matrix that, when applied to , selects the correct measured components. This leads to the formulation that BIGBOSS solves:
(5)  
subject to 
where controls sparsity. Larger encourages reactions having few reactants and products.
We reformulate this bilevel optimization problem as a singlelevel optimization problem.
Theorem 2.1 ().
The proof of Theorem 2.1 follows the same ideas as in (Gianchandani et al., 2008). In a nutshell, we write the KarushKuhnTucker (KKT) conditions for the convex problem (2), and add constraint (10) to say that the primal and dual objective values for (2) are equal, which is true by strong duality.
The bilinear constraints (9) make the problem nonconvex. Furthermore, all of the constraints are coupled by the variable , the new chemical reaction that we want to learn. If it were not for (9), we could decouple problem (6) into independent generalized LASSO problems (Xu et al., 2017). To solve (6)–(12), we use the Alternating Direction Method of Multipliers (ADMM) (Derbinsky et al., 2013b; Bento et al., 2013; Boyd et al., 2011).
3. Summary of the notation used
All of the variables and parameters that we use are defined in Table 1. Vectors are written in bold, matrices are capitalized, and scalars are lowercase.
A few variables are defined locally, and their definitions only hold inside the section, or subsection, where they are defined. These are the only variables not listed in the notation tables.
Variable  Space  Description 
Model variables  
n  number of fluxes in the cell  
m  number of metabolites in the cell  
k  number of experimental conditions  
p  number of measured fluxes  
condition  
,  all fluxes in model  
,  measured fluxes  
relative importance of reaction  
stoichiometric coefficients  
,  rate of accumulation or depletion of metabolites (typically zero)  
,  flux lower bounds  
,  flux upper bounds  
coefficients of the new reaction, which we want to learn from data  
,  fixed flux through new reaction  
,  dual variables for linear constraints  
,  dual variables for flux lower bounds,  
,  dual variables for flux upper bounds,  
,  relative importance of the new reaction  
sparsity regularization weight  
Factorgraph and ADMM variables  
QP  Quadratic functionnode  
Bilinear functionnode  
BD  Boundary functionnode  
SP  Lnorm functionnode  
FX  Fluxes variablenode  
NR  New reaction Variablenode  
DB  Dual variable for flux bounds variablenode  
Dual variable for equilibrium constraint variablenode  
Primal iterate  
Consensus iterate  
Dual iterate  
Proximal operator (PO) map  
ADMM PO tuning parameter  
ADMM overrelaxation tuning parameter  
ADMM stepsize tuning parameter  
,  Input to PO map 
4. Solution procedure using ADMM
BIGBOSS’s inner workings are based on the overrelaxed, distributed consensus, ADMM algorithm (Boyd et al., 2011; Nishihara et al., 2015). This choice is based on the fact that ADMM (i) allows us to easily parallelize BIGBOSS, (ii) has good empirical performance for several nonsmooth, nonconvex problems (Bento et al., 2013; Hao et al., 2016; Derbinsky et al., 2013b; Mathy et al., 2015; Zoran et al., 2014; Derbinsky et al., 2013a), and (iii) has global optimality guarantees (under mild convexity assumptions) and a convergence rate that (under proper tuning) equals the convergence rate of the fastest possible firstorder method (França and Bento, 2016), although itself not a firstorder method.
Let and . Let be a function that takes the value if the constraint (7) is not satisfied, and zero if it is. Let be defined analogously. For the special case of constraints (9), we define one function for each condition in (9). To explain how we solve problem (6)–(12) using ADMM, it is convenient to perform two simple rerepresentations of (6)–(12) using the functions .
First, we rewrite problem (6)–(12) as
Second, we represent this unconstrained problem in a factorgraph form, i.e., a bipartite graph connecting objective functions (functionnodes) to the variables that they depend on (variablenodes), as shown in Figure 1. There are functionnodes and variablenodes in total.
We interpret ADMM as an iterative scheme that operates on iterates that live on the edges/nodes of the factorgraph in Figure 1, similar to the approaches in (Derbinsky et al., 2013b; Derbinsky et al., 2013a; Hao et al., 2016). The functionnodes have the following labels: quadratic (QP), sparse (SP), bilinear (), and bound (BD). The variablenodes have the following labels: flux (FX), new reaction (NR), dual equilibrium (), and dual bound (DB). Each edge (functionnode, variablenode), such as , has two associated iterates: a primal iterate and a dual iterate . Both iterates are vectors whose dimension matches the variable in the variablenode that they connect to. Each edge also has an associated parameter . For example, the dual iterate and parameter associated with the edge are (dimension ) and respectively. Each variablenode has one associated iterate: a consensus iterate whose dimension matches the variable associated with the variablenode. For example, the consensus iterate associated with node DB is (dimension ).
Essential to ADMM is the concept of a proximal operator (PO) of a function with parameter . This is a map that takes as input and outputs
(13) 
We denote the POs for the function in each functionnode as , , , and respectively. All POs have uniquely defined outputs.
Given this setup, the overrelaxed distributedconsensus ADMM algorithm leads directly to Algorithm 1, which is the basis of BIGBOSS. In Algorithm 1, if is a variablenode, represents all edges incident on . Similarly, if is a functionnode, represents all edges incident on . Note that each forloop in Algorithm 1 can be computed in parallel.
The rate of convergence of ADMM is influenced by the multiple ’s, , and , and there are various methods for choosing good values or adapting them while ADMM is running (He et al., 2000; Boyd et al., 2011; Derbinsky et al., 2013b; Xu et al., 2016, 2017; Stellato et al., 2017; Wohlberg, 2017). As described in Section 4.6, we implement the residual balancing method (Stellato et al., 2017). Although this scheme is not provably optimal, the algorithm converges very quickly in practice, and the BIGBOSS user need not worry about tuning in general. In BIGBOSS, all of the ’s corresponding to edges incident on the same functionnode in the factorgraph have the same value, , throughout ADMM’s execution. See Section 4.6 for more details. In this work, we run ADMM until all of the iterates are identical between iterations to a numerical absolute tolerance of . This is our stopping criterion.
Most updates in Algorithm 1 are linear. The exceptions are updates involving POs, which require solving (13) for different functions . A vital part of our work was to find efficient solvers for each PO. Our decomposition of problem (6)(12) into the specific form of Figure 1 is not accidental. It leads to a very small number of POs, all of which can be evaluated quickly. ( amounts to solving a linear system, and the others can be computed in closed form.) We expand on this in the next sections.
4.1. Proximal operator for node Qp
The functionnode QP is a quadratic objective subject to some linear constraints. To explain how we compute , we start by defining for each condition , writing
and writing the constraints as , , where
Evaluating for input vectors now amounts to finding the unique minimizer of a strongly convex quadratic program with linear constraints, namely,
We find this unique minimizer by solving a linear system obtained from the KKT conditions. We then have linear systems that are coupled by the variable that is common to each . We can write each linear system in block form as
(14) 
where is the Lagrange multiplier for the linear constraints . Finally, we stack the linear systems into one large linear system^{1}^{1}1If is very large, we can also split the functionnode QP into functionnodes, one per condition . This will result in smaller linear systems, now decoupled, that can be solved in parallel.. For numerical stability reasons, we add , , to the system’s matrix. This ensures a unique solution. As described in Section 4.5, we also implemented a preconditioner for this linear system to improve its condition number.
To solve the linear system, we compared two different numerical libraries. One is the C library UMFPACK, which we called from Fortran using the Fortran 2003 interface in mUMFPACK 1.0 (Hanyk, [n. d.]). The other is the symmetric indefinite solver MA57 (Duff, 2004) from HSL (HSL, 2013). For both solvers, the symbolic factorization needs to be computed only once. When is updated, only the numerical factorization is recomputed. As other groups have reported (Gould et al., 2007), we found that MA57’s solve phase was faster than UMFPACK’s. We therefore used MA57 for all results in this study.
4.2. Proximal operator for node Db
Observe that the constraint is equivalent to constraints of the type for some constant vectors : one for each condition and each of the variables (for index ), ( for index ), and (for index ). Hence, computing for some input amounts to solving independent problems
This problem has a unique minimizer given explicitly by
where and are taken componentwise.
4.3. Proximal operator for node Sp
The PO for node SP, which has , requires finding the minimizer of
This minimizer is , where is the elementwise softthresholding operator (Boyd et al., 2011) defined, for input scalar , as
(15) 
4.4. Proximal operator for node
The th bilinear proximal operator receives input vectors , and outputs the minimizer of
(16)  
(17)  subject to 
To solve this problem, we consider two cases. If , we know that the minimizer is . Otherwise, we know that the constraint (17) is active, and we compute the minimizer from among the set of solutions of the KKT conditions for the (now constrained) problem. The KKT conditions are
where is the dual variable for the constraint . By defining and reformulating the KKT system, we arrive at the following quartic equation that determines the value of :
(18) 
We solve this quartic equation in closed form, using the Quartic Polynomial Program Fortran code (Morris Jr, 1993; Carmichael, [n. d.]). For each real root, , we compute and as
(19) 
and we keep the solution that minimizes the objective function (16).
4.5. Preconditioner
The convergence rate of the ADMM can be improved by preconditioning the data matrices associated with the node QP (Stellato et al., 2017; Takapoui and Javadi, 2016).
We use the Ruiz equilibration method (Ruiz, 2001) to compute positive definite diagonal matrices, and , with which we scale all of the variables in our formulation, as well as the constraints , associated with the node QP. The new scaled variables are defined through the change of variable , and the new constraints then transform as . Note that all of the copies of inside each get scaled exactly in the same way.
This scaling affects the different proximal operators in different ways. The node QP, that requires computing , now requires solving a modified version of (14), namely, the following system of linear equations (coupled by ),
For , let , , be the diagonal block of that scales the variables (), (), and (). We now solve problems of the type
(20)  
(21) 
which has a unique minimizer
For the node SP, let be the diagonal block of that scales the variable . The new now outputs
where, for the th component of , we use the th diagonal element of in the operator , that is applied componentwise.
For the th bilinear proximal operator , we now solve
subject to 
where is the diagonal block of that scales the variable . We modify the Ruiz equilibration method such that , and hence we can solve this PO using the same method as before.
4.6. Adaptive penalty
Our ADMM scheme allows for one per each edge in the factorgraph in Fig. 1. In BIGBOSS, these ’s are chosen automatically, such that a user does not need to worry about tuning. All of the ’s associated with edges incident on the same functionnode have the same value. Hence, we refer to the different ’s by their respective PO. We update the ’s for each PO dynamically every iterations using the residual balancing method (Boyd et al., 2011), which we now explain.
Similar to (Stellato et al., 2017; Wohlberg, 2017), we define the vector of relative primal residuals at iteration , and for any proximal operator , as
where, we recall, are all the edges incident on node .
The vector of relative dual residuals for the proximal operator at iteration is defined as follows. Let
We define
If is a PO other than QP, we define
since and do not exist when is not QP.
The update rule is then defined as follows. Once the norm of both the primal and the dual residuals falls below a threshold of , we start updating as
where
where and , which in this study we set to , is a threshold to prevent the from changing too frequently. Note that a frequentlychanging requires many numerical refactorization for solving the linear system in the node QP. Furthermore, a frequentlychanging might compromise the convergence of the ADMM.
4.7. Step size
The step size in Algorithm 1 can be adjusted to speed up convergence. We used a default step size of 1.0 for all problems. If the problem did not converge to the desired tolerance within 2 million iterations, we reran the problem with a simple step size heuristic. We use for 10,000 iterations, for 100,000 iterations, then until convergence, or the iteration limit is reached.
5. Experimental results
5.1. BOSS versus BIGBOSS
We compare BIGBOSS with BOSS in terms of running speed, and accuracy. To make a valid comparison, we set and , such that BIGBOSS is doing the same inference as BOSS (Gianchandani et al., 2008). BOSS requires choosing an offtheshelf nonlinear programming solver, we choose IPOPT (Wächter and Biegler, 2006) (v24.7.1), an opensource, interiorpoint solver used to solve largescale nonlinear programs.
We generate synthetic flux measurements, , using the latest metabolic network of E. coli (Monk et al., 2017) called iML1515. This network has metabolites and reactions. We then hide one goal reaction from the network, , and try to infer it from flux data.
Using a single CPU core (Intel i7, GHz), BOSS is able to learn for this model in seconds (wall time) using IPOPT to primal and dual residuals of and . Using the same CPU core, BIGBOSS is able to learn in seconds to primal and dual residuals of and . In Appendix A.3, we report the runtime for other accuracy values. Although BIGBOSS is slower than BOSS, it is still fast, and, more importantly, allows , and the use of multiple cores for large models.
Now we look at how well the recovered reaction coefficients compare to the true coefficients that we hide, both in BOSS and BIGBOSS. We do so by (i) directly comparing the ground truth with the recovered , as well as, (ii) using the recovered to predict measured fluxes (training fluxes).
For BOSS, the recovered goal reaction coefficients closely resemble the true reaction coefficients, with Spearman rank correlation, (Fig. 2a). With the recovered back into the model, we then solve (1) to simulate fluxes. For BOSS, the estimated goal reaction has zero flux, resulting in simulated fluxes that do not resemble the training fluxes (Fig. 2c). To help BOSS, we make sparse by zeroing coefficients with absolute value smaller than a fixed threshold . Since this cannot be done in a real setting, it shows a limitation in BOSS. We tested thresholds between to and found that values between to result in accurate flux predictions ().
5.2. Experiments with largescale models
We test our algorithm’s performance on largescale metabolic networks in the BiGG database (King et al., 2015). Model sizes ranged from to reactions and to metabolites. The median model size is reactions and metabolites. For each model, we generate synthetic training fluxes by solving Problem (1) using the existing cellular goal reaction. To generate test data, we do as above but in a different simulated nutrient environment. This is accomplished by changing the bounds and on nutrient transport reactions. Our algorithm then receives as input the training fluxes, the bounds, and an incomplete stoichiometric matrix, which is missing the column corresponding to the “ground truth” cellular goal reaction that we want to estimate.
We evaluate our algorithm’s performance based on three metrics: (i) how well the estimated reaction allows to predict training fluxes, (ii) how well the estimated reaction allows to predict testing fluxes, and (iii) how close the estimated reaction coefficients are to the ground truth coefficients. To evaluate metrics (i) and (ii), we insert the estimated reaction into the metabolic network, having first removed the true goal reaction, and then simulate fluxes.
Based on the coefficient of determination () between measured and predicted fluxes, both training and test performances were high (median of and ) (Table 2). To assess the significance of these values, we compare them with fluxes generated from random goal reactions. These random goal reactions are obtained by randomly shuffling the coefficients of the ground truth coefficients. We say that the based on our algorithm is significant if it exceeds 95% of the values that were computed using the 1,000 random goal reactions. Overall, values are significant for of the models for training data and of the models for test data (Table 2). The ground truth goal reaction is recovered accurately: across models, the median Pearson correlation was (Table 3). The Spearman rank correlation was lower, with median . The fact that predicted fluxes are accurate indicates that metabolic flux predictions are relatively insensitive to the precise recovery of the true coefficients in .
In biological experiments, the majority of fluxes are unmeasured, so missing flux measurements is common (Antoniewicz, 2015). We test the effect of missing flux measurements by randomly removing % and % of measurements, repeated five times. The main consequence of missing measurements is that recovering the ground truth goal is more difficult. With and of missing measurements, median Pearson correlations of and are obtained (Table 2). The accuracy of predicted fluxes also deteriorate when measurements are missing (Table 2). However, depending on which reactions are measured, fluxes can be predicted with for of the cases with of missing measurements, and for of the cases with of missing measurements, in the case of test data. This result shows that certain fluxes are more important than others, and that BIGBOSS can recover goal reactions when those important fluxes are measured.
% missing  % cases,  % cases,  

Data  fluxes  mean  median  significant  
Train  0.0  0.711  0.993  63.8  70.1 
10.0  6.89  0.56  12.2  19.7  
50.0  38.9  0.604  4.98  6.45  
Test  0.0  0.666  0.977  59.4  67.2 
10.0  4.7  0.33  6.67  15  
50.0  29.2  0.431  3.16  6.56 
% missing  Metric  min  max  mean  median 

fluxes  
0.0  Pearson  0.021  1  0.96  1 
Spearman  0.1  0.76  0.48  0.48  
10.0  Pearson  0.74  1  0.063  0.023 
Spearman  0.12  0.9  0.59  0.63  
50.0  Pearson  0.92  0.82  0.032  0.0012 
Spearman  0.27  0.76  0.38  0.4 
5.3. Effect of multiple data types
Next, we investigate if including data types other than metabolic fluxes can improve reaction estimation. In particular, the concentration of over of cell proteins and transcripts (RNA) can be measured (Schmidt et al., 2016). These data types have been used to identify statistically enriched biological functions (Hart et al., 2015) but not to estimate reactions. Thus, to include protein concentrations in our algorithm, we extend (1), similar to (Beg et al., 2007). (Details in Supplement A.1.)
We estimated goal reactions with three different data sets: only metabolic fluxes, only proteins, and both. To reflect realistic data, we include only metabolic fluxes for central metabolism, and include all proteins—this is achievable using proteomics, or with RNA sequencing where RNA is used to estimate protein concentrations. We then use the estimated goal reaction to predict all metabolic fluxes and protein concentrations (including values not seen by the algorithm) in leftout conditions to assess the test performance.
When only metabolic fluxes are provided, the prediction accuracy is low (mean ) (Table 4). When only proteins are provided, the accuracy is even lower (mean ). One reason for the low accuracy is that protein concentrations are not always directly proportional to flux—they only provide an upper bound. When both fluxes and proteins are provided, the average test accuracy is , which is double the accuracy with only flux measurements.
Coverage (%)  

Flux  Protein  mean  min  max 
57  100  0.567  0.108  0.77 
57  0  0.281  0.225  0.322 
0  100  16.375  23.556  11.293 
5.4. Multicore speedup
Problem (5) can grow rapidly in size and can benefit from parallel computing. BOSS cannot exploit parallel computational resources, but BIGBOSS can. We thus tested how BIGBOSS scales with multiple processors, using the same metabolic network as in Section 5.1. To maximize parallelism, we implemented a finegrained version of our algorithm, similar to (Hao et al., 2016). The fact that we are using ADMM is important to achieve this. This implementation involves splitting node QP: one for every constraint. By combining this finegrained version of QP with BD and BI, we have our finegrained BIGBOSS. This implementation of BIGBOSS gave 1.6 speedup with 2 CPUs, and 7 speedup with 64 CPUs (Fig. 3). With the speedup was greater, achieving with 32 and 64 CPUs.
In these experiments, we did not perform preconditioning or adaptive penalty to prevent factors besides CPU count from affecting speed. To parallelize our code, we used OpenACC and the PGI compiler 18.10. These tests were performed on Google Compute Engine virtual machines with 32core (64thread) Intel Xeon 2.30 GHz processors and 64 GB RAM.
6. Conclusion
In this work, we addressed the problem of estimating cellular goal reactions from measured fluxes in the context of constraintbased modeling of cell metabolism. This problem was previously addressed by the BOSS algorithm (Gianchandani et al., 2008), which successfully demonstrated the viability of this approach on a model of Saccharomyces cerevisiae central metabolism composed of 60 metabolites and 62 reactions. Here, we developed a new method that extends BOSS and demonstrates its performance on 75 metabolic networks having up to 10,600 reactions and 5,835 metabolites.
Our method successfully estimated goal reactions that enabled accurate prediction of metabolic fluxes in new nutrient environments (median coefficient of determination, ). Furthermore, the stoichiometric coefficients of the estimated reactions matched those of the ground truth reactions that were used to generate the training fluxes (median Pearson correlation, ).
As with the original BOSS, our algorithm involves a nonconvex optimization problem with bilinear constraints. ADMM is an efficient approach for solving problems with many bilinear constraints, and more generally, nonconvex quadratic constraints (Huang and Sidiropoulos, 2016). For our problem, the number of bilinear constraints can be large, as it increases with the number of conditions in the data. Besides scalability, BIGBOSS allows for regularization, modularity—enabling extensions, and can use multiple data types (fluxes and protein concentrations), which improves prediction accuracy.
Genomescale metabolic network models have been invaluable in numerous studies including infectious disease and cancer metabolism (Bordbar et al., 2014). The success of these studies depends on the availability of a reaction representing cellular goals. For many important cellular systemss, such as human tissue or microbiomes, such reactions are still challenging to determine experimentally (Feist and Palsson, 2016). Our study shows that a datadriven approach for cellular goal estimation is promising. This approach is most effective when we have high coverage flux measurements or complementary data types, such as flux measurements for parts of the network and protein or RNA abundance measurements for most of the metabolic genes. BIGBOSS is thus suited for analyzing large gene expression data sets, e.g., from cancer tissue. BIGBOSS is available at https://github.com/laurenceyang33/cellgoal.
7. Acknowledgements
This work was supported by the NIH grants 2R01GM057089 and 1U01AI124302, NSF grant IIS1741129, and Novo Nordisk Foundation grant NNF10CC1016517.
References
 (1)
 Antoniewicz (2015) M. R. Antoniewicz. 2015. Methods and advances in metabolic flux analysis: a minireview. J Ind Microbiol Biot 42, 3 (2015), 317–325.
 Bard (2013) J. F. Bard. 2013. Practical bilevel optimization: algorithms and applications. Vol. 30. Springer Science & Business Media.
 Beg et al. (2007) Q. K. Beg, A. Vazquez, J. Ernst, M. A. de Menezes, Z. BarJoseph, A.L. Barabasi, and Z. N. Oltvai. 2007. Intracellular crowding defines the mode and sequence of substrate uptake by Escherichia coli and constrains its metabolic activity. Proc. Natl. Acad. Sci. USA 104 (2007), 12663–12668.
 Bento et al. (2013) J. Bento, N. Derbinsky, J. AlonsoMora, and J. S. Yedidia. 2013. A messagepassing algorithm for multiagent trajectory planning. In Adv Neur In. 521–529.
 Bordbar et al. (2014) A. Bordbar, J. M. Monk, Z. A. King, and B. O. Palsson. 2014. Constraintbased models predict metabolic and associated cellular functions. Nat Rev Genet 15 (2014), 107–120.
 Boyd et al. (2011) S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning 3, 1 (2011), 1–122.
 Brunk et al. (2018) E. Brunk, S. Sahoo, D. C. Zielinski, A. Altunkaya, A. Dräger, et al. 2018. Recon3D enables a threedimensional view of gene variation in human metabolism. Nat Biotechnol 36, 3 (2018), 272.
 Brynildsen et al. (2013) M. P. Brynildsen, J. A. Winkler, C. S. Spina, I. C. MacDonald, and J. J. Collins. 2013. Potentiating antibacterial activity by predictably enhancing endogenous microbial ROS production. Nat Biotechnol 31, 2 (2013), 160–165.
 Burgard and Maranas (2003) A. P. Burgard and C. D. Maranas. 2003. Optimizationbased framework for inferring and testing hypothesized metabolic objective functions. Biotechnol Bioeng 82 (2003), 670–677.
 Carmichael ([n. d.]) R. L. Carmichael. [n. d.]. Public Domain Aeronautical Software. http://www.pdas.com. Accessed: 20180812.
 Derbinsky et al. (2013b) N. Derbinsky, J. Bento, V. Elser, and J. S. Yedidia. 2013b. An improved threeweight messagepassing algorithm. arXiv preprint arXiv:1305.1961 (2013).
 Derbinsky et al. (2013a) N. Derbinsky, J. Bento, and J. S. Yedidia. 2013a. Methods for integrating knowledge with the threeweight optimization algorithm for hybrid cognitive processing. arXiv preprint arXiv:1311.4064 (2013).
 Duff (2004) I. S. Duff. 2004. MA57—a code for the solution of sparse symmetric definite and indefinite systems. ACM T Math Soft 30, 2 (2004), 118–144.
 Feist and Palsson (2016) A. M. Feist and B. O. Palsson. 2016. What do cells actually want? Genome biology 17, 1 (2016), 110.
 Folger et al. (2011) O. Folger, L. Jerby, C. Frezza, E. Gottlieb, E. Ruppin, and T. Shlomi. 2011. Predicting selective drug targets in cancer through metabolic networks. Mol Syst Biol 7, 1 (2011), 501.
 França and Bento (2016) G. França and J. Bento. 2016. An explicit rate bound for overrelaxed ADMM. In IEEE International Symposium on Information Theory. IEEE, 2104–2108.
 Gianchandani et al. (2008) E. P. Gianchandani, M. A. Oberhardt, A. P. Burgard, C. D. Maranas, and J. A. Papin. 2008. Predicting biological system objectives de novo from internal state measurements. BMC bioinformatics 9, 1 (2008), 43.
 Gould et al. (2007) N. I. M. Gould, J. A. Scott, and Y. Hu. 2007. A numerical evaluation of sparse direct solvers for the solution of large sparse symmetric linear systems of equations. ACM T Math Software 33, 2 (2007), 10.
 Hanyk ([n. d.]) L Hanyk. [n. d.]. UMFPACK Fortran interface. http://geo.mff.cuni.cz/~lh/Fortran/UMFPACK/. Accessed: 20180818.
 Hao et al. (2016) N. Hao, A. Oghbaee, M. Rostami, N. Derbinsky, and J. Bento. 2016. Testing finegrained parallelism for the ADMM on a factorgraph. In IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 835–844.
 Hart et al. (2015) Y. Hart, H. Sheftel, J. Hausser, P. Szekely, N. B. BenMoshe, Y. Korem, A. Tendler, A. E. Mayo, and U. Alon. 2015. Inferring biological tasks using Pareto analysis of highdimensional data. Nature Methods 12 (2015), 233–235.
 He et al. (2000) B. S. He, H. Yang, and S. L. Wang. 2000. Alternating direction method with selfadaptive penalty parameters for monotone variational inequalities. Journal of Optimization Theory and Applications 106, 2 (2000), 337–356.
 HSL (2013) HSL. 2013. A collection of Fortran codes for large scale scientific computation. http://www.hsl.rl.ac.uk
 Huang and Sidiropoulos (2016) K. Huang and N. D. Sidiropoulos. 2016. ConsensusADMM for general quadratically constrained quadratic programming. IEEE Transactions on Signal Processing 64, 20 (2016), 5297–5310.
 King et al. (2015) Z. A. King, J. Lu, A. Dräger, P. Miller, S. Federowicz, et al. 2015. BiGG Models: A platform for integrating, standardizing and sharing genomescale models. Nucleic Acids Res 44, D1 (2015), D515–D522.
 Kuo et al. (2018) C. Kuo, A. W. T. Chiang, I. Shamie, M. Samoudi, J. M. Gutierrez, and N. E. Lewis. 2018. The emerging role of systems biology for engineering protein production in CHO cells. Curr Opin Biotechnol 51 (2018), 64–69.
 Lloyd et al. (2018) C. J. Lloyd, A. Ebrahim, L. Yang, Z. A. King, E. Catoiu, et al. 2018. COBRAme: A computational framework for genomescale models of metabolism and gene expression. PLoS Comput Biol 14, 7 (2018), e1006302.
 Majewski and Domach (1990) R. A. Majewski and M. M. Domach. 1990. Simple constrainedoptimization view of acetate overflow in E. coli. Biotechnol Bioeng 35 (1990), 732–738.
 Mathy et al. (2015) C. J. M. Mathy, F. Gonda, D. Schmidt, N. Derbinsky, A. A. Alemi, J. Bento, et al. 2015. SPARTA: Fast global planning of collisionavoiding robot trajectories. In NIPS 2015 Workshop on Learning, Inference, and Control of Multiagent Systems.
 Monk et al. (2014) J. Monk, J. Nogales, and B. O. Palsson. 2014. Optimizing genomescale network reconstructions. Nat Biotechnol 32 (2014), 447–452.
 Monk et al. (2017) J. M. Monk, C. J. Lloyd, E. Brunk, N. Mih, A. Sastry, Z. King, R. Takeuchi, W. Nomura, Z. Zhang, H. Mori, et al. 2017. iML1515, a knowledgebase that computes Escherichia coli traits. Nat Biotechnol 35, 10 (2017), 904.
 Morris Jr (1993) A. H. Morris Jr. 1993. NSWC library of mathematics subroutines. Technical Report. NAVAL SURFACE WARFARE CENTER DAHLGREN VA.
 Nishihara et al. (2015) R. Nishihara, L. Lessard, B. Recht, A. Packard, and M. I. Jordan. 2015. A general analysis of the convergence of ADMM. (2015). arXiv:1502.02009
 Ruiz (2001) D. Ruiz. 2001. A scaling algorithm to equilibrate both rows and columns norms in matrices. Technical Report. CMP00040415.
 Schmidt et al. (2016) A. Schmidt, K. Kochanowski, S. Vedelaar, E. Ahrné, B. Volkmer, et al. 2016. The quantitative and conditiondependent Escherichia coli proteome. Nat Biotechnol 34 (2016), 104–110.
 Schuetz et al. (2007) R. Schuetz, L. Kuepfer, and U. Sauer. 2007. Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli. Mol Syst Biol 3 (2007), 119.
 Stellato et al. (2017) B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. 2017. OSQP: An Operator Splitting Solver for Quadratic Programs. (Nov. 2017). arXiv:math.OC/1711.08013
 Takapoui and Javadi (2016) R. Takapoui and H. Javadi. 2016. Preconditioning via diagonal scaling. (2016). arXiv:1610.03871
 Wächter and Biegler (2006) A. Wächter and L. T. Biegler. 2006. On the implementation of an interiorpoint filter linesearch algorithm for largescale nonlinear programming. Math Program 106, 1 (2006), 25–57.
 Wohlberg (2017) B. Wohlberg. 2017. ADMM penalty parameter selection by residual balancing. (2017). arXiv:1704.06209
 Xu et al. (2017) Y. Xu, M. Liu, Q. Lin, and T. Yang. 2017. ADMM without a Fixed Penalty Parameter: Faster Convergence with New Adaptive Penalization. In Adv Neur In. 1267–1277.
 Xu et al. (2016) Z. Xu, M. A. T. Figueiredo, and T. Goldstein. 2016. Adaptive ADMM with spectral penalty parameter selection. (2016). arXiv:1605.07246
 Yim et al. (2011) H. Yim, R. Haselbeck, W. Niu, C. PujolBaxley, A. Burgard, et al. 2011. Metabolic engineering of Escherichia coli for direct production of 1,4butanediol. Nat Chem Biol 7 (2011), 445–452.
 Zhao et al. (2016) Q. Zhao, A. I. Stettner, E. Reznik, I. Ch. Paschalidis, and D. Segrè. 2016. Mapping the landscape of metabolic goals of a cell. Genome Biol 17, 1 (2016), 109.
 Zoran et al. (2014) D. Zoran, D. Krishnan, J. Bento, and B. Freeman. 2014. Shape and illumination from shading using the generic viewpoint assumption. In Adv Neur In. 226–234.
Appendix A Details for reproducibility
a.1. Model formulation including proteins
Our model formulation for including protein concentrations is based on (Beg et al., 2007) and (Lloyd et al., 2018). First, we denote as the vector of protein concentrations, whose length is the number of proteins in the metabolic network. Each protein has a molecular weight, which is stored in . One or more proteins combine to form enzyme complexes, , which are ultimately the molecules that catalyze metabolic reactions. The rate of a metabolic reaction is limited by the catalytic efficiency of an associated enzyme and that enzyme’s concentration. Since multiple enzymes can sometimes catalyze the same reaction, we have a matrix of catalytic efficiencies, mapping reactions to all possible enzymes. The total protein mass of a cell, , can be measured, and this quantity imposes an upper bound on the sum of individual protein concentrations multiplied by their molecular weights, . Finally, each protein can be part of one or more enzyme complexes, and each enzyme complex is comprised of a fixed number of a particular protein. This protein to enzyme mapping is encoded in the matrix .
By adding these relationships to , we have the following linear problem:
(22) 
where is the cell’s relative importance of different proteins, and are nonnegative slack variables.
By defining ,
, , , and , we can write (22) as
(23) 
which is (1) with a change of variables. Therefore, BIGBOSS can be applied directly to the problem with proteins and fluxes.
The matrix is typically not known fully. Here, we used a default value of second, which are in units of enzyme turnover rate and represents an average catalytic efficiency for all enzymes in E. coli. Values for matrix were determined from the geneproteinreaction mappings that are included with each metabolic model in the BiGG database (King et al., 2015). Without additional information we assumed that each complex requires one unit of each protein in the geneprotein mapping. For , we chose a default value of 0.30, indicating that the model accounts for % of the cell’s protein mass. For molecular weights , we used a default value of 77 kDa for every protein, based on the average molecular weight of proteins in E. coli. (Note that can also be determined directly from the protein’s amino acid sequence.)
a.2. Online Resources
Code and documentation for BIGBOSS are available at https://github.com/laurenceyang33/cellgoal. Detailed installation instructions are available there. Briefly, users will need a Fortran compiler, CMake, and a linear system solver. We tested our algorithm with both UMFPACK, which is part of SuiteSparse, and MA57 (Duff, 2004) from HSL (HSL, 2013). For parallel computing features, users will need OpenACC and the PGI Compiler. We have tested all code using PGI Community Edition version 18.10. In the code repository, we include test suites to run BIGBOSS on metabolic network models that users can download from the BiGG database (King et al., 2015).