Quantifying total uncertainty in physicsinformed neural networks for solving forward and inverse stochastic problems
Abstract
Physicsinformed neural networks (PINNs) have recently emerged as an alternative way of solving partial differential equations (PDEs) without the need of building elaborate grids, instead, using a straightforward implementation. In particular, in addition to the deep neural network (DNN) for the solution, a second DNN is considered that represents the residual of the PDE. The residual is then combined with the mismatch in the given data of the solution in order to formulate the loss function. This framework is effective but is lacking uncertainty quantification of the solution due to the inherent randomness in the data or due to the approximation limitations of the DNN architecture. Here, we propose a new method with the objective of endowing the DNN with uncertainty quantification for both sources of uncertainty, i.e., the parametric uncertainty and the approximation uncertainty. We first account for the parametric uncertainty when the parameter in the differential equation is represented as a stochastic process. Multiple DNNs are designed to learn the modal functions of the arbitrary polynomial chaos (aPC) expansion of its solution by using stochastic data from sparse sensors. We can then make predictions from new sensor measurements very efficiently with the trained DNNs. Moreover, we employ dropout to correct the overfitting and also to quantify the uncertainty of DNNs in approximating the modal functions. We then design an active learning strategy based on the dropout uncertainty to place new sensors in the domain in order to improve the predictions of DNNs. Several numerical tests are conducted for both the forward and the inverse problems to quantify the effectiveness of PINNs combined with uncertainty quantification. This NNaPC new paradigm of physicsinformed deep learning with uncertainty quantification can be readily applied to other types of stochastic PDEs in multidimensions.
keywords:
physicsinformed neural networks, uncertainty quantification, stochastic differential equations, arbitrary polynomial chaos, dropout1 Introduction
How to make the best use of existing data while exploiting the information from classical mathematical models or even empirical correlations developed within a discipline is an important issue as datadriven modeling is emerging as a powerful paradigm for physical and biological systems. For example, in geophysics, researchers have been using the remote sensing data collected from multispectral satellites and the topofatmospheric reflectance model as a calibration of the data to study the soil salinization Fan et al. (2015), or estimating the Earth heat loss based on the heat flow measurements and a model of the hydrothermal circulation in the oceanic crust Pollack et al. (1993). Data can be used to provide closures in nonlinear models or to estimate parameters or functions in mathematical models. Moreover, mathematical models can be used as additional knowledge to formulate “informative priors” in statistical estimation methods or be encoded in specially designed machine learning tools so that a smaller amount of data is required for inference of system identification. There has been recent progress for both forward (inference) and inverse (identification) problems using different methods. For example, for the forward problem, some of the popular choices of machine learning tools are Gaussian process Graepel (2003); Särkkä (2011); Bilionis (2016); Raissi et al. (2018); Pang et al. (2018) and deep neural networks (DNNs) Lagaris et al. (1998, 2000); Khoo et al. (2017); Raissi et al. (2017); Nabian and Meidani (2018). For inverse problems, similar methods have been advanced, e.g., Bayesian estimation Stuart (2010) and variational Bayes inference Zhu and Zabaras (2018), and have been proposed for a wide variety of objectives, from parameter estimation Raissi et al. (2017) to discovering partial differential equations Rudy et al. (2017, 2018); Raissi et al. (2017); Raissi and Karniadakis (2018) to learning constitutive relationships Tartakovsky et al. (2018).
In this work we focus on the DNNs, and in particular the physicsinformed neural networks (PINNs) for forward and inverse problems, first introduced in Raissi et al. (2017, 2017). However, in those works the mathematical models were deterministic differential equations, so here we consider stochastic differential equations that model either random microstructure in a medium or the lack of complete knowledge (“uncertainty”), e.g., of the material property. There have only been very few works published on solving stochastic differential equations using DNNs, e.g., E et al. (2017); Raissi (2018), for forward problems. Here we study the special and perhaps the most complex case where some of the physics is known, namely via the stochastic differential equations, and the parameter in the equation is represented as a stochastic process, introducing parametric uncertainty. First, we solve the forward stochastic Poisson equation, where there is uncertainty associated with the driving force. Subsequently, we consider the inverse stochastic elliptic equation, where the diffusivity is modeled as a random process. In the latter case, we have only partial information of the diffusivity from scattered sensors but we have much more data available for the solution, and we aim to infer the stochastic processes of not only the solution but also the diffusivity, and quantify their uncertainties given the randomness in the data. An additional uncertainty is due to the DNN approximation, which we will refer to as the approximation uncertainty. Taken together, we refer to the parametric uncertainty and the approximation uncertainty as the total uncertainty. To the best of our knowledge, the current work is the first to address total uncertainty in solving stochastic forward and inverse problems using DNNs.
In particular, in this paper we combine the arbitrary polynomial chaos (aPC) with PINNs for both the forward and the inverse stochastic problems. One of the most popular methods for uncertainty quantification studies is the polynomial chaos Ghanem and Spanos (1990); Xiu and Karniadakis (2002) because it has been very effective in representing correlated stochastic fields. However, aPC Zheng et al. (2015); Wan and Karniadakis (2006); Oladyshkin and Nowak (2012); Lei et al. (2018); Witteveen and Bijl (2006) is more suitable for building the orthogonal basis from arbitrary random space, without the need of any assumption on the distribution of the data. Therefore, in the current work, we employ the aPC to develop a combined method that we call NNaPC, where we use the DNNs to learn each individual mode of the aPC expansion. More importantly, after training, the proposed method can be used to predict new realizations of the solution based only on very few measurements.
Treatment of the DNN approximation uncertainty has been addressed using different methods in the past. The traditional way to estimate uncertainty in DNNs is using the Bayes’ theorem, e.g., the Bayesian neural networks (BNNs) MacKay (1992); Neal (1996). BNNs are standard DNNs with prior probability distributions placed over their weights, and given observed data inference is then performed on weights. Because the inference is not tractable in general, variational inference is often used to approximate the inference Jordan et al. (1998); Paisley et al. (2012); Kingma and Welling (2013); Rezende et al. (2014); Titsias and LázaroGredilla (2014); Hoffman et al. (2013). However, these models have very high additional computational cost because they require more parameters for the same network size and more time for the DNN parameters to converge. Recently, Gal et al. developed a new way to quantify uncertainty in DNNs by using dropout Gal and Ghahramani (2016); Yarin Gal and Kendall (2017); Li and Gal (2017), which is largely used as a regularization technique Hinton et al. (2012); Srivastava et al. (2014) to address the problem of overfitting. Gal et al. Gal and Ghahramani (2016) showed that a DNN with dropout is mathematically equivalent to approximating a probabilistic deep Gaussian process Damianou and Lawrence (2013), no matter what network architecture and nonlinearities are used. Moreover, dropout does not induce much computation overhead and thus has been used as a practical tool to obtain uncertainty estimation effectively in real applications including language modelling Gal and Ghahramani (2016), computer vision Kendall et al. (2015); Kendall and Gal (2017) and medical applications Angermueller et al. (2017); Yang et al. (2016). In this paper, dropout is used to to estimate the uncertainty in approximating each aPC mode. Based on the magnitude of this uncertainty we set up an active learning strategy and deploy additional sensors to obtain more measurements of the quantity of interest (QoI), in order to improve the predictability of PINNs.
The organization of this paper is as follows. In Section 2, we set up the datadriven forward and inverse problems. In Section 3, we introduce the PINNs for solving deterministic differential equations, followed by our main algorithm, the NNaPC, and the method of dropout for uncertainty. In Section 4, we provide a detailed study of the accuracy and performance of the NNaPC method for solving both the forward and inverse stochastic diffusion equation and demonstrate the effectiveness of active learning via dropoutinduced uncertainty. Finally, we conclude with a brief discussion in Section 5.
2 Problem Setup
Suppose we have a stochastic differential equation:
(1) 
where is the general form of a differential operator that could be nonlinear, is a dimensional physical domain in , is the random space, and is the solution to this equation. The boundary condition is imposed through the generalized boundary condition operator at the domain boundary . The random parameter is the source of parametric uncertainty, which could be represented by either a few random variables or by an infinite dimensional (in the random space) random process.
We consider two types of problems here: first, a forward problem, where we know exactly the distribution of everywhere in the domain and is our QoI; and second, an inverse problem, where we assume that we have incomplete information on but some extra knowledge on , and we are interested in inferring the full stochastic profile of . In practice, both problems are datadriven, since the information usually comes from data collected via sensor measurements. Here, we summarize the different scenarios of the sensors placement for each type of the problems:

Forward problem: The sensors are placed only at the boundary to provide boundary condition, while the sensors are virtual (since we know the distribution of ), thus we can have as many sensors as we want and they can be placed anywhere in .

Inverse problem: In addition to having sensors at the boundary , we have a limited number of extra sensors that can be placed in the domain , whereas we only have a limited number of sensors.
In this paper, we address both types of problems but we will focus more on solving the inverse problem.
3 Methodology
3.1 Physicsinformed neural network
In this part, we briefly review using DNNs to solve deterministic differential equations Lagaris et al. (1998, 2000); Raissi et al. (2017), and its generalization for solving deterministic inverse problems in Raissi et al. (2017). To this end, reconsider Equation 1 but replace the random input and approximate it with a finite set of parameters, leading to a parameterized differential equation:
(2) 
where is the solution and denotes the parameters.
A DNN, denoted by , is constructed as a surrogate of the solution , and it takes the coordinate as the input and outputs a vector that has the same dimension as . Here we use to denote the DNN parameters that will be tuned at the training stage, namely, contains all the weights and biases in . For this surrogate network , we can take its derivatives with respect to its input by applying the chain rule for differentiating compositions of functions using the automatic differentiation, which is conveniently integrated in many machine learning packages such as Tensorflow Abadi et al. (2016). The restrictions on is twofold: first, given the set of scattered data of the observations, the network should be able to reproduce the observed value, when taking the associated as input; second, should comply with the physics imposed by Equation 2. The second part is achieved by defining a residual network:
(3) 
which is computed from straightforwardly with automatic differentiation. This residual network , also named the physicsinformed neural network (PINN), shares the same parameters with network and should output the constant 0 for any input . Figure 1 shows a sketch of the PINN. At the training stage, the shared parameters (and also , if it is also to be inferred) are finetuned to minimize a loss function that reflects the above two constraints.
Suppose we have a total number of observations on , collected at location , and is the number of collocation points where we evaluate the residual . We shall use to represent a single instance of training data, where the first entry denotes the input and the second entry denotes the anticipated output (also called “label”). The workflow of solving a differential equation with PINN can be summarized as follows:
(4) 
(5) 
3.2 NNaPC: Combining arbitrary polynomial chaos with neural networks
We generalize the PINN method to solve stochastic differential equations for both forward and inverse problems, i.e., we aim to infer continuous random processes. Assume that a sensor will generate a sequence of measurements after being installed, and when the data is recorded, all sensors are read simultaneously. We denote the measurements from all the sensors at the same instant by a snapshot of the sensor data. Although the measurement results change from one measurement to the next due to randomness, it is reasonable to believe that every snapshot of sensor data corresponds to the same random event in the random space. We also assume that when the number of snapshots is big enough, the empirical distribution approximates the true distribution.
Let us consider Equation 1. Suppose we have sensors for placed at , sensors for placed at , and collocation points at that are used to calculate the residual of Equation 1. A total number of snapshots of measurements are made from all these sensors. Let and () be the th measurement of and at location and respectively, and is the random instance at the th measurement, i.e., and . The training data set can be represented by
(6) 
The proposed NNaPC method consists of the following steps:

dimension reduction;

constructing the aPC basis;

building the NNaPC as a surrogate model of aPC modes and train the network for each mode.
The trained NNaPC can then be used to calculate the statistics of our QoI and to predict new instances of the continuous trajectories of the QoI, with newly collected sensor data. We will explain each of three steps and the prediction procedure below.
Dimension reduction with principal component analysis
As the first step, we find a lower dimensional random space spanned by a set of hidden random variables for the dimension reduction of our QoI. The most convenient way to do this is via the principal component analysis (PCA). Naturally, we would analyze the data of , which is the source of randomness in Equation 1. Let be the covariance matrix for the sensor measurements on , i.e.,
(7) 
Let and be the th largest eigenvalue and its associated normalized eigenvector of . Therefore, PCA yields
(8) 
where is an orthonormal matrix and is a diagonal matrix. Let be the results of the measurements of the th snapshot, then
(9) 
is an uncorrelated random vector, and hence can be rewritten as a reduced dimensional expansion
(10) 
where is the mean of each sensor’s measurements. The choice of depends on how much energy should be maintained in the lowdimensional random space. For reasons of simplicity, we shall always use the reduced dimensional representation and omit the superscript . We note that Equation 10 can also be written in the form of the KarhunenLoève expansion:
(11) 
where is the mean of measurements at , is the th mode function of whose value at coincides with the th entry of the eigenvector , and is the th entry of the random vector . We want to extend the range of to the entire domain to approximate the continuous samples of .
Arbitrary polynomial chaos
Assume that we have a set of dimensional samples of random vectors
with hidden probability measure . Given a sufficiently large number of snapshots, we can approximate the underlying probability measure by the discrete measure ,
(12) 
where is the Dirac measure. Then, a set of multivariate orthonormal polynomial basis can be constructed via the GramSchmidt orthogonalization process following Witteveen and Bijl (2006); Lei et al. (2018). The subscript is the graded lexicographic multiindex and the number of basis, , depends on the highest allowed polynomial order in , following the formula
(13) 
Specifically, the basis are constructed using the recursive algorithm
(14) 
where represents the multivariate monomial basis function; the coefficients are determined by imposing the orthonormal condition with respect to the discrete measure , i.e.,
(15)  
With the polynomial basis that are automatically adapted to the distribution of , we can write any function in the form of the aPC expansion,
(16) 
where the functions are called the aPC modes of and can be calculated by
(17) 
Learning stochastic modes
The key to our method is to train DNNs that predict the stochastic modes of our QoI. In this section, we focus on the inverse problem where we have to learn both the modes of and . (Solving a forward problem is similar and more straightforward, and will be briefly discussed in Section 4.1.1.) Two disjoint DNNs are constructed, i.e., the network , which takes the coordinate as the input and outputs a vector of the aPC modes of evaluated at , and the network that also takes the coordinate as the input and outputs a vector of the modes (we take in Equation 11 as the 0th mode of ). Then, we can approximate and at the th snapshot by
(18) 
and
(19) 
Similar to the PINN method, we construct the residual network via automatic differentiation and arithmetic operations of DNNs by substituting and in Equation 1 with and . This residual network is aPCinformed, because instead of being an uninterpretable blackbox this network is designed to reflect the essence of the aPC expansion (see Figure 2 for a schematic of the NNaPC).
In practice, for , we separate the mean from the rest of the modes and learn it with a small scale DNN, and for , we group the modes corresponding to the same order of aPC expansion together and learn each group of modes with a separate DNN, as depicted in Figure 3. This is due to fact that the mean and the modes of different orders often correspond to vastly different scales.
The loss function is defined as a sum of the mean squared errors (MSE):
(20) 
where
(21) 
(22) 
and
(23) 
So far, we have specified the training data, constructed the DNNs and formalized the loss function, and we now ready to train the DNNs.
Predicting stochastic realizations
In real applications, the training set could be generated from historical data, and the training process should be performed at the offline stage. The finetuned model shall be used to predict new random instances provided with new snapshots of sensor data, at the online stage. Suppose we have a snapshot of sensor data ; the first step is to extract the hidden random variables from using Equation 9. Then, for any assigned location , we can predict the modal functions for both and from the trained DNNs and . Finally, the prediction of and for the new random instance can be made via Equation 18 and 19, respectively.
3.3 Dropout for uncertainty
Although DNNs can be used to approximate any measurable function accurately, standard DNNs do not capture model uncertainty. Dropout is one convenient way to quantify the approximation uncertainty in DNNs. The key idea of dropout is to drop units from the DNN independently and randomly with a preselected probability . In the original work, dropout was only used during training, while no units were dropped at test time, i.e., the prediction of the DNN for the unknown data was deterministic. In the dropout for uncertainty, the units are also dropped at test time, resulting in stochastic predictions each time. The loss in the dropout inference is the summation of the original loss and a regularization term over the DNN parameters:
(24) 
where is the number of training points, is the prediction, is the true value, is the loss for a single prediction, is any weight and bias, and is the regularization rate. During prediction, the mean of the output is directly estimated by the Monte Carlo (MC) method,
(25) 
where is the dropped neural network at the th prediction. The output variance is also estimated from these MC outputs.
Figure 4 shows an example of using the dropout DNN for regression. We plan to use the dropout strategy in the NNaPC method to estimate the uncertainty of our DNN model and as a guidance for active learning.
4 Numerical Examples
4.1 Solving stochastic PDEs
We first demonstrate the effectiveness of solving stochastic differential equations with the NNaPC method for the forward and inverse problems.
Forward problem: stochastic Poisson equation (a pedagogical example)
Consider the following onedimensional stochastic Poisson equation with homogeneous boundary conditions:
(26) 
Here is the random space, the forcing term is a Gaussian random process with mean and a squared exponential covariance function
(27) 
where the standard deviation and the correlation length .
In this and the following examples, all data are generated by the MC sampling method. Specifically, we sample snapshots of continuous trajectories and extract from the values where the (virtual) sensors are located. For every trajectory, we solve for its corresponding solution trajectories using the finite difference method, and will use the statistics of these trajectories as our reference. To evaluate the performance of the trained model, we collect another snapshots of continuous and trajectories independently from the training data. The pairs of trajectories form our test sample set. Similarly, we extract the sensor data from every snapshot in the test set as the input at the predicting stage. We shall use the same and in the following tests, if not explicitly mentioned.
We place sensors of in the domain (the sensors are equidistant) and keep principal random variables corresponding to stochastic energy after performing PCA. The solution is approximated with a firstorder aPC expansion. Figure 5 shows the scattered plots of the measurements from the first three sensors and the first three arbitrary polynomial basis. It is evident that the raw data from the measurements are correlated while their induced polynomials are not, thus the induced polynomials would serve as a valid set of basis in the random space.
The DNNs used to approximate the modes of are constructed as in Figure 3, where we use an isolated small scale DNN of 2 hidden layers with 4 neurons per hidden layer to approximate the mean profile, and a DNN of 4 hidden layers with 32 neurons per hidden layer to model the modes. The tanh function is selected as the default activation function due to it is second order differentiable. Then, a DNN for the residual can be constructed via autodifferentiation and arithmetic operations. The training set is
where the data is collected only at the boundaries to provide boundary conditions. The loss function is slightly modified based on Equation 20 to add a regularization term. At the training stage, we choose the regularization rate , and use the Adam Kingma and Ba (2014) optimizer with learning rate to train our model for epochs. Figure 6 shows the predicted mean and standard deviation of the solution versus the reference. Figure 7 shows our DNN prediction of three modes where the reference modes are calculated by Equation 17. We can see that the NNaPC method makes accurate predictions of the mean and standard deviation of the solution and learns the arbitrary polynomial chaos modes.
The trained model is then used to predict the QoI at any location given new snapshots of sensor data, with only one forward evaluation of the DNN with as the input, as described in Section 3.2.4. Figure 7(a) illustrates the prediction of solution for three different snapshots of the sensor data in the test samples; our prediction recovers the true solution very well. In Figure 7(b), we use an increasing value of to study the effect of the number of sensors on the accuracy of the predicted solutions; we observe that when more sensors are deployed, we can achieve better accuracy.
Inverse problem: stochastic elliptic equation
We solve the onedimensional stochastic elliptic equation as an inverse problem, where we have some extra information on the solution but incomplete information of the diffusion coefficient . The equation reads
(28) 
In this example, we use a constant forcing term . The randomness comes from the diffusion coefficient , for which we only have limited information at the locations where we place the sensors. Here in this example, is modeled and sampled from a nonGaussian random process such that
(29) 
where the mean and the covariance function has the same form as in Equation 27, where we set the standard deviation and correlation length . We use the same strategy as in Section 4.1.1 to generate the training and testing samples, and the training set is constructed as in Equation 6.
For this case, two groups of DNNs, i.e., and are built to calculate the modes for and . Again, we use the Adam optimizer with learning rate 0.001 to train the DNNs for 50000 epochs. In Figure 8(a), we use the 1storder aPC expansion and we study the impact of using different regularization rate and different shapes of DNNs. We only change the DNNs that learn the stochastic modes, while the DNNs that learn the mean profiles are fixed to have 2 hidden layers and 4 neurons per hidden layer; this is the default setting for the future examples as well. In these plots, we compare the averaged relative error of predicting and in the test set. The results indicate that a moderate choice of (0.0005) gives us the most accurate predictions, since a too small/large choice of causes over/under fitting. A suitable choice of DNN shape (4 hidden layers with 32 neurons per hidden layer) produces the best trained model. For the rest of this numerical example, we shall adopt these optimal DNN setting.
In Figure 8(b), we use the 1storder aPC expansion and compare the averaged relative error in predictions when different numbers of  and sensors are deployed to collect the training data. In general, the proposed method makes more accurate predictions after training with data collected from more  and sensors. One reason is that a larger number of sensors supports a greater variety of input in the training data, thus feeding more information to the model to reduce the probability of overfitting. Also, more sensors allows for a higher effective random dimension of the aPC expansion for better approximation. Figure 8(b) also shows that a 2ndorder aPC expansion helps to improve predictions. We note that this is not the case when we use only three sensors. The bottleneck here is insufficient random dimension, so without enough training information, adopting the 2ndorder aPC expansion doubles the number of net outputs and would only increase the risk of overfitting.
mean  std  mode 1  mode 2  mode 3  mode 4  
1storder  0.54%  5.26%  4.15%  5.39%  12.03%  42.81%  
2ndorder  0.45%  1.87%  1.28%  1.95%  3.67%  29.95%  
1storder  0.14%  1.83%  0.98%  3.60%  4.34%  45.56%  
2ndorder  0.14%  0.51%  0.08%  0.80%  1.04%  32.16% 
We use a combination of 7 sensors and 4 sensors. Figure 9(a) and 10(a) compare the mean and standard deviation of and calculated by the trained DNNs when we use the 1st and 2ndorder aPC expansions. Figure 9(b) shows the first four common aPC modes of for both expansions, where we can see that the 2ndorder aPC expansion helps to improve the accuracy of the predicted lowerorder modes. The 2ndorder aPC expansion would also help with learning the modes, as depicted in Figure 10(b), even if the network is disjoint with the network. Table 1 lists the relative error of the trained model in calculating the mean, standard deviation and all the common modes of and . We can see that using a 2ndorder aPC significantly improves the accuracy. We also note that in Figure 10(a), due to the placement of sensors, the measurements from each sensor will yield almost the same mean (1.0) and standard deviation (0.1), but the trained net would reveal the nontrivial wavy structure of its mean and standard deviation in the entire domain, containing information more than just the training data could provide. This indicates that there is information fusion of all three types of training data and the stochastic differential equation, rather than a simple interpolation. Figure 12 shows that the higherorder aPC modes can also be captured even if they exist in a much smaller scale compared to the lowerorder more energetic modes.
Figure 13 shows the prediction of and for three arbitrary snapshots in the test sample set. Every pair of predictions are made based on a single time measurement from 4  and 7 sensors, and they agree with the true reference value. Again, the predicting stage takes little computational time since only one forward evaluation of the welltrained DNNs is needed for every input , no matter how many snapshots of predictions are to be made.
4.2 Solving deterministic differential equations with PINN and dropout
In this part, we first show that dropout reduces overfitting in solving differential equations. Then, more importantly, we show that the dropoutinduced uncertainty serves as useful guidance for active learning.
Reducing overfitting
To show how dropout reduces overfitting, we implement the dropout neural networks to solve both a forward Poisson equation and an inverse elliptic equation. They are the deterministic version of Equation 26 and 28. For the forward Poisson equation, we choose as the forcing term. A dropout neural network with 6 hidden layers and 100 neurons per hidden layer is constructed to model the solution . The training data consists of 2 measurements at both boundaries and 6 sensors in the domain. For the inverse elliptic equation, we choose, as before, and as the hidden diffusion coefficient, and we use 5 sensors and 7 sensors. Two separate DNNs are constructed: a small scale regular DNN that has 2 hidden layers with 4 neurons per hidden layer to model the function , and a large scale dropout neural network with 6 hidden layers and 100 neurons per hidden layer to model the function . The dropout rate in both examples is fixed at 0.01.
For both the forward and inverse problem, we train the DNNs as described in Section 3.3, using an Adam optimizer with learning rate for 30000 epochs. Due to the lack of sufficient number of sensors, there is a big chance that overfitting occurs at the training stage if we use just regular DNNs. Figure 14 shows a comparison of the results when we train the networks with and without using dropout. As we can see in the plots, the results from a regular DNN are very different from each other, showing irregular jumps of large amplitudes, while the results from dropout DNNs are similar to each other, and they are closer to the truth. This shows that dropout works as an effective means of reducing overfitting.
Estimating DNN approximation uncertainty
More importantly, the uncertainty introduced by dropout serves as a useful guidance for active learning. Again, we solve the same inverse elliptic equation but this time we are provided with additional sensors. At the training stage, we add an regularization term with to the loss function. At the predicting stage, we evaluate the dropout network for times to estimate the mean and standard deviation. Next, a new sensor is placed where the standard deviation reaches its maximum. If it happens that the location to add the new sensor is close to an existing sensor by a threshold distance of , we do not add the new sensor, but instead, we count the nearest existing sensor twice as if we have added a virtual sensor at the same location of the existing one. In practice, here we choose . We have designed an automatic iterative procedure for active learning that reduces the cost of adding new sensors by efficiently reusing the old ones.
Figure 15 shows the initial prediction of and the prediction after iterating the above algorithm for 15 steps. In Figure 14(b), the sensors are clustered where the curvature of is big, which is consistent with our intuition that we should put more sensors where the function changes rapidly. In Figure 16, we see that during these 15 iterations, only 8 new sensors are deployed while the relative error of predictions reduces from more than to less than .
4.3 Active learning for inverse stochastic elliptic problems
We consider the inverse stochastic elliptic problem as described in Equation 28 for active learning, but in this example, is modeled by a Gaussian random process with correlation length . To start with, we have 1000 snapshots of data from 3 sensors, 7 sensors and 21 sensors that are equidistantly distributed in the physical domain, and our goal is to infer in the entire domain. Suppose we are then provided with additional sensors of ; we shall allocate them according to the uncertainty induced by the dropout neural networks. In practice, we model the modes of with a dropout neural network that has 4 hidden layers with 128 neurons per hidden layer, and a dropout rate 0.01. The solution is expanded with the 1storder aPC expansion, while the modes are modeled by a regular DNN with 4 layers and 32 neurons per hidden layer. The reasons for implementing dropout only on are as follows:

Dropout can be used to efficiently reduce overfitting, and the overfitting issue of is worse than that of .

As an inverse problem, our main goal is to identify and then identify where we should add more sensors to enhance the accuracy of prediction.
We use an Adam optimizer with learning rate 0.0005 to train the networks for 50000 epochs. The mean and standard deviation of the modes model are evaluated from 10000 independent evaluations of the dropout neural network. We place the additional sensor where the standard deviation of the first mode attains its maximum. This is because the first mode is associated with the largest eigenvalue, therefore bringing the largest impact to the stochastic structure, thus it is most important to learn the first mode accurately.
We carry out the active learning steps until the prediction error does not decrease after a new sensor is added. To better illustrate the process of adding new sensors, in Figure 16(a), 16(b) and 16(c) we show the learned first mode of and its associated standard deviation from dropout uncertainty. Three steps of active learning are displayed. Note that the shapes of the first modes do not stay the same due to the fact that every time a new sensor is added, the principal components of in Equation 8 are therefore changing. Figure 17(a) and Figure 17(b) show the comparison of the predicted standard deviation of and in the first three steps and the last step, respectively. It is evident that adding extra sensors automatically according to the dropoutinduced uncertainty will improve the accuracy of standard deviation prediction for both and . Finally, the trained model is used to predict continuous trajectories of and in the test samples. We can conclude from Table 2 that adding extra sensors based on active learning with the dropout helps us to make better predictions.
mean  std  prediction  mean  std  prediction  
Step 0  
Step 1  
Step 2  
Step 11 
5 Summary
We have presented a new approach to quantify the parametric uncertainty in the physicsinformed neural networks (PINNs) by employing the arbitrary polynomial chaos (aPC) expansion to represent the stochastic solution. We use the data collected from sensor measurements to build a set of arbitrary polynomial basis and learn the modal functions of the aPC expansion through the PINNs, i.e., DNNs that encode the underlying stochastic differential equation. The proposed datadriven method can be used to solve forward problems, but more importantly, it deals with stochastic inverse problems. In the classical inverse problem, typically all the information available is for the solution, and we aim to identify the parameters. Here we selected to solve the inverse problems, where we have partial information available both for the solution and the parameter, which is a stochastic process. Once the model is trained with existing sensor data, i.e., historical data, it can be used to predict new instances of trajectories of the quantity of interest (solution or parameter) at very small additional computational cost.
We aim at quantifying two different types of uncertainties, i.e., the parametric uncertainty due to the stochastic equation, as well as the approximation uncertainty of the PINN. The latter represents how well the PINN is trained and how robust it is at the predicting stage. To this end, we adopt the dropout strategy for estimating the approximation uncertainty. Dropout is typically used to deal with overfitting problems but it can also be exploited to quantify the approximation uncertainty at no additional cost. In our examples, we use DNNs to learn the modal functions of the stochastic parametric modes and the dropout strategy to quantify their associated uncertainty. Based on this, we propose an iterative method of actively learning where to place new sensors to enhance the approximation accuracy of PINNs. The numerical results exhibit the effectiveness of such an active learning strategy, not only in placing new sensors but also in making better use of the existing ones.
There are other possible methods of quantifying parametric and approximation uncertainties. For example, we can abandon aPC and use directly the stochastic data as the input. One possible approach is to consider the random space together with the physical space. Hence, standard DNNs, and in particular the deterministic PINNs developed in Raissi et al. (2017, 2017) can be directly used to stochastic differential equations. However, we did not choose this approach for two reasons: first, it does not lead to explicit expressions for stochasiticity of the quantity of interest (QoI); and second, different from sampling in the physical space where we can always mark the location by their coordinates, marking random instances with random variables is much harder. Moreover, the dimension of the random space is usually much higher than that of the physical space, requiring a proper dimension reduction procedure to be carried out before actually solving the problem. Although weaker, the high dimensionality is also a limitation of the aPC approach because it requires the highdimensional DNN outputs, which could make the training process harder. Other promising methods that may be able to deal with high dimensional stochastic problems are the generative adversarial networks (GANs) Goodfellow et al. (2014), and we plan to systematically investigate this line of research in our future work.
6 Acknowledgement
This work is supported by DARPA N660011524055, Air Force FA95501710013, ARL W911NF1220023, NSF of China (No. 11671265) and the Science Challenge Project (No. TZ2018001). In addition, we would like to thank Liu Yang for his generous advice.
References
Footnotes
 journal: Journal Name
References
 X. Fan, Y. Liu, J. Tao, Y. Weng, Soil salinity retrieval from advanced multispectral sensor with partial least square regression, Remote Sensing 7 (2015) 488–511.
 H. N. Pollack, S. J. Hurter, J. R. Johnson, Heat flow from the Earth’s interior: Analysis of the global data set, Reviews of Geophysics 31 (1993) 267–280.
 T. Graepel, Solving noisy linear operator equations by Gaussian processes: Application to ordinary and partial differential equations, in: International Conference on Machine Learning (2003), pp. 234–241.
 S. Särkkä, Linear operators and stochastic partial differential equations in gaussian process regression, in: International Conference on Artificial Neural Networks (2011), Springer, pp. 151–158.
 I. Bilionis, Probabilistic solvers for partial differential equations, arXiv preprint (2016) arXiv:1607.03526.
 M. Raissi, P. Perdikaris, G. E. Karniadakis, Numerical Gaussian processes for timedependent and nonlinear partial differential equations, SIAM Journal on Scientific Computing 40 (2018) A172–A198.
 G. Pang, L. Yang, G. E. Karniadakis, Approximation and PDE solution neuralnetinduced Gaussian process regression for function approximation and PDE solution, arXiv preprint (2018) arXiv:1806.11187.
 I. E. Lagaris, A. C. Likas, D. I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE Transactions on Neural Networks 9 (1998) 987–1000.
 I. E. Lagaris, A. C. Likas, D. G. Papageorgiou, Neuralnetwork methods for boundary value problems with irregular boundaries, IEEE Transactions on Neural Networks 11 (2000) 1041–1049.
 Y. Khoo, J. Lu, L. Ying, Solving parametric PDE problems with artificial neural networks, arXiv preprint (2017) arXiv:1707.03351.
 M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics informed deep learning (part I): Datadriven solutions of nonlinear partial differential equations, arXiv preprint (2017) arXiv:1711.10561.
 M. A. Nabian, H. Meidani, A deep neural network surrogate for highdimensional random partial differential equations, arXiv preprint (2018) arXiv:1806.02957.
 A. M. Stuart, Inverse problems: A Bayesian perspective, Acta Numerica 19 (2010) 451–559.
 Y. Zhu, N. Zabaras, Bayesian deep convolutional encoderdecoder networks for surrogate modeling and uncertainty quantification, Journal of Computational Physics 366 (2018) 415–447.
 M. Raissi, P. Perdikaris, G. E. Karniadakis, Machine learning of linear differential equations using Gaussian processes, Journal of Computational Physics 348 (2017) 683–693.
 S. H. Rudy, S. L. Brunton, J. L. Proctor, J. N. Kutz, Datadriven discovery of partial differential equations, Science Advances 3 (2017) e1602614.
 S. Rudy, A. Alla, S. L. Brunton, J. N. Kutz, Datadriven identification of parametric partial differential equations, arXiv preprint (2018) arXiv:1806.00732.
 M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics informed deep learning (part II): Datadriven discovery of nonlinear partial differential equations, arXiv preprint (2017) arXiv:1711.10566.
 M. Raissi, G. Karniadakis, Hidden physics models: Machine learning of nonlinear partial differential equations, Journal of Computational Physics 357 (2018) 125–141.
 A. M. Tartakovsky, C. O. Marrero, P. Perdikaris, G. D. Tartakovsky, D. BarajasSolano, Learning parameters and constitutive relationships with physics informed deep neural networks, arXiv preprint (2018) arXiv:1808.03398v2.
 W. E, J. Han, A. Jentzen, Deep learningbased numerical methods for highdimensional parabolic partial differential equations and backward stochastic differential equations, arXiv preprint (2017) arXiv:1706.04702.
 M. Raissi, Forwardbackward stochastic neural networks: Deep learning of highdimensional partial differential equations, arXiv preprint (2018) arXiv:1804.07010.
 R. Ghanem, P. D. Spanos, Polynomial chaos in stochastic finite elements, Journal of Applied Mechanics 57 (1990) 197–202.
 D. Xiu, G. E. Karniadakis, The WienerAskey polynomial chaos for stochastic differential equations, SIAM Journal on Scientific Computing 24 (2002) 619–644.
 M. Zheng, X. Wan, G. E. Karniadakis, Adaptive multielement polynomial chaos with discrete measure: Algorithms and application to SPDEs, Applied Numerical Mathematics 90 (2015) 91–110.
 X. Wan, G. Karniadakis, Multielement generalized polynomial chaos for arbitrary probability measures, SIAM Journal on Scientific Computing 28 (2006) 901–928.
 S. Oladyshkin, W. Nowak, Datadriven uncertainty quantification using the arbitrary polynomial chaos expansion, Reliability Engineering and System Safety 106 (2012) 179–190.
 H. Lei, J. Li, P. Gao, P. Stinis, N. Baker, Datadriven approach of quantifying uncertainty in complex systems with arbitrary randomness, arXiv preprint (2018) arXiv:1804.08609.
 J. A. Witteveen, H. Bijl, Modeling arbitrary uncertainties using GramSchmidt polynomial chaos, in: 44th AIAA aerospace sciences meeting and exhibit (2006), p. 896.
 D. J. MacKay, A practical Bayesian framework for backpropagation networks, Neural computation 4 (1992) 448–472.
 R. M. Neal, Bayesian learning for neural networks, volume 118 of Lecture Notes in Statistics, Springer, (1996).
 M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, L. K. Saul, An introduction to variational methods for graphical models, in: Learning in graphical models, Springer, (1998), pp. 105–161.
 J. Paisley, D. Blei, M. Jordan, Variational Bayesian inference with stochastic search, arXiv preprint (2012) arXiv:1206.6430.
 D. P. Kingma, M. Welling, Autoencoding variational Bayes, arXiv preprint (2013) arXiv:1312.6114.
 D. J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, arXiv preprint (2014) arXiv:1401.4082.
 M. Titsias, M. LázaroGredilla, Doubly stochastic variational Bayes for nonconjugate inference, in: International Conference on Machine Learning (2014), pp. 1971–1979.
 M. D. Hoffman, D. M. Blei, C. Wang, J. Paisley, Stochastic variational inference, The Journal of Machine Learning Research 14 (2013) 1303–1347.
 Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: Representing model uncertainty in deep learning, in: International Conference on Machine Learning (2016), pp. 1050–1059.
 J. H. Yarin Gal, A. Kendall, Concrete dropout, in: Advances in Neural Information Processing Systems (2017), pp. 3584–3593.
 Y. Li, Y. Gal, Dropout inference in Bayesian neural networks with alphadivergences, in: International Conference on Machine Learning (2017), pp. 2052–2061.
 G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov, Improving neural networks by preventing coadaptation of feature detectors, arXiv preprint (2012) arXiv:1207.0580.
 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15 (2014) 1929–1958.
 A. Damianou, N. Lawrence, Deep Gaussian processes, in: Artificial Intelligence and Statistics (2013), pp. 207–215.
 Y. Gal, Z. Ghahramani, A theoretically grounded application of dropout in recurrent neural networks, in: Advances in Neural Information Processing Systems (2016), pp. 1019–1027.
 A. Kendall, V. Badrinarayanan, R. Cipolla, Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding, arXiv preprint (2015) arXiv:1511.02680.
 A. Kendall, Y. Gal, What uncertainties do we need in Bayesian deep learning for computer vision?, in: Advances in Neural Information Processing Systems (2017), pp. 5580–5590.
 C. Angermueller, H. J. Lee, W. Reik, O. Stegle, DeepCpG: Accurate prediction of singlecell DNA methylation states using deep learning, Genome Biology 18 (2017) 67.
 X. Yang, R. Kwitt, M. Niethammer, Fast predictive image registration, in: Deep Learning and Data Labeling for Medical Applications, Springer, 2016, pp. 48–57.
 M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, X. Zheng, TensorFlow: A system for largescale machine learning, in: 12th USENIX Symposium on Operating Systems Design and Implementation (2016), pp. 265–283.
 D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint (2014) arXiv:1412.6980.
 I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems (2014), pp. 2672–2680.