# Physics-Informed Regularization of Deep Neural Networks

###### Abstract

This paper presents a novel physics-informed regularization method for training of deep neural networks (DNNs). In particular, we focus on the DNN representation for the response of a physical or biological system, for which a set of governing laws are known. These laws often appear in the form of differential equations, derived from first principles, empirically-validated laws, and/or domain expertise. We propose a DNN training approach that utilizes these known differential equations in addition to the measurement data, by introducing a penalty term to the training loss function to penalize divergence form the governing laws. Through three numerical examples, we will show that the proposed regularization produces surrogates that are physically interpretable with smaller generalization errors, when compared to other common regularization methods.

###### keywords:

Deep neural networks, regularization, physics-informed, deep learning, predictive modeling, nonlinear dynamics, surrogates, metamodels.^{†}

^{†}journal: TBD\biboptions

sort&compress ”@ M[1]¿\arraybackslashm#1

## 1 Introduction

Many science and engineering problems require repetitive simulation runs of a model with different input values. Examples of these problems include design optimization, model calibration, sensitivity analysis, what-if analysis, and design space exploration problems. However, in many real-world problems, obtaining a reliable outcome requires large number of these solves (typically for a partial differential equation), which can be prohibitive given the available resources. One way to alleviate this burden is to construct surrogate models koziel2013surrogate () that mimic the solution or response surface. One example is building an analytical polynomial function for the displacement of a 2D plate at different locations.

A surrogate serves as an approximating model for the solution of the PDE, or quantity of interest (QoI), especially when that QoI cannot be easily computed or measured. Let the QoI be denoted by , and the global approximation provided by the surrogate be denoted by . The surrogate is typically built by using a set of exact model evaluations at the -dimensional input locations , where is the domain of the problem. Various surrogate techniques have been used in the literature simpson2001metamodels (). Among the most popular ones are polynomial response surfaces (e.g. queipo2005surrogate (); shanock2010polynomial ()), radial basis functions (e.g. buhmann2000radial (); wild2008orbit ()), polynomial chaos expansions (e.g. xiu2002wiener (); marzouk2009dimensionality ()), Kriging (e.g. simpson2001kriging (); jeong2005efficient ()), Gradient-Enhanced Kriging (GEK) (e.g. bouhlel2017gradient (); de2012efficient ()), Support Vector Regression (SVR) (e.g. smola2004tutorial (); clarke2005analysis ()), and deep neural networks (e.g. lecun2015deep (); goodfellow2016deep (); nabian2018deep2 (); nabian2018deep1 ()). Our focus in this paper is on deep neural network surrogates.

A major challenge to the successful construction of deep neural network surrogates with many parameters (i.e. many layers or many units) is that they easily tend to overfit. That is, even though the model fits very well to the training data, it cannot effectively find the underlying relationship in data and as a result, the model would not generalize well to the unobserved test data. In order to overcome this difficulty, several regularization techniques are developed to prevent deep neural networks from overfitting. This is done particularly during the training, where regularizers apply penalties to layer parameters, and incorporate these penalties in the loss function that is minimized during model training. Popular choices of regularization methods for deep neural networks include and regularizations lecun2015deep (), and dropout srivastava2014dropout ().

This paper presents a novel physics-informed approach for the regularization of deep neural network surrogates for systems that are subject to known governing laws which are in the form of a PDE. These governing equations are obtained using first principles, empirically-validated laws, and/or knowledge obtained by domain expertise. In data-driven modeling of physical and biological systems, this prior knowledge is usually available, but not directly used in training of the models (e.g. guo2016convolutional (); hennigh2017lat (); hennigh2017automated ()). In construction of data-driven deep neural network models, in particular, the prior knowledge about the governing equations can be effectively utilized to “push” the trained models to satisfy the governing laws. Specifically, we do so by creating a regularization term that accounts for the underlying physics by penalizing divergence form the governing equations. It is shown through numerical examples that the proposed regularization method offers two main advantages: (1) it effectively prevents overfitting and results in significantly smaller generalization errors, when compared to other regularization methods; and (2) it produces surrogates that are physically interpretable, as opposed to the ones that are trained using purely data-driven approaches.

The remainder of this paper is organized as follows. Feed-forward fully-connected deep neural networks are explained in Section 2. In Section 3, a number of commonly-used regularization methods for deep neural networks are discussed. The proposed PI regularization method is then introduced in Section 4. Section 5 includes three numerical examples, on which the performance of the proposed PI regularization method is compared with other common alternatives. Finally, Section 6 concludes with discussion on the relative advantages and limitations of the proposed method and potential future works.

## 2 Regression using feed-forward fully-connected deep neural networks

In a regression task, the objective is to approximate an unknown function given a training dataset consisting of input samples , and their corresponding outputs , . Specifically, we consider the following relationship holds for any data ,

(1) |

where is the unknown nonlinear function and is the measurement or simulation noise. In our case, is the solution of a PDE represented as a function of input variables, such as time and/or spatial coordinates, and our objective is to approximate this function by a deep neural network.

For notation brevity, let us first define the single hidden layer neural network, since the generalization of the single hidden layer network to a network with multiple hidden layers, effectively a deep neural network, will be straightforward. Specifically, given an input , a standard single hidden layer neural network approximate the -dimensional response according to

(2) |

where and are weight matrices of size and , and and are bias vectors of size and , respectively. The function is an element-wise non-linear model, commonly known as the activation function. Popular choices of activation functions include Sigmoid, hyperbolic tangent (Tanh), and Rectified Linear Unit (ReLU). In deep neural networks, the output of each activation function is transformed by a new weight matrix and a new bias, and is then fed to another activation function. Each new set of a weight matrix and a bias that is added to (2) constitutes a new hidden layer in the neural network. Generally, the capability of neural networks to approximate complex nonlinear functions can be improved by adding more hidden layers or increasing the dimensionality of the hidden layers lecun2015deep (); goodfellow2016deep ().

In order to calibrate the weight matrices and biases, we use a Euclidean loss function as follows

(3) |

where is the loss function, , , .

The model parameters can be calibrated according to the following optimization problem

(4) |

where are the estimated parameter values at the end of the training. The optimization is performed iteratively using Stochastic Gradient Descent (SGD) and its variants bottou2012stochastic (); kingma2014adam (); duchi2011adaptive (); zeiler2012adadelta (); sutskever2013importance (). Specifically, at the iteration, the model parameters are updated according to

(5) |

where is the step size in the iteration. At each iteration, is calculated using backpropagation lecun2015deep (), where the gradients of the objective function with respect to the weights and biases of a deep neural network are calculated by starting off from the network output and propagating towards the input layer while calculating the gradients, layer by layer, using the chain rule. More details on the feed-forward fully-connected deep neural networks can be found in lecun2015deep (); goodfellow2016deep ().

## 3 Regularization of deep neural networks

In this section, a number of regularization methods for the training of deep neural networks are briefly introduced. In particular, we discuss the parameter norm regularization (specifically and regularizations) and also the dropout. These are commonly-used methods for regularization of deep neural networks, among the others (e.g. dropconnect wan2013regularization (), early stopping caruana2001overfitting (), and dataset augmentation salamon2017deep ()).

### 3.1 Parameter norm regularization

Most of the regularization methods limit the flexibility of deep neural network models by adding a parameter norm penalty term to the loss function . The regularized loss function denoted by can be expressed as

(6) |

where is a hyperparameter controlling the contribution of the parameter norm penalty term relative to the standard loss function . and regularizations are among the most common parameter norm regularizations. We note that for deep neural networks, parameter norm regularization usually penalizes only the weights, and biases will remain unregularized. This is done by forming the penalty terms to be a function of the weights only. The biases usually require significantly smaller training data compared to weights in order to fit accurately. Additionally, regularizing the biases can result in significant underfitting. More discussion in this regard is provided in goodfellow2016deep ()

The parameter regularization is performed by setting the penalty term . The -regularized loss function therefore takes the following form

(7) |

with the corresponding parameter gradient

(8) |

The addition of the weight decay term has modified the learning rule to multiplicatively shrink the weights by a constant factor on each training iteration. Therefore, regularization forces the deep neural network parameters toward taking relatively small values.

The parameter regularization consists in setting the penalty term , where are the individual weight parameters of the neural network. The -regularized loss function therefore takes the following form

(9) |

with the corresponding parameter gradient

(10) |

where is simply the sign function of applied element-wise. In comparison to regularization, the regularization contribution to the loss gradient no longer scales linearly with but instead it is a constant whose sign is determined by . As a result, a regularization is created that effectively promotes sparsity for the weight matrix .

### 3.2 Dropout

A recently developed regularization technique, the dropout regularization involves removing components of each layer randomly with probability during model optimization and for each forward-backward pass (i.e. each iteration to update the model parameters), srivastava2014dropout (). This prevents units from excessive co-adapting hinton2012improving (). The dropped-out components will not have a contribution to the forward pass and weight updates will not be applied to these components on the backward pass. As a result of applying dropout, effectively an exponential number of different thinned networks are sampled. At test time, a single unthinned network is used (including all the units) by averaging the predictions of all these thinned networks srivastava2014dropout ().

The standard single hidden layer neural network defined in Equation 2 with dropout applied to the hidden layer takes the following form

(11) |

where is a vector, and . Dropout is shown to improve the performance of deep neural networks in a variety of supervised learning tasks in speech recognition, vision, document classification, and computational biology srivastava2014dropout (); hinton2012improving (); dahl2013improving (); krizhevsky2012imagenet (); pham2014dropout (). It is shown in wager2013dropout () that dropout applied to linear regression is equivalent to regularization, with a different weight decay coefficient for each input feature, where the magnitude of each weight decay coefficient is determined by the variance of the corresponding feature.

## 4 Physics-Informed (PI) regularization

As stated earlier in Section 2, in DNN regression we seek to approximate the unknown response function . We consider cases where the response function a governing law, as follows

(12) |

where is a general differential operator that may consist of partial derivatives and linear and nonlinear terms. Let us denote the DNN approximation by . The PI-regularized loss function is then defined as follows

(13) |

where is a hyperparameter controlling the contribution of the physics-informed penalty term that is defined as

(14) |

in which the term measures the divergence of the DNN solution from the governing laws at input location . By adding the PI-regularization term to the standard loss function, the standard supervised learning task is converted to a semi-supervised learning task, for which the supervised objective minimizes the mean squared differences between model prediction and measurements (as reflected in ), and the unsupervised objective minimizes divergence from the governing laws (as reflected in .)

It is shown through numerical examples that the proposed PI regularization method effectively prevents deep neural networks from overfitting, and also results in surrogates that are better physically interpretable. That is, it can estimate more accurately the partial derivatives of the response which carry physical interpretation and can be utilized in subsequent calculations, such as sensitivity analysis. Although the PI regularization term 14 introduces an unsupervised learning task over the same inputs as the inputs to the standard loss function , this is not a requirement. A different and possibly larger set of input data (especially in situations with lack of sufficient labeled input data) may be used to perform this unsupervised learning task. Additionally, training the deep neural networks parameters may be performed in a sequential fashion by using the standard loss function first and the PI-regularized loss function at a later stage in training. It is also worthwhile mentioning that the proposed PI regularization method can generally be combined with other regularization methods. For example, we can use hybrid PI- regularization in order to push the deep neural network model to satisfy governing laws and at the same time promote model sparsity.

## 5 Numerical examples

In this section, we numerically study the performance of the proposed PI regularization in constructing accurate DNN surrogates for systems governed by physical laws. In the first and second examples, we consider systems governed by the Burgers’ and Navier-Stokes equations. In both of the examples, DNN surrogates are constructed using different regularizations including the PI regularization, and results are compared with each other. In the third example, we construct a DNN surrogate using the PI regularization method that can be used for vehicle aerodynamic optimization, and show that our proposed method results in smaller generalization error compared to the current state of the practice.

### 5.1 DNN surrogate for a system governed by the Burgers’ equation

Let us start with the Burgers’ equation, which arises in various areas of engineering, such as traffic flow, fluid mechanics, and acoustics. The burgers’ equation considered in this example is expressed as

(15) |

In order to generate training, evaluation, and test datasets, using the source code provided by raissi2018deep (), we solve this equation using spectral methods. Specifically, the Chebfun package driscoll2014chebfun () is used with a spectral Fourier discretization with 256 modes and a -order explicit Runge-Kutta scheme, where the size of time steps is set to , and the solution is saved every 0.05 s time interval. The solution dataset is depicted in Figure 1. From the solution dataset, we randomly select 10,500 samples, out of which 500 samples are reserved for training, 5,000 samples are reserved for evaluation, and the rest are reserved for testing (note that compared to the size of training dataset, we chose to have a large evaluation and test dataset to eliminate the need for cross-validation when performing hyperparameter tuning). Also, it should be noted that a Gaussian noise is added to the solution , with a zero mean and a standard deviation of , where is the mean value of in the training dataset, and is a constant which controls the noise level. In this example we consider three different noise levels, with , , and .

We construct five different surrogate models with five different regularization choices: no regularization, regularization, regularization, dropout, and the proposed PI regularization. The architecture of these deep neural network surrogates is fixed and consists of 4 hidden layers, each with 32 units with Tanh nonlinearities. The Adam optimization algorithm kingma2014adam () is used to solve the optimization problem defined in Equation 4. Parameters , , and for the Adam optimizer are set to 0.9, 0.999, and , respectively. Batch size is set to 50. For the PI regularization, the following penalty term is used

(16) |

where denotes the deep neural network surrogate.

Table 1 shows the hyperparameters that are tuned for each of the surrogates, together with the search domain for each of the hyperparameters. Training is performed for 8 different number of epochs starting from 25,000 epochs and ending with 200,000 epochs. For each regularization method, given the number of epochs, we train 100 models on the training dataset. The model which results in the lowest relative norm on the evaluation dataset is then selected as the best surrogate model for the given number of training epochs.

Regularization method | No Reg. | Reg. | Reg. | Dropout | PI Reg. |
---|---|---|---|---|---|

Hyperparameters |

Figure 2 shows a comparison between the performance of each of the regularization methods for different noise levels in the dataset. For each regularization method, we train three different surrogates, each trained using a different random selection of training, evaluation, and test datasets, and the results for each of the surrogates as well as the average results are shown. It is evident that the PI regularization method provides superior accuracies compared to the other regularization methods at all the noise levels. Furthermore, Figure 3 represents a comparison between the performance of different regularization methods in accurate prediction of first and second-order derivatives of the solution to the Burgers’ equation with set to 0. As can be seen, all the regularization methods, except PI regularization, fail to provide accurate derivative values. This is a remarkable feature of the proposed PI regularization method, producing physically-interpretable derivatives that can be accurately used in subsequent calculations such as sensitivity analysis.

### 5.2 DNN surrogate for a system governed by the Navier-Stokes equation

In this example we consider the vorticity equation medjo1995vorticity () given explicitly by

(17) |

where and are respectively the - and -component of the velocity field, and is the vorticity, defined to be the curl of the velocity vector. We use the source code provided by raissi2018deep () to generate training, evaluation, and test datasets. Specifically, we use the Immersed Boundary Projection Method taira2007immersed (); colonius2008fast () to simulate the 2D fluid flow past a circular cylinder at Reynolds number . Following the procedure presented in kutz2016dynamic (), a multi-domain scheme with four nested domains is used, with each successive grid twice as large as the previous one. Time and length are nondimensionalized. The flow has unit velocity and the cylinder has unit diameter. Data is collected on the highest-resolution domain with dimensions with a resolution of . The Navier-Stokes solver uses a 3rd-order Runge-Kutta (RK3) scheme with time steps . Once the simulation converges to steady periodic vortex shedding, 151 flow snapshots are saved at each time step. A small portion of the resulting data set is then sub-sampled to be used for construction of deep neural network surrogates. Specifically, we subsample 50,000 data points. We use 5,000 data points for training, 15,000 data points for evaluation, and 30,000 data points for testing.

Again in this example we construct five different surrogate models with five different regularization choices: no regularization, regularization, regularization, dropout, and the proposed PI regularization. The architecture of these deep neural network surrogates is fixed and consists of 4 hidden layers, each with 128 units with Tanh nonlinearities. The output of this surrogate is 3-dimensional, consisting of , , and . The Adam optimization algorithm kingma2014adam () is used to solve the optimization problem defined in Equation 4. The parameters , , and for the Adam optimizer are set to 0.9, 0.999, and , respectively. The batch size is set to 50. For the PI regularization, the following penalty term is used

(18) |

Table 2 shows the hyperparameters that are tuned for each of the surrogates, together with the search domain for each of the hyperparameters. Training is performed for 9 different number of epochs starting from 5,000 epochs and ending with 45,000 epochs. For each regularization method, given the number of epochs, we train 100 models on the training dataset. The model which results in the lowest relative norm on the evaluation dataset is then selected as the best surrogate model for the given number of training epochs.

Regularization method | No Reg. | Reg. | Reg. | Dropout | PI Reg. |
---|---|---|---|---|---|

Hyperparameters |

Figure 5 shows a comparison between the performance of different regularization methods. For each regularization method, we train three different surrogates, each trained using a different random selection of training, evaluation, and test datasets, and the results for each of the surrogates as well as the average results are shown. Once again, it is evident that the PI regularization method provides superior accuracies compared to other regularization methods.

#### 5.2.1 Note on the poor performance of dropout

It is observed through the first two numerical examples that surrogates trained with dropout have inferior accuracies compared to surrogates trained with no regularization. Similar observation has been previously reported in other studies, e.g. kingma2014adam (). There are multiple reasons that can explain this observation. Firstly, the success of dropout regularization has been mainly shown in the literature on classification tasks rather than on regression tasks, and also for Convolutional Neural Networks (CNN) rather than fully-connected DNNs srivastava2014dropout (); krizhevsky2012imagenet (); goodfellow2016deep (). Also, as stated earlier, at test time a single unthinned network is used by implementing a weight scaling rule. However, the weight scaling rule is only an approximation for deep neural network models. It is only empirically shown (mostly on CNNs) that weight scaling rule performs well, and this has not been theoretically studied goodfellow2016deep (). It is stated in goodfellow2016deep () that the optimal choice of inference approximation for dropout networks is problem dependent, and weight scaling rule does not necessarily perform well generally for all the problems. Finally, dropout networks, compared to networks with no regularization, are known to require a relatively larger number of units/layers, and are required to be trained for a relatively larger number of epochs kingma2014adam (). However, this doesn’t apply to our examples, where the network architecture and number of training epochs are kept the same for all the surrogates.

### 5.3 Surrogate modeling in CFD-based design optimization

In aerodynamics analysis and design problems, fluid flow is simulated by Computational Fluid Dynamics (CFD) solvers. This is done by solving the Navier-Stokes equations, which consist of mass and momentum conservation equations tu2018computational (); nabian2016multiphase (). In the Eulerian framework, for 2D steady laminar flows, the mass conservation equation is given by

(19) |

and the momentum conservation equation is defined as

(20) |

where is the 2D velocity vector, is the pressure, is the fluid density, and is the fluid viscosity. CFD simulation is known to be typically intensive in terms of computational time and memory usage. This could potentially make a CFD-based design space exploration prohibitively costly. As an alternative, surrogate models can serve as substitutes for fast fluid flow prediction and enable engineers and designers to perform design space exploration efficiently, especially at the early stages of design optimization when there is no need for high-fidelity simulations. In this example, we show how a PI-regularized Convolutional Neural Network (CNN) surrogate can produce accurate results towards this end.

Specifically, we reconstruct the CNN model which was proposed in guo2016convolutional () and implemented in hennigh2018git () for the prediction of velocity field in 2D non-uniform steady laminar flows in the presence of rigid bodies, and will show how the application of PI regularization to the CNN loss function can lead to accuracy improvement, when compared to the state of the practice, as reported in hennigh2018git (). For a fair comparison, we use the same implementation as hennigh2018git (), with no changes to the datasets, network architecture, hyperparameters, and the number of training epochs. The only changes we made were pertinent to training the regularization.

In the past two examples, we regularized the surrogates in the form of the fully-connected feed-forward deep neural network, a form briefly explained in Section 3. In this part we consider CNNs, which are a specific type of feed-forward deep neural networks. A CNN consists of recursive application of convolution and pooling layers, followed by fully-connected layers at the end of the network (as described in Section 3). A convolution layer is a linear transformation that preserves spatial information in the input data. Pooling layers then simply reduce the dimensionality of the output of a convolution layer. More discussion on the CNNs can be found in lecun2015deep (); goodfellow2016deep ().

The training and validation datasets of hennigh2018git (); guo2016convolutional (), used in this example, consist of five different types of simple geometric primitives, including triangles, quadrilaterals, pentagons, hexagons and dodecagons. Each sample is projected into a Cartesian grid. The test dataset consists of different kinds of car prototypes including SUVs, vans, and sport cars. A binary representation of the geometry shapes is used, where a grid value is 1 if and only if it is within or on the boundaries of the geometry shapes, and a grid value is 0 otherwise. Each sample consists of five matrices each of size , where the first matrix represents the geometry shape and the second and third matrices represent the x and y-components of the Cartesian grid, respectively. The latter two matrices represent the ground-truth values for the x- and y-components of the velocity field, respectively, computed using the Lattice Boltzmann Method (LBM) chen1998lattice (). In all the experiments, the Reynolds number is set to 20. The no-slip boundary condition is applied to the geometry shape boundaries and horizontal walls. The training dataset contains 3,000 samples (600 samples for each type of primitives that are different in size, location, and orientation). The validation dataset consists of 300 samples (60 different samples for each type of primitives). Finally, the the test dataset consists of 28 car prototypes.

The loss function used in guo2016convolutional (); hennigh2018git () is in the form of

(21) |

where , , , is the number of samples, is an indicator function and has the same size as , and are the CNN predictions for the x and y-components of the velocity field for the sample . The loss function in 21 is simply the Euclidean loss function that takes into account only the fluid part of the computational domain. As the competitor for our proposed method, we consider the surrogate trained with dropout with .

In order to do apply the PI regularization, we only add a penalty term for the violation of the divergence-free condition (i.e. Equation 19) of the velocity field, and leave violations from the momentum conservation equation 20 unpenalized. This is because the second penalty term would necessitate another independent surrogate to be built for the pressure field . Therefore, since in the competing study hennigh2018git (), no surrogate for the pressure field was constructed, for the sake of a fair comparison, we only applied regularization to the velocity field surrogate. As a result, the PI regularization is given by

(22) |

It should be noted that doing so, we are incorporating only partial prior knowledge about the physics in the PI regularization, and potential further improvement can be expected with the inclusion of the momentum penalty term.

For the CNN surrogate a U-network approach is used with residual layers he2016deep () similar to Pixel-CNN++ van2016conditional (); Salimans2017PixeCNN () which is a class of powerful generative models goodfellow2016nips (). For implementation, we used the source code provided by hennigh2018git (). The Adam optimization algorithm is used to solve the optimization problem defined in Equation 4. Parameters , , and for the Adam optimizer are set to 0.9, 0.999, and , respectively. Batch size is set to 8. Learning rate is set to .

Figure 6 shows a visualization of the velocity field for the test data. The first column shows the LBM ground truth results. The second column shows the CNN prediction results using the proposed PI regularization method. The third column shows the norm of the difference between ground truth and predicted results, which are averaged over three independent training efforts. It is evident that the results are in close agreement. Table 3 shows a comparison between the performance of the surrogate models trained with different regularization methods. The third column shows the state of the practice hennigh2018git (). It can be seen that dropout regularization (third column) and PI regularization (fourth column) have similar performances. It should be noted again that the applied PI regularization only incorporates the partial prior knowledge pertaining to the mass conservation, and doesn’t regularize based on the momentum equation. However, it is observed that the best performance is obtained when PI regularization is applied in addition to dropout. Specifically, the application of our PI regularization method to the dropout implementation of hennigh2018git () has reduced the relative norm by .

Regularization method | No Reg. | Dropout () | PI Reg. () | Dropout () & PI Reg. () |
---|---|---|---|---|

Relative norm |

## 6 Conclusion

In this work, we presented a novel method for physics-informed regularization of deep neural networks. It has been shown through three numerical examples (systems governed by the Burgers’ and Navier-Stokes equations) that the proposed PI regularization method results in surrogates that are physically interpretable, and when compared to other common regularization methods results in significantly smaller generalization errors. This is achieved by applying a regularization term to the optimization loss function that prevents the surrogate from violating the governing laws.

One limitation of the proposed PI regularization method is that, in order for us to promote the DNN surrogates that satisfy the governing equations, an independent surrogate has to be constructed for each of the physical variables that appear in the governing laws. This was discussed in Section 5.3 where a separate ‘pressure field’ surrogate was needed to enforce the momentum conservation. In the presence of available labeled training data for all the physical variables that appear in the governing laws, we can construct separate surrogates for each physical variable in order to enable the PI regularization. This can be done even if some of those physical variables are not our QoIs. This will in fact increase the computational cost compared to other regularization alternatives, which only account for the QoIs. But, it can be investigated in different applications whether the accuracy improvement can justify the extra cost. Also, in the absence of labeled training data, we will be limited in utilizing the PI regularization, unless we make use of the fully unsupervised algorithms, such as the one proposed by the authors in nabian2018deep2 (), where a surrogate is constructed without using any training data, but only by minimizing divergence from the governing physical laws.

There exists a series of research opportunities to pursue in the future studies as extensions pertinent to this work, including the following: (1) An exciting avenue of future research is to propose hybrid regularization techniques that make use of the PI regularization together with , , and/or dropout regularizations and investigate the performance of these hybrid regularization methods compared to the proposed PI regularization. A glimpse of this hybrid use is already discussed in Section 5.3, however, more comprehensive studies are needed; (2) In some cases when preparing the dataset for the surrogate training and evaluation, there exists the option to choose the input variables for which the simulations or experiments are conducted. It is interesting to investigate optimal sampling strategies in order to reduce the surrogate generalization error and also improve convergence rate when using the proposed PI regularization method; (3) It would be also worthwhile to investigate the performance of the proposed method in constructing accurate surrogates for nonlinear dynamic systems with varying system parameters, for which some variations of the system parameters may lead the system to go through bifurcations (such as a system with varying Reynolds number that is governed by the Navier-Stokes equations); and finally (4) A modified version of the proposed method may be proposed that, instead of applying the regularizer term to the loss function (Equation 13) at the beginning of training phase, it applies the regularizer at some optimal point during training phase in order to improve the rate of convergence.

## References

- (1) S. Koziel, L. Leifsson, Surrogate-based modeling and optimization, Applications in Engineering.
- (2) T. W. Simpson, J. Poplinski, P. N. Koch, J. K. Allen, Metamodels for computer-based engineering design: survey and recommendations, Engineering with computers 17 (2) (2001) 129–150.
- (3) N. V. Queipo, R. T. Haftka, W. Shyy, T. Goel, R. Vaidyanathan, P. K. Tucker, Surrogate-based analysis and optimization, Progress in aerospace sciences 41 (1) (2005) 1–28.
- (4) L. R. Shanock, B. E. Baran, W. A. Gentry, S. C. Pattison, E. D. Heggestad, Polynomial regression with response surface analysis: A powerful approach for examining moderation and overcoming limitations of difference scores, Journal of Business and Psychology 25 (4) (2010) 543–554.
- (5) M. D. Buhmann, Radial basis functions, Acta numerica 9 (2000) 1–38.
- (6) S. M. Wild, R. G. Regis, C. A. Shoemaker, Orbit: Optimization by radial basis function interpolation in trust-regions, SIAM Journal on Scientific Computing 30 (6) (2008) 3197–3219.
- (7) D. Xiu, G. E. Karniadakis, The wiener–askey polynomial chaos for stochastic differential equations, SIAM journal on scientific computing 24 (2) (2002) 619–644.
- (8) Y. M. Marzouk, H. N. Najm, Dimensionality reduction and polynomial chaos acceleration of bayesian inference in inverse problems, Journal of Computational Physics 228 (6) (2009) 1862–1902.
- (9) T. W. Simpson, T. M. Mauery, J. J. Korte, F. Mistree, Kriging models for global approximation in simulation-based multidisciplinary design optimization, AIAA journal 39 (12) (2001) 2233–2241.
- (10) S. Jeong, M. Murayama, K. Yamamoto, Efficient optimization design method using kriging model, Journal of aircraft 42 (2) (2005) 413–420.
- (11) M. A. Bouhlel, J. R. Martins, Gradient-enhanced kriging for high-dimensional problems, Engineering with Computers (2017) 1–17.
- (12) J. De Baar, T. P. Scholcz, C. V. Verhoosel, R. P. Dwight, A. H. van Zuijlen, H. Bijl, Efficient uncertainty quantification with gradient-enhanced kriging: Applications in fsi, Eccomas Vienna.
- (13) A. J. Smola, B. Schölkopf, A tutorial on support vector regression, Statistics and computing 14 (3) (2004) 199–222.
- (14) S. M. Clarke, J. H. Griebsch, T. W. Simpson, Analysis of support vector regression for approximation of complex engineering analyses, Journal of mechanical design 127 (6) (2005) 1077–1087.
- (15) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
- (16) I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016.
- (17) M. A. Nabian, H. Meidani, A deep neural network surrogate for high-dimensional random partial differential equations, arXiv preprint arXiv:1806.02957.
- (18) M. A. Nabian, H. Meidani, Deep learning for accelerated seismic reliability analysis of transportation networks, Computer-Aided Civil and Infrastructure Engineering 33 (6) (2018) 443–458.
- (19) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15 (1) (2014) 1929–1958.
- (20) X. Guo, W. Li, F. Iorio, Convolutional neural networks for steady flow approximation, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 481–490.
- (21) O. Hennigh, Lat-net: Compressing lattice boltzmann flow simulations using deep neural networks, arXiv preprint arXiv:1705.09036.
- (22) O. Hennigh, Automated design using neural networks and gradient descent, arXiv preprint arXiv:1710.10352.
- (23) L. Bottou, Stochastic gradient descent tricks, in: Neural networks: Tricks of the trade, Springer, 2012, pp. 421–436.
- (24) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
- (25) J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research 12 (Jul) (2011) 2121–2159.
- (26) M. D. Zeiler, Adadelta: an adaptive learning rate method, arXiv preprint arXiv:1212.5701.
- (27) I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in: International conference on machine learning, 2013, pp. 1139–1147.
- (28) L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, R. Fergus, Regularization of neural networks using dropconnect, in: International Conference on Machine Learning, 2013, pp. 1058–1066.
- (29) R. Caruana, S. Lawrence, C. L. Giles, Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping, in: Advances in neural information processing systems, 2001, pp. 402–408.
- (30) J. Salamon, J. P. Bello, Deep convolutional neural networks and data augmentation for environmental sound classification, IEEE Signal Processing Letters 24 (3) (2017) 279–283.
- (31) G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, arXiv preprint arXiv:1207.0580.
- (32) G. E. Dahl, T. N. Sainath, G. E. Hinton, Improving deep neural networks for lvcsr using rectified linear units and dropout, in: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, IEEE, 2013, pp. 8609–8613.
- (33) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
- (34) V. Pham, T. Bluche, C. Kermorvant, J. Louradour, Dropout improves recurrent neural networks for handwriting recognition, in: Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, IEEE, 2014, pp. 285–290.
- (35) S. Wager, S. Wang, P. S. Liang, Dropout training as adaptive regularization, in: Advances in neural information processing systems, 2013, pp. 351–359.
- (36) M. Raissi, Deep hidden physics models: Deep learning of nonlinear partial differential equations, arXiv preprint arXiv:1801.06637.
- (37) T. A. Driscoll, N. Hale, L. N. Trefethen, Chebfun guide (2014).
- (38) T. T. Medjo, Vorticity-velocity formulation for the stationary navier-stokes equations: the three-dimensional case, Applied mathematics letters 8 (4) (1995) 63–66.
- (39) K. Taira, T. Colonius, The immersed boundary method: a projection approach, Journal of Computational Physics 225 (2) (2007) 2118–2137.
- (40) T. Colonius, K. Taira, A fast immersed boundary method using a nullspace approach and multi-domain far-field boundary conditions, Computer Methods in Applied Mechanics and Engineering 197 (25-28) (2008) 2131–2146.
- (41) J. N. Kutz, S. L. Brunton, B. W. Brunton, J. L. Proctor, Dynamic mode decomposition: data-driven modeling of complex systems, Vol. 149, SIAM, 2016.
- (42) J. Tu, G. H. Yeoh, C. Liu, Computational fluid dynamics: a practical approach, Butterworth-Heinemann, 2018.
- (43) M. A. Nabian, L. Farhadi, Multiphase mesh-free particle method for simulating granular flows and sediment transport, Journal of Hydraulic Engineering 143 (4) (2016) 04016102.
- (44) O. Hennigh, Steady-state-flow-with-neural-nets, https://github.com/loliverhennigh/Steady-State-Flow-With-Neural-Nets.git (2018).
- (45) S. Chen, G. D. Doolen, Lattice boltzmann method for fluid flows, Annual review of fluid mechanics 30 (1) (1998) 329–364.
- (46) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- (47) A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al., Conditional image generation with pixelcnn decoders, in: Advances in Neural Information Processing Systems, 2016, pp. 4790–4798.
- (48) T. Salimans, A. Karpathy, X. Chen, D. P. Kingma, Pixelcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modifications, in: ICLR, 2017.
- (49) I. Goodfellow, Nips 2016 tutorial: Generative adversarial networks, arXiv preprint arXiv:1701.00160.