A Expectation-Maximization for the \boldsymbol{\mu} prior

# Sparse Variational Bayesian Approximations for Nonlinear Inverse Problems: applications in nonlinear elastography

## Abstract

This paper presents an efficient Bayesian framework for solving nonlinear, high-dimensional model calibration problems. It is based on a Variational Bayesian formulation that aims at approximating the exact posterior by means of solving an optimization problem over an appropriately selected family of distributions. The goal is two-fold. Firstly, to find lower-dimensional representations of the unknown parameter vector that capture as much as possible of the associated posterior density, and secondly to enable the computation of the approximate posterior density with as few forward calls as possible. We discuss how these objectives can be achieved by using a fully Bayesian argumentation and employing the marginal likelihood or evidence as the ultimate model validation metric for any proposed dimensionality reduction. We demonstrate the performance of the proposed methodology for problems in nonlinear elastography where the identification of the mechanical properties of biological materials can inform non-invasive, medical diagnosis. An Importance Sampling scheme is finally employed in order to validate the results and assess the efficacy of the approximations provided.

###### keywords:
Uncertainty Quantification, Variational Bayesian, Inverse Problem, Dimensionality reduction, Elastography, Dictionary Learning
1\cortext

[cor1]Corresponding Author. Tel: +49-89-289-16690 url]http://www.contmech.mw.tum.de

## 1 Introduction

The extensive use of large-scale computational models poses several challenges in model calibration as the accuracy of the predictions provided depends strongly on assigning proper values to the various model parameters. In mechanics of materials, accurate mechanical property identification can guide damage detection and an informed assessment of the system’s reliability (1). Identifying property-cross correlations can lead to the design of multi-functional materials (2). Permeability estimation for soil transport processes can assist in detection of contaminants, oil exploration (3).

Deterministic optimization techniques which have been developed to address these problems (4), lead to point estimates for the unknowns without rigorously considering the statistical nature of the problem and without providing quantification of the uncertainty in the inverse solution. Statistical approaches based on the Bayesian paradigm (5) on the other hand, aim at computing a (posterior) probability distribution on the parameters of interest. Bayesian formulations offer several advantages as they provide a unified framework for dealing with the uncertainty introduced by the incomplete and noisy measurements. Significant successes have been noted in applications such as geological tomography (6) , medical tomography (7), petroleum engineering (8) , as well as a host of other physical, biological, or social systems (9); (10). Representations of the parametric fields in existing deterministic and Bayesian approaches (artificially) impose a minimum length scale of variability usually determined by the discretization size of the governing PDEs (11). As a result they give rise to a very large vector of unknowns. Inference in high-dimensional spaces using standard Markov Chain Monte Carlo (MCMC) schemes is generally impractical as it requires an exuberant number of calls to the forward simulator in order to achieve convergence. Advanced schemes such as those employing Sequential Monte Carlo samplers (12); (13), adaptive MCMC (14), accelerated MCMC methods (15) or spectral methods (16) can alleviate some of these difficulties particularly when the posterior is multi-modal but still pose significant challenges in terms of the computational cost (17).

This work is particularly concerned with the identification of the mechanical properties of biological materials, in the context non-invasive medical diagnosis. While in certain cases mechanical properties can also be measured directly by excising multiple tissue samples, non-invasive procedures offer obvious advantages in terms of ease, cost and reducing risk of complications to the patient. Rather than x-ray techniques which capture variations in density, the identification of stiffness or mechanical properties in general, can potentially lead to earlier and more accurate diagnosis (18); (19), provide valuable insights that differentiate between modalities of the same pathology (20) and monitor the progress of treatments. In this paper we do not propose new imaging techniques but rather aim at developing rigorous statistical models and efficient computational tools that can make use of the data/observables (i.e. noisy displacements of deformed tissue) from existing imaging modalities (such as magnetic resonance (21), ultrasonic) in order to produce certifiable estimates of mechanical properties. The primary imaging modality considered in this project is ultrasound elasticity imaging (elastography (22); (23)). It is based on ultrasound tracking of pre- and post-compression images to obtain a map of position changes and deformations of the specimen due to an external pressure/load. The pioneering work of Ophir and coworkers (24) followed by several clinical studies (25); (26); (27) have demonstrated that the resulting strain images typically improve the diagnostic accuracy over ultrasound alone.

Beyond a mere strain imaging there are two approaches for inferring the constitutive material parameters. In the direct approach, the equations of equilibrium are interpreted as equations for the material parameters of interest, where the inferred strains and their derivatives appear as coefficients (28). While such an approach provides a computationally efficient strategy, it does not use the raw data (i.e. noisy displacements) but transformed versions i.e. strain fields (or even-worse, strain derivatives) which arise by applying sometimes ad hoc filtering and smoothing operators. As a result the informational content of the data is compromised and the quantification of the effect of observation noise is cumbersome. Furthermore, the smoothing employed can smear regions with sharply varying properties and hinder proper identification.

The alternative to direct methods, i.e. indirect or iterative procedures admit an inverse problem formulation where the discrepancy between observed and model-predicted displacements is minimized with respect to the material fields of interest (29); (30); (31); (32). While these approaches utilize directly the raw data, they generally imply an increased computational cost as the forward problem and potentially derivatives have to be solved/computed several times. This effort is amplified when stochastic/statistical formulations are employed as those arising in the Bayesian paradigm. Technological advances have led to the development of hand-carried ultrasound systems in the size of a smartphone (33). Naturally their accuracy and resolution does not compare with the more expensive traditional ultrasound machines or even more so MRI systems. If however computational tools are available that can distill the informational content from noisy and incomplete data then this would constitute a major advance. Furthermore, significant progress is needed in improving the computational efficiency of these tools if they are to be made applicable on a patient-specific basis.

In this work we advocate a Variational Bayesian (VB) perspective (34); (35). Such methods have risen into prominence for probabilistic inference tasks in the machine learning community (36); (37); (38) but have recently been employed also in the context of inverse problems (39); (40). They provide approximate inference results by solving an optimization problem over a family of appropriately selected probability densities with the objective of minimizing the Kullback-Leibler divergence (41) with the exact posterior. The success of such an approach hinges upon the selection of appropriate densities that have the capacity of providing good approximations while enabling efficient (and preferably) closed-form optimization with regards to their parameters. We note that an alternative optimization strategy originating from a different perspective and founded on map-based representations of the posterior has been proposed in (42).

A pivotal role in Variational Bayesian strategies or any other inference method, is dimensionality reduction i.e. the identification of lower-dimensional features that provide the strongest signature to the unknowns and the corresponding posterior. Discovering a sparse set of features has attracted great interest in many applications as in the representation of natural images (43) and a host of algorithms have been developed not only for finding such representations but also an appropriate dictionary for achieving this goal (44). While all these tools are pertinent to the present problem they differ in a fundamental way. They are based on several data/observations/instantiations of the vector that we seek to represent. In our problem however we do not have such direct observations i.e. the data available pertains to the output of a model which is nonlinearly and implicitly dependent on the vector of unknowns. Furthermore we are primarily interested in approximating the posterior of this vector rather than the dimensionality reduction itself. We demonstrate how this can be done by using a fully Bayesian argumentation and employing the marginal likelihood or evidence as the ultimate model validation metric for any proposed dimensionality reduction.

The paper is organized as follows: The next section (Section 2) presents the essential ingredients of the forward model (Section 2.1) which are common with a wide range of nonlinear, high-dimensional problems encountered in several simulation contexts. We also discuss the VB framework advocated, the dimensionality reduction scheme proposed, the prior densities for all model parameters, an iterative, coordinate-ascent algorithm that enables the identification of all the unknowns (Section 2.2) as well as an information-theoretic criterion for determining the number of dimensions needed (Section 2.3). We finally describe a Monte Carlo scheme based on Importance Sampling that can provide statistics of the exact posterior as well as a quantitative assessment of the VB approximation (Section 2.4). Section 3 demonstrates the performance and features of the proposed methodology in two problems from solid mechanics that are of relevance to the elastography settings. Various signal-to-noise ratios are considered and the performance of the method, in terms of forward calls and accuracy, is assessed.

## 2 Methodology

The motivating application is related to continuum mechanics in the nonlinear elasticity regime. We describe below the governing equations in terms of conservation laws and the constitutive equations. The proposed model calibration process can be readily adapted to other forward models. As it will be shown, the only information utilized by the Bayesian inference engine proposed is a) the response quantities at the locations where measurements are available, and b) their derivatives with respect to the unknown model parameters.

### 2.1 Forward model - Governing equations

The following expressions are formulated in the general case which includes nonlinear material behavior and large deformations. Our physical domain is described by in in the reference configuration. Let denote the coordinates of the continuum particles in the undeformed configuration and in the deformed. Their relation (in the static case) is provided by the deformation map such that: . The displacement field is defined as . The gradient of the deformation map is denoted by and is then the Lagrangian (finite) strain tensor used as the primary kinematic state variable in our constitutive law (45); (46); (47). The governing equations consist of the conservation of linear momentum:

 Invalid decimal number (1)

and the Dirichlet and Neumann boundary conditions as

 u=^uonΓu (2)
 FS⋅N=^TonΓS. (3)

b is body force vector (per unit mass), is the initial density, S is the second Piola-Kirchhoff stress tensor and is the outward normal at . and are subsets of the boundary, , on which displacement and traction boundary data, and , respectively, are specified. For a hyperelastic material, it is assumed that the strain energy density function exists and depends on the invariants of the Lagrangian strain tensor E and the constitutive material parameters . We note that the latter in general exhibit spatial variability which we intend to estimate using the methods discussed. The conjugate stress variables described by the second Piola-Kirchhoff stress tensor can be found as:

 S=∂w∂E=S(E;ψ). (4)

The aforementioned governing equations should be complemented with any other information about the problem or the material, such as incompressibility. In fact incompressibility is frequently encountered in bio-materials and corresponds to the condition at all points in the problem problem domain.

The governing equations presented thus far cannot be solved analytically for the vast majority of problems and one must resort to numerical techniques that discretize these equations and the associated fields. The most prominent such approach is the Finite Element Method (FEM) which is employed in this study as well. In the first step, the weak form of the PDEs needs to be derived. To that end, we define the usual function spaces and for the set of admissible solutions and weighting functions respectively, as follows (48):

 S={u|ui∈H1(Ω0):ui=^uionΓu},V={v|vi∈H1(Ω0):vi=0onΓu} (5)

where denotes the Sobolev space of square integrable functions with square integrable derivatives in (49). By multiplying Equation (1) with a weighting function , integrating by parts and exploiting the essential and non-essential boundary conditions above, we obtain:

 ∫Ω0FiKSKLvi,L dΩ0=∫ΓS^Tivi dΓS+∫Ω0ρ0bivi dΩ0 (6)

In the incompressible case, pressure must be taken into account and for that purpose the pressure trial solutions and weighting functions should also be introduced (50).

Subsequently the problem domain is discretized into finite elements and shape functions are used for interpolating the unknown fields. As this is a very mature subject, from a theoretical and computational point of view, we do not provide further details here but refer the interested reader to one of many books available (51); (52) or more specifically in the context of inverse problems for (in)compressible elasticity in (48); (50). Most often all unknowns are expressed in terms of the discretized displacement field denoted here by a vector . An approximate solution can be found by solving an dimensional system of nonlinear algebraic equations which in residual form can be written as:

 r(U;Ψ)=0. (7)

We denote here by the residuals and by , the discretized vector of constitutive material parameters .

The discretizations can be done in many different ways. For example if the same mesh and shape functions as for the discretization of the displacements are adopted, then each entry of the vector corresponds to the value of the material parameter of interest at each nodal point. Frequently it is assumed that the value of the constitutive parameters are constant within each finite element in which case coincides with the number of elements in the FE mesh. While the representation of is discussed in detail in the sequence, we point out that the discretization of does not need to be associated with the discretization used for the governing equations. Usually in practice the two are related, but if one aims at inferring as many details about the variability of that the discretized equations can provide, a finer discretization might be employed for . We note however that if the material properties exhibit significant variability within each finite element i.e. if , special care has to be taken in formulating the finite element solution and multiscale schemes might need to be employed (53).

We note here:

• Frequently the size of the system of the equations that need to be solved is large. This is necessitated by accuracy requirements in capturing the underlying physics and mathematics. It can impose a significant computational burden as in general repeated solutions of this system, under different values of , are needed. If for example a Newton-Raphson method is employed then repeated solutions of the linearized Equation (7) will need to be performed:

 Unknown environment '% (8)

where is the iteration number and is the Jacobian matrix. Hence for large as in applications of interest, the number of such forward solutions is usually what dictates the overall computational cost and this is what we report in subsequent numerical experiments. Depending on the particular solution method employed, converged solutions at a certain stage of the inversion procedure can be used as initial guesses for subsequent solutions under different reducing as a result the overall cost. In this work we do not make use of such techniques.

• The data available generally concerns a subset or more generally a lower-dimensional function of . In this work, the experimental measurements/ observations are (noisy) displacements at specific locations in the physical domain. We denote these displacements by and they can be formally expressed as where is a Boolean matrix which picks out the entries of interest from . Naturally, since depends on , is also a function of i.e. . We emphasize that this function is generally highly nonlinear and most often than not, many-to-one (54). The unavailability of the inverse as well as the high nonlinearity constitute two of the basic difficulties of the associated inverse problem.

• In addition to the solution vector , the proposed inference scheme will make use of the derivatives . The computation of derivatives of the response with respect to model parameters is a well-studied subject in the context of PDE-constrained optimization (55); (56); (57) and we make use of it in this work. For any scalar function , one can employ the adjoint form of Equation (7) according to which:

 dfdΨk=−νj∂ri∂Ψk (9)

where is defined such as:

 νj∂rj∂Ui=∂f∂UiorJTν=∂f∂U. (10)

We note that is the Jacobian of the residuals in Equation (7) evaluated at the solution . We point out that if a direct solver for the solution of the linear system in is employed, then the additional cost of evaluating is minimal as the Jacobian would not need to be re-factorized for solving Equation (10) 2. In the context of the problems considered in this paper (see Section 3), repeated use of Equation (10) is made where is a different component of the observables and as such the overall cost increases proportionally with the number of observables (displacements in our problems) that are available. In problems where is so large that it precludes the use of direct solvers, then the cost of its solution of the adjoint equations can be augmented but nevertheless comparable to the cost of a forward solution. In cases where both as well as the dimension of are high, then advanced iterative solvers, suitable for multiple right-hand sides must be employed (58); (59). These imply an added computational burden which nevertheless scales sublinearly with the dimension of .

### 2.2 Bayesian Model

The following discussion is formulated in general terms and can be applied for the calibration of any model with parameters represented by the vector when output is available. We also presuppose the availability of the derivatives . For problems of practical interest, it is assumed that the dimension of the unknowns is very large which poses a significant hindrance in the solution of the associated inverse problem as well as in finding proper regularization (in deterministic settings (60)) or in specifying appropriate priors (in probabilistic settings (61); (62)). The primary focus of the Bayesian model developed is two-fold:

• find lower-dimensional representations of the unknown parameter vector that capture as much as possible of the associated posterior density

• enable the computation of the posterior density with as few forward calls (i.e. evaluations of ) as possible.

We denote the vector of observations/measurements. In the context of elastography the observations are displacements (in the static case) and/or velocities (in the dynamics). The extraction of this data from images (ultrasound or MRI) is a challenging topic that requires sophisticated image registration techniques (63); (64). Naturally, these compromise the informational content of the raw data (i.e. the images). In this study we ignore the error introduced by the image registration process, as the emphasis is on the inversion of the continuum mechanics, PDE-based, model, and assume that the displacement data are contaminated with noise. We postulate the presence of i.i.d. Gaussian noise denoted here by the random vector such that:

 Extra open brace or missing close brace (11)

We assume that each entry of has zero mean and an unknown variance which will also be inferred from the data. We note that other models can also be employed as for example impulsive noise to account for outliers due to instrument calibration or experimental conditions (65). Generally the difference between observed and model-predicted outputs can be attributed not only to observation errors (noise) but also to model discrepancies (66); (67); (68). In this work such model errors are lumped with observation errors in the -term.

The likelihood function of the observed data i.e. its conditional probability density given the model parameters (and implicitly the model itself as described by the aforementioned governing Equations (7)) and is:

 p(^y|Ψ,τ)=(τ2π)dy/2e−τ2|^y−y(Ψ)|2. (12)

In the Bayesian framework advocated one would also need to specify priors on the unknown parameters. We defer a detailed discussion of the priors associated with for the next section where the dimensionality reduction aspects are discussed. With regards to the noise precision we employ a (conditionally) conjugate Gamma prior i.e.

 τ∼Gamma(a0,b0). (13)

The values of the parameters are taken in the following examples. This corresponds to a limiting case where the density degenerates to an improper, non-informative Jeffreys prior i.e. that is scale invariant (69). Naturally more informative choices can be made if such information is available a priori.

#### Dimensionality Reduction for Ψ

As mentioned earlier one of the primary goals of the present work to is identify, with the least number of forward calls, a lower-dimensional subspace in on which the posterior probability density can be sufficiently-well approximated. Dimensionality reduction could be enforced directly by appropriate prior specification. For example in (70) the Fourier transform coefficients of corresponding to small-wavelength fluctuations were turned-off by assigning zero prior probability to non-zero values. While such an approach achieves the goal of dimensionality reduction it does not take into account the forward model in doing so. The nonlinear map as well as the available data provide varying amounts of information for identifying different features of . One would expect the likelihood (which measures the degree of fit of model predictions with the data) to exhibit different levels of sensitivity along different directions in the -space. Consider for example Laplace’s method which is based on a semi-analytic Gaussian approximation around the Maximum-A-Posteriori estimate (71); (35). The negative of the Hessian of the log-posterior (assuming this is positive-definite) serves as the covariance matrix. As it was shown in (72) in many inverse problems this covariance matrix exhibits a significant discrepancy in its eigenvalues which was exploited in constructing low-rank approximations. At one extreme, there would be principal directions (with small variance) along which the slightest change from from would cause a huge decrease in the posterior and on the other, there would principal directions (with large variance) along which the posterior would remain almost constant. Such principal directions will naturally encapsulate the effect of the log-prior. In the proposed scheme however, only the data log-likelihood affects the directions with the maximal posterior variance (73). More importantly perhaps we propose a unified framework where the identification of the subspace with the largest posterior variance is performed simultaneously with the inference of the posterior under the same Variational Bayesian objective. This yields not only a highly efficient algorithm (in terms of the number of forward solves) but also a highly extendable framework as discussed in the conclusions.

The inference and dimensionality reduction problems are approached by employing fully Bayesian argumentation and invoking the quality of the approximation to the posterior as our guiding objective. To that end we postulate the following representation for the high-dimensional vector of unknowns :

 ΨdΨ×1=μdΨ×1+WdΨ×dΘΘdΘ×1. (14)

The motivation behind such a decomposition is quite intuitive as it resembles a Principal Component Analysis (PCA) model (74). The vector represents the mean value of the representation of whereas the reduced (and latent) coordinates of along the linear subspace spanned by the columns of the matrix . The linear decomposition of a high-dimensional vector such as has received a lot of attention in several different fields. Most commonly represents a high-dimensional signal (e.g. an image, an audio/video recording) and consists of an over- or under-complete basis set (43); (75) which attempts to encode the signal as sparsely as possible. Significant advances in Compressed Sensing (76) or Sparse Bayesian Learning (77) have been achieved in recent years along these lines. A host of deterministic (78) or probabilistic (79) algorithms have been developed for identifying the reduced-coordinates (or their posterior) as well as techniques for learning the most appropriate set of basis (dictionary learning) i.e. the one that can lead to the sparsest possible representation. While all these tools are pertinent to the present problem they differ in a fundamental way. They are based on several data/ observations/instantiations of whereas in our problem we do not have such direct observations i.e. the data available pertains to which is nonlinearly and implicitly dependent on . Furthermore we are primarily interested in approximating the posterior on rather than the dimensionality reduction itself.

We focus now on the representation of Equation (14) and proceed to discuss the identification of , and . In a fully Bayesian setting these parameters would be equipped with priors, say respectively3, and their joint posterior would be sought:

 p(μ,W,Θ,τ|^y)∝p(^y|μ,W,Θ,τ) p(μ)p(W)p(Θ)p(τ) (15)

where represents the Gamma prior for discussed in Equation (13). Such an inference problem would in general be formidable particularly with regards to and whose dimension is dominated by . To address this difficulty we propose computing point estimates for and while inferring the whole posterior of . In doing so for and the natural objective function would be the marginal posterior :

 p(μ,W|^y)=∫p(μ,W,Θ,τ|^y) dΘdτ. (16)

In such a case the point estimates for would be the Maximum-a-Posteriori-Estimates (MAP). We note that (up to an additive constant):

 logp(μ,W|^y)=log∫p(μ,W,Θ,τ|^y) dΘdτ=log∫p(^y|μ,W,Θ,τ) p(μ)p(W)p(Θ)p(τ) dΘdτ=log∫p(^y|μ,W,Θ,τ)p(Θ)p(τ) dΘdτ+logp(μ)+logp(W)=log∫(τ2π)dy/2e−τ2|^y−y(μ+WΘ)|2 p(Θ) p(τ) dΘ dτ+logp(μ)+logp(W). (17)

We note that such an integration is analytically impossible primarily due to the nonlinear and implicit nature of and secondarily due to the coupling of and . To that end we employ a Variational Bayesian approximation (35) to the integral in Equation (17). We provide further details in the next section. We note that similar approximations have been employed in previous works (40); (65); (39) in order to expedite Bayesian inference. The novel element of this work pertains to the dimensionality reduction that can be achieved.

#### Variational Bayesian approximation

Consider an arbitrary joint density on the latent variables . Then by employing Jensen’s inequality one can construct a lower bound to the log-marginal-posterior in Equation (17) as follows:

 logp(μ,W|^y)=log∫p(μ,W,Θ,τ|^y) dΘdτ=log∫q(Θ,τ)p(μ,W,Θ,τ|^y)q(Θ,τ) dΘdτ≥∫q(Θ,τ)logp(μ,W,Θ,τ|^y)q(Θ,τ) dΘdτ=F(q(Θ,τ),μ,W). (18)

We note here that the variational lower-bound has an intimate connection with the Kullback-Leibler divergence between and the (conditional) posterior on :

 p(Θ,τ|^y,μ,W)=p(μ,W,Θ,τ|^y)p(μ,W|^y). (19)

In particular, if we denote by is the expectation with regards to :

 Missing or unrecognized delimiter for \left (20)

By definition the KL-divergence is non-negative and it becomes when . Hence , for a given , constructing a good approximation to the conditional posterior (in the KL-divergence sense) is equivalent to maximizing the lower bound with regards to .

The aforementioned discussion suggests an iterative optimization scheme that resembles the Variational Bayes - Expectation-Maximization (VB-EM) methods that have appeared in Machine Learning literature (34). At each iteration , one alternates between (Figure 1):

• VB-Expectation: Given , find:

 q(t)(Θ,τ)=argmaxqF(q(Θ,τ)),μ(t−1),W(t−1)) (21)
• VB-Maximization: Given , find:

 (μ(t),W(t))=argmaxμ,WF(q(t)(Θ,τ),μ,W). (22)

In plain terms, the strategy advocated in order to carry out the inference task can be described as a generalized coordinate ascent with regards to (Figure 2).

From Equations (15) and (18), we have that:

 F(q(Θ,τ),μ,W)=∫q(Θ,τ)logp(μ,W,Θ,τ|^y)q(Θ,τ) dΘdτ=∫q(Θ,τ)logp(^y|μ,W,Θ,τ)p(Θ)p(τ)q(Θ,τ) dΘdτ+logp(μ)+logp(W)=Eq[logp(^y|Θ,τ,μ,W) p(Θ) p(τ)q(Θ,τ)]+logp(μ)+logp(W)=^F(q(Θ,τ),μ,W)+logp(μ)+logp(W) (23)

where (up to an additive constant):

 ^F(q(Θ,τ),μ,W)=Eq[logp(^y|Θ,τ,μ,W) p(Θ) p(τ)q(Θ,τ)]=Eq[log(τ2π)dy/2e−τ2|^y−y(μ+WΘ)|2]+Eq[logp(Θ) p(τ)q(Θ,τ)]. (24)

To alleviate the difficulties with the log-likelihood integral above we employ the following approximations:

• We linearize the map at . Hence:

 Invalid decimal number (25)

where is the gradient of the map at .

By keeping the first order terms from Equation (25) , the term in the exponent of the likelihood becomes:

 |^y−y(μ+WΘ)|2=|^y−y(μ)−GWΘ|2=|^y−y(μ)|2−2(^y−y(μ))TGWΘ+WTGTGW:ΘΘT. (26)

We note here that a quadratic expression with respect to could also be obtained by considering the order Taylor series of around directly. In particular if we denote by and and keeping only up to second order terms yields:

 |^y−y(μ+WΘ)|2=|^y−y(μ)|2+gTWΘ+12WTHW:ΘΘT. (27)

The computation of order derivatives can also be addressed within the adjoint framework. We refer the interested reader to (56); (80) as we do not pursue this possibility further in this work. The ensuing expressions are based on Equation (26) but can be readily adjusted to include the terms in Equation (27) instead 4.

We note that by making use of the linearization of the map and the Variational Bayesian approximation, one can obtain a tractable approximations of the posterior of the latent parameters and . This will enable us to ultimately identify all model parameters and through this process the optimal subspace for approximating the posterior on . This will be explained in detail when the final algorithm is presented in section 2.2.4.

• The aforementioned equations for the VB-Expectation step imply that probabilistic inference can be expressed in terms of a parametric optimization problem. One can adopt a functional form for depending on an appropriate set of parameters and identify their optimal value by minimizing the KL-divergence with the posterior or equivalently maximizing . We adopt a mean-field approximation where one looks for factorized densities of the form:

 q(Θ,τ)=q(Θ)q(τ). (28)

Variational mean-field approximations have their origin in statistical physics (81). We make these expressions more specific in the next sections where we discuss the prior for as well.

#### Prior Specification for Θ,μ and W

We discuss first the prior specification on . Its columns span the subspace over which an approximation of is sought. We note that depends on the product which would remain invariant by appropriate rescaling of each pair of and for any . Hence, to resolve identifiability issues we require that is orthogonal i.e. where is the dimensional identity matrix. This is equivalent to employing a uniform prior on on the Stiefel manifold (82).

The latent, reduced coordinates capture the variation of around its mean along the directions of as implied by Equation (14). It is therefore reasonable to assume that, a priori, these should have zero mean and should be uncorrelated (74). For that purpose we adopt a multivariate Gaussian prior (denoted by in the Equations of the previous section) with a diagonal covariance denoted by . We select prior variances such that . This induces a natural (stochastic) ordering to the reduced coordinates since is invariant to permutations of the entries of the and the columns of (Equation (14)). As a result of this ordering, is associated with the direction along which the largest variance in is attained, with the direction with the second largest variance and so on. We discuss the particular values given to prior hyperparameters in the sequel (Section 3) and in Section 2.3 the possibility of an adaptive decomposition is also presented. This enables the sequential addition of reduced coordinates until a sufficiently good approximation to the posterior is attained.

The final aspect of the prior model pertains to . We use a hierarchical prior that induces the requisite smoothness given that represents the spatial variability of the material parameters. In particular the prior model employed penalizes the jumps in the values of and which correspond to neighboring sites/locations . The definition of a neighborhood can be adjusted depending on the problem, but in this work we assume that sites/locations belong to the neighborhood if the they correspond to adjacent pixels/voxels. Suppose is the total number of jumps or neighboring pairs. Then for if and denote the corresponding neighboring pair:

 p(μkj−μlj|ϕj)=√ϕj2πe−ϕj2(μkj−μlj)2. (29)

The strength of the penalty is proportional to the hyperparameter , i.e. smaller values of induce a weaker penalty and vice versa. Let the denote the Boolean matrix that can be used to produce the vector of all jumps (as the one above) between all neighboring sites from the vector as , and the diagonal matrix containing all the hyperparameters associated with each of these jumps. We can represent the combined prior on as:

 p(μ|Φ)∝|Φ|1/2e−12μTLTΦLμ. (30)

A conjugate prior of the hyperparameters is a product of Gamma distributions:

 p(Φ)=dL∏j=1Gamma(aϕ,bϕ). (31)

As in (83) the independence is motivated by the absence of correlation (a priori) with regards to the locations of the jumps. In this work we use which corresponds to a limiting case of a Jeffreys prior that is scale invariant. We note that in contrast to previous works where such priors have been employed for the vector of unknowns and MAP estimates have been obtained (5), we employ this here for which is only part of the overall decomposition in Equation (14). We discuss in the following section the update equations for and the associated hyper-parameters as well as for the remaining model variables.

#### Update equations for q(Θ),q(τ),μ,W

We postulate first that the reduced coordinates should a posteriori have zero mean as they capture variability around . For that purpose we confine our search for to distributions with zero mean. Given the aforementioned prior and the linearization discussed in the previous section, we can readily deduce from Equation (23) that the optimal approximate posteriors and under the mean-field Variational Bayesian scheme adopted will be:

 qopt(Θ)≡N(0,Λ−1),qopt(τ)≡Gamma(a,b). (32)

The associated parameters are given by the following iterative Equations:

 a=a0+dy/2b=b0+12|^y−y(μ)|2+12tr(WTGTGWΛ−1) (33)
 Λ=Λ0+<τ>WTGTGW (34)

where .

As a result of the aforementioned Equations and Equation (14), one can establish that the posterior of is approximated by a Gaussian with mean and covariance given by:

 E[Ψ]=E[μ+WΘ]=μ+WΘCov[Ψ]=WΛ−1WT. (35)

We note that if we diagonalize i.e. where is diagonal and is orthogonal with columns equal to the eigenvectors of , then:

 Cov[Ψ]=WVDVTWT=~WD~WT (36)

where is also orthogonal (i.e. ) and contains the principal directions of the posterior covariance of . Hence it suffices to consider approximate posteriors with covariance that is diagonal i.e. . In this case the update equations for in Equation (34) reduce to:

 λi=λ0,i+<τ>wTiGTGwi. (37)

We note that despite the prior assumption on uncorrelated , the posterior on exhibits correlation and captures the principal directions along which the variance is largest. Furthermore, implicit to the aforementioned derivations is the assumption of a unimodal posterior on and subsequently on . This assumption can be relaxed by employing a mixture of Gaussians (e.g. (84)) that will enable the approximation of highly non-Gaussian and potentially multi-modal posteriors. Such approximations could also be combined with the employment of different basis sets for each of the mixture component which would provide a wide range of possibilities. We defer further discussions along these lines to future work. In the examined elastography applications, the unimodal assumption seems to be a reasonable one due to the generally large of amounts of data/observations obtained from various imaging modalities.

Given the aforementioned results one can obtain an expression for the variational lower bound in Equation (23). For economy of notation we use to denote expectations with respect to and/or as implied by the arguments:

 F(q(Θ)q(τ),μ,W)=Eq[logp(^y|Θ,τ,μ,W) p(Θ) p(τ)q(Θ,τ)]+logp(μ)+logp(W)=−dy2log2π+dy2−<τ>2<|^y−y(μ)−G WΘ|2>+12log|Λ0|−12Λ0:<ΘΘT>+(a0−1)−b0<τ>−logZ(a0,b0)−12log|Λ|+dΘ2−(a−1)+b<τ>+logZ(a,b)+logp(μ)+logp(W) (38)

where is the normalization constant of a distribution with parameters . The aforementioned equation can be further simplified by making use of the following expectations: , :

 F(qopt(Θ)qopt(τ),μ,W)=−dy2log2π+dy2−<τ>2|^y−y(μ)|2−<τ>2WTGTG W:Λ−1+12log|Λ0|−12Λ0:Λ−1+(a0−1)−b0<τ>−logZ(a0,b0)−12log|Λ|+dΘ2−(a−1)+b<τ>+logZ(a,b)+logp(μ)+logp(W). (39)

In order to update in the VB-Maximization step, it suffices to consider only the terms of that depend on it which we denote by i.e.:

 FW(W)=−<τ>2WTGTG W:Λ−1+logp(W) (40)

As discussed earlier the prior enforces the orthogonality constraint on . To address this constrained optimization problem, we employ the iterative algorithm proposed in (85) which has proven highly efficient in terms of the number of iterations and the cost per iterations in several settings. It employs the Cayley transform to preserve the constraint during the optimization and makes use only of first order derivatives:

 ∂FW∂W=−<τ>GTGWΛ−1+logp(W) (41)

In brief, if is the skew-symmetric matrix:

 B=∂FW∂WWT−W∂FW∂WT (42)

the update equations are based on a Crank-Nicholson-like scheme:

 Wnew=(I+αW2B)−1(I+αW2B)Wold (43)

where is the step size. One notes that the aforementioned update preserves the orthogonality of ((85)). In order to derive a good step size we use the Barzilai-Borwein scheme (86) which results in a non-monotone line search algorithm:

 αW=|tr(ΔWΔ∂FW∂W)|tr(Δ∂FW∂WTΔ∂FW∂W) (44)

where represents the difference between the current parameter values as compared to the previous step. As discussed in detail in (85) the inversion of the matrix in Equation (43) can be efficiently performed by inverting a matrix of dimension which is much smaller than . We note that the updates of require no forward calls for the computation of or its derivatives . The updates/iterations are terminated when no further improvement to the objective is possible.

The final component involves the optimization of . As with we consider only the terms of (Equation (39)) that depend on which we denote by i.e.:

 Fμ(μ)=−<τ>2|^y−y(μ)|2+logp(μ). (45)

Due to the analytical unavailability of and its derivatives we employ here an Expectation-Maximization scheme (87); (88) which we describe in A for completeness. The output of this algorithm is also the posterior on the hyperparameters in Equation (29) which capture the locations of jumps in as well as the probabilities associated with them. The cost of the numerical operations is minimal and scales linearly with the number of neighboring pairs . In the following we simply make use of Equations (65) without further explanation.

Formally the determination of the optimal would require the derivatives in Equation (45). We note that depends on . Hence finding would require the computation of second-order derivatives of which poses significant computational difficulties in the high-dimensional setting considered. To avoid this and only for the purpose of the updates, we linearize Equation (45) around the current guess by ignoring the dependence of on or equivalently by assuming that remains constant in the vicinity of the current guess. In particular, let denote the value of at iteration , then in order to find the increment , we define the new objective as follows:

 F(t)μ(Δμ(t))=Fμ(μ(t)+Δμ(t))+logp(μ(t)+Δμ(t))=−<τ>2|^y−y(μ(t)+Δμ(t))|2−12(μ(t)+Δμ(t))TLT<Φ>L(μ(t)+Δμ(t))≈−<τ>2|^y−y(μ(t))−G(t)Δμ(t)|2−12(μ(t)+Δμ(t))TLT<Φ>L(μ(t)+Δμ(t)). (46)

We note that there is no approximation with regards to the prior term. By keeping only the terms depending on in the Equation above we obtain:

 F(t)μ(Δμ(t))=−<τ>2(Δμ(t))T(G(t))TG(t) Δμ(t)+<τ>(^y−y(μ(t)))TG(t) Δμ(t)−12(Δμ(t))TLT<Φ>L Δμ(t)−(μ(t))TLT<Φ>L Δμ(t). (47)

This is a concave and quadratic with respect to the unknown . The maximum can be found by setting which yields:

 (<τ>(G(t))TG(t)+LT<Φ>L)Δμ(t)=<τ>(G(t))T(^y−y(μ(t)))−LT<Φ>Lμ(t). (48)

We note that the exact objective is evaluated at and is accepted only if the value of the objective is larger than that at . Iterations at terminated when no further improvement is possible. Finally it was found that activating the regularization term () after five updates/iterations during which the optimization is performed solely on the basis of , enabled better exploration of the feasible solutions.

We summarize below the basic steps of the iterative Variational Bayesian scheme proposed in Algorithm 1.

With regards to the overall computational cost we note that the updates of are the most demanding as they require calls to the forward model to evaluate and the derivatives . The updates were terminated when no further increase in (Equation (39)) can be attained.

### 2.3 Adaptive learning - Cardinality of reduced coordinates

The presentation thus far was based on a fixed number of reduced coordinates . A natural question that arises is how many should one consider. In order to address this issue we propose an adaptive learning scheme. According to this the analysis is first performed with a few (even one) reduced coordinates and upon convergence additional reduced coordinates are introduced, either in small batches or even one-by-one. Critical to the implementation of such a scheme is a metric for the progress achieved by the addition of reduced coordinates and basis vectors which can also be used as a termination criterion.

In this work we advocate the use of an information-theoretic criterion which measures the information gain between the prior beliefs on and the corresponding posterior. To measure such gains we employ again the KL-divergence between the aforementioned distributions (89). In particular if (section 2.2.3) and (Equation (37)) denote the dimensional prior and posterior respectively, we define the quantity as follows:

 I(dΘ)=KL(pdΘ(Θ)||qdΘ(Θ))−KL(pdΘ−1(Θ)||qdΘ−1(Θ))KL(pdΘ(Θ)||qdΘ(Θ)) (49)

which measures the (relative) information gain from to reduced coordinates. The KL divergence between and with and where are diagonal as explained previously follows with:

 KL(pdΘ(Θ)||qdΘ(Θ))=12dΘ∑i=1(−log(λiλ0,i)+λiλ0,i−1) (50)

and equation (49) becomes:

 I(dΘ)=∑dΘi=1(−log(λiλ0,i)+λiλ0,i−1)−∑dΘ−1i=1(−log(λiλ0,i)+λiλ0,i−1)∑dΘi=1(−log(λiλ0,i)+λiλ0,i−1). (51)

In the simulations performed in section 3, we demonstrate the evolution of this metric as reduced-coordinates/basis vectors are added one-by-one. The addition of reduced coordinates was terminated when was below for at least five consecutive . In Figure