# A Helmholtz equation solver using unsupervised learning: Application to transcranial ultrasound

## Abstract

Transcranial ultrasound therapy is increasingly used for the non-invasive treatment of brain disorders. However, conventional numerical wave solvers are currently too computationally expensive to be used online during treatments to predict the acoustic field passing through the skull (e.g., to account for subject-specific dose and targeting variations). As a step towards real-time predictions, in the current work, a fast iterative solver for the heterogeneous Helmholtz equation in 2D is developed using a fully-learned optimizer. The lightweight network architecture is based on a modified UNet that includes a learned hidden state. The network is trained using a physics-based loss function and a set of idealized sound speed distributions with fully unsupervised training (no knowledge of the true solution is required). The learned optimizer shows excellent performance on the test set, and is capable of generalization well outside the training examples, including to much larger computational domains, and more complex source and sound speed distributions, for example, those derived from x-ray computed tomography images of the skull.

Helmholtz equation learned optimizer unsupervised learning physics-based loss function transcranial ultrasound

## 1 Introduction and background

### 1.1 Motivation

Transcranial ultrasound therapy is a rapidly emerging technique for the noninvasive treatment of brain disorders in which ultrasound is used to cause functional or structural changes to brain tissue. Several different types of treatment are possible depending on the pattern of ultrasound pulses used and the addition of exogeneous microbubbles. This includes precisely destroying small regions of tissue [35], generating or suppressing electrical signals in the brain [30], and temporarily opening the blood-brain barrier to allow drugs to be delivered more effectively [1]. A major challenge for transcranial ultrasound therapies is the presence of the skull bone, which causes the ultrasound waves to be distorted and attenuated, even at low frequencies [24]. Critically, the skull morphology and acoustic properties vary both within and between individuals, which leads to undesirable changes in the position and intensity of the ultrasound focus [10, 3], and in some cases can destroy the focus entirely [48].

Using computational ultrasound models and knowledge of the geometric and acoustic properties of the skull (e.g., derived from an x-ray computed tomography image), it is possible to predict the ultrasound field inside the brain after propagating through the skull, and thus account for subject-specific dose and targeting variations [6, 29]. However, existing models based on conventional numerical techniques typically take tens of minutes to several hours to complete due to the large size of the computational domain compared to the size of the acoustic wavelength, in some cases generating models with billions of unknowns which require tens of thousands of iterations to solve [38, 39, 4, 43, 36]. This makes them too slow to be used for online calculations and corrections, i.e., while the subject is undergoing the therapy. Consequently, approximate models are often used (e.g., ray tracing) which trade off between accuracy and computational efficiency [12, 28].

In the current work, instead of using a classical numerical partial differential equation (PDE) solver, a recurrent neural network architecture is developed to rapidly solve the Helmholtz equation with a heterogeneous sound speed distribution representative of a human skull. The network is trained using a physics loss term based on the Helmholtz equation and a training set of idealized sound speed distributions. The use of a physics loss term, which plays an analogous role to the data consistency term in inverse problems [26, 20], avoids the need to run a large number of computationally expensive simulations using a conventional solver to generate training data for supervised training. A review of the relevant background to this approach is described in the remainder of §1, with the developed network architecture and training outlined in §2. Results are then given in §3 with discussion and outlook in §4.

### 1.2 Governing equations

In the most general case, the propagation of ultrasound waves through the skull and brain involves a heterogeneous distribution of material parameters, shear wave effects in the skull, nonlinear effects when high intensities are used, and acoustic absorption [55, 39]. However, if the ultrasound waves approach the skull close to normal incidence (this is often the case), then shear motion can be ignored [44]. Moreover, nonlinear effects are only important for ablative therapies and are restricted to a small region near the focus [46], and acoustic absorption can be considered a second-order effect [52]. In addition, for many therapeutic applications, the applied ultrasound signals are at a single frequency and last for many milliseconds or seconds, which is typically much longer than the time taken for the acoustic field to reach a steady-state, and thus time-independent models can be used.

Considering the above, a simplified model of wave propagation through the skull and brain can be described by the heterogeneous Helmholtz equation subject to the Sommerfeld radiation condition at infinity:

(1) | ||||

(2) |

Here is the number of spatial dimensions, is the speed of sound, is the angular frequency of the source, is a general space coordinate, is the source distribution, and is the complex acoustic wavefield. Here, it is assumed that the speed of sound distribution is heterogeneous in a bounded region of the domain, while it is uniform and equal to outside of it. In practice, the solution to Eq. (1) is sought within a domain of interest as shown in Fig. (1).

The aim of the current work is to find a learned iterative solver capable of generating a field which satisfies Eqs. (1) and (2). We will focus on the 2D version of the Helmholtz problem, keeping in mind that the ultimate goal is to translate our findings to large three-dimensional simulations. Extensions to 3D and to other wave equations, e.g., to include the effects of nonlinearity, changes in mass density, and acoustic absorption, will be considered as part of future work.

### 1.3 Numerical solution techniques

The Helmholtz equation given in Eq. (1) can be written in the general form

(3) |

where is a linear forward operator which depends on the speed of sound distribution . There are many approaches to discretize , including finite difference methods [53], boundary element methods [19], and finite-element methods [8]. In many cases, direct inversion of the forward operator is not feasible due to the size and conditioning of the problem, and thus iterative schemes are used. In this case, the solution of the PDE is cast as an optimization problem via a suitable loss function, which is solved using a minimization algorithm.

The most widely used loss is the squared norm of the residual , which is equivalent to the mean squared error (MSE) up to a scaling factor. Given the solution at iteration over the domain of interest , this can be calculated by [41]:

(4) |

Other objective functions may also be used depending on the required characteristics of the solution or the discretization model at hand. Common alternative choices are the root-MSE (RMSE) or the mean absolute error (MAE). Other examples relevant to the solution of PDEs include using the Dirichlet energy [54], augmenting the MSE loss with an term to enforce strong PDE solutions [7], and using physics-constrained loss functions [57].

Given a suitable discretization and loss function, a common approach to solving the system of equations is the use of Krylov subspace methods, such as the widely used conjugate gradient (CG) and generalized minimal residual (GMRES) methods [47]. However, Krylov subspace methods are known to have a slow convergence rate for the Helmholtz problem [16]. From an intuitive perspective, if the solution starts with an empty wavefield and a spatially localized source, the nature of the Helmholtz equation makes each update of Krylov methods local (due to the Laplacian).^{1}

To mitigate these issues, preconditioning methods can be used. Mathematically, this means finding a suitable change of basis that reduces the dynamic range of the singular values of the forward operator, which often leads to the use of a multiscale representation that can take into account long-range dependencies [15, 37]. However, finding suitable preconditioning methods for wave problems is still a challenging task [14]. An alternative idea towards improving Krylov-based iterative schemes is given in [42], where Krylov iterations are interleaved with a UNet [45] that boosts the current solution towards the true one. In [42], the UNet is trained on known examples, however, training on unlabeled examples using a physics loss is also suggested. A nice advantage of keeping some Krylov iterations is that they may act as regularizers, preventing the solution from diverging.

More generally, this leads to the following question: instead of using traditional optimizers to solve Eq. (3), is it possible to learn a suitable optimizer that can be iteratively applied to generate an accurate solution in a small number of iterations, while at the same time being simple enough to be used on large scale problems? While the general idea of learning an optimizer is not new [5, 40], fully learned optimizers for the Helmholtz equation have not previously been explored, and the task involves a problem-specific trade-off between speed (number of iterations) and accuracy (how close we are to the solution).

### 1.4 Learned optimizers

In the case of the heterogenous Helmholtz equation given in Eq. (1), assuming a fixed angular frequency , an iterative optimization scheme could be written in the form

(5) |

where is a learned function parametrised by the estimate of the wavefield after the iteration, the sound speed distribution (which is used in the forward operator ), and the source distribution . However, a key problem with this formulation is that it is hard for the network to directly manipulate the sound speed and source distributions to give an update for , as they belong to two very different domains. Although some learning-based methods have been proposed to combine samples from different domains, such as AUTOMAP [56], it is hard to design a function approximator with the right inductive biases while preserving the necessary flexibility required for fast inference. This often means that fully connected deep networks must be used, which are hard to scale to high dimensional input/output pairs and require a large amount of data to be trained.

Instead of using Eq. (5), we can instead leverage our knowledge of the forward operator to manipulate and to get a new input , which belongs to the same domain as . For example, we could use the residual signal as an input

(6) |

(Other choices are also possible for , for example, the derivative of the loss function.) This approach is in direct parallel to the inputs to many optimization algorithms. However, Eq. (6) still assumes that the combination of the current wavefield estimate and latest residual gives enough information to the network to specify the next update. This is unlikely to be true, as for different problems, the same residual could be observed given the same wavefield. A simple example is for an empty wavefield and a fixed source distribution , the residual will be the same regardless of the sound speed distribution. If we look at the choice of the sequence of updates as a Markov decision process (MDP) [31], this makes it only partially observable.

One way to restore the properties of a fully observable MDP is to construct a recurrent belief state which contains information from all preceding actions and observations [21]. Augmenting the problem with such a state can be done via the following update rule:

(7) |

Note that the wavefield update resembles a discrete Euler stepping scheme [33]. Augmenting the input with a hidden state is known to improve the representation capabilities of neural ODEs [13] and it is therefore reasonable to assume it also helps in their discretized counterpart.

If the function is considered as an iterative solver, the presence of the state variable allows several optimizers to be cast in this framework. For example, if stores the previous gradient and its magnitude, could in principle work as a quasi-Newton method [5, 40], while if it stores the collection of all the previous residuals, it may generalize Krylov subspace methods.

A learned iterative solver in this framework was proposed by Putzky and Welling [40] for various kinds of image restoration tasks in which was given by the gradient of the loss function (in their case, the log likelihood for a Gaussian prior). This scheme was later extended by Adler and Oktem [2] for non-linear operators and applied to a non-linear tomographic inversion problem.

## 2 Network architecture and training

### 2.1 Iterative solution to the Helmholtz equation

Building on the work discussed in §1.4, we propose an iterative method in the form of Eq. (7) to solve the Helmholtz equation using a learned optimizer. The discrete dynamical system that models the iterative solution of the heterogeneous Helmholtz equation given in Eq. (1) can be written as

(8) |

Here is the neural network with learnable parameters , is the residual computed from the heterogeneous Helmholtz equation, is the iterative update, and is a learned hidden or belief state. A diagram showing two unrolled iterations is given in Fig. 2. The network also uses an auxiliary input of the variable absorption coefficients that characterize the perfectly matched layer (PML) in each Cartesian direction (the PML is used as part of the discretized Laplacian operator to mimic the radiation condition in Eq. (2) as outlined in Appendix A). This allows the network to learn how to dampen the waves at the edges of the domain to simulate exterior open boundaries, and to appropriately weight the corresponding residuals.

The network is trained using a physics-based loss function. This avoids the need for labeled training data, e.g. created by running a large number of simulations using a conventional PDE solver to obtain ground truth solutions. While generating sufficient labeled training data may technically be feasible in 2D, for 3D problems, running a single simulation can take tens of minutes to many hours [39], which makes the generation of a large training set practically intractable.

If the total number of iterations is fixed to (i.e., a finite horizon), one approach for training the network would be to minimize , the final loss calculated after performing iterations. However, in practice, it is not computationally feasible to perform backpropagation through a large number of iterative steps of the learned optimizer. The optimal choice of is also not straightforward, as the accuracy required and the number of iterations to reach that accuracy are problem specific.

Instead, we choose to minimize the total loss function across all iterations, which is given by

(9) |

where is the physics-based loss function calculated after each iteration, and are a set of weights that define the importance of the loss at each step. Here, the loss from Eq. (4) is used with the residual from Eq. (8). The weights are set to , in other words, we aim to minimize the loss at every step of the iteration, on average, which yields

(10) |

To avoid unrolling the network for a large number of iterations for backpropagation, a replay buffer is used as discussed in detail in §2.3.

### 2.2 Neural network architecture

The neural network architecture for the optimizer is depicted in Fig. 3. The input to the network has the same spatial dimensions as the sound speed distribution (which is defined on a regularly spaced Cartesian grid) and contains six channels: two channels for the real and imaginary parts of the wavefield , two channels for the real and imaginary parts of the residual , and two channels for the variable absorption coefficients used in the PML in each Cartesian direction ( and defined in Appendix A).

The core building block of the network is the widely used double-convolution layer, which consists of two bidimensional convolutions with 3-by-3 kernels, interleaved by a parametrized-ReLU non-linear activation function [23]. Each encoding block contains two double-convolution layers. The first accepts two inputs: the network input representation at the current scale and a hidden state. The output is then passed to a second double-convolution layer (used to update the hidden state), the corresponding decoding block of the network, and a restriction operator which downsamples the output and feeds it to a deeper layer of the network. The restriction operator is represented by an 8-by-8 convolutional stage, applied with stride 2 in order to halve the dimension of the wavefield at each depth, reaching a total of 4 depths.

Internally, each encoding block stores its own hidden state, which from a functional point of view can be considered as being passed between iterations, as shown in Fig. 2. The size of the hidden state for each encoding block matches the size of its corresponding input, and has two channels. Note, the state variable in Eq. (8) and Fig. 2 refers to all the states stored across all encoding blocks.

The decoding blocks take an input from the layer below, upsamples it using transposed convolutions with 8-by-8 kernels and stride 2, and after concatenating it with the output from the corresponding encoding block (i.e., the skip connection), produces an output via another double-convolution layer. Finally, the last layer of the network is a 1-by-1 convolution that maps the output of the neural network to the wavefield domain. The output has the same spatial dimensions as the input and contains two channels for the real and imaginary parts of the wavefield.

There are several intuitions behind this choice of architecture. First, having a fully convolutional network implicitly imposes some degree of translation invariance to the iterative solver,^{2}

In total, the network has approximately 47k trainable parameters, with only 8 channels per convolutional block at every scale. The very small network size is possible because the solution is iteratively updated using the residuals of the true forward operator.

### 2.3 Training

The neural network is trained on a dataset of sound speed distributions containing idealized skulls. All calculations are performed in normalized units with a source frequency of rad/s and a background sound speed of 1 m/s. The idealized skulls are randomly generated with a hollow convex structure with a constant thickness and constant speed of sound, defined by summing up several circular harmonics of random amplitude and phase. Between examples, the skull thickness ranges from 2 to 10 m and the sound speed from 1.5 to 2 m/s giving a maximum sound speed contrast of 100 % (this matches the sound speed contrast between soft tissue and human skull bone [18]). The size of each example is 96 96 grid points with a normalized grid spacing of 1 m, giving points per acoustic wavelength (PPW). Note that while the training is performed using normalized units, the results can be re-scaled for any combination of grid spacing / source frequency / background sound speed, provided the PPW remains fixed at (an example of simulating a transcranial ultrasound field within an adult skull at 490 kHz is discussed in Sec. 3.3).

The training set contains 9000 sound speed distributions while both the validation and test sets contain 1000 distributions. Several examples are shown in Fig. 4. As a physics-based loss function is used, the training data only contains sound speed distributions—no ground truth wavefields (e.g., generated using another PDE solver) are required. During training, the source distribution is always fixed to a single grid point with magnitude 10 at the bottom of the domain.

From a practical point of view, it is not computationally feasible to perform backpropagation through a large number of iterative steps of the learned optimizer. To overcome this, the network is trained using a replay buffer and truncated backpropagation through time (TBPTT) [25], where TBPTT is implemented by unrolling 10 iterations. The replay buffer is initially filled with 600 triplets containing sound speed distributions randomly selected from the training set. For each sound speed example, the wavefield, and hidden state are initialized to zero, while the iteration index is initialized as a random integer between 0 and the maximum number of iterations, in this case set to .

During training, at each training step a mini-batch of triplets (containing examples with a range of different sound speed distributions and iteration indices) is randomly selected from the buffer. For each triplet, the loss is calculated over 10 iterative steps using Eq. (4), and the total loss is then summed over the mini-batch, where

(11) |

Here is the triplet index over the mini-batch, is the iteration index, and is the starting iteration index for the -th triplet. Finally a gradient descent step is performed over 10 unrolled iterations to update the network.

The calculation of the loss using Eq. (11) is performed using a Fourier collocation spectral method with a modified Laplacian that includes a PML. This allows the Sommerfeld radiation condition at infinity to be approximately satisfied while cropping the domain to a finite size . The discrete formulation used is given in Appendix A.

After each training step, for each example in the mini-batch, one of the iterative steps is randomly selected, and a new triplet is stored back into the buffer replacing the previously selected triplet. For computational reasons, the residual is also stored alongside the triplet to avoid needing to recalculate this when the triplet is later re-drawn from the buffer. During training, if the iteration index of any of the triplets to be stored back into the buffer exceeds , this is replaced with a new sound speed distribution randomly selected from the training set. The wavefield, hidden state, and iteration index for the new example are initialized to zero. The full training algorithm is given in Appendix B.

Note, the use of a small window for backpropagation can bias the network towards learning state representations which have short temporal dependency. While there are techniques for mitigating such bias [50], we didn’t find this to be a problem in practice. Storing the belief state in the replay buffer (rather than re-initializing it) has also been shown to improve the performance of recurrent networks trained using experience replay [27].

Gradient descent is performed using Adam with a batch size of 32, learning rate of , and gradient clipping at . The biases of all convolutional layers are initialized to zero to minimize the risk of divergence of the wavefield in the early iterations. We also don’t store triplets in the buffer if the loss goes above an arbitrary threshold value of , as this suggests that the wavefield is diverging. We found this to be especially important in the early phase of training.

The network and training were implemented using PyTorch and parallelized using PyTorch Lightning [17]. The training was performed using a cluster of 6 NVIDIA Tesla P40 graphics processing units (GPUs) in parallel. During training, at the end of every 10 epochs, the loss on the validation set was also evaluated, in this case by summing the loss at the last iteration over all examples in the validation set. However, in this case the source position for each example was moved to a random position on a circle to provide a simple test of network generalization. Since no input/output pairs were provided during training, inputs and outputs were scaled by a factor of and , respectively. These values were hand-tuned to roughly normalize the variance of the inputs and outputs across all iterations. Similarly, the loss function was amplified by a factor of . The training was run for 1000 epochs (52k training steps), and the network with the lowest validation loss was selected. The total training time was approximately 21 hours. A summary of the network and training hyperparameters is given in Appendix C.

## 3 Results

### 3.1 Model accuracy against a reference solution for the test set

To evaluate the performance of the trained network, a series of tests were performed. First, the accuracy of the network for the (unseen) sound speed maps in the test set was evaluated by comparing the wavefields calculated after 1000 iterations of the learned optimizer against a reference solution. The reference solution was calculated using the open-source k-Wave acoustics toolbox [51]. To obtain a time-independent solution to the wave equation, the time-domain solver `kspaceFirstOrder2DG`

was used with a continuous wave sinusoidal source term, the solution was run to steady state, and the complex wavefield extracted by temporal Fourier transform. The wavefields were then normalized to an amplitude of 1 and phase of zero at the source location to account for the different source scaling and relative phase between the two models. The accuracy was computed using the relative and average RMSE error norms calculated as

(12) |

where is the total number of pixels in the wavefield. As the learned optimizer and k-Wave use different formulations for the PML, this region was excluded from the error calculations.

A histogram of the error norms for the 1000 examples in the test set is shown in Fig. 5, with four examples of the calculated wavefields given in Fig. 6. The predicted wavefields have very low errors compared to the reference solution, with a mean error of 0.36%, and mean RMSE of . This demonstrates that the learned optimizer gives highly accurate results. Although not a focus of the current work, preliminary benchmarks show the learned model to be at least an order of magnitude faster than k-Wave for the same level of accuracy.

During the iterative procedure, it is possible to monitor the progression of the solution using metrics based on the computed residual. Figure 6 gives four examples of the evolution of the residual with iteration number, and Fig. 8 for all examples in the test set. Typically, a few hundred iterations are needed for the residual magnitude to reach a minimum. However, while a zero residual indicates convergence to the true solution, it is not immediately obvious how other values correspond to absolute accuracy. To investigate this, the evolution of the residual magnitude vs error with iteration number for the test set is plotted in Fig. 7. In general, the curves decrease with iteration number, meaning a lower residual magnitude gives a lower error for a given problem. However, while a high residual magnitude (e.g., ) implies a high error, and a very low residual (e.g., ) implies a low error, for intermediate values (e.g., ), there is significant spread. This will be investigated further in future work.

In both Fig. 5 (right panel) and Fig. 7, a small number of outliers can be seen with errors on the order of a few percent. In general, these outliers had a higher value for the final residual, and the evolution of the residual often displayed oscillations. Further investigation into these examples demonstrated the presence of a mode-like structure in the wavefield within and adjacent to the idealized skull, similar to whispering gallery modes. These examples were also much more difficult for k-Wave to compute, requiring at least twice as many time steps to reach an approximately steady state. Empirically it seems that examples in which highly resonant modes are supported take longer to converge, which suggests there may be a connection between the iterations and the time evolution of the field, as shown in Figs. 9 and 13. The discrepancy between the mean and median error in the early steps of optimization as shown in Fig. 8 also suggests that there are some particular sound speed distributions in the test set which are more challenging for the learned optimizer to solve. Interestingly, the same behavior was not observed for more complex sound speed distributions, e.g., based on real skulls.

### 3.2 Comparison with GMRES

To benchmark against classical methods, we compared the learned iterative solver with the widely used GMRES method for the sound speed examples in the test set. Figure 5 shows histograms and box plots of the and RMSE errors against k-Wave, while Fig. 8 shows the progression of the residual and against iteration number. The learned optimizer outperforms the generic solver both in terms of convergence speed with iteration number, and in terms of accuracy for a given number of iterations. After 200 iterations, the learned optimizer reaches an average residual magnitude of about and error of 1.0%. In comparison, GMRES reaches an average residual magnitude of about and error of 20.4%. Even after 1000 iterations (a five-fold increase), GMRES only reaches an average error of 2.8%.

The difference in the convergence rates can be understood by looking at the evolution of the solution with iteration number for a representative example as shown in Fig. 9. While both GMRES and the proposed solver construct the solution from the source location outwards, the spatial extent of the update made at each step by the two are very different: while GMRES tends to make very local updates (this is due to the nature of the Krylov iterations using a local forward operator, as discussed in §1.3), the learned iterative solver updates the solution over much larger spatial ranges.

### 3.3 Network generalizability

Having established the ability of the learned optimizer to give highly accurate solutions for the sound speed distributions in the test set, a preliminary investigation was performed into its generalization capabilities to solve previously unseen problems. Similar performance was also observed on a range of other examples analogous to the ones described below.

First, we evaluated the model on a speed of sound distribution containing a rectangular region with a speed of sound of 2, to test the ability of the network to deal with large homogeneous speed of sound regions (recall during training that only idealized skull shapes were used). Figure 10 shows the reference solution calculated using k-Wave, the prediction using the learned optimizer, and the evolution of the error with iteration number. For this example, the learned model reaches a very small final error on the order of 0.2%, and reaches an error below 1% extremely quickly (about 70 iterations).

Second, we tested the ability of the network to generalize to larger domains. A large speed of sound distribution with 480 480 grid points was created by patching together 24 distributions from the test set. Figure 11 shows the reference solution calculated using k-Wave, the prediction using the learned optimizer, and the evolution of the error with iteration number. Despite all the sound speed distributions in the training and validation sets having 96 96 grid points, the model is able to generalize to a much larger domain, reaching 1% error within 600 steps. This suggests that the hard task of training on large 3D volumes containing whole skulls for clinical applications can possibly be entirely bypassed by learning the network weights using much simpler and smaller problems, for example, by training using small skull patches.

Finally, while the previous example suggests that the network is able to generalize to different domain sizes, it is unclear how much diversity in the training set is required to ensure that the network still converges to a satisfactory solution with an arbitrary sound speed or source distribution. To test this, we performed a representative simulation of the intracranial wavefield for a transcranial focused ultrasound stimulation experiment [29]. We used a large 512 512 speed of sound distribution generated from a transverse CT slice from an adult skull from the Qure.ai CQ500 Dataset [11] converted using the approach outlined in [6]. The source distribution was defined as a focused transducer represented by a 1D arc [34] (recall that the network has only seen single-point sources in a fixed position during training). The transducer aperture diameter and radius of curvature were set to 60 mm, and the source frequency to 490 kHz.

Figure 12 shows the reference solution calculated using k-Wave, the prediction using the learned optimizer, and the evolution of the error with iteration number. Despite this example being well outside the training set (including a much larger spatial domain, a more complex speed of sound distribution, and a distributed source in an unseen position), the relative error compared to k-Wave is very low at 0.8%. This shows that the trained model can be used to solve problems of the scale and complexity needed to make the clinical problem tractable, albeit currently in 2D.

The evolution of the solution shown in Fig. 12 with iteration number is shown in Fig. 13. The solution is constructed from the arc-shaped source outwards. While it takes approximately three thousand iterations for the error across the whole domain to reach a minimum, most of the complex structure in the field can be seen after just 200 iterations. Moreover, the position of the focus and the focal pressure (which are not strongly affected by reverberation within the skull for this example) converge extremely quickly, within just 20 iterations. For the purpose of real-time treatment planning (for example, where several candidate positions for the ultrasound transducer may be being evaluated), the ability to generate a reasonable approximation for the acoustic field around the focus in a very short time, which can then be improved by letting the model continue to iterate, is highly desirable.

## 4 Summary and discussion

A lightweight neural network based on a modified UNet is proposed as a fully learned iterative optimizer and used to solve the heterogeneous Helmholtz equation. The network architecture is motivated by multi-scale solvers which utilize multi-scale memory, and Markov decision processes that utilize a belief state to make them fully observable. The network is trained using a physics-based loss function: no explicit knowledge of the true solution is given beyond the implicit consistency with the known source and sound speed.

The training set consists of idealized skulls in a small domain with 96 96 grid points, which makes training time manageable (less than one day using 6 GPUs). The learned optimizer shows excellent performance on the test set, and is capable of generalization well outside the training set, including to much larger domains, more complex sound speed distributions, and more complex source distributions.

In the current work, the network is used to solve the lossless Helmholtz equation in 2D. In future, we will look to extend this to 3D, and to include the effects of changes in mass density, acoustic absorption, and potentially the effects of acoustic nonlinearity. The examples given in §3.3 suggest that the model weights could be trained on small 3D patches, which will be essential to make the training tractable.

One challenge in the formulation that may need addressing is the limited number of unrolling steps used in the training phase. While a replay buffer has proven useful to extend the training horizon well beyond this limit, it may still induce biases or unwanted phenomena such as state representation drift [27]. This could be mitigated by borrowing ideas from the reinforcement learning community, which often deals with large or even infinite horizon tasks. In particular, Q-learning methods [49], such as temporal difference learning, can theoretically take into account the entire sequence of possible future states when providing the loss for a given output, by estimating the future loss. Furthermore, as they are designed to work with externally given reward signals, the loss function on which the network is trained doesn’t need to be differentiable. This makes it possible to use more elaborate training strategies, such as imposing a monotonically decreasing residual norm during inference.

Since we are mostly interested in the solution in a restricted region of space, such as the brain for transcranial applications, an interesting extension of the method would be to include a spatial estimate of the uncertainty of the solution at each step [57]. While being useful in and of itself, this may also allow the design of training procedures where the network focuses on solving the problem in the spatial region of interest.

In our experiments, the complexity of the network was kept very low by using a small number of parameters, and this is partially responsible for the good generalization performance. On the other hand, having a small number of parameters reduces the capacity of the network and as a consequence the speed at which a problem is solved. Conversely, having a large number of parameters may overspecialize the network on the distribution of problems encountered in training, which may require a more diverse training dataset to restore its generalization capabilities. Furthermore, the validation loss exhibits large oscillations throughout training, making evaluation and checkpointing of the network parameters subject to a low validation loss a crucial part of the training process. As part of future work, we will aim to understand and disentangle the effect of dataset diversity, network complexity, training strategies, and the generalization properties of the network.

Lastly, although the neural network solver already provides an advantage over many standard numerical methods by being fully differentiable, the computational performance of the network still needs to be properly assessed. In particular, profiling and optimization of the deployed network, and a comparison to more widespread solving procedures including a measure of their scaling properties for different problem sizes are required. While there may be several small design ideas that can be used to reduce the computational complexity of the network, such as learning the Laplacian operator [32] rather than evaluating it using spectral methods, a fair comparison will require a careful implementation of the method.

## Acknowledgments

This work was supported by the Engineering and Physical Sciences Research Council (EPSRC), UK. This work was also supported by European Union’s Horizon 2020 Research and Innovation Program H2020 ICT 2016-2017 (as an initiative of the Photonics Public Private Partnership) under Grant 732411.

## Appendix A Discrete solution of the forward operator

The calculation of the residual in Eq. (8) and the loss function in Eq. (11) require the discretization of the heterogeneous Helmholtz equation in Eq. (1). To allow the Sommerfeld radiation condition at infinity to be approximately satisfied while cropping the domain to a finite size , a perfectly matched layer or PML is used as defined in [8]. Assuming the original bounded domain is rectangular with size by , the domain is extended to a size of by , where is the thickness of the PML on each side of the domain (see Fig. 1). The derivative operators are then transformed to introduce absorption within the extended part of the domain:

(13) |

where and

(14) |

The absorption profile grows quadratically within the perfectly matched layer according to [8]

(15) |

The Laplacian including the PML then becomes

(16) |

where , which is computed analytically. To discretize Eq. (16), the Fourier collocation spectral method is used [9], where

(17) |

Here are the wavenumbers in the -direction (see e.g., [9]), and and represent the forward and inverse Fourier transform, in this case computed using the fast Fourier transform (FFT). Finally, this gives

(18) |

For a numerical calculations presented in the current work, the PML size is set to 8, and the absorption parameter is set to 2.

## Appendix B Training algorithm

The algorithm used for training is given in Algorithm. 1