Validating QuantumClassical Programming Models with Tensor Network Simulations
Abstract
The exploration of hybrid quantumclassical algorithms and programming models on noisy nearterm quantum hardware has begun. As hybrid programs scale towards classical intractability, validation and benchmarking are critical to understanding the utility of the hybrid computational model. In this paper, we demonstrate a newly developed quantum circuit simulator based on tensor network theory that enables intermediatescale verification and validation of hybrid quantumclassical computing frameworks and programming models. We present our tensornetwork quantum virtual machine (TNQVM) simulator which stores a multiqubit wavefunction in a compressed (factorized) form as a matrix product state, thus enabling singlenode simulations of larger qubit registers, as compared to bruteforce statevector simulators. Our simulator is designed to be extensible in both the tensor network form and the classical hardware used to run the simulation (multicore, GPU, distributed). The extensibility of the TNQVM simulator with respect to the simulation hardware type is achieved via a pluggable interface for different numerical backends (e.g., ITensor and ExaTENSOR numerical libraries). We demonstrate the utility of our TNQVM quantum circuit simulator through the verification of randomized quantum circuits and the variational quantum eigensolver algorithm, both expressed within the eXtremescale ACCelerator (XACC) programming model.
I Introduction
Quantum computing is a computation paradigm that relies on the principles of quantum mechanics in order to process information. Recent advances in both algorithmic research, which has found remarkable speedups for a growing number of applications Childs and van Dam (2010); Montanaro (2016); Biamonte and Bergholm (2017), and hardware development Linke et al. (2017); Friis et al. (2018) continue to progress the field of quantum information processing. The nearterm state of quantum computing is defined by the noisy intermediatescale quantum (NISQ) paradigm which involves smallscale noisy quantum processors Preskill (2018) being used in a hybrid quantumclassical framework. In this context, recent experimental demonstrations Peruzzo et al. (2014); O’Malley et al. (2016a); Kandala et al. (2017); Otterbach et al. (2017); Dumitrescu et al. (2018) of hybrid computations have reinforced the need for robust programming models and classical validation frameworks.
The successful integration of quantum processors into conventional computational workloads is a complex task which depends on the programming and execution models that define how quantum resources interact with conventional computing systems Humble and Britt (2016); Britt and Humble (2017a). Many different models have been proposed for programming quantum computers and a number of software development efforts have begun focusing on highlevel hybrid programming mechanisms capable of integrating both conventional and quantum computing processors together Green et al. (2013); JavadiAbhari et al. (2014); Wecker and Svore (2014); Humble et al. (2014); Smith et al. (2016a); Liu et al. (2017); Svore et al. (2018); PAKIN (2018). For example, recent efforts have focused on Pythonbased programming frameworks enabling the highlevel expression of quantum programs in a classical context, which may target numerical simulators or a variety of physical quantum processing units (QPUs) Smith et al. (2016b); Steiger et al. (2016); Cross et al. (2017). The eXtremescale ACCelerator programming model (XACC) is a recentlydeveloped quantumclassical programming, compilation, and execution framework that enables programming across multiple languages and targets multiple virtual and physical QPUs McCaskey et al. (2017).
In all cases, the verification of quantum program correctness is a challenging and complex task due to the intrinsically noisy nature of nearterm QPUs, and this is additionally complicated by remote hosting. As a remedy, numerical simulation techniques can greatly expedite the analysis of quantumclassical programming efforts by providing direct insight into the prepared quantum states, as well as serving to test a variety of quantum computing hardware models. Modeling and simulation is essential for designing effective program execution mechanisms because it provides a controlled environment for understanding how complex computational systems interact, subsequently generating feedback based on the state machine statistics. For example, the performance of existing QPUs is limited by the hardware connectivity Linke et al. (2017) and numerical simulations can draw on a broad range of parameterized models to test new processor layouts and architectures.
In practice, exact bruteforce simulations of quantum computing are notoriously inefficient in memory complexity due to the exponential growth in resources with respect to system size. These bruteforce methods explicitly solve the Schrodinger equation, or a mixedstate master equation, using a full representation of the quantum state in its underlying (exponentially large) Hilbert space. Limits on available memory place upper bounds on the size of the vectors or density operators that can physically be stored, severely restricting the size the simulated quantum circuit. Even with the availability of current largescale HPC systems, including the stateoftheart supercomputing systems, recent records for quantum circuit simulations are limited to less than 50 qubits Häner and Steiger (2017); Pednault et al. (2017). The performance of the bruteforce quantum circuit simulators on current supercomputing architectures is also limited by the inherently low arithmetic intensity (Flop/Byte ratio) of the underlying vector operations (sparse matrixvector multiplications) required for simulating a discrete sequence of one and twoqubit gates.
The inherent inefficiency of the bruteforce statevector quantum circuit simulators has motivated a search for approximate numerical simulation techniques increasing the upper bound on the number of simulated qubits. As we are interested in generalpurpose (universal) quantum circuit simulators, we will omit efficient specialized simulation algorithms that target certain subclasses of quantum circuits, for example, quantum circuits composed of only Clifford operations Aaronson and Gottesman (2004). As a general solution, we advocate for the use of the tensor network (TN) theory as a tool for constructing factorized approximations to the exact multiqubit wavefunction tensor. The two main advantages offered by the tensornetwork based wavefunction factorization are (1) the memory (space) and time complexity of the quantum circuit simulation reflect the level of entanglement in the quantum system, (2) the numerical action of quantum gates on the factorized wavefunction representation results in numerical operations (tensor contractions) which become arithmetically intensive for entangled systems, thus potentially delivering close to the peak utilization of modern HPC platforms.
Ii Quantum Circuit Simulation with Tensor Networks
Tensor network theory Orús (2014); Biamonte and Bergholm (2017) provides a versatile and modular approach for the dimensionality reduction of operators acting in highdimensional linear spaces. For the following discussion, a tensor is a generalization of a vector and is defined in a linear space constructed as the tensor product of two or more primitive vector spaces. Consequently, the components of a tensor are enumerated by a tuple of indices, instead of by a single index as is the case for vectors. From the numerical perspective, a tensor can be viewed as a multidimensional array of objects, which may be real or complex numbers. In this work, following the physics nomenclature, we shall refer to the number of indices in a tensor as its rank (in this case the tensor rank is ). Each index represents a distinct vector space contributing to the composite space by the tensor product. The extent of the range of each index gives the dimension of the vector space. In their essence, tensor networks aim at decomposing higherrank tensors into a contraction over lowerrank tensors such that the factorized product accurately reconstructs properties of the original tensor (i.e. a variant of lossy compression in linear spaces). Any tensor can be approximated by a suitably chosen tensor network with arbitrary precision, however the size of the tensor factors may grow exponentially in worst case examples. Tensor factorizations, which we also refer to as decompositions, are not unique in general and the problem of finding the optimal tensor decomposition is a difficult nonconvex optimization problem Kolda (2015).
In practice, a tensor network factorization is typically specified by a graph in which the nodes are the tensor factors and the edges represent physical or auxiliary vector spaces which are associated with the indices of the corresponding tensor factors. A closed edge, that is, an edge connecting two nodes, represents a contracted index shared by two tensor factors over which a summation is to be performed. In a standalone tensor network, contracted indices are associated with auxiliary vector spaces. An open edge, that is, an edge connected to only one node, represents an uncontracted index of that tensor factor. Uncontracted indices are typically associated with physical vector spaces. Different tensor network architectures differ by the topology of their representative graphs. Furthermore, one can define even more general tensor network architectures by replacing graphs with hypergraphs, in which case an edge may connect three or more tensors. In the subsequent discussion, however, we will mostly deal with conventional graph topologies.
A quantum manybody wavefunction, including a multiqubit wavefunction, is essentially a highrank tensor (its rank is equal to the number of simulated quantum particles or quasiparticles) Orús (2014). A number of different tensor network architectures have been suggested for the purpose of factorizing quantum manybody wavefunctions, including the matrixproduct state (MPS) White (1992); Schollwöck (2011), the projected entangled pairstate (PEPS) Schuch et al. (2007); Verstraete et al. (2008), the tree tensor network state (TTNS) Murg et al. (2010); Nakatani and Chan (2013); Dumitrescu (2017), the multiscale entanglement renormalization ansatz (MERA) Vidal (2006); Evenbly and Vidal (2009), as well as somewhat related nonconventional schemes like the completegraph tensor network (CGTN) Marti et al. (2010). All of the above tensor network ansaetze differ in the factorization topology, that is, in how the tensor factors are contracted with each other to form the final quantum manybody wavefunction. In a good tensor network factorization, topology is induced by the entanglement structure of a particular quantum manybody system. Many physical systems are described by manybody Hamiltonians with only local interactions – in many cases, nearest neighbor only – with correlation functions decaying exponentially for noncritical states. In such cases, the locality structure of the manybody Hamiltonian induces the necessary topology required to properly capture the quantum correlations present in the system of interest. The factorization topology also strongly affects the computational cost associated with the numerical evaluation/optimization of a specific tensor network architecture. Another important characteristic of a tensor network is its socalled maximal bond dimension, that is, the maximal dimension of the auxiliary linear spaces (auxiliary linear spaces are those contracted over). Provided that the maximal bond dimension is bounded, many tensor network factorizations can be evaluated with a polynomial computational cost in the bond dimension. In practice, the entanglement structure of the underlying quantum manybody system determines the maximal bond dimension needed for a given error tolerance and a given tensor network topology. A poorly chosen tensor network topology will necessarily lead to rapidly increasing (exponentially at worst) bond dimensions in order to keep the factorization error within the error threshold.
The entanglement structure in a multiqubit wavefunction is determined by the quantum circuit and may be unknown in general. Consequently, there is no welldefined choice of a tensor network architecture (topology) that could work equally well for all quantum circuits, unless it is some kind of an adaptive topology. In practice, the choice of a tensor network architecture for representing a multiqubit wavefunction is often dictated by numerical convenience and ease of implementation. For example, one of the simplest tensor network architectures, the MPS ansatz, was used to simulate Shor’s algorithm for integer factorization Dang et al. (2017). Although the inherently onedimensional chain topology of the MPS ansatz often results in severely growing bond dimensions, and this can be remedied by a more judicious tensor network formDumitrescu (2017), its computational convenience and well understood theory makes the MPS factorization an appealing first candidate for our quantum virtual machine (quantum circuit simulator). In future, we plan on adding more advanced tensor network architectures.
In order to simulate a general quantum circuit over an qubit register with the tensor network machinery the following steps will be necessary (see Figure 2):

Specify the chosen tensor network graph that factorizes the rank wavefunction tensor into a contracted product of lowerorder tensors (factors).

Transform the quantum circuit into an equivalent quantum circuit augmented with SWAP gates in order to maximize the number of accelerated gate applications (see below). This is an optional step.

Group quantum gates into ordered aggregates (supergates) which will act as a whole on the qubit wavefunction. In the simplest case, all quantum gates will be applied onebyone in order of appearance, with no aggregation. This is an optional step.

Sequentially apply aggregated supergates (or individual gates when no aggregation occurred) to the wavefunction tensor network, thus evolving it towards the output state.
In the above general algorithm, the application of a supergate (or just an individual gate) on a multiqubit wavefunction tensor consists of the following steps:

Append the individual gates constituting the given supergate to the input wavefunction tensor network , thus obtaining a larger tensor network .

If there are 2 or higherbody gates present, check whether they are applied to the qubit pairs or triples, etc. that allow accelerated gate application (for example, in MPS factorization, these would be the adjacent qubit pairs, triples, and so on). If yes, evaluate their action in an accelerated fashion (see below). Otherwise, resort to the general algorithm in the next steps.

Instantiate a new tensor network by cloning .

Close with , thus obtaining a closed tensor network .

Optimize the tensors of to maximize .

If value is not acceptable, increase dimensions of the auxiliary spaces in and repeat Step 5.
In cases where an accelerated gate application is possible (for example, a 2body gate is applied to the adjacent qubits in the MPSfactorized wavefunction), one can restrict the update procedure only to the tensor factors directly affected by the gate action. In case of MPS factorization, in order to apply a 2body gate to two adjacent qubits one can contract the gate tensor with the two MPS tensors representing the affected qubits and then perform the singular value decomposition (SVD) on the tensorresult, thus obtaining the new (updated) MPS tensors as illustrated in Figure 3.
The above general algorithm demonstrates the procedure used by TNQVM for approximate simulation of quantum circuits based on the tensor network factorization. For the sake of completeness, we should also mention quantum circuit simulators which use tensor representations for a bruteforce simulation of quantum circuits with no approximations Pednault et al. (2017); Fried et al. (2017). This is different from our approach which is based on the explicit factorization of the multiqubit wavefunction tensor. In these other tensorbased schemes the entire quantum circuit as a collection of gate tensors is considered as a tensor network which is subsequently contracted over in order to compute observables or evaluate output probability distributions. In Ref. 27, a clever tensor slicing technique was introduced that avoided the evaluation of the full wavefunction tensor, thus reducing the memory footprint and bypassing the existing 45qubit limit on largescale HPC systems. Yet, despite enabling simulations of somewhat larger qubit counts, this technique does not lift the asymptotic bounds of the exact simulation cost.
Iii Quantum Virtual Machines
In order to evaluate the correctness of a quantum program and its implementation via a decomposition into primitive gate operations, it is necessary to model both the conventional computing and quantum computing elements of the system architecture. In particular, it is necessary to expose the interface to the available instruction set architecture (ISA) and methods to support quantum program execution, scheduling, and layout. There are currently many different technologies available for testing and evaluating quantum processing units, and each of these technologies presents different ISAs and methods for program execution Britt and Humble (2017b).
As shown in Fig. 4, a quantum virtual machine (QVM) provides a portable abstraction of technologyspecific details for a broad variety of heterogeneous quantumclassical computing architectures. The hardware abstraction layer (HAL) defines a portable interface by which the underlying quantum processor technology as well as other hardware components such as memory are exposed to system libraries, runtimes and drivers running on the host conventional computer. The implementation of the HAL provides an explicit translation of quantum program instructions into native, hardwarespecific syntax, which may be subsequently executed by the underlying quantum processor. The HAL serves as a convenience to ensure portability of programs across different QPU platforms, while the QVM encapsulates the environment in which applications can be developed independently from explicit knowledge of QPU details. This environment is provided by the integration of the HAL with programming tools, libraries, and frameworks as well as the host operating system.
Application performance within a QVM depends strongly on the efficiency with which host programs are translated into hardwarespecific instructions. This includes the communication overhead between the HAL and hardware layers as well as the overhead costs for managing these interactions by the host operating system. Both algorithmic and hardware designs impact this performance by deciding when and how to allocate computational burden to specific devices. Presently, there is an emphasis on the development and validation of hybrid programs, which loosely integrates quantum processing with conventional postprocessing tasks. This algorithmic design introduces a requirement for transferring memory buffers between the host and QPU systems. Memory management therefore becomes an important task for application behavior. While current QPUs are often accessed remotely through network interfaces, longterm improvements in application performance will require fine grain control over memory management.
Iv Tensor Network Quantum Virtual Machine
Our implementation of a QVM presented in this work is based on a previously developed hybrid quantumclassical programming framework, called XACC McCaskey et al. (2017), combined with a quantum circuit simulator that uses tensor network theory for compressing the multiqubit wavefunction. We provide an overview of the Tensor Network Quantum Virtual Machine (TNQVM) and its applications, including its software architecture and integration with the XACC programming framework. Since XACC integrates directly with TNQVM, compiled programs can in principle be verified instantaneously on any classical computer including workstations as well as HPC clusters and supercomputers. The support of different classical computer architectures (singlecore, multicore, GPU, distributed) for performing numerical simulations is achieved by interchangeability of the numerical backends in our TNQVM simulator. These backends are numerical tensor algebra libraries which perform all underlying tensor computations on a supported classical computer. In this work, we detail the HAL implementation of TNQVM using ITensor ite () for serial simulations, with some example applications demonstrating the utility of TNQVM. We also sketch some details on the upcoming ExaTENSOR backend that will enable largescale quantum circuit simulations on homo and heterogeneous HPC systems. Independent verification of hybrid programs within TNQVM provides an increased confidence in the use of these codes to characterize and validate actual QPUs.
iv.1 Xacc
The eXtremescale ACCelerator programming model (XACC) has been specifically designed for enabling nearterm quantum acceleration within existing classical highperformance computing applications and workflows McCaskey et al. (2017, 2018). This programming model and associated opensource reference implementation follow the traditional coprocessor model, akin to OpenCL or CUDA for GPUs, but takes into account the subtleties and complexities arising from the interplay between classical and quantum hardware. XACC provides a highlevel application programming interface (API) that enables classical applications to offload quantum programs (represented as quantum kernels, similar in structure to GPU kernels) to an attached quantum accelerator in a manner that is agnostic to both the quantum programming language and the quantum hardware. Hardware agnosticism enables quantum code portability and also aids in benchmarking, verification and validation, and performance studies for a wide array of virtual (simulators) and physical quantum platforms.
To achieve language and hardware interoperability, XACC defines three important abstractions: the quantum intermediate representation (IR), compilers, and accelerators. XACC compiler implementations map quantum source code to the IR – the inmemory object key to integrating of a diverse set of languages to a diverse set of hardware. IR instances (and therefore compiled kernels) are executed by realizations of the accelerator concept, which defines an interface for injecting physical or virtual quantum hardware. Accelerators take this IR as input and delegate execution to vendorsupplied APIs for the QPU, or an associated API for a simulator. This forms the hardware abstraction layer, or abstract device driver, necessary for a general quantum (virtual) machine.
The IR itself can be further decomposed into instruction and function concepts, with instructions forming the foundation of the IR infrastructure and functions serving as compositions of instructions (see Figure 5). Each instruction exposes a unique name and the set of qubits that it operates on. Functions are a subtype of the instruction abstraction that can contain further instructions. This setup, the familiar composite design pattern com (), forms an nary tree of instructions where function instances serve as nodes and concrete instructions instances serve as leaves.
Operating on this tree and executing program instructions is a simple preorder traversal on the IR tree. In order to enhance this tree of instructions with additional functionality, XACC provides a dynamic doubledispatch mechanism, specifically an implementation of the familiar visitor pattern Gamma et al. (1995). The visitor pattern provides a mechanism for adding virtual functions to a hierarchy of common data structures dynamically, at runtime, and without modifying the underlying type. This is accomplished via the introduction of a visitor type that exposes a public set of visit functions, each one taking a single argument that is a concrete subtype of the hierarchical data structure composition (see Figure 5). For gate model quantum computing, XACC exposes a visitor class that exposes a visit method for all concrete gate instructions (X, H, RZ, CX, etc…). All instructions expose an accept method that takes as input a general visitor instance, and invokes the appropriate visit method on the visitor through doubledispatch. XACC instruction visitors thereby provide an extensible mechanism for dynamically operating on, analyzing, and transforming compiled IR instances at runtime.
iv.2 Tensor Network Accelerator and Instruction Visitors
The integration of a tensor network quantum circuit simulator with XACC can be accomplished through extensions of appropriate XACC concepts. In essence, this is an extension of the quantum virtual machine hardware abstraction layer that enables existing highlevel programs and libraries to target a new virtual hardware instance. Injecting new simulators into the XACC framework requires a new implementation of the accelerator concept. Enabling that simulator to be extensible in the type of tensor networks, algorithmic execution, and the library backend requires different mappings of the IR to appropriate simulation data structures. This can be accomplished through individual implementations of the instruction visitor concept.
Our opensource implementation of the Tensor Network Quantum Virtual Machine (TNQVM) library extends the XACC accelerator concept with a new derived class that simulates purestate quantum computation via tensor network theory McCaskey and Chen (2017). This library provides the TNAccelerator (Tensor Network Accelerator) that exposes an execute method that takes as input the XACC IR function to be executed. Generality in the tensor network graph structure and the simulation algorithm is enabled through appropriate implementations of the instruction visitor concept. For example, an instruction visitor can be implemented to map the incoming XACC IR tree to tensor operations on a matrix product state (MPS) ansatz. Walking the IR tree via preorder traversal and invoking the instruction visitor accept mechanism on each instruction triggers invocation of the appropriate visit function via double dispatch. The implementation of these visit methods provides an extensible mechanism for performing instructionspecific tensor operations on a specific tensor network graph structure.
Furthermore, this visitor extension mechanism can be leveraged to not only provide new tensor network structures and operations, but also provide the means to leverage different tensor algebra backend libraries, and therefore introduce a classical parallel execution context. Different visitor implementations may provide a strictly serial simulation approach, while others can enable a massively parallel or heterogeneous simulation approach (incorporating the Message Passing Interface, OpenMP, and/or GPU acceleration via CUDA or OpenCL).
To date we have implemented two instruction visitor backends for the TNQVM and the TNAccelerator. We have leveraged the ITensor library ite () to provide a serial matrix product state simulator, and the ExaTENSOR library from the Oak Ridge Leadership Computing Facility (OLCF) to provide a matrix product state simulator that leverages MPI, OpenMP and CUDA for distributed parallel execution on GPUaccelerated heterogeneous HPC platforms. However, the ExaTENSOR library is currently undergoing final testing before its public release, thus it has not been utilized yet as a fully functional backend of TNQVM. Nevertheless, we will provide some details on the ExaTENSOR backend below.
iv.2.1 ITensor MPS Implementation
The ITensor MPS instruction visitor implementation provides a mechanism for the simulation of an qubit wavefunction via a matrix product state tensor network decomposition. The MPS provides a way to restrict the entanglement entropy through SVD and associated truncation of Schmidt coefficients to reduce the overall Schmidt rank. With these MPS states, we need numbers to represent qubits, where is the largest Schmidt rank we keep. As long as is not too large (grows polynomially with system size), the space complexity is feasible for classical simulation. For example, if the quantum register is used to store the gapped ground states of systems with local interactions, we can simulate larger number of qubits and still adequately approximate the wavefunction by keeping small enough.
Our ITensor MPS visitor implementation begins by initializing a matrix product state tensor network using the serial tensor data structures provided by the ITensor libraryite (). Simulation of the compiled IR program is run through a preorder tree traversal of the instruction tree. At each leaf of this tree (a concrete instruction), the accept method on the instruction is invoked (see Figure 5) which dispatches a call to the correct visit method of the instruction visitor.
At this point, the appropriate gate tensor is contracted into the MPS representation, which maps onto itself under local quantum gates. Updating the MPS according to twobody entanglers involves twoqubit gates which act on two rank3 tensors, and the full contraction results in a rank4 tensor. We maintain the MPS structure by decomposing the rank4 tensor into two rank3 tensors and a diagonal matrix between them. Note that when the two qubits are not adjacent we apply SWAP gates on intermediary qubits to bring them together. The gate is then applied and reverse SWAPs bring the qubits back to the original positions. Otherwise, applying a gate to nonadjacent qubits would modify the underlying graph topology, complicating future evolution by adding an nonlocal loop in the tensor network.
The SVD is used to return the resulting rank4 tensor to the canonical MPS form ( rank3 tensors and diagonal matrices), with the singular values below a cutoff threshold (e.g., default is ) being truncated. The truncation over subspaces supporting exponentially small components of the wavefunction allows our MPSbased TNQVM simulate large numbers of qubits, contingent on some slowly growing entanglement properties. Examples and discussion may be found in the demonstrations in Sec. V.
iv.2.2 ExaTENSOR MPS Implementation
The ExaTENSOR numerical tensor algebra backend will enable largerscale TNQVM quantum circuit simulations on GPUenabled and other accelerated as well as conventional multicore HPC platforms. ExaTENSOR stores tensors in distributed memory (on multiple/many nodes) as a generally sparse collection of tensor slices in a hierarchical fashion. Such distributed tensor storage lifts the memory limitations pertinent to a single node, thus extending the maximal number of simulated qubits. Although we currently target the (distributed) MPS implementation, ExaTENSOR also provides a generic tensor network builder that can be used for constructing an arbitrary tensor network. The ExaTENSOR MPS visitor implementation provides a constructor that creates the MPS representation of the simulated multiqubit wavefunction (all constituent MPS tensors are distributed now). Then the XACC IR tree traversal invokes ExaTENSOR MPS visit method for each traversed node (instruction). The visit method implements lazy visiting, namely it only caches the corresponding instruction (gate) in the instruction cache of the ExaTENSOR MPS visitor. At some point, once the instruction cache has enough work to perform, the evaluate method of the ExaTENSOR visitor is invoked which implements the generic gate action algorithm shown in Section II. Specifically, it allocates the output MPS tensor network, that is, the result of the action of the cached gates on the input MPS tensor network. Then it creates the inner product (closed) tensor network by joining the gate tensors to the input MPS tensor network, subsequently closing it with the output tensor network (see Figure 2). This closed tensor network is a scalar whose value needs to be maximized. The ExaTENSOR MPS visitor will utilize the standard gradient descent algorithm by evaluating the gradient with respect to each tensor constituting the output tensor network. Each of these gradients is an open tensor network itself that needs to be fully contracted. Importantly, the computational cost of this contraction of many tensors strongly depends on the order in which the pairwise tensor contractions are performed. Finding the optimal tensor contraction sequence is an NPhard problem. Instead, ExaTENSOR implements a heuristic algorithm that delivers the best found sequence of pairwise tensor contractions in a reasonable amount of time (subseconds). Then this pseudooptimal sequence of pairwise tensor contractions is cached for a subsequent reuse, if needed. Given the sequence of pairwise tensor contractions, the ExaTENSOR library will numerically evaluate all of them and return the gradients that will subsequently be used for updating the output tensor network tensors, until the optimized inner product scalar reaches the desired value. In case it does not reach the desired value, the tensors constituting the output tensor network are reallocated with increased dimensions of the auxiliary spaces and the entire procedure is repeated. At this point, the early prototype implementation of the ExaTENSOR MPS visitor in TNQVM is based on the singlenode version of the ExaTENSOR library and we are currently finishing the integration of the TNQVM with the distributed version of the ExaTENSOR library as well as performing the final testing of the ExaTENSOR library itself before the public release scheduled later this year.
V Demonstration
Here we demonstrate the utility of TNQVM by describing the overall memory scaling of our matrix product state TNQCM for varying levels of entanglement and system size. Our demonstrations show how TNQVM can be leveraged to validating hybrid quantumclassical programming models. Specifically we focus on random circuit simulations and the variational quantum eigensolver (VQE) hybrid algorithm.
v.1 Profiling Random Circuit Simulations with MPS
We demonstrate the improved resource cost of representing quantum states ( vs ) with TNQVM by using an MPS formulation and by profiling the memory usage of simulating randomly generated circuits. We vary the entanglement structure of our random circuits by constructing time slices defined as rounds. The first round begins with a layer of Hadamard operations on all qubits, followed by a layer of single qubit gates (Pauli gates and other general rotations), followed by a set of nearestneighbor CNOT entangling operations. Multiple rounds constitute multiple iterations of generating these layers (excluding the Hadamards, which only appear in the first layer). Clearly, later rounds add layers of entangling CNOT operations and therefore generate states with a more complicated entanglement structure.
We generate these random circuits for through qubits in increments of , and for numbers of rounds ranging from through in increments of . For each pair, we generate 10 random circuits, compute the heap memory usage, and compute the mean and standard deviation of the memory usage. The results are plotted in Figure 6. For lightlyentangled systems (i.e. those generated by a small number of random rounds) we see that the MPS structure is able to encode the wavefunction of the system efficiently with a small cost. For example, for only two rounds the maximum bond dimension is , which is independent of system size. As we increase the entanglement in our random circuits, the computational cost of the MPS simulations increases exponentially. This is because circuits we have sampled from are designed to saturate exponentially increase the entanglement which saturates at entanglement described by of an qubit system undergoing random roundsBoixo et al. (2018).
v.2 Variational Quantum Eigensolver
Finally, we demonstrate the utility of our tensor network simulation XACC Accelerator backend (the TNQVM library) in validating quantumclassical algorithms. It is this rapid feedback mechanism that is critical to understanding intended algorithmic results, and enables confidence in the programming of larger systems. Here we demonstrate this programmability and its verification and validation through a simple simulation of diatomic hydrogen via the variational quantum eigensolver algorithm. The quantumclassical program is shown in the listing below leveraging the TNQVM library.
This code listing demonstrates the integration of XACC and our tensor network accelerator implementation. The code shows how to program, compile, and execute the VQE algorithm to compute expectation values for the simplified (symmetryreduced), two qubit Hamiltonian (see O’Malley et al. (2016b)). We start off by defining the quantum source code as XACC quantum kernels (note  we have left out a few measurement kernels for brevity). Each of these kernels is parameterized by a single double representing the variational parameter for the problem ansatz circuit (the ansatz kernel in the h2_src string). Integration with the TNQVM simulation library is done through a public XACC API function (getAccelerator). This accelerator reference is used to compile the program and get reference to executable kernels that delegate work to the TN Accelerator. We then loop over all and compute the expectation values for each Hamiltonian measurement term. Notice that this execution mechanism is agnostic to the accelerator subtype. This provides a way to quickly swap between validation and verification with TNQVM, and physical hardware execution on quantum computers from IBM, Rigetti, etc.
Vi Conclusion
In this work we have discussed the concept of a general quantum virtual machine and introduced a concrete implementation of the QVM that enables quantumclassical programming with validation through an extensible tensor network quantum circuit simulator (TNQVM). We have discussed the applicability and scalability of a matrix product state backend implementation for TNQVM and discussed the role of TNQVM in benchmarking quantum algorithms and hybrid quantumclassical applications including random circuit sequences used in quantum supremacyBoixo et al. (2018) and the variational quantum eigensolverPeruzzo et al. (2014). We have chosen a tensor network based quantum virtual machine due to the complexity reduction such a formalism provides for a broad range of problems. In general TNQVM enables largescale simulation of quantum circuits which generate states characterized by shortrange entanglement. Studying systems with longrange entanglement interactions will require further developments in implementing more advanced tensor network decomposition types. We plan to investigate the applicability of the tree tensor network and the multiscale entanglement renormalization ansatz in future work, in an effort to scale simulation capabilities to a larger number of qubits.
Acknowledgements
This work has been supported by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory and the US Department of Energy (DOE) Early Career Research Program. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DEAC0500OR22725. This manuscript has been authored by UTBattelle, LLC, under contract DEAC0500OR22725 with DOE. The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paidup, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan.
References
 Childs and van Dam (2010) A. M. Childs and W. van Dam, Rev. Mod. Phys. 82, 1 (2010), URL http://link.aps.org/doi/10.1103/RevModPhys.82.1.
 Montanaro (2016) A. Montanaro, npj Quantum Information 2, 15023 (2016), ISSN 20566387, URL http://www.nature.com/articles/npjqi201523.
 Biamonte and Bergholm (2017) J. Biamonte and V. Bergholm (2017), eprint 1708.00006, URL http://arxiv.org/abs/1708.00006.
 Linke et al. (2017) N. M. Linke, D. Maslov, M. Roetteler, S. Debnath, C. Figgatt, K. A. Landsman, K. Wright, and C. Monroe, Proceedings of the National Academy of Sciences p. 201618020 (2017).
 Friis et al. (2018) N. Friis, O. Marty, C. Maier, C. Hempel, M. Holzäpfel, P. Jurcevic, M. B. Plenio, M. Huber, C. Roos, R. Blatt, et al., Phys. Rev. X 8, 021012 (2018), URL https://link.aps.org/doi/10.1103/PhysRevX.8.021012.
 Preskill (2018) J. Preskill, arXiv preprint arXiv:1801.00862 (2018).
 Peruzzo et al. (2014) A. Peruzzo, J. McClean, P. Shadbolt, M.H. Yung, X.Q. Zhou, P. J. Love, A. AspuruGuzik, and J. L. O’Brien, Nature Communications 5, 4213 (2014), URL http://dx.doi.org/10.1038/ncomms5213http://10.0.4.14/ncomms5213https://www.nature.com/articles/ncomms5213{#}supplementaryinformation.
 O’Malley et al. (2016a) P. J. J. O’Malley, R. Babbush, I. D. Kivlichan, J. Romero, J. R. McClean, R. Barends, J. Kelly, P. Roushan, A. Tranter, N. Ding, et al., Phys. Rev. X 6, 031007 (2016a), URL https://link.aps.org/doi/10.1103/PhysRevX.6.031007.
 Kandala et al. (2017) A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow, and J. M. Gambetta, Nature 549, 242 (2017), URL http://dx.doi.org/10.1038/nature23879http://10.0.4.14/nature23879https://www.nature.com/articles/nature23879{#}supplementaryinformation.
 Otterbach et al. (2017) J. Otterbach, R. Manenti, N. Alidoust, A. Bestwick, M. Block, B. Bloom, S. Caldwell, N. Didier, E. S. Fried, S. Hong, et al., arXiv preprint arXiv:1712.05771 (2017).
 Dumitrescu et al. (2018) E. F. Dumitrescu, A. J. McCaskey, G. Hagen, G. R. Jansen, T. D. Morris, T. Papenbrock, R. C. Pooser, D. J. Dean, and P. Lougovski, Phys. Rev. Lett. 120, 210501 (2018), URL https://link.aps.org/doi/10.1103/PhysRevLett.120.210501.
 Humble and Britt (2016) T. S. Humble and K. A. Britt, in 2016 IEEE High Performance Extreme Computing Conference (HPEC) (2016), pp. 1–8.
 Britt and Humble (2017a) K. A. Britt and T. S. Humble, ACM Journal on Emerging Technologies in Computing Systems (JETC) 13, 39 (2017a).
 Green et al. (2013) A. S. Green, P. L. Lumsdaine, N. J. Ross, P. Selinger, and B. Valiron, in ACM SIGPLAN Notices (ACM, 2013), vol. 48, pp. 333–342.
 JavadiAbhari et al. (2014) A. JavadiAbhari, S. Patil, D. Kudrow, J. Heckey, A. Lvov, F. T. Chong, and M. Martonosi, in Proceedings of the 11th ACM Conference on Computing Frontiers (ACM, 2014), p. 1.
 Wecker and Svore (2014) D. Wecker and K. M. Svore, arXiv preprint arXiv:1402.4467 (2014).
 Humble et al. (2014) T. S. Humble, A. J. McCaskey, R. S. Bennink, J. J. Billings, E. F. DÊ¼Azevedo, B. D. Sullivan, C. F. Klymko, and H. Seddiqi, Computational Science and Discovery 7, 015006 (2014), URL http://stacks.iop.org/17494699/7/i=1/a=015006.
 Smith et al. (2016a) R. S. Smith, M. J. Curtis, and W. J. Zeng, arXiv preprint arXiv:1608.03355 (2016a).
 Liu et al. (2017) S. Liu, X. Wang, L. Zhou, J. Guan, Y. Li, Y. He, R. Duan, and M. Ying, arXiv:1710.09500 (2017).
 Svore et al. (2018) K. Svore, A. Geller, M. Troyer, J. Azariah, C. Granade, B. Heim, V. Kliuchnikov, M. Mykhailova, A. Paz, and M. Roetteler, in Proceedings of the Real World Domain Specific Languages Workshop 2018 (ACM, New York, NY, USA, 2018), RWDSL2018, pp. 7:1–7:10, ISBN 9781450363556, URL http://doi.acm.org/10.1145/3183895.3183901.
 PAKIN (2018) S. PAKIN, Theory and Practice of Logic Programming p. 1â22 (2018).
 Smith et al. (2016b) R. S. Smith, M. J. Curtis, and W. J. Zeng, A practical quantum instruction set architecture (2016b).
 Steiger et al. (2016) D. S. Steiger, T. Häner, and M. Troyer, ArXiv eprints (2016), eprint 1612.08091.
 Cross et al. (2017) A. W. Cross, L. S. Bishop, J. A. Smolin, and J. M. Gambetta, ArXiv eprints (2017), eprint 1707.03429.
 McCaskey et al. (2017) A. J. McCaskey, E. F. Dumitrescu, D. Liakh, M. Chen, W.c. Feng, and T. S. Humble, ArXiv eprints (2017), eprint 1710.01794.
 Häner and Steiger (2017) T. Häner and D. S. Steiger, ArXiv eprints (2017), eprint 1704.01127.
 Pednault et al. (2017) E. Pednault, J. A. Gunnels, G. Nannicini, L. Horesh, T. Magerlein, E. Solomonik, and R. Wisnieff (2017), eprint 1710.05867, URL http://arxiv.org/abs/1710.05867.
 Aaronson and Gottesman (2004) S. Aaronson and D. Gottesman, Phys. Rev. A 70, 052328 (2004), URL https://link.aps.org/doi/10.1103/PhysRevA.70.052328.
 Orús (2014) R. Orús, Annals of Physics 349, 117 (2014), ISSN 1096035X, eprint 1306.2164, URL http://linkinghub.elsevier.com/retrieve/pii/S0003491614001596.
 Kolda (2015) T. G. Kolda, Mathematical Programming 151, 225 (2015), ISSN 14364646, URL https://doi.org/10.1007/s1010701508950.
 White (1992) S. R. White, Physical Review Letters 69, 2863 (1992).
 Schollwöck (2011) U. Schollwöck, Annals of Physics 326, 96 (2011), ISSN 00034916, eprint 1008.3477, URL http://linkinghub.elsevier.com/retrieve/pii/S0003491610001752.
 Schuch et al. (2007) N. Schuch, M. M. Wolf, F. Verstraete, and J. I. Cirac, Physical Review Letters 98, 140506 (2007), ISSN 00319007, eprint 0708.1567, URL https://link.aps.org/doi/10.1103/PhysRevLett.98.140506.
 Verstraete et al. (2008) F. Verstraete, V. Murg, and J. Cirac, Advances in Physics 57, 143 (2008), ISSN 00018732, URL http://www.tandfonline.com/doi/abs/10.1080/14789940801912366.
 Murg et al. (2010) V. Murg, F. Verstraete, Ö. Legeza, and R. M. Noack, Physical Review B  Condensed Matter and Materials Physics 82, 205105 (2010), ISSN 10980121, eprint 1006.3095, URL https://link.aps.org/doi/10.1103/PhysRevB.82.205105.
 Nakatani and Chan (2013) N. Nakatani and G. K. L. Chan, Journal of Chemical Physics 138, 134113 (2013), ISSN 00219606, eprint 1302.2298, URL http://aip.scitation.org/doi/10.1063/1.4798639.
 Dumitrescu (2017) E. Dumitrescu, Physical Review A 96, 062322 (2017), ISSN 24699934, eprint 1705.01140, URL https://link.aps.org/doi/10.1103/PhysRevA.96.062322.
 Vidal (2006) G. Vidal, Phys. Rev. Lett. 101, 110501 (2006), ISSN 00319007, eprint 0610099, URL http://arxiv.org/abs/quantph/0610099{%}0Ahttp://dx.doi.org/10.1103/PhysRevLett.101.110501.
 Evenbly and Vidal (2009) G. Evenbly and G. Vidal, Physical Review B 79, 1 (2009), ISSN 10980121, eprint arXiv:0707.1454v3, URL http://prb.aps.org/abstract/PRB/v79/i14/e144108.
 Marti et al. (2010) K. H. Marti, B. Bauer, M. Reiher, M. Troyer, and F. Verstraete, New Journal of Physics 12, 103008 (2010), URL http://stacks.iop.org/13672630/12/i=10/a=103008.
 Dang et al. (2017) A. Dang, C. D. Hill, and L. C. L. Hollenberg (2017), eprint 1712.07311, URL http://arxiv.org/abs/1712.07311.
 Fried et al. (2017) E. S. Fried, N. P. D. Sawaya, Y. Cao, I. D. Kivlichan, J. Romero, and A. AspuruGuzik, arXiv preprint arXiv:1709.03636 (2017), eprint 1709.03636, URL http://arxiv.org/abs/1709.03636.
 Britt and Humble (2017b) K. A. Britt and T. S. Humble, in High Performance Computing, edited by J. M. Kunkel, R. Yokota, M. Taufer, and J. Shalf (Springer International Publishing, Cham, 2017b), pp. 98–105.
 (44) iTensor, URL itensor.org.
 McCaskey et al. (2018) A. McCaskey, E. Dumitrescu, D. Liakh, and T. Humble, ArXiv eprints (2018), eprint 1805.09279.
 (46) A look at the composite design pattern, https://www.javaworld.com/article/2074564/ learnjava/alookatthecompositedesignpattern.html, accessed: 20180312.
 Gamma et al. (1995) E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design Patterns: Elements of Reusable Objectoriented Software (AddisonWesley Longman Publishing Co., Inc., Boston, MA, USA, 1995), ISBN 0201633612.
 McCaskey and Chen (2017) A. McCaskey and M. Chen, Tnqvm  tensor network quantum virtual machine, https://github.com/ORNLQCI/tnqvm (2017).
 Boixo et al. (2018) S. Boixo, S. V. Isakov, V. N. Smelyanskiy, R. Babbush, N. Ding, Z. Jiang, M. J. Bremner, J. M. Martinis, and H. Neven, Nature Physics 14, 1 (2018), ISSN 17452481, eprint 1608.00263, URL https://doi.org/10.1038/s415670180124x.
 O’Malley et al. (2016b) P. J. J. O’Malley, R. Babbush, I. D. Kivlichan, J. Romero, J. R. McClean, R. Barends, J. Kelly, P. Roushan, A. Tranter, N. Ding, et al., Phys. Rev. X 6, 031007 (2016b), URL https://link.aps.org/doi/10.1103/PhysRevX.6.031007.