Quantum CISC Compilation by Optimal Control and Scalable Assembly of Complex Instruction Sets beyond Two-Qubit Gates

Quantum CISC Compilation by Optimal Control and Scalable Assembly of Complex Instruction Sets beyond Two-Qubit Gates


We present a quantum cisc compiler and show how to assemble complex instruction sets in a scalable way. Enlarging the toolbox of universal gates by optimised complex multi-qubit instruction sets thus paves the way to fight relaxation for realistic experimental settings.

Compiling a quantum module into the machine code for steering a concrete quantum hardware device lends itself to be tackled by means of optimal quantum control. To this end, there are two opposite approaches: (i) one may use a decomposition into the restricted instruction set (risc) of universal one- and two-qubit gates and translate them into the machine code, or (ii) one may prefer to generate the entire target module as a complex instruction set (cisc) directly by evoltution under drift and available controls. Here we advocate direct compilation up to the limit of system size a classical high-performance parallel computer cluster can reasonably handle. For going beyond these limits, i.e. for large systems, we propose a combined way, namely (iii) to make recursive use of medium-sized building blocks generated by optimal control in the sense of a quantum cisc compiler.

The advantage of the method over standard risc compilations into one- and two-qubit universal gates is explored on the parallel cluster hlrb-ii (with a total linpack performance of TFlops/s) for the quantum Fourier transform, the indirect swap gate as well as for multiply-controlled not gates. Implications for upper limits to time complexities are also derived.

03.67.-a, 03.67.Lx, 03.65.Yz, 03.67.Pp; 82.56.-b, 82.56.Jn, 82.56.Dj, 82.56.Fk


Richard Feynman’s seminal conjecture of using experimentally controllable quantum systems to perform computational tasks Feynman (1982, 1996) roots in reducing the complexity of the problem when moving from a classical setting to a quantum setting. The most prominent pioneering example being Shor’s quantum algorithm of prime factorisation Shor (1994, 1997) which is of polynomial complexity (bqp) on quantum devices instead of showing non-polynomial complexity on classical ones Papadimitriou (1995). It is an example of a class of quantum algorithms Jozsa (1998); Cleve et al. (1998) that solve hidden subgroup problems in an efficient way Ettinger et al. (2004), where in the Abelian case, the speed-up hinges on the quantum Fourier transform (qft). Whereas the network complexity of the fast Fourier transform (fft) for classical bits is of order Cooley and Tukey (1965); Beth (1984), the qft for qubits shows a complexity of order . Moreover, Feynman’s second observation that quantum systems may be used to efficiently predict the behaviour of other quantum systems has inaugurated a branch of research dedicated to Hamiltonian simulation Lloyd (1996); Abrams and Lloyd (1997); Zalka (1998); Bennett et al. (2002); Masanes et al. (2002); Jané et al. (2003).

For implementing a quantum algorithm in an experimental setup, local operations and universal two-qubit quantum gates are required as a minimal set ensuring every unitary module can be realised Deutsch (1985). More recently, it turned out that generic qubit and qudit pair interaction Hamiltonians suffice to complement local actions to universal controls Dodd et al. (2002); Bremner et al. (2005). Common sets of quantum computational instructions comprise (i) local operations such as the Hadamard gate, the phase gate and (ii) the entangling operations cnot, controlled-phase gates, , swap as well as (iii) the swap operation. The number of elementary gates required for implementing a quantum module then gives the network or gate complexity.

As is well known, a generic -qubit generic operation requires exponentially many two-qubit gates to be implemented exactly Barenco et al. (1995); Knill (1995), the complexity being . Yet, as has been pointed out by Barenco et al., many quantum computationally pertinent gates can be decomposed into a number of one- and two-qubit gates increasing linearly with the number of qubits. At the expense of a single ancilla qubit this also holds for multiply controlled unitary gates Barenco et al. (1995) tantamount to error correction. For an overview, see e.g. Nielsen and Chuang (2000); Kitaev et al. (2002); Mermin (2007). Moreover, Blais Blais (2001) showed how to implement the QFT with linear gate complexity. Later, Solovay (Solovay (1995) quoted in Kitaev (1997) and Nielsen and Chuang (2000)) and then Kitaev addressed the problem to approximate arbitrary unitary gates by polynomially long 2-qubit gate sequences up to a given precision Kitaev (1997); Kitaev et al. (2002). More recently the bounds of approximating an arbitrary unitary were taken down to a polynomial of sixth-order in the number of qubits and of third order in the geodesic distance of the unitray to unity Nielsen et al. (2006). Differential geometric aspects in terms of Finsler metrics have been raised in Nielsen (2006).

Program,  Module

Quantum  Algorithm,  Unitary  Module

,,, Machine Code

Machine Code  of  Quantum  Evolutions  under  Drift  and   Controls ,

Abbildung 1: Compilation in classical computation (left) and quantum computation (right). Quantum machine code has to be time-optimal or protected against relaxation, otherwise the coherent superpositions are wiped out. A quantum risc-compiler (1) by universal gates leads to unnecessarily long machine code. Direct cisc-compilation into a single pulse sequence (2) exploits quantum control for a near time-optimal quantum machine code. Its classical complexity is np, so direct compilation by numerical optimal control resorting to a classical computer is unfeasible for large quantum systems. The third way (3) promoted here pushes quantum cisc-compilation to the limits of classical supercomputer clusters and then assembles the multi-qubit complex instructions sets recursively into time-optimised or relaxation-protected quantum machine code.

However, gate complexity often translates into too coarse an estimate for the actual time required to implement a quantum module (see e.g. Vidal et al. (2002); Childs et al. (2003); Zeier et al. (2004)), in particular, if the time scales of a specific experimental setting have to be matched. Instead, effort has been taken to give upper bounds on the actual time complexity Wocjan et al. (2002), e.g., by way of numerical optimal control Schulte-Herbrüggen et al. (2005).

Interestingly, in terms of quantum control theory, the existence of universal gates is equivalent to the statement that the quantum system is fully controllable as has first been pointed out in Ref. Ramakrishna and Rabitz (1995). This is, e.g., the case in systems of spin- qubits that form Ising-type weak-coupling topologies described by arbitrary connected graphs Schulte-Herbrüggen (1998); Glaser et al. (1998). Therefore the usual approach to quantum compilation in terms of local plus universal two-qubit operations Tucci (1999); Williams (2004); Shende et al. (2006); Svore et al. (2006); Tucci (2007) lends itself to be complemented by optimal-control based direct compilation into machine code: it may be seen as a technology-dependent optimiser in the sense of Ref. Svore et al. (2006), however, tailored to deal with more complex instruction sets than the usual local plus two-qubit building blocks. Not only is it adapted to the specific experimental setting, it also allows for fighting relaxation by either being near timeoptimal or by exploiting relaxation-protected subspaces Schulte-Herbrüggen et al. (2006). Devising quantum compilation methods for optimised realisations of given quantum algorithms by admissible controls is therefore an issue of considerable practical interest. Here it is the goal to show how quantum compilation can favourably be accomplished by optimal control: the building blocks for gate synthesis will be extended from the usual set of restricted local plus universal two-qubit gates to a larger toolbox of scalable multi-qubit gates tailored to yield high fidelity in short time given concrete experimental settings.

Quantum Compilation as an Optimal Control Task

As shown in Fig. 1, the quantum compilation task can be addressed following different principle guidelines: (1) by the standard decomposition into local operations and universal two-qubit gates, which by analogy to classical computation was termed reduced instruction set quantum computation (risc) Sanders et al. (1999) or (2) by using direct compilation into one single complex instruction set (cisc) Sanders et al. (1999). The existence of a such a single effective gate is guaranteed simply by the unitaries forming a group: a sequence of local plus universal gates is a product of unitaries and thus a single unitary itself.

As a consequence, cisc quantum compilation lends itself for resorting to numerical optimal control (on clusters of classical computers) for translating the unitary target module directly into the ‘machine code’ of evolutions of the quantum system under combinations of the drift Hamiltonian and experimentally available controls .

In a number of studies on quantum systems up to 10 qubits, we have shown that direct compilation by gradient-assisted optimal control Khaneja et al. (2005); Schulte-Herbrüggen et al. (2005); Spörl et al. (2007) allows for substantial speed-ups, e.g., by a factor of for a cnot and a factor of for a Toffoli-gate on coupled Josephson qubits Spörl et al. (2007). However, the direct approach naturally faces the limits of computing quantum systems on classical devices: upon parallelising our C++ code for high-performance clusters Gradl et al. (2006), we found that extending the quantum system by one qubit increases the cpu-time required for direct compilation into the quantum machine code of controls by roughly a factor of eight. So the classical complexity for optimal-control based quantum compilation is np.

Therefore, here we advocate a third approach (3) that uses direct compilation into multi-qubit complex instruction sets up to the cpu-time limits of optimal quantum control on classical computers: these building blocks are designed such as to allow for recursive scalable quantum compilation in large quantum systems (i.e. those beyond classical computability). In particular, the complex instruction sets may be optimised such as to fight relaxation by being near time-optimal, or, moreover, they may be devised such as to fight the specific imperfections of an experimental setting.


Before turning to optimal-control based cisc quantum compilation in more detail, it is important to ensure the quantum control system characterised by is in fact fully controllable.

Hamiltonian quantum dynamics following Schrödinger’s equation for the unitary image of a complete basis set of ‘state vectors’ representing a quantum gate


resembles the setting of a standard bilinear control system with state , drift , controls , and control amplitudes reading


where and . Clearly in the dynamics of closed quantum systems, the system Hamiltonian is the drift term, whereas the are the control Hamiltonians with as control amplitudes. In systems of qubits, , , and .

A system is fully operator controllable, if to every initial state the entire unitary orbit can be reached. With density operators being Hermitian this means any final state can be reached from any initial state as long as both of them share the same spectrum of eigenvalues.

As established in Jurdjevic and Sussmann (1972), the bilinear system of Eqn. 2 is fully controllable if and only if the drift and controls are a generating set of by way of the commutator, i.e., .

Example 1 Consider a system of weakly coupled spin- qubits. Let , , be the Pauli matrices. In spins-, a for spin is tacitly embedded as where is at position . The same holds for , , and in the weak coupling terms with .

Now a system of qubits is fully controllable Schulte-Herbrüggen (1998), if e.g. the control Hamiltonians comprise the Pauli matrices on every single qubit selectively and the drift Hamiltonian encompasses the Ising pair interactions , where the coupling topology of may take the form of any connected graph. This theorem has meanwhile been generalised to other coupling types Schulte-Herbrüggen et al. (2002); Albertini and D’Alessandro (2002).

In view of the compilation task in quantum computation we get the following synopsis:

Corollary 1

The following are equivalent:

  1. in a quantum system of coupled spins-, the drift and the controls form a generating set of ;

  2. the quantum system is operator controllable (in the sense of Ref. Albertini and D’Alessandro (2003));

  3. every unitary transformation can be realised by that system;

  4. there is a set of universal quantum gates for the quantum system.

Proof: The equivalence of (1) and (2) relies on the unitary group being a compact connected Lie group: compact connected Lie groups have no closed subsemigroups that are no groups themselves Jurdjevic and Sussmann (1972). Moreover, in compact connected Lie groups the exponential mapping is surjective, hence (1) (3). Assertions (3) and (4) just re-express the same fact in different terminology.

Scope and Organisation of the Paper

The purpose of this paper is to show that optimal control theory can be put to good use for devising multi-qubit building blocks designed for scalable quantum computing in realistic settings. Note these building blocks are no longer meant to be universal in the practical sense that any arbitrary quantum module should be built from them (plus local controls). Rather they provide specialised sets of complex instructions tailored for breaking down typical tasks in quantum computation with substantial speed gains compared to the standard compilation by decomposition into one-qubit and two-qubit gates. Thus a cisc quantum compiler translates into significant progress in fighting relaxation.

For demonstrating quantum cisc compilation and scalable assembly, in this paper we choose systems with linear coupling topology, i.e., qubit chains coupled by nearest-neighbour Ising interactions. The paper is organised as follows: cisc quantum compilation by optimal control will be illustrated in three different, yet typical examples

  1. the indirect -swap gate,

  2. the quantum Fourier transform (qft) ,

  3. the generalisation of the cnot and Toffoli gate to multiply-controlled not gates, cnot.

For every instance of -qubit systems, we analyse the effects of (i) sacrificing universality by going to special instruction sets tailored to the problem, (ii) extending pair interaction gates to effective multi-qubit interaction gates, and (iii) we compare the time gain by recursive -qubit cisc-compilation () to the two limiting cases of the standard risc-approach () on one hand and the (extrapolated) time-complexity inferred from single-cisc compliation (with ).


Time Standards

When comparing times to implement unitary target gates by the risc vs the cisc approach, we will assume for simplicity that local unitary operations are ‘infinitely’ fast compared to the duration of the Ising coupling evolution so that the total gate time is solely determined by the coupling evolutions unless stated otherwise. Let us emphasise, however, this stipulation only concerns the time standards. The optimal-control assisted cisc-compilation methods presented here are in no way limited to fast local controls. In particular, also the assembler step of concatenating the cisc-building blocks is independent of the ratio of times for local operations vs coupling interactions.

Overview on Gate and Time Complexities

For practical purposes, the complexity of a unitary quantum operation can be expressed in terms of two measures: the gate complexity counts the number of universal one- and two-qubit gates for exactly implementing the target operation in a circuit. Moreover, in view of fighting relaxation, we will estimate the time complexity in terms of consecutive time-slots with simultaneous -qubit modules required.

In order not to raise false expectations, upon changing from universal 2-qubit decompositions (risc) to -qubit cisc-implementations the gate complexity for exact implemention of a generic -qubit unitary operation clearly remains np: it requires ‘exponentially many’ -qubit modules or -qubit modules () alike, yet a cut from the order of roughly necessary -qubit modules down to some -qubit modules (with up to ) is substantial and particularly valuable in few-qubit systems. More elaborate estimates will be given shortly. — Likewise, also in target modules with linear 2-qubit risc complexity, -qubit cisc complexity remains linear, yet when translated into time complexity it may entail sizeable speed-ups – we will show examples where they allow for accelerations by more than a factor of .

Abbildung 2: Decomposition of an -qubit gate into a circuit of -qubit gates, where is a uniform block size and may consist of risc modules or cisc modules with . (a) Margolus pattern with integer, (b) or (c) , so integer.

To be more precise, a lower bound for the number of two-qubit gates necessary to exactly implement a a generic -qubit unitary target module was given by Barenco et al. Barenco et al. (1995). Their parameter-counting argument is based on a gem, which deserves to be picked up for generalising it to realisations by -qubit modules as illustrated in Fig. 2. The key is that only in the first time slot the number of parameters directly relates to the unitary group, while from the second slot onwards the parameters have to be counted in terms of cosets of the form , if the -qubit module has overlaps of qubits and qubits with the two adjacent modules in the time slot before. The number of real parameters (denoted by for short) in the respective basic building bocks amount to

-qubit operation with
no of 2-qubit gates ()
no of 2-qubit time slots ()
no of 10-qubit gates ()
no of 10-qubit time slots ()
Tabelle 1: Lower Bounds to Gate Complexities and Time Complexities for Implementing Generic -Qubit Unitaries

With these stipulations one may readily determine the number of -qubit gates in a unitary network of the type of Fig. 2 a, where is integer, such as to ensure to exhaust the number of parameters of a generic -qubit target gate to be implemented. In the first time slot there are parallel -qubit gates (counting by the number of parameters in the group according to Eqn.4), in the second time slot there are parallel -qubit gates. They contribute the number of parameters of the coset (Eqn. 6), where one is forced to choose for even and for odd in order to be efficient. Following the same Margolus pattern one adds as many -qubit gates (counting cosets) as required to superseed parameters. Using Gauss’ brackets one thus obtains the number of -qubit gates needed to implement a generic -qubit target gate


and the respective number of time slots by


For even with and Eqn. 7 specialises to reproduce the result of Ref. Barenco et al. (1995), i.e. .

Next, consider Fig. 2 b and its Margolus pattern with one overhead of qubits to be taken into account by Eqn. 5. Then the same arguments give


Finally, for a pattern with two such overheads as in Fig. 2 c, where , one likewise finds


With efficient implementations requiring to be closest to (vide supra), three overheads do not occur.

Since , one may use with the most efficient setting of for even or for odd as a lower bound for the number of unitary -qubit modules necessary to exactly implement an arbitrary generic -qubit target unitary.

In the limit of large , one thus obtains the bounds on gate complexities and so . Likewise the limiting time complexities and give a speed-up potential of in units of the ratio of single-gate times in the respective experimental setting. These limiting speed-up ratios are nearly reached already for , as the numbers given in Tab. 1 show. In this sense, accelerations may be taken as roughly constant over the entire range of interest.

Although in generic -qubit unitaries, the cisc speed-up may appear overwhelming, quantum algorithms are usually by construction resorting to highly non-generic unitary bulding blocks, many of which with linear complexities Barenco et al. (1995). However, in these seemingly less rewarding yet practically relevant cases cisc compilation will turn out to be highly advantageous as demonstrated in three worked examples in the current study. — Since generic and thus highly entangled states have recently turned out to be computationally of modest use Gross et al. (2008), recasting the above analysis in terms of -designs and -designs Dankert et al. (2006); Gross et al. (2007); Ambainis and Emerson (2007) and following concentrations of measure will give a more realistic estimate, which is part of a different project.

Abbildung 3: (Colour online) Comparison of error-propagation models for random unitary gates with qubits (a) and qubits (b) requiring representations with different scales. Single gate fidelity in the Monte Carlo simulations is . Repetition of the same gate (blue) is compared with repetitions of a sequence of four independent gates (black). Out of 10 Monte Carlo simulations (details see text), the median (solid lines) as well as the best and worst cases (dashed lines) are given. The red solid lines denote independent error propagation . Large systems () with several gates () resemble independent error propagation almost perfectly, as in (b) the black and the red solid lines virtually coincide.

Error Propagation and Relaxative Losses

As the main figure of merit we refer to a quality function


resulting from the fidelity and the relaxative decay with overall relaxation rate constant during a duration assuming independence of fidelity and decay. Moreover, for qubits one defines as the trace fidelity of an experimental unitary module with respect to the target gate the quantity


where both with . It follows via the simple relation to the Euclidean distance

the latter two identities invoking unitarity of . The reason for chosing the trace fidelity is its convenient Fréchet differentiability in view of gradient-flow techniques, see also Ref. Schulte-Herbrüggen et al. (2008a).

Consider an -qubit-interaction module (cisc) with quality that decomposes into universal two-qubit gates (risc), out of which gates have to be performed sequentially. Moreover, each 2-qubit gate shall be carried out with the uniform quality . Henceforth we assume for simplicity equal relaxation rate constants, so are identified with . Then, as a first useful rule of the thumb and assuming independent error propagation, it is advantageous to compile the -qubit module directly if . Or more precisely taking relaxation into account, if the module can be realised with a fidelity


A more refined picture emerges from Monte-Carlo simulations of error propagation. To this end, compare the above independent error estimates with two scenarios for a sequence of gates in total: (i) the -fold repetition of single unitary gates with individual errors meant to give with and (ii) the repetition of a sequence of four different gates again each with individual errors to give where . In the sequel, we refer to case (i) as and to case (ii) as .

(a)                                         (b)                                                                                   

Abbildung 4: (a) Simple starting point: building a swap gate from five swaps. (b) Generalisation: assembling a swap by four swaps for each type and one single swap so that .

For gates and errors to be generic, we use random unitaries (distributed according to the Haar measure following a recent modification Mezzadri (2007) of the qr-algorithm). To a given random unitary -qubit gate (defining its Hamiltonian via ) we simulate a generic error as follows: from another independent unitary take the matrix logarithm such that . Then to a given trace fidelity , a corresponding unitary with a Monte-Carlo random error (the error being introduced on the level of the Hamiltonian generators) can readily be obtained by solving


for . Along these lines one obtains the Monte-Carlo fidelities for repeating the -gate by




where the product runs from right to left. These Monte-Carlo simulations are compared to the simple model of independent errors according to


As shown in Fig. 3 a, for two-qubit gates the error propagates with a vast variance, which makes it virtually unpredictable. Thus assuming independence is always too optimistic for AAAA, while for ABCD it is still mostly optimistic, although there are cases in which the errors may compensate to give less effective loss than expected under independence.

However, when moving to effective multi-qubit gates, i.e., cisc modules, the generic situation becomes more predictable. For example, in -qubit random unitary gates, Fig. 3 b shows that AAAA is significantly deviating from independent error propagation, whereas ABCD resembles independent error propagation almost perfectly. The situation is qualitatively exactly the same even if the single gate error is larger as tested by analogous Monte-Carlo simulations setting or (not shown).

In the sequel, we will—for the sake of simplicity—often assume independent error propagation at the expense of systematically underestimating the pros of cisc compilation compared to the standard risc compilation into universal local and two-qubit gates.

Computational Methods and Devices

Following the lines of our previous work on time complexity Schulte-Herbrüggen et al. (2005), we used the grape algorithm Khaneja et al. (2005) for direct cisc compilation. It tracks the fixed final times down to the shortest durations of controls still allowing for synthesising the unitary target gates with full fidelity. This gives currently the best known upper bounds to the minimal times required to realise a target module on a concrete hardware setting. We extended our parallelised c++ code of the grape package described in Gradl et al. (2006) by adding more flexibility allowing to efficiently exploit available parallel nodes independent of internal parameters Schulte-Herbrüggen et al. (2008b). Moreover, faster algorithms for matrix exponentials on high-dimensional systems based on approximations by Tchebychev series have been developed Waldherr (2007) specifically in view of application to large quantum systems Schulte-Herbrüggen et al. (2008b). Thus computations could be performed on the hlrb-ii supercomputer cluster at Leibniz Rechenzentrum of the Bavarian Academy of Sciences Munich. It provides an sgi Altix 4700 platform equipped with 9728 Intel Itanium2 Montecito Dual Core processors with a clock rate of GHz, which give a total linpack performance of TFlops/s. The present explorative study exploited the time allowance of approx. cpu hours.

I The SWAP Operation

         (a)                                                                   (b)                                                        

Abbildung 5: (Colour online) (a): Times required for indirect swaps on linear chains of Ising-coupled qubits by assembling swap building blocks reaching from (risc) up to (cisc). Using linear regression, the dashed line is an extrapolation of the direct single-cisc compilations shown in the inset to large number of qubits, where direct cisc compilation is virtually impossible on classical computers. Time units are expressed as assuming the duration of local operations can be neglected compared to coupling evolutions (details in the text). (b): Translation of the effective gate times into overall quality figures for an effective gate assembled from components of single qualities (with the respective component fidelities homogeneously falling into a narrow interval for ). Data are shown for a uniform relaxation rate constant of .

The easiest and most basic examples to illustrate the pertinent effects of optimal-control based cisc-quantum compilation are the respective indirect swap gates in spin chains of qubits coupled by nearest-neighbour Ising interactions with denoting the coupling constant.

For the swap unit there is a standard textbook decomposition into three cnots. Thus for Ising-coupled systems and in the limit of fast local controls, the total time required for an swap is , and there is no faster implementation Khaneja et al. (2001, 2002). Note, however, that in systems coupled by the isotropic Heisenberg interaction , the swap may be directly implemented just by letting the system evolve for a time of only . Sacrificing universality, it may thus be advantageous to regard the swap as basic unit for the swap task rather than the universal cnot. Clearly, any even-order swap can be built from swaps along the lines of the most obvious scheme of Fig. 4 a. (The odd-order swaps follow, e.g., from swap by omitting qubit and all the gates connected to it.)

Moreover, the generalisation to decomposing a swap into a sequence with different swap building blocks (where ) as shown in Fig. 4 b is straightforward by ensuring . Due to its symmetry, the total duration then amounts to


and the overall quality as a function of the fidelities of the constitutent gates reads


Now, the swap building blocks themselves can be precompiled into time-optimised single complex instruction sets by exploiting the grape-algorithm of optimal control up to the current limits of imposed by cpu-time allowance.

Proceeding in the next step to large , Fig. 5 underscores how the time required for swap gates decreases significantly by assembling precompiled swap building blocks as cisc units recursively up to a multi-qubit interaction size of , where the speed-up is by a factor of nearly . Clearly, such a set of swap building blocks with allows for efficiently synthesising any swap. Assuming for the moment that a linear time complexity of the swap can be taken for granted, one may extrapolate the results of direct cisc compilation from the range of the inset of Fig. 5 a to a large number of qubits. One thus obtains an estimated upper limit to the time complexity of the swap. This is indicated by the dashed line, the slope of which will be defined as . Likewise, the irrespective slopes of the -qubit decomposition are denoted by .

With these stipulations, we introduce as a measure for the potential of cisc compilation (versus risc compilation) the ratio of the slopes


and as a measure for the extent to which this potential has been exhausted by -qubit cisc compilation the ratio


thus providing as convenient measure of improvement


The data of Fig. 5 thus give a potential of ; by -qubit interactions it is already pretty well exhausted, as inferred from . The current cisc over risc improvement then amounts to .

On the other hand, deducing from Fig. 5 right away that the time complexity of swaps ought to be linear would be premature: although the slopes seem to converge to a non-zero limit, numerical optimal control may become systematically inefficient for larger interaction sizes . Therefore, although improbable, e.g., convergence of the slopes to a value of zero cannot be safely excluded on the current basis of findings. This also means a logarithmic time complexity can ultimately not be excluded either.

Abbildung 6: By rearranging the swaps and controlled phase gates, the standard decomposition of a -qubit quantum Fourier transform, qft (top trace) reduces to a realisation adapted to a coupling topology of linear nearest-neighbour interactions (lower trace) with a -qubit qft, -qubit cp-swaps (solid boxes), and an -qubit qft (dashed box). The notation is a shorthand for a rotation angle of .
Abbildung 7: For , a -qubit qft can be assembled from times an -qubit qft and instances of -qubit modules cp-swap, where the index of different phase-rotation angles takes the values . The dashed boxes correspond to Fig. 6 and show the induction .

Summarising the results for the indirect swaps in terms of the three criteria described in the introduction, we have the following: (i) in Ising coupled qubit chains, there is no speed-up by changing the basic unit from the universal cnot into a swap, whereas in isotropically coupled systems the speed-up amounts to a factor of three; (ii) extending the building blocks of swap from (risc) to (cisc) gives a speed-up by a factor of nearly two under Ising-type couplings; (iii) the numerical data are consistent with a time complexity converging to a linear limit for the swap task in Ising chains, however, there is no proof for it yet.

Ii The Quantum Fourier Transform (QFT)

Since many quantum algorithms take advantage of efficiently solving some hidden subgroup problem, the quantum Fourier transform plays a central role Jozsa (1998); Cleve et al. (1998); Ettinger et al. (2004).

In order to realise a qft on large qubit systems, our approach is the following: given an -qubit qft, we show that for obtaining a -qubit qft by recursively using multi-qubit building blocks, a second type of module is required, to wit a combination of controlled phase gates and swaps, which henceforth we dub -qubit cp-swap for short.

Abbildung 8: (Colour online) Comparison of cisc-compiled qft (red) with standard risc compilations () following the scheme by Saito Saito et al. (2000) (black) or Blais Blais (2001) (blue): (a) times for implementation translate into quality factors (b) for a relaxation rate constant of . Again, the dashed red line extrapolates from the direct single-cisc compilations shown in the upper inset of (a), the lower inset giving in logarithmic scales the times needed for the standard textbook risc compilation () on a linear Ising chain. Dotted red lines represent the less favourable results from qft variant II (Appendix A).

Here we present two alternatives: variant I with and, as a special case, variant II for even .

Choosing and for a start, the recursive construction is illustrated in Fig. 6. The top trace shows the standard textbook realisation of a -qubit qft. By shifting the final swap operations, it can be rearranged into the sequence of gates depicted in the lower trace. Note that the gates appearing in solid boxes constitute a -qubit qft (which itself is made of two -qubit qfts and a central -qubit cp-swap), while the ones in dashed boxes have to be added for a -qubit qft. For we have thus shown how a -qubit qft reduces to a -qubit qft, two -qubit cp-swaps, and an -qubit qft. So with providing a foundation, at the same time we have also illustrated the induction from a -qft to a -qft. Moreover, the same construction principle holds for any block size , which can readily be proven by a straightforward, but lengthy induction from to .

One thus arrives at the desired block decomposition of a general -qubit qft as shown in Fig. 7 (which is variant I; the less effective variant II can be found in Appendix A): it requires times the same -qubit qft interdispersed with times an -qubit cp-swap, out of which show different phase-roation angles. For all and , one finds the following observations:

  1. a cp-swap takes as least as long as a qft;

  2. a qft takes as least as long as a cp-swap;

  3. a cp-swap takes least as long as a cp-swap .

Thus the duration of a -qubit qft built from -qubit and -qubit modules amounts to


Next, consider the overall quality of a -qubit qft in terms of its two types of building blocks, namely the basic -qubit qft as well as the constituent -qubit cp-swaps with their respective different rotation angles. It reads


In the following, we will neglect rotations as soon as their angle falls short of a threshold of . This approximation is safe since it is based on a calculation of a -qubit qft, where the truncation does not introduce any relative error beyond . According to the block decomposition of Fig. 7, thus three instances of cp-swaps are left, since all cp-swap elements with boil down to mere swap gates due to truncation of small rotation angles. The representation of these cp-swap modules is shown in Appendix B as Fig. .2.

With these stipulations, we address the task of assembling an -qubit qft, exploiting the limits of current allowances on the hlrb-ii cluster. This translates into using -qubit cp-swap building blocks () and the -qubit qft () in the sense of a -qubit qft. Its duration is readily obtained as in Eqn. 25 thus giving an overall quality of


Based on this relation, the numerical results of Fig. 8 show that a cisc-compiled qft is moderately superior to the standard risc versions Saito et al. (2000); Blais (2001). Although the potential of cisc compilation amounts to , recursively assembling -qubit qfts and -qubit cp-swaps only exploits about half of it as apparent in the value of .

As has been pointed out by Zeier Zeier (2007), the decomposition of a many-qubit qft into smaller qfts and concatenations of a permutation matrix and a diagonal matrix roots back in a principle already used in the Cooley-Tukey algorithm Cooley and Tukey (1965) for the discrete Fourier transform (dft): Let . Then one obtains Clausen and Baum (1993); Egner (1997)


where is a permutation matrix. Moreover, setting , the diagonal matrix takes the form


Therefore, the qft decompositions made use of here exactly follow the classical scheme in the second line of Eqn. 28, the expression corresponding to the cp-swap.

Iii The Multiple-Controlled NOT Gate (CNot)

Multiply-controlled cnot gates generalise Toffoli’s gate. Here, we move from cnot to cnot in an -qubit system with one ancilla and one target qubit. The reason for the ancilla qubit being that it turns the problem to linear complexity Barenco et al. (1995). Moreover, in view of realistic large systems, we assume again a topology of a linear chain coupled by nearest-neighbour Ising interactions. Since cnot-gates frequently occur in error-correction schemes, they are highly relevant in practice.

Here we address the task of decomposing a cnot into lower cnots and indirect swap gates (see Sec. I).

To this end, we will generalise the basic principle of reducing a cnot to cnot gates with that can be demonstrated by decomposing a cnot into Toffoli gates according to scheme of Fig. 9 devised by Barenco et al. in Barenco et al. (1995). Starting with any of the computational basis states (where , denotes addition , and being the usual scalar product) track the effect of the gate sequentially from state through state

Abbildung 9: Decomposition of a -NOT with one ancilla qubit into four -NOTs (Toffoli gates) according to ref. Barenco et al. (1995). States through are explained in the text.

to see the overall effect of the gate sequence is a cnot thus proving the decomposition.

(a)                                                            (b)



Abbildung 10: Decomposition of a cnot gate on a linear coupling topology: (a) reduction of the number of control qubits to four intermediate gates with fewer control qubits and (b) decomposition of the intermediate multiply-controlled not-gate appearing in (a). In an -qubit system, there is one target qubit, one ancilla qubit and control qubits; so with and . Read the brackets in (a) as to be expanded times and in (b) as expanded -fold.
Abbildung 11: (Colour online) Comparing implementations of cnots on a linear Ising spin chain using cnot and swap modules for the risc compilation or multi-qubit building blocks according to the cisc assembler scheme of Fig. 10. As a short-hand, the different numbers of control qubits are expressed by . Using the expansion of Fig. 10, the cisc results (black solid lines) are obtained for with and (for odd ) or (for even ), while thus ensuring . The red dotted line extrapolates again the direct cisc results beyond 10 qubits. In (a) deviations from straight lines occur, as the cases follow special concatenation patterns (see text), while are generic. The inset in (a) also shows results of a non-scalable recursive expansion that is confined up to qubits (blue circles). The step functions with periods indicated by tags represent a faster alternative explained in the next section, where the boxed part of trace (a) is blown up in Fig. 15.

Fig. 10 provides a generalisation of the scheme in Fig. 9: in the first place (a), the number of control qubits is reduced by introducing blocks with qubits that are left invariant. The price for this reduction is a four-fold occurence of the reduced building blocks. In the second step (b), the reduced building blocks are expanded into a sequence with two central cnots, two terminal cnots and two lots of times cnot each. For part (a) and (b) can be expanded in a general concatenated way thus entailing an overall duration of


For completeness, note that the cases have to be treated separately, since they only allow for less and less densly concatenated expansions (not shown). Their respective durations are


However, the total number of gates only depends on , so that obtains as the overall quality


Given the duration of the decomposition as in Eqn. 30, it is easy to see that implementing the control qubits comes with the lowest time weight (4) and without a time overhead of auxiliary gates. Implementing the control qubits, however, requires the same time weight (4), but entails the time for one auxiliary swap gate. In order to implement the control qubits, in turn, a sizeable amount of auxiliary swaps are needed.

Therefore, whenever high fidelities can be reached (so that the quality is limited by relaxation not by fidelity), a good strategy of combining the expansive decomposition in Fig. 10 a with the recursive decomposition in part (b) is the following: given control qubits and with the current limitation from direct cisc compilation being , choose to be the largest, to be the second largest and such that one obtains an even number for .

In the next step, a decision has to be made in order to minimise the contributions in the last two lines of Eqn. 30, whenever there are several integer solutions . So for integer , this amounts to the ordinary minimisation task