Quantum CISC Compilation by Optimal Control and Scalable Assembly of Complex Instruction Sets beyond TwoQubit Gates
Zusammenfassung
We present a quantum cisc compiler and show how to assemble complex instruction sets in a scalable way. Enlarging the toolbox of universal gates by optimised complex multiqubit instruction sets thus paves the way to fight relaxation for realistic experimental settings.
Compiling a quantum module into the machine code for steering a concrete quantum hardware device lends itself to be tackled by means of optimal quantum control. To this end, there are two opposite approaches: (i) one may use a decomposition into the restricted instruction set (risc) of universal one and twoqubit gates and translate them into the machine code, or (ii) one may prefer to generate the entire target module as a complex instruction set (cisc) directly by evoltution under drift and available controls. Here we advocate direct compilation up to the limit of system size a classical highperformance parallel computer cluster can reasonably handle. For going beyond these limits, i.e. for large systems, we propose a combined way, namely (iii) to make recursive use of mediumsized building blocks generated by optimal control in the sense of a quantum cisc compiler.
The advantage of the method over standard risc compilations into one and twoqubit universal gates is explored on the parallel cluster hlrbii (with a total linpack performance of TFlops/s) for the quantum Fourier transform, the indirect swap gate as well as for multiplycontrolled not gates. Implications for upper limits to time complexities are also derived.
pacs:
03.67.a, 03.67.Lx, 03.65.Yz, 03.67.Pp; 82.56.b, 82.56.Jn, 82.56.Dj, 82.56.FkIntroduction
Richard Feynman’s seminal conjecture of using experimentally controllable quantum systems to perform computational tasks Feynman (1982, 1996) roots in reducing the complexity of the problem when moving from a classical setting to a quantum setting. The most prominent pioneering example being Shor’s quantum algorithm of prime factorisation Shor (1994, 1997) which is of polynomial complexity (bqp) on quantum devices instead of showing nonpolynomial complexity on classical ones Papadimitriou (1995). It is an example of a class of quantum algorithms Jozsa (1998); Cleve et al. (1998) that solve hidden subgroup problems in an efficient way Ettinger et al. (2004), where in the Abelian case, the speedup hinges on the quantum Fourier transform (qft). Whereas the network complexity of the fast Fourier transform (fft) for classical bits is of order Cooley and Tukey (1965); Beth (1984), the qft for qubits shows a complexity of order . Moreover, Feynman’s second observation that quantum systems may be used to efficiently predict the behaviour of other quantum systems has inaugurated a branch of research dedicated to Hamiltonian simulation Lloyd (1996); Abrams and Lloyd (1997); Zalka (1998); Bennett et al. (2002); Masanes et al. (2002); Jané et al. (2003).
For implementing a quantum algorithm in an experimental setup, local operations and universal twoqubit quantum gates are required as a minimal set ensuring every unitary module can be realised Deutsch (1985). More recently, it turned out that generic qubit and qudit pair interaction Hamiltonians suffice to complement local actions to universal controls Dodd et al. (2002); Bremner et al. (2005). Common sets of quantum computational instructions comprise (i) local operations such as the Hadamard gate, the phase gate and (ii) the entangling operations cnot, controlledphase gates, , swap as well as (iii) the swap operation. The number of elementary gates required for implementing a quantum module then gives the network or gate complexity.
As is well known, a generic qubit generic operation requires exponentially many twoqubit gates to be implemented exactly Barenco et al. (1995); Knill (1995), the complexity being . Yet, as has been pointed out by Barenco et al., many quantum computationally pertinent gates can be decomposed into a number of one and twoqubit gates increasing linearly with the number of qubits. At the expense of a single ancilla qubit this also holds for multiply controlled unitary gates Barenco et al. (1995) tantamount to error correction. For an overview, see e.g. Nielsen and Chuang (2000); Kitaev et al. (2002); Mermin (2007). Moreover, Blais Blais (2001) showed how to implement the QFT with linear gate complexity. Later, Solovay (Solovay (1995) quoted in Kitaev (1997) and Nielsen and Chuang (2000)) and then Kitaev addressed the problem to approximate arbitrary unitary gates by polynomially long 2qubit gate sequences up to a given precision Kitaev (1997); Kitaev et al. (2002). More recently the bounds of approximating an arbitrary unitary were taken down to a polynomial of sixthorder in the number of qubits and of third order in the geodesic distance of the unitray to unity Nielsen et al. (2006). Differential geometric aspects in terms of Finsler metrics have been raised in Nielsen (2006).
Program, Module
Quantum Algorithm, Unitary Module
Machine Code
Machine Code of Quantum Evolutions under Drift and Controls
However, gate complexity often translates into too coarse an estimate for the actual time required to implement a quantum module (see e.g. Vidal et al. (2002); Childs et al. (2003); Zeier et al. (2004)), in particular, if the time scales of a specific experimental setting have to be matched. Instead, effort has been taken to give upper bounds on the actual time complexity Wocjan et al. (2002), e.g., by way of numerical optimal control SchulteHerbrüggen et al. (2005).
Interestingly, in terms of quantum control theory, the existence of universal gates is equivalent to the statement that the quantum system is fully controllable as has first been pointed out in Ref. Ramakrishna and Rabitz (1995). This is, e.g., the case in systems of spin qubits that form Isingtype weakcoupling topologies described by arbitrary connected graphs SchulteHerbrüggen (1998); Glaser et al. (1998). Therefore the usual approach to quantum compilation in terms of local plus universal twoqubit operations Tucci (1999); Williams (2004); Shende et al. (2006); Svore et al. (2006); Tucci (2007) lends itself to be complemented by optimalcontrol based direct compilation into machine code: it may be seen as a technologydependent optimiser in the sense of Ref. Svore et al. (2006), however, tailored to deal with more complex instruction sets than the usual local plus twoqubit building blocks. Not only is it adapted to the specific experimental setting, it also allows for fighting relaxation by either being near timeoptimal or by exploiting relaxationprotected subspaces SchulteHerbrüggen et al. (2006). Devising quantum compilation methods for optimised realisations of given quantum algorithms by admissible controls is therefore an issue of considerable practical interest. Here it is the goal to show how quantum compilation can favourably be accomplished by optimal control: the building blocks for gate synthesis will be extended from the usual set of restricted local plus universal twoqubit gates to a larger toolbox of scalable multiqubit gates tailored to yield high fidelity in short time given concrete experimental settings.
Quantum Compilation as an Optimal Control Task
As shown in Fig. 1, the quantum compilation task can be addressed following different principle guidelines: (1) by the standard decomposition into local operations and universal twoqubit gates, which by analogy to classical computation was termed reduced instruction set quantum computation (risc) Sanders et al. (1999) or (2) by using direct compilation into one single complex instruction set (cisc) Sanders et al. (1999). The existence of a such a single effective gate is guaranteed simply by the unitaries forming a group: a sequence of local plus universal gates is a product of unitaries and thus a single unitary itself.
As a consequence, cisc quantum compilation lends itself for resorting to numerical optimal control (on clusters of classical computers) for translating the unitary target module directly into the ‘machine code’ of evolutions of the quantum system under combinations of the drift Hamiltonian and experimentally available controls .
In a number of studies on quantum systems up to 10 qubits, we have shown that direct compilation by gradientassisted optimal control Khaneja et al. (2005); SchulteHerbrüggen et al. (2005); Spörl et al. (2007) allows for substantial speedups, e.g., by a factor of for a cnot and a factor of for a Toffoligate on coupled Josephson qubits Spörl et al. (2007). However, the direct approach naturally faces the limits of computing quantum systems on classical devices: upon parallelising our C++ code for highperformance clusters Gradl et al. (2006), we found that extending the quantum system by one qubit increases the cputime required for direct compilation into the quantum machine code of controls by roughly a factor of eight. So the classical complexity for optimalcontrol based quantum compilation is np.
Therefore, here we advocate a third approach (3) that uses direct compilation into multiqubit complex instruction sets up to the cputime limits of optimal quantum control on classical computers: these building blocks are designed such as to allow for recursive scalable quantum compilation in large quantum systems (i.e. those beyond classical computability). In particular, the complex instruction sets may be optimised such as to fight relaxation by being near timeoptimal, or, moreover, they may be devised such as to fight the specific imperfections of an experimental setting.
Controllability
Before turning to optimalcontrol based cisc quantum compilation in more detail, it is important to ensure the quantum control system characterised by is in fact fully controllable.
Hamiltonian quantum dynamics following Schrödinger’s equation for the unitary image of a complete basis set of ‘state vectors’ representing a quantum gate
(1)  
(2) 
resembles the setting of a standard bilinear control system with state , drift , controls , and control amplitudes reading
(3) 
where and . Clearly in the dynamics of closed quantum systems, the system Hamiltonian is the drift term, whereas the are the control Hamiltonians with as control amplitudes. In systems of qubits, , , and .
A system is fully operator controllable, if to every initial state the entire unitary orbit can be reached. With density operators being Hermitian this means any final state can be reached from any initial state as long as both of them share the same spectrum of eigenvalues.
As established in Jurdjevic and Sussmann (1972), the bilinear system of Eqn. 2 is fully controllable if and only if the drift and controls are a generating set of by way of the commutator, i.e., .
Example 1 Consider a system of weakly coupled spin qubits. Let , , be the Pauli matrices. In spins, a for spin is tacitly embedded as where is at position . The same holds for , , and in the weak coupling terms with .
Now a system of qubits is fully controllable SchulteHerbrüggen (1998), if e.g. the control Hamiltonians comprise the Pauli matrices on every single qubit selectively and the drift Hamiltonian encompasses the Ising pair interactions , where the coupling topology of may take the form of any connected graph. This theorem has meanwhile been generalised to other coupling types SchulteHerbrüggen et al. (2002); Albertini and D’Alessandro (2002).
In view of the compilation task in quantum computation we get the following synopsis:
Corollary 1
The following are equivalent:

in a quantum system of coupled spins, the drift and the controls form a generating set of ;

the quantum system is operator controllable (in the sense of Ref. Albertini and D’Alessandro (2003));

every unitary transformation can be realised by that system;

there is a set of universal quantum gates for the quantum system.
Proof: The equivalence of (1) and (2) relies on the unitary group being a compact connected Lie group: compact connected Lie groups have no closed subsemigroups that are no groups themselves Jurdjevic and Sussmann (1972). Moreover, in compact connected Lie groups the exponential mapping is surjective, hence (1) (3). Assertions (3) and (4) just reexpress the same fact in different terminology.
Scope and Organisation of the Paper
The purpose of this paper is to show that optimal control theory can be put to good use for devising multiqubit building blocks designed for scalable quantum computing in realistic settings. Note these building blocks are no longer meant to be universal in the practical sense that any arbitrary quantum module should be built from them (plus local controls). Rather they provide specialised sets of complex instructions tailored for breaking down typical tasks in quantum computation with substantial speed gains compared to the standard compilation by decomposition into onequbit and twoqubit gates. Thus a cisc quantum compiler translates into significant progress in fighting relaxation.
For demonstrating quantum cisc compilation and scalable assembly, in this paper we choose systems with linear coupling topology, i.e., qubit chains coupled by nearestneighbour Ising interactions. The paper is organised as follows: cisc quantum compilation by optimal control will be illustrated in three different, yet typical examples

the indirect swap gate,

the quantum Fourier transform (qft) ,

the generalisation of the cnot and Toffoli gate to multiplycontrolled not gates, cnot.
For every instance of qubit systems, we analyse the effects of (i) sacrificing universality by going to special instruction sets tailored to the problem, (ii) extending pair interaction gates to effective multiqubit interaction gates, and (iii) we compare the time gain by recursive qubit cisccompilation () to the two limiting cases of the standard riscapproach () on one hand and the (extrapolated) timecomplexity inferred from singlecisc compliation (with ).
Preliminaries
Time Standards
When comparing times to implement unitary target gates by the risc vs the cisc approach, we will assume for simplicity that local unitary operations are ‘infinitely’ fast compared to the duration of the Ising coupling evolution so that the total gate time is solely determined by the coupling evolutions unless stated otherwise. Let us emphasise, however, this stipulation only concerns the time standards. The optimalcontrol assisted cisccompilation methods presented here are in no way limited to fast local controls. In particular, also the assembler step of concatenating the ciscbuilding blocks is independent of the ratio of times for local operations vs coupling interactions.
Overview on Gate and Time Complexities
For practical purposes, the complexity of a unitary quantum operation can be expressed in terms of two measures: the gate complexity counts the number of universal one and twoqubit gates for exactly implementing the target operation in a circuit. Moreover, in view of fighting relaxation, we will estimate the time complexity in terms of consecutive timeslots with simultaneous qubit modules required.
In order not to raise false expectations, upon changing from universal 2qubit decompositions (risc) to qubit ciscimplementations the gate complexity for exact implemention of a generic qubit unitary operation clearly remains np: it requires ‘exponentially many’ qubit modules or qubit modules () alike, yet a cut from the order of roughly necessary qubit modules down to some qubit modules (with up to ) is substantial and particularly valuable in fewqubit systems. More elaborate estimates will be given shortly. — Likewise, also in target modules with linear 2qubit risc complexity, qubit cisc complexity remains linear, yet when translated into time complexity it may entail sizeable speedups – we will show examples where they allow for accelerations by more than a factor of .
To be more precise, a lower bound for the number of twoqubit gates necessary to exactly implement a a generic qubit unitary target module was given by Barenco et al. Barenco et al. (1995). Their parametercounting argument is based on a gem, which deserves to be picked up for generalising it to realisations by qubit modules as illustrated in Fig. 2. The key is that only in the first time slot the number of parameters directly relates to the unitary group, while from the second slot onwards the parameters have to be counted in terms of cosets of the form , if the qubit module has overlaps of qubits and qubits with the two adjacent modules in the time slot before. The number of real parameters (denoted by for short) in the respective basic building bocks amount to
(4)  
(5)  
(6)  
qubit operation with  

no of 2qubit gates ()  
no of 2qubit time slots ()  
no of 10qubit gates ()  
no of 10qubit time slots () 
With these stipulations one may readily determine the number of qubit gates in a unitary network of the type of Fig. 2 a, where is integer, such as to ensure to exhaust the number of parameters of a generic qubit target gate to be implemented. In the first time slot there are parallel qubit gates (counting by the number of parameters in the group according to Eqn.4), in the second time slot there are parallel qubit gates. They contribute the number of parameters of the coset (Eqn. 6), where one is forced to choose for even and for odd in order to be efficient. Following the same Margolus pattern one adds as many qubit gates (counting cosets) as required to superseed parameters. Using Gauss’ brackets one thus obtains the number of qubit gates needed to implement a generic qubit target gate
(7) 
and the respective number of time slots by
(8) 
For even with and Eqn. 7 specialises to reproduce the result of Ref. Barenco et al. (1995), i.e. .
Next, consider Fig. 2 b and its Margolus pattern with one overhead of qubits to be taken into account by Eqn. 5. Then the same arguments give
(9) 
(10) 
Finally, for a pattern with two such overheads as in Fig. 2 c, where , one likewise finds
(11) 
(12) 
With efficient implementations requiring to be closest to (vide supra), three overheads do not occur.
Since , one may use with the most efficient setting of for even or for odd as a lower bound for the number of unitary qubit modules necessary to exactly implement an arbitrary generic qubit target unitary.
In the limit of large , one thus obtains the bounds on gate complexities and so . Likewise the limiting time complexities and give a speedup potential of in units of the ratio of singlegate times in the respective experimental setting. These limiting speedup ratios are nearly reached already for , as the numbers given in Tab. 1 show. In this sense, accelerations may be taken as roughly constant over the entire range of interest.
Although in generic qubit unitaries, the cisc speedup may appear overwhelming, quantum algorithms are usually by construction resorting to highly nongeneric unitary bulding blocks, many of which with linear complexities Barenco et al. (1995). However, in these seemingly less rewarding yet practically relevant cases cisc compilation will turn out to be highly advantageous as demonstrated in three worked examples in the current study. — Since generic and thus highly entangled states have recently turned out to be computationally of modest use Gross et al. (2008), recasting the above analysis in terms of designs and designs Dankert et al. (2006); Gross et al. (2007); Ambainis and Emerson (2007) and following concentrations of measure will give a more realistic estimate, which is part of a different project.
Error Propagation and Relaxative Losses
As the main figure of merit we refer to a quality function
(13) 
resulting from the fidelity and the relaxative decay with overall relaxation rate constant during a duration assuming independence of fidelity and decay. Moreover, for qubits one defines as the trace fidelity of an experimental unitary module with respect to the target gate the quantity
(14) 
where both with . It follows via the simple relation to the Euclidean distance
the latter two identities invoking unitarity of . The reason for chosing the trace fidelity is its convenient Fréchet differentiability in view of gradientflow techniques, see also Ref. SchulteHerbrüggen et al. (2008a).
Consider an qubitinteraction module (cisc) with quality that decomposes into universal twoqubit gates (risc), out of which gates have to be performed sequentially. Moreover, each 2qubit gate shall be carried out with the uniform quality . Henceforth we assume for simplicity equal relaxation rate constants, so are identified with . Then, as a first useful rule of the thumb and assuming independent error propagation, it is advantageous to compile the qubit module directly if . Or more precisely taking relaxation into account, if the module can be realised with a fidelity
(15) 
A more refined picture emerges from MonteCarlo simulations of error propagation. To this end, compare the above independent error estimates with two scenarios for a sequence of gates in total: (i) the fold repetition of single unitary gates with individual errors meant to give with and (ii) the repetition of a sequence of four different gates again each with individual errors to give where . In the sequel, we refer to case (i) as and to case (ii) as .
For gates and errors to be generic, we use random unitaries (distributed according to the Haar measure following a recent modification Mezzadri (2007) of the qralgorithm). To a given random unitary qubit gate (defining its Hamiltonian via ) we simulate a generic error as follows: from another independent unitary take the matrix logarithm such that . Then to a given trace fidelity , a corresponding unitary with a MonteCarlo random error (the error being introduced on the level of the Hamiltonian generators) can readily be obtained by solving
(16) 
for . Along these lines one obtains the MonteCarlo fidelities for repeating the gate by
(17) 
and
(18) 
where the product runs from right to left. These MonteCarlo simulations are compared to the simple model of independent errors according to
(19) 
As shown in Fig. 3 a, for twoqubit gates the error propagates with a vast variance, which makes it virtually unpredictable. Thus assuming independence is always too optimistic for AAAA, while for ABCD it is still mostly optimistic, although there are cases in which the errors may compensate to give less effective loss than expected under independence.
However, when moving to effective multiqubit gates, i.e., cisc modules, the generic situation becomes more predictable. For example, in qubit random unitary gates, Fig. 3 b shows that AAAA is significantly deviating from independent error propagation, whereas ABCD resembles independent error propagation almost perfectly. The situation is qualitatively exactly the same even if the single gate error is larger as tested by analogous MonteCarlo simulations setting or (not shown).
In the sequel, we will—for the sake of simplicity—often assume independent error propagation at the expense of systematically underestimating the pros of cisc compilation compared to the standard risc compilation into universal local and twoqubit gates.
Computational Methods and Devices
Following the lines of our previous work on time complexity SchulteHerbrüggen et al. (2005), we used the grape algorithm Khaneja et al. (2005) for direct cisc compilation. It tracks the fixed final times down to the shortest durations of controls still allowing for synthesising the unitary target gates with full fidelity. This gives currently the best known upper bounds to the minimal times required to realise a target module on a concrete hardware setting. We extended our parallelised c++ code of the grape package described in Gradl et al. (2006) by adding more flexibility allowing to efficiently exploit available parallel nodes independent of internal parameters SchulteHerbrüggen et al. (2008b). Moreover, faster algorithms for matrix exponentials on highdimensional systems based on approximations by Tchebychev series have been developed Waldherr (2007) specifically in view of application to large quantum systems SchulteHerbrüggen et al. (2008b). Thus computations could be performed on the hlrbii supercomputer cluster at Leibniz Rechenzentrum of the Bavarian Academy of Sciences Munich. It provides an sgi Altix 4700 platform equipped with 9728 Intel Itanium2 Montecito Dual Core processors with a clock rate of GHz, which give a total linpack performance of TFlops/s. The present explorative study exploited the time allowance of approx. cpu hours.
I The SWAP Operation
The easiest and most basic examples to illustrate the pertinent effects of optimalcontrol based ciscquantum compilation are the respective indirect swap gates in spin chains of qubits coupled by nearestneighbour Ising interactions with denoting the coupling constant.
For the swap unit there is a standard textbook decomposition into three cnots. Thus for Isingcoupled systems and in the limit of fast local controls, the total time required for an swap is , and there is no faster implementation Khaneja et al. (2001, 2002). Note, however, that in systems coupled by the isotropic Heisenberg interaction , the swap may be directly implemented just by letting the system evolve for a time of only . Sacrificing universality, it may thus be advantageous to regard the swap as basic unit for the swap task rather than the universal cnot. Clearly, any evenorder swap can be built from swaps along the lines of the most obvious scheme of Fig. 4 a. (The oddorder swaps follow, e.g., from swap by omitting qubit and all the gates connected to it.)
Moreover, the generalisation to decomposing a swap into a sequence with different swap building blocks (where ) as shown in Fig. 4 b is straightforward by ensuring . Due to its symmetry, the total duration then amounts to
(20) 
and the overall quality as a function of the fidelities of the constitutent gates reads
(21) 
Now, the swap building blocks themselves can be precompiled into timeoptimised single complex instruction sets by exploiting the grapealgorithm of optimal control up to the current limits of imposed by cputime allowance.
Proceeding in the next step to large , Fig. 5 underscores how the time required for swap gates decreases significantly by assembling precompiled swap building blocks as cisc units recursively up to a multiqubit interaction size of , where the speedup is by a factor of nearly . Clearly, such a set of swap building blocks with allows for efficiently synthesising any swap. Assuming for the moment that a linear time complexity of the swap can be taken for granted, one may extrapolate the results of direct cisc compilation from the range of the inset of Fig. 5 a to a large number of qubits. One thus obtains an estimated upper limit to the time complexity of the swap. This is indicated by the dashed line, the slope of which will be defined as . Likewise, the irrespective slopes of the qubit decomposition are denoted by .
With these stipulations, we introduce as a measure for the potential of cisc compilation (versus risc compilation) the ratio of the slopes
(22) 
and as a measure for the extent to which this potential has been exhausted by qubit cisc compilation the ratio
(23) 
thus providing as convenient measure of improvement
(24) 
The data of Fig. 5 thus give a potential of ; by qubit interactions it is already pretty well exhausted, as inferred from . The current cisc over risc improvement then amounts to .
On the other hand, deducing from Fig. 5 right away that the time complexity of swaps ought to be linear would be premature: although the slopes seem to converge to a nonzero limit, numerical optimal control may become systematically inefficient for larger interaction sizes . Therefore, although improbable, e.g., convergence of the slopes to a value of zero cannot be safely excluded on the current basis of findings. This also means a logarithmic time complexity can ultimately not be excluded either.
Summarising the results for the indirect swaps in terms of the three criteria described in the introduction, we have the following: (i) in Ising coupled qubit chains, there is no speedup by changing the basic unit from the universal cnot into a swap, whereas in isotropically coupled systems the speedup amounts to a factor of three; (ii) extending the building blocks of swap from (risc) to (cisc) gives a speedup by a factor of nearly two under Isingtype couplings; (iii) the numerical data are consistent with a time complexity converging to a linear limit for the swap task in Ising chains, however, there is no proof for it yet.
Ii The Quantum Fourier Transform (QFT)
Since many quantum algorithms take advantage of efficiently solving some hidden subgroup problem, the quantum Fourier transform plays a central role Jozsa (1998); Cleve et al. (1998); Ettinger et al. (2004).
In order to realise a qft on large qubit systems, our approach is the following: given an qubit qft, we show that for obtaining a qubit qft by recursively using multiqubit building blocks, a second type of module is required, to wit a combination of controlled phase gates and swaps, which henceforth we dub qubit cpswap for short.
Here we present two alternatives: variant I with and, as a special case, variant II for even .
Choosing and for a start, the recursive construction is illustrated in Fig. 6. The top trace shows the standard textbook realisation of a qubit qft. By shifting the final swap operations, it can be rearranged into the sequence of gates depicted in the lower trace. Note that the gates appearing in solid boxes constitute a qubit qft (which itself is made of two qubit qfts and a central qubit cpswap), while the ones in dashed boxes have to be added for a qubit qft. For we have thus shown how a qubit qft reduces to a qubit qft, two qubit cpswaps, and an qubit qft. So with providing a foundation, at the same time we have also illustrated the induction from a qft to a qft. Moreover, the same construction principle holds for any block size , which can readily be proven by a straightforward, but lengthy induction from to .
One thus arrives at the desired block decomposition of a general qubit qft as shown in Fig. 7 (which is variant I; the less effective variant II can be found in Appendix A): it requires times the same qubit qft interdispersed with times an qubit cpswap, out of which show different phaseroation angles. For all and , one finds the following observations:

a cpswap takes as least as long as a qft;

a qft takes as least as long as a cpswap;

a cpswap takes least as long as a cpswap .
Thus the duration of a qubit qft built from qubit and qubit modules amounts to
(25) 
Next, consider the overall quality of a qubit qft in terms of its two types of building blocks, namely the basic qubit qft as well as the constituent qubit cpswaps with their respective different rotation angles. It reads
(26) 
In the following, we will neglect rotations as soon as their angle falls short of a threshold of . This approximation is safe since it is based on a calculation of a qubit qft, where the truncation does not introduce any relative error beyond . According to the block decomposition of Fig. 7, thus three instances of cpswaps are left, since all cpswap elements with boil down to mere swap gates due to truncation of small rotation angles. The representation of these cpswap modules is shown in Appendix B as Fig. .2.
With these stipulations, we address the task of assembling an qubit qft, exploiting the limits of current allowances on the hlrbii cluster. This translates into using qubit cpswap building blocks () and the qubit qft () in the sense of a qubit qft. Its duration is readily obtained as in Eqn. 25 thus giving an overall quality of
(27) 
Based on this relation, the numerical results of Fig. 8 show that a cisccompiled qft is moderately superior to the standard risc versions Saito et al. (2000); Blais (2001). Although the potential of cisc compilation amounts to , recursively assembling qubit qfts and qubit cpswaps only exploits about half of it as apparent in the value of .
As has been pointed out by Zeier Zeier (2007), the decomposition of a manyqubit qft into smaller qfts and concatenations of a permutation matrix and a diagonal matrix roots back in a principle already used in the CooleyTukey algorithm Cooley and Tukey (1965) for the discrete Fourier transform (dft): Let . Then one obtains Clausen and Baum (1993); Egner (1997)
(28) 
where is a permutation matrix. Moreover, setting , the diagonal matrix takes the form
(29) 
Therefore, the qft decompositions made use of here exactly follow the classical scheme in the second line of Eqn. 28, the expression corresponding to the cpswap.
Iii The MultipleControlled NOT Gate (CNot)
Multiplycontrolled cnot gates generalise Toffoli’s gate. Here, we move from cnot to cnot in an qubit system with one ancilla and one target qubit. The reason for the ancilla qubit being that it turns the problem to linear complexity Barenco et al. (1995). Moreover, in view of realistic large systems, we assume again a topology of a linear chain coupled by nearestneighbour Ising interactions. Since cnotgates frequently occur in errorcorrection schemes, they are highly relevant in practice.
Here we address the task of decomposing a cnot into lower cnots and indirect swap gates (see Sec. I).
To this end, we will generalise the basic principle of reducing a cnot to cnot gates with that can be demonstrated by decomposing a cnot into Toffoli gates according to scheme of Fig. 9 devised by Barenco et al. in Barenco et al. (1995). Starting with any of the computational basis states (where , denotes addition , and being the usual scalar product) track the effect of the gate sequentially from state through state
to see the overall effect of the gate sequence is a cnot thus proving the decomposition.
Fig. 10 provides a generalisation of the scheme in Fig. 9: in the first place (a), the number of control qubits is reduced by introducing blocks with qubits that are left invariant. The price for this reduction is a fourfold occurence of the reduced building blocks. In the second step (b), the reduced building blocks are expanded into a sequence with two central cnots, two terminal cnots and two lots of times cnot each. For part (a) and (b) can be expanded in a general concatenated way thus entailing an overall duration of
(30) 
For completeness, note that the cases have to be treated separately, since they only allow for less and less densly concatenated expansions (not shown). Their respective durations are
(31) 
(32) 
(33) 
However, the total number of gates only depends on , so that obtains as the overall quality
(34) 
Given the duration of the decomposition as in Eqn. 30, it is easy to see that implementing the control qubits comes with the lowest time weight (4) and without a time overhead of auxiliary gates. Implementing the control qubits, however, requires the same time weight (4), but entails the time for one auxiliary swap gate. In order to implement the control qubits, in turn, a sizeable amount of auxiliary swaps are needed.
Therefore, whenever high fidelities can be reached (so that the quality is limited by relaxation not by fidelity), a good strategy of combining the expansive decomposition in Fig. 10 a with the recursive decomposition in part (b) is the following: given control qubits and with the current limitation from direct cisc compilation being , choose to be the largest, to be the second largest and such that one obtains an even number for .
In the next step, a decision has to be made in order to minimise the contributions in the last two lines of Eqn. 30, whenever there are several integer solutions . So for integer , this amounts to the ordinary minimisation task