Cain: Automatic Code Generation for Simultaneous Convolutional Kernels on Focalplane Sensorprocessors
Abstract
Focalplane Sensorprocessors (FPSPs) are a camera technology that enable low power, high frame rate computation, making them suitable for edge computation. Unfortunately, these devices’ limited instruction sets and registers make developing complex algorithms difficult. In this work, we present Cain
Keywords:
Convolution SIMD Image sensor Analogue computing Edge inferencecapbtabboxtable[][1.0\FBwidth]
1 Introduction
Realtime computer vision applications are currently bound to traditional camera sensors that transfer each pixel at each frame to a host where it is processed. This requires highperformance buses between the sensors and hosts, especially where high framerates are required. A selfdriving car may need to receive new information for every 1cm travelled to be vigilant of unexpected scenarios, so at 80 km/hr a frame rate of 2222 Hz would be required. A 2 megapixel camera, with 10bit pixel depth, running at such a frame rate, requires a bus capable of 45.6 Gbit/s — which is currently only possible with devices such as a PCIe x8 Gen3 interface [pcieCam]. For many applications, however, streaming data at such volumes is too demanding – both in power and computation time – hence requiring an alternative solution.
Codesign of hardware and software for computer vision applications is an emerging research field to address the limitations of conventional systems [8436423]. Focalplane Sensorprocessors (FPSPs) are a promising avenue for reducing the data transfer between the camera and the processing unit. FPSPs, often synonymous with Cellular Processor Arrays (CPAs) and Pixel Processor Arrays (PPAs), perform processing on the sensor chip itself and are often designed for tasks which require high frame rates or low latency [fpsp1]. The principle behind them is that a small processor is embedded directly with each pixel of the sensor. While FPSPs come in various forms for specific applications, we in this paper we explore a generalpurpose finegrain architecture SCAMP5 [scamp1], but one can imagine alternatives that could be designed for various use cases.
One of the most widely used methods for image analysis is convolution kernels. From edge detection using Sobel filters to document recognition using Convolutional Neural Networks [lecun1998gradient], convolutional kernels are the foundation for many complex computer vision applications. Traditionally, application of the convolutional kernels to the image data occurs on a CPU, but more recently GPUs and FPGAs are used to accelerate the computations in parallel [abadi2016tensorflow], [chen2016eyeriss]. Several systems have been designed to optimise the processing of convolutional kernels on GPUs and FPGAs, leading to a vast array of techniques to reduce the number of operational cycles needed to apply kernels to input data. While this significantly increased throughput, these methods are still bounded in latency as the image must make its way from the camera through to the host system. As for FPSPs, the ability to process the data on the focal plane enables the kernels to be applied to the image data at very low latency. Furthermore, the unique ability to select the data which is transferred from the device to the host reduces the data volume, which allows for high frame rates. However, the technology is comparatively new. By design, they offer novel ways to interact with the data, and while work has been done to provide a DomainSpecificLanguage and associated tools to program such hardware [martel], there has been less work done so far to produce code generation systems to make efficient use of their architectural features when applying convolutional kernels in particular.
One such system that does exist, however, is AUKE [TomD]. Given an convolutional kernel, AUKE’s reversesplit algorithm generates code for SCAMP5 which applies the kernel efficiently to the captured image on the focalplane using analogue computation. AUKE is, however, limited to compiling just a single convolutional kernel at a time using a reduced instruction set that omits the more powerful instructions available in SCAMP5.
In this work, we present an improved alternative to AUKE, with the ability to produce code for applying multiple convolutional kernels at a time. The problem is presented as a dynamic graph search problem in which we must efficiently generate and traverse possible processor states to find a path that describes the relevant convolutional computation. By incorporating instruction selection and instruction scheduling into the core of search process, we enable the use of more novel features of CPA architectures than AUKE is able to use. By optimising the code for multiple kernels simultaneously, common subexpressions between kernels can be exploited and produced only once rather than for each kernel. This reduces the computational expense of applying the kernels, enabling applications to run at a faster frame rate.
The primary objective of this work is to push the boundary of code generation for FPSP devices through simultaneous kernel optimisation. We offer the following contributions:

Cain: A code generation algorithm which effectively makes use of common subexpressions across filters consisting of multiple convolutional kernels. Our graph search strategy – which enables Cain to efficiently search large graphs – combines instruction scheduling, instruction selection and registerallocation constraints into the core of the search to make better use of specific hardware capabilities in SIMD processors.

We show how this search can be tractable for problems of interest through a problem formulation based on AUKE’s multiset–of–Atoms problem representation, combined with a ranking heuristic and a hybrid graphgenerator–graphsearch exploration strategy.

We show how this approach allows flexible exploitation of hardware capabilities (such as threeoperand adds and multistep shifts), and generates very efficient use of additions to avoid multiplies.

Evaluation of the effectiveness of Cain on the SCAMP5 Focalplane Sensorprocessor. We compare against AUKE and test the effectiveness of simultaneous kernel optimisation. We conclude by exploring how our simultaneous kernel optimisation extends to future devices with more registers per pixel.
The remainder of the paper is organised as follows. Section 2 describes the SCAMP5 and its instruction sets, Section 3 explains our proposed code generation algorithm Cain, and in Section 4 detailed comparison is made between Cain and AUKE, together with an evaluation of the effectiveness of simultaneous kernel optimisation. Section 5 reviews the related work AUKE in detail. Finally, Section 6 concludes our work, with a discussion about potential future research.
2 Background: SCAMP5 Focalplane Sensorprocessor
In this section, we discuss the capabilities of the next generation camera technology SCAMP5, and give an overview of the functionality used by Cain.
SCAMP5 has been demonstrated in many different computer vision applications, ranging from Visual Odometry systems [murai2020bit], [bose_visual_2017], [debrunner2019Multiprog], an endtoend neural sensor which performs learnt pixel exposures [DBLP:journals/pami/MartelMCDW20], to Convolutional Neural Networks [wong2020analognet], [bose2019camera]. Its distinctive ability to perform computation on the focalplane reduces power consumption and data transfers, making the device promising for edge computation.
The SCAMP5 architecture is a generalpurpose finegrain SIMD FPSP [scamp2]. It has a pixel array, and along with each pixel is a small Processing Element (PE). All 65,536 processors execute the same instruction at one time. In addition to 14 binary registers, each PE has analogue registers A through to F as well as a NEWS register. Each PE can also address an XN, XE, XS, and XW register that is actually that PE’s respective neighbours’ NEWS registers. Each PE uses an analogue bus to link its available analogue registers, and because values are stored as charge; analogue arithmetic is done directly on the bus that connects the registers rather than on a separate arithmetic unit.
Instructions in the architecture control how register values are let into and out of the bus with the caveat that values are inverted due to the nature of the analogue electronics. Each macro instruction like add, sub, and mov are made of multiple bus instructions that create the desired behaviour, where the instruction has the general rule that the values of registers are summed up, negated, and divided equally between the receivingregisters . Since a bus operation directly controls which registers are opened to the PE’s common analogue bus, a register may only appear once in each bus instruction. Each bus instruction also incurs significant noise and error factors, especially for bus2 and bus3 [scamp3].
Macro instruction arguments are written as if they are assignment statements. For example; the macro instruction add(A, B, C) means and is made up of two bus instructions: bus(NEWS, B, C) meaning the NEWS register now contains the value of ; and then bus(A, NEWS) so that register A contains . We can see here that the add instruction has additional constraints, such that the two operands cannot be the same register, and that the NEWS register is overwritten, and left containing as a side effect. When using macro instructions, we restrict the registers to A to F, and allow the macros themselves to make use of the NEWS and neighbouring NEWS registers for us by means of a direction value. We use subscripts to denote the registers of neighbouring PEs. For example: mov2x(A, B, north, east) computes in two bus instructions: bus(XS, B); bus(A,XE). The first means that which is equivalent to and then the second instruction means .
While interesting uses of the bus instructions exist, allowing adding and subtracting from neighbouring PEs, individual macro instructions are still highly restricted in comparison to most modern instruction sets. Only primitive analogue operations are available to each PE such as: Move, Add, Subtract, Divide by two, and to acquire the value from the sensor [scamp3]. The lack of a multiplication instruction means the problem of generating convolutional filter code for SCAMP5 builds on the theory of multiplierfree FIR filters [CHANDRA2016212].
The chip has been shown to be capable of operating at 100,000 FPS, largely because it is not limited by the speed of an output bus to transfer all the pixel data [scamp1]. Instead of only offering an analogue or digitally encoded output of all pixels at a time, like traditional camera sensors, the SCAMP5 architecture allows binary outputs per pixel, and even event driven outputs. This allows each PE to come to a judgement on its input pixel data and fire its own event that sends the coordinates of the PE to the host; allowing information transfer without divulging the actual image.
The architecture uses an offchip controller to manage the fetchdecodeexecute cycle, with every pixel’s processor receiving the same instruction, making it a singleinstructionmultipledata (SIMD) design. This has benefits in terms of simplicity and efficiency as none of the Processing Elements need to be able to fetch instructions for themselves. There is also provision for masking pixels such that only selected PEs execute instructions.
One important consideration to be made when using and designing algorithms related to the SCAMP5 chip is noise introduced by the nature of the analogue computation. Every use of the 7 analogue registers introduces noise to the values stored. This makes finding optimal code to perform the convolutions ever more vital for accurate results.
3 Cain
Cain is a framework for compiling convolutional filters, designed to search through a configurable Cellular Processor Array (CPA) instruction set to find efficient code. A fundamental concept Cain uses is to only consider a single arbitrary PE in the CPA, and perform everything relative to it. This works for SIMD architecture like SCAMP5 because every PE will be executing the same steps synchronously in parallel. The assumption we make when producing code is that the neighbours of our arbitrary PE will exist and so will have done the same work but at a relative offset in the input image.
The aim is to search through the graph of possible Processing Element states in such a way that common subexpressions in the given kernels are exploited and used to reduce the cost of any path from initial to final PE states. To do this Cain searches backwards, starting with a set of final kernels, these are the convolutional filter, and applying instructions in reverse to simplify the kernels until only the identity kernel
3.1 Definitions
This section provides an overview of notation and definition used in this paper. Cain is designed such that different definitions could be used without changing the fundamental search algorithm but the definitions we use here for SCAMP5 are based largely on AUKE’s, which provides an elegant way to conceptualise the convolutional kernels without multiplication.
Example 1
We will look at a simple example of how a convolutional kernel is represented in Cain. Here we use AnalogNet2 [wong2020analognet][analognet2] which is a CNN designed for SCAMP5.
(1) 
Since SCAMP5 does not have multiplication we must approximate the kernel and because it does have divisionbytwo instructions the natural approximation to make is to find the nearest integer multiple of for each coefficient in the kernel, given some number of divisions . In our example we have already extracted the common denominator such that and this perfectly represents the kernel. The larger is, the larger the search space and complexity of the problem, so can be limited to allow an acceptable amount of approximation error such that the resulting program is shorter and computational expense of compiling it is reduced.
Definition 1
Let an Atom, denoted as , be a representation of of a pixel value at coordinate , on the th channel. are coordinates relative to the arbitrary PE and so also the centre of the kernel, and refers to an image input channel. The sign is used to negate the value if necessary.
Definition 2
Let a Goal, denoted as , be a multiset of Atoms. The Goal represents an arbitrary kernel, however, scaled by . The aggregate of the values represented by each of the Atoms yields the same result as applying the scaled kernel.
Representing a convolutional kernel as a Goal is a convenient way to support multiplyfree instruction set, such as SCAMP5. One can simply view this as unrolling the multiply instruction into additions. Using Goals simply reframes the problem by scaling everything by , and approximating coefficients to the nearest number of Atoms.
Definition 3
Let a GoalBag, denoted as , be a multiset of Goals. The GoalBag is used to capture the state of our arbitrary PE. This includes defining the FinalGoals, the set of convolution kernels we wish to compute; and the InitialGoals, the set of Goals which the computation will start from.
Using these definitions of Goals and Atoms we see that the first kernel from Example 1 can be represented by
As our Goal notation is verbose, we provide a compact version that disambiguates Goals from kernels
(2) 
By repeating this for process the rest of the convolutional kernels in the AnalogNet2 filter, the FinalGoals GoalBag is produced:
(3) 
Since, in our example, ; the Goal representation of the identity kernel () that makes up the InitialGoals, is based on the approximation of the FinalGoals:
(4) 
Moving a value around the processor array is expressed by translating every Atom of a Goal. Addition and subtraction can be expressed by combining two Goals into one, making sure to cancel out positive and negative Atoms with the same coordinates. Since Cain searches backwards, we apply these operations in reverse. For 2operand addition this means we take a Goal, , that we wish to generate code for, then produce 2 new Goals that when added together produce . Defining Goals as multisets of Atoms makes this process intuitive as we can simply split the Atoms between two Goals in every possible permutation (or fewer if we are willing to assume some are nonoptimal, or willing to miss potentially better code for the sake of more efficient code generation). This definition also restricts the reverse search process since when splitting a Goal we cannot split an Atom. To compute the red Atoms in naively, PEs must sum them and read this value from the west thus translating the Atoms eastward.
3.2 Search Strategy
Cain’s reverse search algorithm works iteratively taking the state of an arbitrary PE, defined as a GoalBag:
(5) 
This is a node in our search graph and represents the state we aim to achieve by executing the instructions that form a path from the initialGoals to this node. In the search graph, nodes are generated dynamically as the graph is explored. Fig. 2 shows a simplified view of how a graph might look as it is generated and searched. We simplify the exploration such that in each iteration of the search algorithm we produce a GoalBag Pair of an GoalBag and a GoalBag as well as an instruction, with the following constraints:
(6) 
This is in contrast to AUKE’s method, shown later in Equation 16. The new child node, , is then produced by applying the instruction in reverse using the following rule, with the instruction becoming an edge in the graph:
(7) 
Following our AnalogNet2 example from Equation 3, the first iteration of the search algorithm will start with and the Pair of GoalBags Cain produces is as follows:
(8)  
(9)  
(10) 
The multiset semantics here mean that if the Goals in are all already part of then the number of Goals to solve is reduced, and so by applying more pairs we traverse the graph of GoalBags, until we reach the initialstate, where the only Goal in the GoalBag is the identity Goal. In our example (Equation 10) we see that the subexpression of 3 negative Atoms is reused in and since applying a mov2x next could eliminate from . There is also further potential to reuse this by how we split . Once the initial GoalBag is found the path from the initial GoalBag back to the FinalGoals becomes the list of instructions that form our generated program.
After this point Cain continues searching for shorter paths, and can cull any nodes with longer paths. During the search the same GoalBags may be reproduced in different ways, we cull the current node any time a GoalBag is produced that has already been seen at a lower or equal cost, or if the GoalBag has more Goals than available registers.
The second part of the search strategy defines the search order. Each invocation of the reverse search algorithm produces one new node, , and the input node is incremented to know how many of its children have been produced so far. Cain uses this simple definition to allow several graph traversal algorithms to be implemented. Using DepthFirstSearch (DFS), Cain can simply maintain a stack of the nodes. On each cycle the top node is popped off the stack and given to the reverse search algorithm. Then the incremented parent node is put back on the stack, followed by the new child node.
While DFS performs well in AUKE, it struggles in Cain because the number of child nodes at every level is far greater, since each edge is only one instruction and there are multiple kernels to consider. This means the size of the graph we would like to search is much larger and we are unable to search even a small fraction of it. To overcome this we use a graphtraversal algorithm that, for our purposes, we call ChildGeneratorDequeSearch (CGDS). The aim of this algorithm is to ensure that the search does not end up ‘trapped’ in one small part of the graph, but can effectively search traverse many children of many of the nodes that are found where DFS will search all of the children of nodes at the extent of the paths it searches before searching the second children of nodes earlier in the graph. Algorithm 1 shows a pseudocode implementation of CGDS. In each cycle the front of the queue is polled, if the node has not been seen before, Cain checks to see if it can be directly transformed from the initialstate GoalBag, this is the ‘node computation’. The node is then passed to the reverse search algorithm to attempt to produce the next new child node and to increment parent node – this is implicit in calling ’yield()’ on g. The child node, if it exists, is put on the front of the queue and the incremented parent node is put on the back. We do not claim that CGDS is novel, but we have found it superior to obvious alternatives, and the strategy used in [Linnea]; for details see [Stow2020].
3.3 Cost Function
In the reverse search algorithm we see that the pairs of and are produced one at a time. While this simplification allows us to produce more generic graph traversal implementations; what allows Cain to efficiently find solutions, are the heuristics that allow us to order the pairs that are produced for a node from the most promising to the least. This type of heuristic provides the order of siblings to search so we call it a ‘local heuristic’. It doesn’t compare nodes in different parts of the graph, which we would call a ‘global heuristic’. We found that we were unable to find effective global heuristics because traversal algorithms that take advantage of such heuristics end up producing huge frontier sets of nodes making the memory requirements too large. The use of local heuristics drives the SCAMP5 code generation in Cain instead, though support for bestfirstsearch with global heuristics is available in Cain. The local heuristics used for SCAMP5 are based on generating every child node of the parent and then ordering them based on a cost function. There are 3 main components considered for the cost: Atom distance, repeated Goals, and divisions. A simplified formula is shown in Equation 11.
(11)  
(12)  
(13)  
(14) 
The Atom distance part counts up how many Atoms every Goal in has, and how far from the centre they are, with some relief if the Goal is a subGoal of another Goal in . The repeated Goals portion of the cost penalises by the square of number of Atoms in each Goal, unless that Goal is equal to a translation of another Goal in . The divisions component penalises for the number of division operations that would be required to produce the Goals from the identitykernel Goal, .
4 Evaluation
All performance evaluation is conducted on an Intel Core i77700HQ CPU (4 cores, 8 threads) with a base frequency of 2.80GHz. The computer has 16GB of RAM, runs Ubuntu 18; as well as Java 1.8 (Oracle) and Python 3.6 to run Cain and AUKE respectively. The implementation of AUKE used, as developed by Debrunner, can be found on Github
4.1 Performance Evaluation Against AUKE
Comparison of our work Cain against AUKE is performed by comparing resulting code generated from the respective compilers, given the same input filters. Both compilers are given 60 seconds to find a solution using all 6 registers. Note as Cain supports multithreading, it spawns 4 worker threads to perform the search.
As shown in Table 1, Cain significantly outperforms AUKE. Cain supports a wider set of instructions in contrast of AUKE, enabling generation of more efficient code. Not only this, the search strategy used by Cain is better than AUKE’s, as shown in Gaussian Kernel, were using the same set of instructions (Basic), code generated by Cain is half in length when compared to output of AUKE’s. Although, in further testing, AUKE is able to produce less inefficient code for this kernel given fewer registers. When given multiple kernels, Cain is able to perform simultaneous kernel optimisation. For example when combining and Gaussian, unlike AUKE, Cain is implemented to utilise the common subexpressions between the kernels, thus, generating shorter code than naively concatenating the code for each of the Gaussian kernels. Neither Cain or AUKE perform a compete exhaustive search.
The AnalogNet2 filter is the kernels used in AnalogNet2 [wong2020analognet][analognet2], which is a CNN for SCAMP5, capable of MNIST digit recognition. Cain requires only 21 instructions whereas AUKE produces kernel code which has in total 49 instructions. Reduced code not only improves the execution time, but also reduces the noise build up, which is significant problem as discussed in [wong2020analognet]. If the aim of finding subexpressions is to eliminate redoing work, then the number of add and subtract operands is a proxy for how effective the search for subexpressions is, regardless of how translations are handled. Table 2 shows that AUKE’s code has 40 add or subtract operands whereas Cain’s code has only 27. We have compared the runtime of AnalogNet2’s convolution kernels, generated by AUKE and Cain on the physical SCAMP5. Note, as AUKE produces code which performs invalid register manipulation, the fixed code as used in [analognet2], which executes on the device is 81 instructions long. The execution time of the code produced by AUKE and Cain for the convolution kernels were and respectively, showing almost 4 times speedup.
Name  Approximated Filter  AUKE  Cain  
Basic  All  Basic  
33 Gauss  12  10  12  
55 Gauss  50  19  25  
55 and 33 Gauss  26  39  
AnalogNet2  21  30  
AUKE  Cain  
Kernel 2
1mov(B,A);
2divq(B,B);
3divq(B,B);
4movx(C,B,north);
5neg(C,C);
6neg(D,C);
7movx(E,D,west);
8neg(E,E);
9add(F,B,E);
10movx(B,D,east);
11add(B,B,E);
12movx(D,E,south);
13movx(D,D,south);
14sub(B,B,D);
15add(B,B,F);
16add(B,C,B);
17movx(C,C,west);
18add(B,B,C);
19movx(C,F,south);
20add(B,C,B);
21add(B,B,F);

Kernel 3
22mov(C,A);
23divq(C,C);
24divq(C,C);
25movx(D,C,south);
26neg(D,D);
27movx(E,C,east);
28sub(D,D,E);
29movx(E,C,north);
30add(E,E,D);
31add(D,D,D);
32add(D,E,D);
33movx(E,C,west);
34sub(C,C,E);
35add(D,D,C);
36movx(C,C,north);
37add(C,D,C);

Kernel 1
38divq(A,A);
39divq(A,A);
40movx(D,A,west);
41neg(D,D);
42movx(E,D,south);
43add(D,D,E);
44add(E,A,D);
45movx(A,A,south);
46movx(A,A,east);
47add(A,D,A);
48add(A,A,A);
49add(A,E,A);

1diva(A,D,E);
2div(D,E,C,A);
3movx(E,D,west);
4movx(C,E,north);
5neg(F,E);
6subx(B,F,east,A);
7addx(E,E,D,south);
8add2x(D,F,D,north,north);
9sub2x(F,D,south,south,C);
10add2x(D,C,D,east,south);
11add(E,E,D);
12movx(D,A,north);
13add2x(A,C,A,east,east);
14movx(C,B,east);
15add(D,F,D);
16add2x(F,F,E,east,south);
17movx(E,B,south);
18addx(A,B,A,south);
19addx(A,B,A,west);
20add2x(B,F,B,north,west);
21add(C,D,C,E);

4.2 Effectiveness of the Search Strategy
If Cain has an effective heuristic we will quickly see a point of diminishing returns in code length, as Cain continues to search new nodes and takes more time. We can track the number of nodes that are explored before finding any plan in Cain, and so use this as a measure of the search strategy and heuristics that is more independent of physical compute performance. With this in mind we test the effectiveness of our heuristic by constructing 100 samples of randomly generated single kernel filters as in Equation 15. Running Cain as per the following configuration – Maximum Nodes to Explore: 20000, Maximum Search Time: 60s, Worker Threads: 1 – allows us to collect as many plans as can be found in the given time limit. We then ran Cain again, but with Cain’s SCAMP5 heuristic disabled and replaced with a random sort. This allows us to compare Cains heuristics against an unaided benchmark.
(15) 
We found that Cain was unable to find any plan for any of the 100 sample filters without its heuristics, principally demonstrating that effective heuristics are required in Cain for any tangible progress to be made. We plot the lengths of the best plans found against the number of nodes expanded before the plan is found in Fig. 3. We can see that improvements are fewer and further between after the first 2500 nodes are explored. After this we see that we can expect at most a reduction equal to the reduction seen at 2500 for the rest of the nodes explored. This clearly demonstrates a point of diminishing returns for these filters. If the heuristic is effective we expect it to direct the search towards short plans first, and try instructions less likely to be optimal later. This model fits the data well as we see short plans are found quickly, and while improvements can be made, it is clear that they are found less often as the search continues.
4.3 Effectiveness of the Simultaneous Kernel Optimisation
One of the significant features of Cain is to efficiently generate code for filters with multiple kernels, and do this simultaneously such that shared common subexpressions can be reused. As it is possible for Cain to perform exhaustive searches for plans, given sufficient time, it will find a solution that simply computes the individual kernels independently, or find a solution with lower cost – utilising the common subexpressions.
First, we wish to test whether the length of generated code is sublinear to the number of input kernels. To test this, we again generate kernels using the using the method in Equation 15. For kernel counts from 1 to 4 we generated 25 filters each and test them all using the same configuration as before except that we remove the maximum nodes explored constraint, and allow 4 worker threads. We plot the results in Fig. 3 and see that the results appear worse than linear, suggesting that common subexpressions are not effectively being taken advantage of.
We hypothesise that the limited number of registers in the SCAMP5 architecture is the major limiting factor in producing efficient code. To test this we increase the number of available registers to 18. For filters with 1 kernel up to 10 kernels we generate 10 samples each. Every kernel in the 100 filters is produced as in Equation 15. For each sample, Cain compiles the kernels individually, given the appropriate number of registers such that other kernels in the filter would not be overwritten. Then we compile the kernels simultaneously using Cain. All compilations are given 60s to run, with 4 worker threads.
Fig. 4 shows the results of this test. We see clearly that when register limitations are not a restricting factor Cain is able to consistently improve the performance of filter implementations by compiling them simultaneously. We see that improvements grow with more kernels, and it appears that the length of code generated for simultaneously compiled kernels increases sublinearly. This supports the idea that with more kernels, ever more commonsub expressions can be exploited.
5 Related Work: AUKE
In this section we look at how AUKE operates to provide extra context and contrast for Cain. Automatic Kernel Code Generation for Analogue SIMD (AUKE) is an algorithm for generating code given a single convolutional kernel created by T. Debrunner [TomD]. It can be characterised by 4 main steps: kernel approximation; the reverse split algorithm; graph relaxation; and finally register allocation. First, AUKE approximates the input kernel into the Goal representation. In this process Cain is similar to AUKE and the reasoning and mechanics have been discussed in Section 3.1.
Unlike in Cain, multiple instructions are represented by a single elemental transformation of Goals. These elemental transformations form edges of a graph that describe the translation, addition, subtraction and division of Goals to produce the desired convolutions filter. This abstraction allows AUKE to reduce the effective size of the search space at the cost of granularity in instruction selection and being extensible to hardware features such as 3operand addition. Debrunner called this the ‘ReverseSplit Algorithm’.
The graph of elemental transformations is dynamically generated via a recursive depthfirst search that tries to split a Goal , that needs to be produced, into 3 subGoals:
(16) 
This recursive algorithm then means that if the search can find solutions for and (two smaller problems) it can trivially create and therefore the desired Goal.
In the ideal case and so only needs to be produced and we save one addition. In the worst case and is a transformation of and so less useful work is done in that step. If two Goals are equal they are merged such that they aren’t calculated twice, to exploit common subexpressions in the Goals. This process is repeated until a single Goal, the initialGoal, is left. This algorithm is able to entirely search the relevant problem space, given a couple of assumptions. Most notably, the assumption that every subGoal generated is a subset of the FinalGoal. This reduces the search space significantly to the most promising but not necessarily the best solutions, allowing AUKE to find generally effective solutions.
The algorithm is made efficient and useful by intelligently selecting the order with which s, s, and s are generated at every recursive step. By selecting pairs of and that are likely to lead to efficient code, the algorithm can quickly find some path to the initialGoal. From then on the recursive search can stop early if a lower cost solution has already be found.
The Graph Relaxation step aims to mitigate missing optimal solutions because of the assumption that subGoals are always subsets of the FinalGoal by using a ‘retiming’ algorithm used in integrated circuit design. This is not needed in Cain since Cain searches instruction by instruction, and so any optimisations found via graph relaxation are already a part of the search space.
The final step is to perform register allocation on the graph to be able to generate usable code. A maximum bound of registers is already accounted for in the search algorithm, since spilling is not an option for the SCAMP5 architecture. For this task; variable liveness is considered for each node of the graph representation, and a graph colouring algorithm is used to find a solution.
6 Conclusion
We have presented Cain, a compiler which produces SCAMP5 instructions from a set of convolutional kernels. Although the effectiveness of simultaneous kernel optimisation is limited on the current iteration of the SCAMP5, we demonstrate, that with the increased number of registers, the length of the output of Cain is sublinear to the number of kernels given. We have conducted extensive comparison against AUKE, and we demonstrate that the code generated by Cain is more efficient, and exhibits almost 4x speed up when the generated kernel is executed on the SCAMP5 device. We believe that SCAMP5 is a strong candidate for edge computation, and by providing easy to use, yet efficient code generation toolkit, we hope to accelerate the relevant research in this field.
Acknowledgements
We would like to thank Piotr Dudek, Stephen J. Carey, and Jianing Chen at the University of Manchester for kindly providing access to SCAMP5, and their support in our work. This work was partially supported by the EPSRC, grant reference EP/P010040/1.
References
Footnotes
 Available at https://github.com/ed741/cain
 Singleentry matrix. Not to be confused with identity matrix
 github.com/najiji/auto_code_cpa/tree/75c017e5ad28c0f3f040fb9f84d7f8727d035baa