# High-Performance Parallel Implementation of Genetic Algorithm on FPGA

###### Abstract

Genetic Algorithms (GAs) are used to solve search and optimization problems in which an optimal solution can be found using an iterative process with probabilistic and non-deterministic transitions. However, depending on the problem’s nature, the time required to find a solution can be high in sequential machines due to the computational complexity of genetic algorithms. This work proposes a parallel implementation of a genetic algorithm on field-programmable gate array (FPGA). Optimization of the system’s processing time is the main goal of this project. Results associated with the processing time and area occupancy (on FPGA) for various population sizes are analyzed. Studies concerning the accuracy of the GA response for the optimization of two variables functions were also evaluated for the hardware implementation. However, the high-performance implementation proposes in this paper is able to work with more variable from some adjustments on hardware architecture.

###### keywords:

Parallel implementation, FPGA, Genetic algorithms, Reconfigurable computing.## 1 Introduction

In the last years, the increasing number of critical applications involving real time systems in conjunction with the growth of integrated circuits density and the continuous reduction in the power supply voltages transformed the development of new suitable computational solutions an even harder task to achieve. Due to the intense demand in the electronics goods market for high processing speeds at smaller time frames, without neglecting the energy savings, the technology industry has faced an extremely competitive and challenging scenario in terms of designing hardware solutions to meet this constantly growing demand. One way found by researchers and developers to address such demands is by using algorithm parallelization techniques. Parallel processing is used to manipulate data concurrently, so that while computing one section of the algorithm, other stations perform similar operations on another set of data (Rodriguez and Moreno, 2015). Combining the hardware implementation with the parallelization of algorithms is often a satisfactory solution for high performance and higher speed applications when compared to sequential solutions.

The Field Programmable Gate Arrays (FPGAs) are reconfigurable hardware devices suited to this scenario due to the nature of its architecture. Given that FPGAs are huge configurable gates, they can be programmed to operate as multiple parallel paths in hardware. In this way, there is a real parallelization in which the running operations do not need to compete for the same resources since each one will be executed by different gates (Instruments, 2011). The increasing density and price reduction of FPGAs expand the opportunities for developers and researchers to use higher density FPGA devices for hardware implementations (Jewajinda and Chongstitvatana, 2009) considering the use of such devices is advantageous since the development time and costs are significantly reduced (Tirumalai et al., 2007).

The convergence among genetic algorithms, parallelization techniques and reconfigurable hardware implementation results in this work which presents a proposal of parallel implementation of a genetic algorithm on FPGA. This paper focuses on high-performance and critical applications that require nanoseconds time constraints to be satisfied. On the other hand, in applications where processing speed is not the critical factor or it is less limiting than the necessity for low power consumption, it is possible to decrease the energy utilization by reducing the clock cycles rate, considering that the dynamic power utilization is diminished when an operating frequency lower than the maximum theoretical one is used (Uyemura, 2002). Applications that process a large flow of data can be benefited and accelerated by this implementation here developed. Some applications examples are: data mining, tactile internet, massive data processing and bioinformatics.

### 1.1 Related Work

Genetic algorithms and Artificial Intelligence (AI) have long been used in applications of the most diverse areas to optimize and find satisfactory solutions in computing, engineering and other fields. More recently, a wider range of applications and variations of genetic algorithms such as parallel and distributed applications, hardware implementations, new proposals for genetic operators, and hybrid (software and hardware) implementations of genetic algorithms have been observed within the research scenario.

In (Fernando et al., 2008), it is proposed an implementation of a customizable Intellectual Property Core (IP Core) for FPGA that implemented a general-purpose genetic algorithm. In this work, the authors have focused on the genetic algorithm programmability implemented in the IP core. The customization could be done regarding population size, generation number, crossover and mutation rates, random number generators seeds and the fitness function. One of the work’s highlights is the support for a multiple of these functions. The proposed IP can be programmed with up to eight fitness functions which could be synthesized in conjunction with the GA and implemented in the same FPGA device. The proposed core also has additional input/output ports that allow the user to add further fitness functions that have been implemented on a second FPGA device or some other external device. The implementation utilized of the available logical cells of a Xilinx Virtex II Pro (xc2vp30-7ff896). However, since a trade-off between performance and flexibility exists, and once the authors focused on flexibility over performance, the speedup over analogous software implementation was only of .

Hardware Genetic Algorithms implementations can also be observed in (Oliveira and Júnior, 2008), (Mengxu and Bin, 2015), (Vavouras et al., 2009) and (Zhu et al., 2007). The work detailed in (Zhu et al., 2007) showed the OIMGA which its strategy was to retain only the ideal individual from the population making the memory requirements drastically reduced. The paper (Oliveira and Júnior, 2008) presented a compact implementation of a genetic algorithm on FPGA that represented the population of chromosomes as a vector of probabilities. The work focused on the lower consumption of memory, power and space resources in hardware, but it was not fully implemented on FPGA as it used a software written in C++ to compute the values from the fitness function. The work (Vavouras et al., 2009) proposed a high-speed GA implementation on FPGA. The implementation was based on the HGA proposed by (Scott et al., 1995), the first known GA implementation on FPGA, and the authors claimed that the developed system surpassed any existing or proposed solution according to their experiments. The P-HGAv1, version developed by (Vavouras et al., 2009) of the HGA claimed to be parametric, have low silicon requirements and support multiple fitness functions. Although the authors have focused on the speed of the algorithm and reached a time of milliseconds for each GA generation, this speed may not be compatible with real-time applications that require low latency.

The works presented by (Mingas et al., 2012), (Ionescu et al., 2015) and (Qu et al., 2013) showed applications for ground mobile robots using GAs, where these first two were embedded implementations on FPGA. (Mingas et al., 2012) developed, according to the authors, the first GA-based hardware implementation of a simultaneous localization and mapping (SLAM) system for ground robots. The authors achieved significant hardware acceleration compared to software implementation by exploiting the pipelining and parallelization capabilities of reconfigurable hardware. In this project the GA’s genes that made up the population represented possible robot movements based on the previous position. Later, In the work developed by (Ionescu et al., 2015), the goal was to determine the optimal movements considering various aspects such as route tracking and low energy consumption, avoiding obstacles collision. The authors pointed out that the implementation was suitable for real-time use and stated that all GA stages have been implemented in hardware modules. The solution presented in this work offered a convergence time of less than milliseconds, it used out of the (%) Lookup tables (LUTs) available in the FPGA, but the frequency obtained after the synthesis process was not informed. (Qu et al., 2013) developed a genetic algorithm with a coevolutionary strategy for global trajectory planning of several mobile robots. According to (Koza, 1991), co-evolution is the process of mutual adaptation of two or more populations simultaneously, and it was used to reflect the fact that all species are simultaneously co-evolving in a given physical environment. The implementation of (Qu et al., 2013) promised an improvement in the genetic operators of conventional GAs and proposed a new operator of genetic modification, but these developments were not implemented in hardware.

The implementations seen in (Merabti and Massicotte, 2014), (Sehatbakhsh et al., 2013) and (Chen and Wu, 2011) were GA applications in digital signal processing and control systems embedded on FPGA. (Merabti and Massicotte, 2014) presented a real-time GA for adaptive filtering application with all modules implemented in hardware such as fitness function, selection, crossover, mutation and random number generator functions. The implementation was designed in hardware and after its synthesis, a rate of thousands generations per second was achieved. Meanwhile, (Sehatbakhsh et al., 2013) proposed a GA for multi-carrier dynamic systems based on filter banks. The authors of (Chen and Wu, 2011) proposed a design and an implementation of a PID controller based on GA and FPGA. The researchers stated that the design method of the intelligent PID controller based on FPGA and GA was successfully verified and had some advantages such as flexible design, automatic online tuning, high reliability, short technical development cycle and high execution speed. For this case, each GA chromosome was coded with the controller set of gains , and . Details of FPGA area occupancy and obtained throughput were not reported.

Lastly, (Guo et al., 2016) and (Lotfi et al., 2016) presented parallel and distributed implementation of GA using FPGAs. (Guo et al., 2016) proposed a solution for parallel genetic algorithms in multiple FPGAs. Using multiple populations in parallel GAs was based on the idea that population isolation can maintain greater genetic diversity, while communication between them can cause GAs to work together to find good solutions. The implementation of (Guo et al., 2016) was applied to three different benchmarks, including the traveling salesman problem, and the authors stated from the experimental results that in a configuration of FPGAs an average acceleration of times over a multi-core processor GA was achieved. (Lotfi et al., 2016) introduced GRATER, an automated design workflow for FPGA accelerators that leveraged imprecise computation to increase data-level parallelism and achieved higher computational throughput. In this work the main idea was establishing a negotiation among circuit that involved area, energy and performance in exchange for precision reduction. This was achieved through an imprecise implementation of specific hardware blocks such as adder and multiplier, since hardware area reduction resulted in better data parallelism utilization and, therefore, increased the yield. Also in (Lotfi et al., 2016), genetic programming was used to evolve variants of the input kernel until one was found with ideal assignments which reduced the synthesized kernel area while still stochastically satisfying the output desired quality. The synthesis results in an Altera Stratix V FPGA showed that the reduced area of the approximate kernels produced a gain of to higher with less than of quality loss compared to benchmarks.

It is essential to observe that the works presented in the literature propose the solutions based on software and hardware on FPGA. This kind of the answer increase de flexibility but decrease the throughput. Thus, Differently of the papers in the literature, this work proposed a high-performance parallel implementation of GA. The implementation uses the full-parallel strategy in all GA operations in order to maximize the throughput.

### 1.2 Paper organization

In section 2 a theoretical foundation about the meta-heuristic used here will be explored as well as the genetic algorithms, its main characteristics, its advantages and its different applications. Section 3 will present a detailed description of the architecture development and implementation, describing the various hardware modules used to construct the parallel genetic algorithm. Later, in section 4, a careful analysis of the results obtained from the implementation described in the previous section will be performed. Simulation results, synthesis in the FPGA and the validation of the proposed architecture will be presented. The analysis will be carried for parameters such as occupation area and sampling frequency, taking into account different configurations of the proposed architecture embedded in the reconfigurable hardware. Following, Section 5 shows a comparison of the obtained with other similar works found in the state of the art. Finally, Section 6 will present final considerations, conclusions on the results obtained.

## 2 Genetic Algorithms

The GAs are used to solve search and optimization problems where an optimal solution can be found through an iterative process in which the search starts from an initial population and then, combining the best representatives of it, obtains a new population that replaces the previous one (Holland, 1975).

GA is an iterative algorithm that is started from a population of chromosomes randomly created. is even, in the case of this proposed work, in order to facilitate the implementation. In every -th iteration, also called generation or epoch, the chromosomes are evaluated, selected, recombined and mutated to form a new population also of chromosomes, that is, the entire population of parents is replaced by the new offspring. Then, the new population is used as input to the algorithm’s next iteration (generation), and this procedure of population updating is repeated times, where is the GA generations number.

The Algorithm 1 displays the pseudo code of a GA. This code details all the variables and procedures that will be used in the implementation to be presented in the following sections. The variable represents the -th chromosome of bits in the -th generation and is a vector that stores all the chromosomes, that is,

(1) |

After the initialization process, the fitness function, called FF (Line 4 of Algorithm 1), calculates the fitness value of the chromosomes of the population. This operation is applied to all chromosomes and results in a respective value for each -th chromosome, where is the number of bits representing the fitness value. The better the value of the chromosome , the more likely it is to continue in the new generations. The fitness values of all individuals are stored in

(2) |

After calculating the fitness value of each -th chromosome of the -th generation, the selection operation is performed. In GAs, the selection’s purpose is to highlight the chromosome alongside its respective fitness values, , in order to produce better future populations. There is a great variety of selection methods in the literature and among them it can be mentioned: the method of selection by ranking, by tournament, roulette selection and elitism. The tournament selection method used in this implementation is one of the most used (Noraini and Geraghty, 2011) and it makes a competition between two or more randomly chosen chromosomes from the population stored in . This competition consists of comparing the strength (fitness), , of all participating chromosomes and the one who holds the best respective value in , proceeds in the algorithm to pass their genes forward. The selection function, called here SF (Line 7 of the Algorithm 1), has the vectors and from the -th generation as it inputs and, for each input value, it outputs the variable that can assume the value of any of the chromosomes stored in . All values are grouped in

(3) |

in order to be used in the crossover stage.

The crossover stage in the -th generation occurs after the selection of the most fit chromosomes in the population (stored in ) by the selection function and aims to originate new chromosomes of which will, after the mutation stage, compose the next GA generation updating the vector ). There are several crossover techniques presented in the literature and the strategy adopted in this implementation was the single point crossover. The crossover operation, called here CF (Line 10 of the Algorithm 1), has as input pairs of elements from vector of the -th generation and as output, pairs of

(4) |

which stores the chromosomes after crossingover, that is, the new -th offspring.

The last GA’s step is the mutation operation that changes the value of a group chromosomes, in order to provide greater diversity to the population avoiding its solution to stabilise in local minimums. The mutation rate, , is the parameter responsible for controlling the amount of mutated chromosomes. Normally, the ranges from to . The value can be easily calculated by the expression

(5) |

## 3 Hardware Proposed

Figure 1 presents the general architecture of the proposed GA hardware implementation. The entire algorithm was developed using a parallel architecture focusing on accelerating the processing speed, taking advantage of the available hardware resources, similarly to (Nedjah and de Macedo Mourelle, 2007). The Figure details in block diagram the main subsystems of the proposed implementation, which in turn were encapsulated in order to make the general visualization of the architecture less complex. It is possible to observe a population of chromosomes of bits in which represents the -th chromosome of the population in the -th generation, according to the Algorithm 1. Each -th chromosome is stored in a -bit register, called here whose value is updated by the new population of chromosome produced after the processes of selection, crossover and mutation. This updating process occurs every time the synchronization module, called here SyncM, enables the registers to store new values.

Given that the implementation optimizes two-variables functions, each register stores the values of both binary inputs for the fitness function using bits concatenation for such storage. The first bits represent the first input of the fitness function, , while the second block of bits stores the second input for the fitness function, . Thus,

(7) |

where is the concatenation operator.

The initial population of the algorithm is randomly chosen. All random values from the present implementation is generated by pseudo random number generators based on Linear Feedback Shift Register (LFSR) (Deliparaschos et al., 2008) and (Goresky and Klapper, 2006). bits independent LFSRs based on the polynomial (Goresky and Klapper, 2006) were used. Each generator is characterised as CCLFSR whose CC, and are labels for its position in the circuit. Every -th generation a random variable of bits, called here , is produced by each LFSR. To avoid the same sequence of values, each generator LFSRCC has a different initial value of bits, called .

The notation used in the following diagrams will be in the form, where is the variable, is the bit word width and represents the generation of the genetic algorithm ranging from 1 to . In some cases only the bracketed notation, , will be shown to represent the amount of bits transferred on the bus.

The implementation consists of five main modules called: Fitness Function Module (FFM), the Selection Module (SM), Crossover Module (CM), Mutation Module (MM) and Synchronization Module (SyncM). Each module has its specific implementations that will be detailed in the following sections.

### 3.1 Fitness Function Module - FFM

The Fitness Function Module (Figure 2) has the purpose of calculating the fitness value of each -th chromosome from a fitness function . The proposed structure has FF modules and each -th module, called here FFM, is associated to an individual and generates as output in every -th generation a fixed-point fitness value expressed by

(8) |

where represents the bit width (equivalent to the 4 line of the Algorithm 1).

Not only for the Fitness Function Module, but for all other stages, the proposed architecture is capable of solving one or two variable problems. Regardless of the case, the user will not need to make any adjustments to the input data. The difference between these two options reflects only on how the data is manipulated by the subsequent modules, but this does not change the performance of the system and is done invisibly to the operator.

Figure 2 details the operation of the th FFM. The FFM input value, stored in the RX register, is divided into two halves of bits, and , by the bit splitters FFMDIV1 and FFMDIV2 so that it is thus possible to operate each variable independently in the case of two variables problems.

After split, the variable is directed to the ROM memory FFMROM1 which implements the function through a Look-Up Table (LUT) and the variable is directed to FFMROM2 which implements the function in the same fashion.

After this, both values are added by the FFMADD adder resulting in the variable, where

(9) |

The variable is then directed to the LUT FFMROM3 where the function will be implemented, hence

(10) |

In general, the FFM shown in Figure 2 is able to solve any one or two variables problem in the format

(11) |

Expressions with product between the two variables are not possible in this current approach, but it would be possible through a change in the structure of the FFM.

### 3.2 Selection Module - SM

The selection module (SM) implements the tournament selection method, as mentioned in Section 2, by doing a competition between two chromosomes. Similarly to the FFM, there are SMs for a group of chromosomes. As detailed in Figure 3, each -th SM, here called SM, has as input the fitness values, , and chromosomes, , from the -th generation (equivalent to the 7 line of the Algorithm 1).

Each -th SM has two random generators called SMLFSR1 and SMLFSR2. In addition to the random generators, this module is formed by three input multiplexers called here SMMUX1, SMMUX2 and SMMUX3, a bits comparator, called SMCOMP and three two-input multiplexers, called SMMUX4, SMMUX5 and SMMUX6.

The SMMUX1 and SMMUX2 multiplexers are driven by the SMLFSR1 and SMLFSR2 generators output signal, ( and ), respectively. As shown in Figure 3 the output signal of each generator ( and ) is truncated in the most significant bits in order to match the population size. The SMMUX1 and SMMUX2 multiplexers select one fitness value each, which is related to its correspondent chromosome by its index value.

Finally, SMMUX3 selects the chromosome associated with the best fitness function value from the output of SMMUX6 which selects whether the goal is to maximize or minimize the evaluation function through the SMMAXMIN variable.

### 3.3 Crossover Module - CM

The crossover module detailed here in this section implements single point crossover. The architecture proposed here contains crossover modules and each one consists of four bit splitters, two identical crossover submodules, and two concatenators. Similarly to the FFM described in Section 3.1, the CM also has chromosome splitters in order to manipulate the two variables stored in independently.

As seen in Figure 4, the two input chromosomes, and , are split into two halves, each. The first half is sectioned by the splitter CMDIV1 which is renamed and the second half of that same variable is sectioned by the splitter CMDIV2 and becomes . The same happens with the chromosome which is sectioned into and through the divisors CMDIV3 and CMDIV4, respectively.

Separating the variables of each chromosome, they are forwarded to the CM submodules CMPQ1 and CMPQ2 so then the crossing is performed. This is conducted in such a way that the crossover is performed between similar variables, that is, the first variable from the chromosome will be crossed with the first variable from the chromosome .

In the case of single variable problems the system works in an equivalent way. Only the least significant half of the variables and will contain useful data and only block CMPQ2 will handle nonzero data.

Figure 5 presents in detail the circuit of the -th crossover submodule named CMPQ1. It is composed by a -input MUX, called here CMPQMUX, whose purpose is to randomly select one of the possible cutting points. The selection of each CMPQMUX is controlled by the pseudo random number generator CMPQLFSR1 whose output signal, , is truncated in the more significant bits before entering the MUX selector.

The selection of the CMPQ1 cut-off point is done relying on the mask originated from the constant . This constant creates a vector of s of the size of the chromosome to be crossed, in the case . Then, a random and zero-padding right shift is performed according to CMPQLFSR1 value. This displacement will transform the vector of s into a vector of s and s concatenated and still of size . This mask and its inverse will be responsible for carrying out the crossover operation aided by the AND and OR logic gates shown also in Figure 5.

Equations 13 and 14 exemplify a case where and CMPQMUX shifts the value of three times

(12) |

(13) |

(14) |

In each -th generation, the two entries of the module CMPQ1, the variables and are divided into head

(15) |

(16) |

and tail

(17) |

(18) |

where is the CMPQMUX output. After this step, the crossover will be performed by concatenating the head of parent 1, , with the parent’s tail 2, , and the parent head 2, , with the parent’s tail 1, , thus giving rise to two new chromosomes of the new population,

(19) |

and

(20) |

For the CMPQ2 submodule the equivalent happens. In this case, the input values will be and and the outputs will be and .

After the similar variables have been crossed within each submodule, they are directed to the outputs of each respective CMPQs where the concatenators CMCCAT1 and CMCCAT2 will give rise to new individuals (chromosomes) from the population by concatenating both the parts forming them, and (Figure 4).

It is important to emphasize that after CMs have performed their operations, new chromosomes that will form a new population will have been created. Some of these individuals will pass through the MM (to be described in Section 3.4) before the start of the next generation, but always at the end of each iteration, new individuals will have been created so that the GA population will always remain with chromosomes.

### 3.4 Mutation Module - MM

As in the Algorithm 1 in Line 13 the mutation operation will be performed on a group of individuals, that is, there are mutation modules and each -th module, MM, changes the value of the chromosome to be mutated through an XOR operation with a number created randomly by an associated generator called MMLFSR (Figure 6). The MM will modify the first individuals of the population as shown in Figure 1.

The output of the -th, MM, module in every -th generation can be expressed by

(21) |

where represents the pseudo random number generated by -th MMLFSR.

In the case of single-variable problem optimization, this mutation operation will possibly assign non-zero values to the unused bits of the mutated chromosome. However, this will not be a problem since these bits will be zeroed when passing through the FFM in the following generation.

### 3.5 Syncronization Module - SyncM

Finally, the last module is the synchronization module. It aims to enable the registers, responsible for storing the population chromosomes of the genetic algorithm, to receive new values. These new values result from the mutation and crossover processes of the previous generation and are stored in the RX registers to initiate a new iteration of the algorithm.

This module contains a counter, a constant value and a comparator as shown in Figure 7. The variable is enabled when the comparison returns a true value, that is, when the counter value matches the value stored in the constant. The value is obtained according to the implemented design, and it is adjusted according to the delay that the system needs to perform all its operations and provide a new set of chromosomes. The output value of this module is a boolean value, and the values of the counter and constant output are 2-bit values. This number was chosen because it was the maximum delay found in the implementation for an entire generation, a delay for each ROM of the FFM.

In all the tests performed in this work the GA operations were performed at a sampling rate

(22) |

where is the time for each -th generation be finished.

Although is the maximum possible sample rate to operate the system and is the minimum equivalent time, the equation 22 divides these values by 3 since only every three clocks a new population is originated in the GA, since there are two delays in the architecture between the beginning of the generation and the end of it. Thus, these two delays caused by the LUTs contained in the FFM (Section 3.1) make the frequency the one which will process the population from an earlier population .

Generally, if the architecture contained any components that caused system delays, the sample rate of this system would be

(23) |

## 4 Results

Aiming to validate the proposed implementation of the GA on FPGA, simulations, analyses and syntheses were performed in the optimisation of different functions for various population sizes. The first function, called here F1, used in the tests to validate the proposal was an one variable function expressed as

(24) |

The second function, called here F2, was

(25) |

and, lastly, the last function, here called F3, was the function

(26) |

This work has implemented and synthesised on FPGA the three functions previously mentioned for populations of size , , , and and for chromosomes with size , , , , and bits.

It is important to emphasise that these functions were chosen for comparison reasons, since they have already been used in the state of the art in previous works that will be shown next. However, the implementation proposed is capable of implementing any function in the format shown by Equation 11 requiring only the modification of the values stored in the memories.

All results were obtained using the development platform and a FPGA Virtex 7 xc7vx550t-1ffg1158. The Virtex 7 FPGA used has slices that group flip-flops, logical cells that can be used to implement logical functions or memories and DSP cells with multipliers and accumulators.

As previously mentioned, three different functions were minimised for the validation of the implementation. The first one was the function F1 presented in Equation 24 and shown in Figure 8.

This function was chosen because it was previously used by (Vavouras et al., 2009) to validate its proposal which developed a high-speed Genetic Algorithm on FPGA. Regarding the implementation of this function as described in Equation 11, the following associations can be made:

(27) |

(28) |

(29) |

Therefore, F1 can be represented by

(30) |

The second function is presented in Figure 9 and has been previously used in (Fernando et al., 2008) to validate the implementation of a customisable GA IP core for general purposes.

Regarding the implementation of F2 as described in Equation 11, the following associations can be made:

(31) |

(32) |

(33) |

Therefore, F2 can be represented by

(34) |

Finally, Figure10 shows the last function used to validate the proposal presented here. This function could be seen previously with the same use in (Guo et al., 2016) and (Qu et al., 2013). Both works use GA, but only (Guo et al., 2016) implements the algorithm on FPGA.

Similarly, the F3 in the parameters of the equation 11, can be seen as follows:

(35) |

(36) |

(37) |

Therefore, F3 can be represented by

(38) |

The parameters used in the experiments were based on configurations of previous experiments found in the literature together with some empirically obtained configurations. For the number of generations , it was experimentally observed that for all evaluated functions, the minimum value sought was obtained before the GA generations were reached. This number is in agreement with what was found in the literature, as can be seen in (Vavouras et al., 2009), for example. Thus, was adopted as default value for the optimization experiments performed here.

Similarly, the GA population sizes to be implemented and synthesised in the FPGA were determined. As seen in (Nedjah and de Macedo Mourelle, 2007) the population size was . In (Scott et al., 1995) the population had size and in (Deliparaschos et al., 2008) GA was implemented with populations of sizes and . Thus, the architecture of the proposal presented here was implemented for the five population sizes already mentioned before. The aim behind these different sizes of was to compare how much this parameter influences the convergence, speed and area occupation in the FPGA.

Finally, for the same purpose of comparing how much a parameter influences certain convergence and synthesis characteristics, the bit length varied for all population sizes as also quoted previously. The Figures 11 and 12 picture the operation of the proposed GA for the fitness functions F1 and F3, respectively.

The fitness function 1 (Equation 24) was minimized using the GA with a population size of and . Thus, bits were used for each variable from the fitness function. Given that this is a single-variable problem, the function made use of only bits. For the minimization shown in Figure 11 the range of the F1 was of to . Thus, the minimum possible value in the range is . As depicted, it is noticed that the global minimum was reached approximately in the half of the generations, thus proving the functionality of the proposed system.

Similarly, the fitness function 3 (Equation 26) was also minimised, as shown in Figure 12 but with a population size of and . In this case, only allows results greater than or equal to zero when working in the domain of real numbers, so the smallest possible value is zero. The parallel Genetic Algorithm implemented on FPGA proposed herein has managed to minimize F3 in a little over iterations of the algorithm. This is not a fixed value, since the GA is a stochastic algorithm, but with this value it is possible to have an idea of the number of generations required for convergence.

Both results were obtained from the average of multiple results. It is also important to emphasise that parameters such as the range of values to be calculated, bit width (), decimal precision and the possibility of exploring negative numbers are all parameters of the LUT (Section 3.1) and configurable by the user. As already mentioned, the option of maximising or minimising the function to be optimised is also another configurable variable.

The Table 1 presents the synthesis results in the target FPGA for various population sizes and = 20. It is clear in all scenarios that the area of occupation, clock consequently the number of generations per second, , are parameters considerably sensitive to the population size, . Here, the represents the number of generations performed in the GA per second, which can also be interpreted as some possible solutions which the system provides in that interval. Equation 22 states that this number is equal to , that is, the inverse of the time of each GA generation divided by three. This is explained because the system generates two delays when placing two ROM memories in series in the FFM described in Section 3.1. Consequently, a new GA population is generated only after three system clocks.

The clock is the maximum frequency the system performs when implementing this architecture, and it does not take into account the delays required to generate a new population. The clock represents only the hardware speed for that specific implementation, so it is faster than the number of generations performed in the GA per second.

Registers | Logic Cells | Clock | Generations | |

Flip-flops | (LUTs) | (MHz) | Per Second | |

The area occupation spent in registers (Figure 13), presented in the second column of the Table 1, is due to the storage of population values in RX (Figure 1) and the pseudo random number generators, mainly. This occupation increases linearly according to , since the larger the population, the greater the number of RX registers required, as well as operations that require the pseudo random number generators. Figure 13 shows this growth graphically with a linear interpolation.

The logical cells (LUTs) occupation, presented in the third column of the Table 1, was increasing and not linear with , as can be seen in Figure 14. This nonlinear growth is caused by the selection module (Subsection 3.2) that for each -th module, SM, there are three inputs multiplexers ( SMMUX1, SMMUX2 and SMMUX3 ).

According to (Chapman, 2014), each Virtex 7 logical cell can construct four 1-input MUXs, thus to build a a -inputs multiplexers , approximately logical cells are required, totalling approximately cells for each SM (SMMUX4, SMMUX5 and SMMUX6) have not been considered). Since there are SM modules, there are approximately logical cells for each bit of the input bus of the MUX. Thus, the exponential growth result from the use of the logical cells is explained.

In this context, it is important to note that implementation with individuals does not reach even one fifth of the FPGA cells (around of Virtex 7). This is a positive indicator for implementations with larger populations.

Finally regarding the table, the last two columns show the Clocks and the number of generations performed in the GA per second for each value of and there is speed reduction according to the population growth. Theoretically, if all the modules were independent (specific for each individual) this reduction should not happen, however, it is observed that in the selection modules, SM (Figure 3.2), there is a dependency between the chromosomes (due to information sharing) causing a join in the circuit and thus, an increase in processing time. On the other hand, it is also observed that the reduction rate is not linear, which favours the implementation. Another important information to note is that even with the reduction, each GA generation of chromosomes is generated in , in other words millions of generations to every ms. This result has a very significant impact and makes the use of GA possible in several real-time embedded applications such as robotics, telecommunications and others.

The Figure 15 represents the influence of the bit width on the Clock for a GA with . It is noted a decrease of the processing speed with the increase of the number of bits, however this fall is not significant. The clock variation is only slightly more than MHz when the implementation is compared using with the implementation using . The interpolation shown in the Figure suggests a linear fall .

The last Figure (16) illustrates the relationship between the increase of LUTs used in the FPGA and the variation of the bit width for three different population sizes. A larger difference is observed between the quantities of LUTs used in , mainly due to the nonlinear growth of these components comparing to . However, as already seen in Figure 15 the increase of is also a factor responsible for slowing the processing speed, .

Analyzing the synthesis results, it was noticed that different fitness functions such as F1, F2 and F3 did not result in significant differences in the LUTs consumption and Registers in the FPGA, as well as no significant differences were observed in . This result was already expected, since the only variation that occurs when changing the fitness function is the content of the FFM LUTs. Thus, it is possible to extend this thinking and assert that the values of the Table 1 are true for any other function, in the parameters of Equation 11, using bits.

## 5 Comparisons with state of the art works

Following, comparisons of the results obtained by the proposed implementation with equivalent results found in works belonging to the state of the art are presented. The comparisons which will be shown below and which are summarised in the Table 2 were made with the greatest similarity of parameters as possible. The table presents a column that presents the comparative references, the next two columns show the parameters of the GAs compared, then the times obtained by the works of the state of the art are shown and, finally, the results obtained by the implementation presented here and the respective speedups are displayed.

The system presented by (Vavouras et al., 2009), a high-speed implementation of GA on FPGA, demonstrated a runtime of milliseconds for a GA implemented on FPGA with generations and a population of size . For the same settings, the system proposed here achieved a time of microseconds, which proves to be faster.

Similarly, the implementation presented by (Deliparaschos et al., 2008) also presented an GA on FPGA with population size, chromosome size of and generations. The implementation validated its proposal with the traveling salesman problem and resulted in a running time of s. Although a test in the same parameters of (Deliparaschos et al., 2008) has not been performed here, a comparison can still be made due to the versatility of problems solved by different LUT as shown in the FFM. An AG with , and can be solved in microseconds in the work presented here, meaning a time faster than in (Deliparaschos et al., 2008).

In similar fashion, the work of (Fernando et al., 2008) presented a highly programmable GA IP core on FPGA. For a setting of and the authors stated a speedup of over an equivalent software implementation which achieved a running time of milliseconds. In a comparison, the implementation presented here performed the equivalent situation in a time of microseconds, which represents a speedup of over the serial implementation shown in (Fernando et al., 2008). In other words, a time less than the GA IP core proposed by (Fernando et al., 2008).

Finally, the implementation here proposed can also be compared to the work published in (Zhu et al., 2007). As already mentioned in Section 1.1, this article presents the OIMGA, an implementation of a monogenetic FPGA algorithm that retains only the best chromosome of the generation. In one of the tests performed to validate the implementation, the authors optimised a one variable function with a population of in seconds. In a scenario where the proposed parallel GA take this time to solve the same function it would process million of generations. Of course, this value is unreasonable for that function. As shown previously in the results, generations was the default value to optimise functions of one or two variables, thus, even if the number of generations needed to optimise the same function was (a generous estimate), the time resulting from the implementation proposed by (Zhu et al., 2007) would still be higher.

## 6 Conclusion

After the presentation of the results in Section 4 it can be affirmed that the implementation proposal was in fact validated and fulfilled with its objective of being a parallel implementation of high-performance of a GA. The synthesis results confirmed that the present proposed parallel implementation of AG on FPGA is able to optimize a wide range of functions in a viable time for critical applications that require short time constraints or a large amount of data to be processed in a short interval.

Comparisons with other implementations found in the literature in Section 5 reinforce the high speed achieved by the implementation developed. This enables the use of this system in a commercial context for applications such as Internet Touch, robotics, real-time applications and medical applications. In addition, this system has proven to be an acceleration tool for any hardware system that makes use of genetic algorithms.

As well as the high-performance achieved, the small area consumption of the implementation developed here is a notorious feature. This makes it possible for other systems to also be embedded in the FPGA, since the on-board GA occupies less than of the Virtex 7 logic cells used as a test. This logical cells low consumption feature is essential for applications where the area is the biggest constraint as spatial applications, for example.

The experiments carried out proved that the sizes of tested are sufficient to solve most of the practical problems as the literature says. It has been found that the duration in iterations () of GA does not need to be greater than a few hundred. It has been proven that a few hundred generations or even is a reasonable number of generations for a GA. The parameter proved to be of great importance, since it directly affects the GA convergence speed, the area occupied on the FPGA, the response precision, as well as the achieved .

## References

- Rodriguez and Moreno (2015) A. Rodriguez, F. Moreno, Evolutionary computing and particle filtering: A hardware-based motion estimation system, IEEE Transactions on Computers 64 (2015) 3140–3152.
- Instruments (2011) N. Instruments, 2011, Understanding parallel hardware: Multiprocessors, hyperthreading, dual-core, multicore and fpgas, URL: http://www.ni.com/tutorial/6097/en/.
- Jewajinda and Chongstitvatana (2009) Y. Jewajinda, P. Chongstitvatana, Hardware architecture and fpga implementation of a parallel elitism-based compact genetic algorihm, in: TENCON 2009-2009 IEEE Region 10 Conference, IEEE, 2009, pp. 1–6.
- Tirumalai et al. (2007) V. Tirumalai, K. G. Ricks, K. A. Woodbury, Using parallelization and hardware concurrency to improve the performance of a genetic algorithm, Concurrency and Computation: Practice and Experience 19 (2007) 443–462.
- Uyemura (2002) J. P. Uyemura, Introduction to VLSI circuits and systems, Wiley India, 2002.
- Fernando et al. (2008) P. Fernando, H. Sankaran, S. Katkoori, D. Keymeulen, A. Stoica, R. Zebulum, R. Rajeshuni, A customizable fpga ip core implementation of a general purpose genetic algorithm engine, in: Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, IEEE, 2008, pp. 1–8.
- Oliveira and Júnior (2008) T. C. Oliveira, V. P. Júnior, An implementation of compact genetic algorithm on fpga for extrinsic evolvable hardware, in: Programmable Logic, 2008 4th Southern Conference on, IEEE, 2008, pp. 187–190.
- Mengxu and Bin (2015) F. Mengxu, T. Bin, Fpga implementation of an adaptive genetic algorithm, in: 2015 12th International Conference on Service Systems and Service Management (ICSSSM), IEEE, 2015, pp. 1–5.
- Vavouras et al. (2009) M. Vavouras, K. Papadimitriou, I. Papaefstathiou, High-speed fpga-based implementations of a genetic algorithm, in: Systems, Architectures, Modeling, and Simulation, 2009. SAMOS’09. International Symposium on, IEEE, 2009, pp. 9–16.
- Zhu et al. (2007) Z. Zhu, D. J. Mulvaney, V. A. Chouliaras, Hardware implementation of a novel genetic algorithm, Neurocomputing 71 (2007) 95–106.
- Scott et al. (1995) S. D. Scott, A. Samal, S. Seth, Hga: A hardware-based genetic algorithm, in: Third International ACM Symposium on Field-Programmable Gate Arrays, 1995, pp. 53–59. doi:10.1109/FPGA.1995.241945.
- Mingas et al. (2012) G. Mingas, E. Tsardoulias, L. Petrou, An fpga implementation of the smg-slam algorithm, Microprocessors and Microsystems 36 (2012) 190–204.
- Ionescu et al. (2015) L.-M. Ionescu, A. Mazare, A.-I. Lita, G. Serban, Fully integrated artificial intelligence solution for real time route tracking, in: 2015 38th International Spring Seminar on Electronics Technology (ISSE), IEEE, 2015, pp. 536–540.
- Qu et al. (2013) H. Qu, K. Xing, T. Alexander, An improved genetic algorithm with co-evolutionary strategy for global path planning of multiple mobile robots, Neurocomputing 120 (2013) 509–517.
- Koza (1991) J. R. Koza, Genetic evolution and co-evolution of computer programs, Artificial life II 10 (1991) 603–629.
- Merabti and Massicotte (2014) H. Merabti, D. Massicotte, Hardware implementation of a real-time genetic algorithm for adaptive filtering applications, in: Electrical and Computer Engineering (CCECE), 2014 IEEE 27th Canadian Conference on, IEEE, 2014, pp. 1–5.
- Sehatbakhsh et al. (2013) N. Sehatbakhsh, M. Aliasgari, S. M. Fakhraie, Fpga implementation of genetic algorithm for dynamic filter-bank-based multicarrier systems, in: Design & Technology of Integrated Systems in Nanoscale Era (dtis), 2013 8th International Conference on, IEEE, 2013, pp. 72–77.
- Chen and Wu (2011) Y. Chen, Q. Wu, Design and implementation of pid controller based on fpga and genetic algorithm, in: Electronics and Optoelectronics (ICEOE), 2011 International Conference on, volume 4, IEEE, 2011, pp. V4–308.
- Guo et al. (2016) L. Guo, A. I. Funie, D. B. Thomas, H. Fu, W. Luk, Parallel genetic algorithms on multiple fpgas, ACM SIGARCH Computer Architecture News 43 (2016) 86–93.
- Lotfi et al. (2016) A. Lotfi, A. Rahimi, A. Yazdanbakhsh, H. Esmaeilzadeh, R. K. Gupta, Grater: An approximation workflow for exploiting data-level parallelism in fpga acceleration, in: 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, 2016, pp. 1279–1284.
- Holland (1975) J. H. Holland, Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence., U Michigan Press, 1975.
- Noraini and Geraghty (2011) M. R. Noraini, J. Geraghty, Genetic algorithm performance with different selection strategies in solving tsp, World Congress on Engineering 2011 Vol II (2011).
- Nedjah and de Macedo Mourelle (2007) N. Nedjah, L. de Macedo Mourelle, An efficient problem-independent hardware implementation of genetic algorithms, Neurocomputing 71 (2007) 88–94.
- Deliparaschos et al. (2008) K. Deliparaschos, G. Doyamis, S. Tzafestas, A parameterised genetic algorithm ip core: Fpga design, implementation and performance evaluation, International Journal of Electronics 95 (2008) 1149–1166.
- Goresky and Klapper (2006) M. Goresky, A. Klapper, Pseudonoise sequences based on algebraic feedback shift registers, IEEE Transactions on Information Theory 52 (2006) 1649–1662.
- Chapman (2014) K. Chapman, Multiplexer design techniques for datapath performance with minimized routing resources, Application Note: Spartan-6 Family, Virtex-6 Family, 7 Series FPGAs, 2014.