Effective reinforcement learning based local search for the maximum plex problem
Abstract
The maximum plex problem is a computationally complex problem, which emerged from graphtheoretic social network studies. This paper presents an effective hybrid local search for solving the maximum plex problem that combines the recently proposed breakout local search algorithm with a reinforcement learning strategy. The proposed approach includes distinguishing features such as: a unified neighborhood search based on the swapping operator, a distanceandquality reward for actions and a new parameter control mechanism based on reinforcement learning. Extensive experiments for the maximum plex problem () on 80 benchmark instances from the second DIMACS Challenge demonstrate that the proposed approach can match the bestknown results from the literature in all but four problem instances. In addition, the proposed algorithm is able to find 32 new best solutions.
Keywords: Heuristic, Local search, Reinforcement Learning, NPhard, plex
, , ,
1 Introduction
Let be a simple undirected graph, where is the set of vertices and is the set of edges. A plex for a given positive integer is a subset of such that each vertex of is adjacent to at least vertices in the subgraph induced by . Formally, let be the set of adjacent vertices of (i.e., ), the maximum plex problem with any fixed () aims to find a plex of maximum cardinality, such that . The plex problem first aroses in the context of graph theoretic social network [22] and has become popular in several other contexts [6, 7, 9]. The plex problem with any fixed positive integer is an NPcomplete problem [2], it reduces to the wellknown maximum clique problem (MC) when , one of Karp’s 21 NPcomplete problems [15]. The plex problem has a number of applications in information retrieval, code theory, signal transmission, social networks, classification theory amongst others [8, 16, 20, 26].
Due in part to the wide variety of realworld applications, increased research effort is being devoted to solving this problem. Over the past few years, several exact algorithms have been proposed for finding the maximum plex of a given graph [2, 17, 19, 24, 26]. These methods can find optimal solutions for graphs with around a thousand vertices in a reasonable amount of computing time (within 3 hours). However, they often fail to solve larger instances of the problem. On the other hand, several heuristic approaches have also been presented, which are able to find highquality solutions for larger problem instances [10, 18, 28]. Two of these approaches [10, 18] are based on the general GRASP framework [21], whilst the other [28] uses the tabu search metaheuristic.
One notices that, compared to the research effort on the MC problem, studies for the maximum plex problem are more recent and less abundant. In this work, we are interested in approximately solving the representative large scale plex instances by presenting an effective heuristic approach. Moreover, the reinforcement learning techniques are shown to be able to improve the performance of local search algorithms [5, 27], hereby, we are also interested in investigating the reinforcement learning based local search for solving the maximum plex problem. We apply the recent breakout local search and the reinforcement learning together to explore the search space (denoted as BLSRLE). It uses descent search to discover local optima before applying sophisticated diversification strategies to move to unexplored regions of the search space. BLSRLE integrates several distinguishing features to ensure that the search process is effective. Firstly, a reinforcement learning technique is applied to adaptively and interdependently control three parameters deciding the type of perturbation to be applied and the magnitude of that perturbation. Secondly, the search is driven by a unified (,1)swap( ) operator, used to explore the constrained neighborhoods of the problem. Finally, a distanceandquality reward is used to maintain a highquality set of parameters based on previous experience.
We evaluate BLSRLE for the plex problem (with ) on a set of 80 large and dense graphs from the second DIMACS Challenge benchmark. Comparisons are performed to a number of stateofthe art methods from the literature, and the computational results show that BLSRLE is able to achieve the bestknown results for all the instances tested except 4 cases. In particular, BLSRLE finds 11 new best solutions for , 7 for , 7 for and 7 for respectively.
The rest of the paper is organized as follows. Section 2 presents the proposed reinforcement learning based local search algorithm for solving the maximum plex problem. Section 3 shows the computational results of BLSRLE for the plex problem with and comparisons with the stateoftheart algorithms in the literature. Before concluding, Section 5 investigates and analyzes the influence of reinforcement learning for the local search algorithm.
2 The reinforcement learning based local search algorithm
The proposed reinforcement learning based local search algorithm (BLSRLE) follows the recent general learning based local search framework, which was first introduced by [5] and applied to the Vertex Separator Problem. BLSRLE combines an intensification stage that applies descent local search, with a distinctive diversification stage that adaptively selects between two or more perturbation operators with a particular perturbation magnitude. The type of perturbation operator (denoted as ), the number of perturbation moves (the depth of perturbation, denoted as ) and the degree of random perturabtion (denoted as ) are three important selfadaptive parameters that control the degree of diversification during perturbation. The combination of these three parameters by determining the values of , and independently may not constitute the most suitable degree of diversification required at one stage of the search. And [12] highlights the automated configuration and parameter tuning techniques are very important for effectively solving difficult computational problems. Hence, unlike the parameter tuning techniques introduced in [12], BLSRLE uses a parameter control mechanism based on reinforcement learning [1] to adaptively and interdependently determine the value of parameters.
2.1 General procedure
The overall BLSRLE algorithm is summarized in Algorithm 1. Following the Prelearning phase, the first step of each iteration of BLSRLE consists of selecting an action (parameter triple ) according to the Softmax actionselection rule [23] (line 4). Although we use Softmax, any action selection model could be used as an alternative. The diversification procedure is then applied to the current local optimum using (line 5), followed by the (descent local search) to improve the quality of the perturbed solution (line 6). Furthermore, a global variable is used to record the best plex solution discovered during the search (lines 79). Finally, BLSRLE applies the procedure to determine the reward for the selected action (parameter triple) with regard to by considering both the quality and the diversity criterion (line 10). Finally, the action value corresponding to is updated with (line 11).
A parameter triple determines the perturbation type, perturbation magnitude and degree for diversification, where represents the number of perturbation moves, represents the probability of selecting one type of perturbation operator over another and represents the degree of random perturbation to adjust the strength of random peturbation. The set contains all possible parameter triples, i.e., where is the total number of triples to be generated. Here we use one of two types of perturbation operator, directed or random. A novel method for generating parameter triples is proposed, with the number of perturbation moves takes the following piecewise function (Eq. (1)). The linear nature of the first component reflects a more finegrained approach to diversification, using a larger number of values when the perturbation level is low. As the perturbation level increases, the number of potential values for will decrease, with the values that they take increasing exponentially. This provides a greater level of diversification within a smaller number of potential values at the higher end.
(1) 
This method for defining the set of potential values and subsequently triples in is in contrast to previous work [5], where a linear relationship between consecutive values is maintained throughout the range. As for the parameters and , the value of ranges from 95 to 100, and ranges 70 to 90 respectively.
The algorithm uses a onetime procedure to evaluate the degree of diversification introduced by each parameter triple that can be applied throughout the search process. For this purpose, the degree of diversification refers to the ability of the search to discover new local optima. To limit the number of possible actions and to reduce the time required to learn, BLSRLE maintains a small subset of parameter triple, representing potential actions that can be performed at that time. is periodically updated (every iterations) with a new parameter triple from based on the action values learned during the Prelearning procedure.
2.2 Prelearning procedure
Iterated Local Search is used to assess the diversification capability of each parameter triple , based on how frequently a new local optimum is found. The number of times a previously encountered local optima is visited by each triple is recorded, with the value for each triple initially set to 0. The detailed procedure is as follows:

A perturbation operator is selected and applied, based on the parameter . The perturbation operator makes moves, with a hash table used to record recently visited local optima as historical information.

Following this, descent local search attempts to improve the solution, returning the local optimum.

Finally, a check is performed to see if the local optimum has already been encountered. If so, the count of revisited local optima for the corresponding triple is increased by 1.

Repeat steps 1 to 3 until each parameter triple has been used.
This process is repeated times, where is chosen to provide enough samples from which a ranking of triples can be derived. Once completed, the parameter triples in are sorted in ascending order of the number of times they revisited previously encountered local optima, with the bestranked triples kept in a set . Note that here is experimentally set to 2375, and that this ranking will also be used for the reward and value function.
2.3 Intensification of search
The intensification stage of BLSRLE aims to find better solutions by descent local search. For this purpose, BLSRLE employs the (0,1)swap move operator to try to improve a solution. Recall that is the set of adjacent vertices of , let be the complementary set of and be the set of vertices in that are adjacent to at least one vertex in , i.e., . For a vertex in , if , then is called a critical vertex. The set , shown in Eq. (2), consists of all vertices in that are connected to at least vertices in and are also connected to all of the critical vertices [24, 28]. Descent local search includes a new vertex from to increase the cardinality of the current , while maintaining the feasibility of the plex. When is empty, the intensification stage is complete and the local optimum found is returned.
(2) 
2.4 Perturbation operators
In order to escape from local optima, directed or random perturbations are used to guide the search towards unexplored regions. Directed perturbation aims to minimize the degradation of the plex cardinality, while random perturbation aims to move the search away from the current location. The choice of operator is based on the probability of using a directed perturbation operator, with random perturbation applied otherwise (with the corresponding probability 1  ). If the random perturbation is chosen, the value of parameter controls the strength of random perturbation (the random perturbation is compared to a random start when ). The number of perturbation moves is defined as , with , and controlled interdependently by the reinforcement learning strategy.
The proposed approach uses a unified () move to explore the search space. Four sets are involved in this procedure as given below, is already presented above. The set consists of vertices that are adjacent to at least vertices in , where is the unique critical vertex not adjacent to , as shown in Eq. (3). consists of vertices that are adjacent to exactly vertices in . These vertices should include all of the critical vertices, as shown in Eq. (4) [28]. A corresponding exchanged vertex is randomly selected from . The vertex from could be employed for the (1,1)swap move operator, such that the search is guided to a new region while leaving the quality of solution unchanged. The consists of the vertex , such that is in the set of and is not included in the sets , and . A corresponding exchanged vertex is selected from that will cause the plex to be infeasible after including . The vertex from could be used for the swap () move in order to discover new and promising regions of the search space.
(3) 
(4) 
(5) 
Directed perturbation applies a move from , favoring moves that minimize the degradation of the cardinality of the current plex. Random perturbation applies a random move from . Each move is only accepted when the quality is not worse than a given threshold, determined by the value of parameter . A tabu list is used to prevent the search returning to previously visited locations. More precisely, each time a vertex is removed from the plex when employing the (,1)swap (), this vertex is prevented from moving back to for the next iterations ( is called tabu tenure and = 7). On the other hand, each time a vertex is removed from the plex when employing the (1,1)swap, the tabu tenure is set to where is a random integer from 1 to .
2.5 Reward and value functions
After a locally optimal solution is returned by descent local search, we apply a distanceandquality reward for each parameter triple (action). This reward considers both the quality of solution and the degree to which new areas of the search space are explored. The distanceandquality reward is given in Eq. (6) [3]. The core motivation behind this strategy is to give the highest reward when new local optima are discovered with the minimum amount of diversification introduced. Recall that the parameter triples in are sorted in increasing order in the phase, the distance is determined by where is the index of the parameter triple when is not stored in the hash table. The quality is evaluated by where is the cardinality of the current plex and is the best solution found so far during the search.
(6) 
As soon as the reward for the parameter triple is computed, the procedure is used to update the credit attributed to the action and accordingly compute the action value . The values of are initialized to 1, and the 100 latest rewards ascribed to are used to estimate the credit that quantifies the performance of a particular action in a given period. According to this performance summary, the action value is updated.
2.6 Update of the parameter triple learning list
In order to prevent the search process from premature convergence, the parameter triple learning list should be updated periodically. Afterwards, the Softmax actionselection model [23] is used to select an action for the next iteration of the search.
The Softmaxbased approach applies the Gibbs or Boltzmann distributions to assign a probability to each action in [3]. The updating rule for the triple learning list first selects the worst parameter triple , which has the lowest probability of selection. Then, the updating rule determines one promising parameter triple in . According to the probabilities of actions in , we estimate the probability of each action in by a linear correspondence. For example, given 12 actions in , 4 actions are selected from and the probability of these 4 actions are known. We use a line to connect two adjacent actions whose probabilities are known, as shown in Figure 1. For action , the estimated probability of is obtained by the probability of the first action in divided by the index of action in . Similarly, the estimated probability of is obtained by the probability of the last action in divided by the difference between 11 and the index of action in . Finally, we estimate the probability of all vertices in and select a vertex with a high probability to replace the worst action in . After the parameter triple learning list is updated, the actions in are resorted, the action values are reset to 1, and the next round of search is triggered.
3 Experimental Setup
In order to evaluate the performance of BLSRLE, we conduct experiments on 80 benchmark instances^{1}^{1}1https://turing.cs.hbg.psu.edu/txn131/clique.html from the Second DIMACS Implementation Challenge [13]. These instances were first established for the maximum clique (MC) problem and are frequently used for evaluating MC algorithms. As previously mentioned, the maximum plex problem can be reduced to the MC, hence these instances are also quite popular and challenging for evaluating solution methods to this problem.
The BLSRLE algorithm is implemented in C++ and complied using g++ with the ‘O3’ option under GNU/Linux running on an Intel Xeon E5440 processor (2.83GHz and 4GB RAM). When the DIMACS machine benchmarks^{2}^{2}2ftp://dimacs.rutgers.edu/pub/dsj/clique/ are run on our machine, the run time required is 0.44, 2.63 and 9.85 seconds for graphs r300.5, r400.5 and r500.5 respectively. Each instance is run 20 times independently. The algorithm stops when a fixed cutoff time (180 seconds) is met. This experimental protocol is also used by a stateoftheart heuristic in the literature [28]. BLSRLE uses the selfadaptive parameters given in Table 1, all computational results were obtained with the same parameters. The adopted parameter settings are inspired by a previous method from the literature [5], where a parameter sensitivity analysis during the search guidance was provided.
Parameter  Description  Value 

the size of the learning parameter triple set  6  
the number of prelearning iterations for each triple in  100  
the temperature of the Softmaxbased adaptive procedure  2  
the update frequency of the parameter triple learning list  4000  
coefficient for reward function  2  
coefficient for reward function  1 
In addition to BLSRLE we also provide the results of BLSRND, a variant of BLS where the values of the two parameters, , and are determined randomly. [14] highlighted the importance of comparing adaptive parameter control mechanisms with random sampling of the parameter space. When introducing a dynamic parameter control strategy, rather than using static parameter values, it is not clear whether it is the adaptive strategy or simply that the value of parameter is changing dynamically over time that is having an effect on performance.
4 Experimental Results
To assess the performance of BLSRLE, we compare it with some stateoftheart algorithms from the literature [2, 17, 19, 24, 28]. The experimental platform of [2] was performed on Dell Precision PWS690 machine with a 2.66GHz Xeon Processor, 3 GB RAM and 120 GB HDD. The experimental platform of [17, 19] was performed on a 2.2 GHz DualCore AMD Opteron Processor with 3GB memory, a AMD Athlon 64 3700+ machine with 2.2 GHz and 3GB memory respectively. The experimental platform of [24] was run on an Intel 2 Quad 3 GHz Processor with 4GB RAM, a 2.5 GHz Intel Core i53210M processor with 4GB memory respectively. The experimental platform of [28] was run on an AMD Opteron 4184 with 2.8 GHz Processor with 2GB RAM. Table 2, 9, 10 and 11 compare BLSRLE to several recent bestperforming algorithms and the heuristic method proposed in [28] (FDTS) for , covering the best known results for all of the instances tested.

BEV  FDTS  BLSRLE  
Max (Avg.)  Time(s)  Max (Avg.)  Time(s)  

25  26  0.15  26  0.04  

13  13  0.01  13  0.01  

17  17  0.02  17  0.02  

20  20  0.06  20  0.03  

23  30  0.42  30  0.16  

27  30  0.51  30  0.11  

  30  0.35  30  0.20  

27  33 (31.2)  64.18  31  23.83  

  25  10.90  25  2.26  

  25  11.36  25  1.82  

  25  12.68  25  2.28  

  26 (25.55)  56.21  26 (25.9)  51.07  

  81 (80.55)  39.65  82 (81.75)  51.66  

  43  0.00  43  0.01  

  19 (18.95)  27.96  20 (19.05)  15.00  

  90 (88.9)  65.21  93 (91.7)  76.41  

  55  8.43  55  12.96  

  20  39.23  21 (20.15)  11.86  

  69  10.67  69  2.78  

12  12  0.00  12 (10.75)  0.01  

24  24  0.01  24 (22.25)  0.02  

58  58  0.01  58 (57.4)  0.01  

126  126  0.08  126 (124.95)  0.02  

14  14  0.00  14 (12.4)  0.01  

26  26  0.00  26 (25.2)  0.01  

64  64  0.02  64 (62.45)  0.01  

  18  26.61  18  13.37  

  16  0.29  16  0.18  

  53  0.14  53  0.38  

  57  0.02  57  0.04  

  68 (67.7)  65.76  68  40.14  

  73 (71.6)  28.01  74 (72.75)  44.32  

  79 (78.05)  30.62  80 (79.1)  35.22  

512  512  8.97  512  1.96  

41  48  1.53  48  0.45  

32  32  0.00  32  0.01  

6  6  0.00  6  0.01  

128  128  0.09  128  0.02  

16  16  0.01  16  0.01  

10  10  0.00  10  0.00  

  21  0.02  21  0.00  

5  5  0.00  5  0.00  

14  14  0.00  14  0.02  

15  15  0.00  15  0.01  

  31  0.09  31  0.10  

  63  3.60  63  3.19  

236  236 (235.9)  12.64  236  0.13  

662  662 (661.4)  5.46  662  6.61  

  2162 (2113.9)  139.35  2162  12.33  

26  26  0.00  26  0.01  

  13  0.28  13  0.08  

  56  0.43  56  0.32  

  82  0.33  82  0.17  

  14  1.46  14  0.33  

  80  1.75  80  0.95  

  114  0.61  114  0.16  

10  10  0.00  10  0.02  

30  30  0.01  30  0.03  

43  44  0.07  44  0.05  

12  12  0.06  12  0.03  

  42  0.02  42  0.02  

  62  0.18  62  0.06  

13  13  0.06  13  0.07  

50  52  0.06  52  0.03  

73  76  0.54  76  0.14  

  17  9.39  18 (16.75)  50.68  

  31  0.59  31  0.05  

24  26 (25.4)  9.74  26  0.10  

90  90  0.01  90  0.01  

  71  2.47  71 (69.35)  2.03  

  54 (53.95)  72.31  54  7.16  

  15  1.24  15 (14.9)  9.08  

  41  0.20  42 (41.55)  10.27  

  32  7.22  33 (32.45)  36.36  

  27 (26.3)  12.27  28 (27.75)  66.10  

  102 (101.3)  9.09  103 (102.6)  39.24  

  22  0.01  22  0.04  

  51  0.61  51  0.17  

  15  0.02  15  0.03  

  26  1.03  26  0.34 

BEV  FDTS  BLSRLE  

#max  #max  #avg.  #max  #avg.  

3  12  10  11  10  

0  3  3  7  3  

7  7  7  7  0  

0  2  2  2  2  

0  3  2  5  3  

5  6  6  6  6  

3  4  4  4  4  

1  3  3  3  3  

3  4  1  4  4  

4  15  15  15  15  

1  6  4  11  4  

0  4  4  4  4 

BEV  FDTS  BLSRLE  

#Max  #Max  #Avg.  #Max  #Avg.  

1  11  9  12  9  

0  3  2  7  3  

7  7  7  7  0  

0  2  2  2  2  

0  5  4  5  2  

5  6  6  6  6  

3  4  4  4  4  

1  2  2  3  3  

3  4  3  4  4  

2  15  15  15  15  

1  10  8  11  2  

0  4  4  4  4 

BEV  FDTS  BLSRLE  

#Max  #Max  #Avg.  #Max  #Avg.  

0  11  8  12  9  

0  2  2  7  2  

7  7  7  7  0  

0  2  1  2  1  

0  5  5  5  2  

3  6  5  5  4  

3  4  4  4  4  

0  2  1  3  1  

3  4  3  4  4  

1  15  14  15  15  

1  11  9  11  2  

0  4  4  4  4 

BEV  FDTS  BLSRLE  

#Max  #Max  #Avg.  #Max  #Avg.  

0  12  8  12  11  

0  3  2  7  3  

7  7  7  7  1  

0  2  1  2  1  

0  5  5  5  2  

2  6  4  6  4  

1  4  4  4  4  

0  3  2  3  1  

2  3  3  4  4  

0  13  12  15  12  

1  11  10  10  6  

0  4  4  4  4 
Instance  Max  Max  Instance  Max  Max  
= 2  C1000.9  81  82  = 4  brock800_4  33  34 
C2000.5  19  20  C1000.9  107  109  
C2000.9  90  93  C2000.5  25  26  
C4000.5  20  21  C2000.9  118  123  
gen400_p0.9_65  73  74  C4000.5  26  27  
gen400_p0.9_75  79  80  C500.9  92  93  
san1000  17  18  keller6  107  112  
san400_0.7_1  41  42  = 5  C1000.9  119  122  
san400_0.7_2  32  33  C2000.9  132  137  
san400_0.7_3  27  28  C4000.5  29  30  
san400_0.9_1  102  103  C500.9  103  104  
= 3  brock800_1  30  MANN_a81  3135  3240  
C1000.9  95  96  p_hat15003  164  165  
C2000.5  22  23  p_hat5003  89  90  
C2000.9  105  109  
C4000.5  23  24  
keller6  90  93  
san400_0.7_3  38  39 
BEV lists the best known results obtained by the four reference exact algorithms of [2, 17, 19, 24]. Each of these exact algorithms terminates after a maximum of 3 hours run time (except the algorithm in [17] which terminates after 1 hour). We acknowledge that the experimental platforms used by these methods vary, however the large difference in running times means that the impact of differing CPU speeds on results is nominal. As such we have chosen to extract the computational results of the exact algorithms from the corresponding papers directly. FDTS gives the results of the heuristic method presented in [28], which uses the same termination criteria as our method, stopping after 180 seconds. For FDTS and BLSRLE, the best result and average results that are achieved over 20 runs are given. The time column provides the average time in seconds required to find the best solution by those runs that obtained the best result.
From this table, we observe that the reference exact algorithms are able to solve a subset of the 80 instances to optimality, however these consist mainly of small instances. In general, both the reference heuristic FDTS and BLSRLE can obtain good results for these instances in a short period of time. In most cases, the average performance matches the best performance, achieving the same results over all 20 runs. New best known solutions are found for eleven of the 80 instances: C1000.9, C2000.5, C2000.9, C4000.5, gen400_p0.9_65, gen400_p0.9_75, san1000, san400_0.7_1, san400_0.7_2, san400_0.7_3 and san400_0.9_1. Interestingly all of these instances are in the minority, where the average performance and best performance differ. It may be the case that these instances are intrinsically more difficult than some of the others in the set, indeed the average time to find a best solution is also generally longer for these cases.
Table 4, 5 and 6 summarize the computational results of BLSRLE for the plex when = 3, 4, and 5 respectively. Each cell in this column represents the number of instances in that subset where the best known value was found. Here, the represents the average result over the number of runs by the reference algorithm. For BLSRLE, results are highlighted in bold in the case that BLSRLE outperforms FDTS, no highlighting is applied when the results are the same, and results are italic when FDTS outperforms BLSRLE.
Over all 80 DIMACS instances, the BLSRLE algorithm can match the bestknown results for the plex when = 2, 3, 4, 5 in all but four cases. In particular, BLSRLE improves the best known results for 32 instances, finding eleven new best solutions for = 2 (in bold in Table 2) and seven new best solutions for = 3, 4, and 5. This comparison shows that BLSRLE offers highly competitive performance for these problem instances. Table 7 provides the specific results for the 32 instances where a new best solution has been found.
5 Analysis on the reinforcement learning
Table 8 provides a direct comparison between BLSRLE and BLSRND (where , and are selected randomly at each step, but share the same range with BLSRLE), over all 80 benchmark instances. The number of instances in each benchmark set is provided in parenthesis. Despite its simplicity, determining the perturbation operator and perturbation magnitude is clearly an effective strategy in this problem domain, often matching or beating BLSRLE in terms of both maximum and average results obtained. This performance is not necessarily linked to , with both methods seemingly scaling in a similar manner as the value of increases. Nevertheless, BLSRLE performs better than BLSRND on the hard instances, i.e, BLSRLE can find the new results 34, 26 for brock8004 and C2000.5 while BLSRND can only achieve the bestknown results 33 and 25 when = 4.
k=2  k=3  


BLSRLE  BLSRND  BLSRLE  BLSRND  
#max  #avg.  #max  #avg.  #max  #avg.  #max  #avg.  
brock (12)  11  10  11  10  12  9  12  9  
C (7)  7  3  7  4  7  3  6  3  

7  0  7  0  7  0  7  0  

2  2  2  2  2  2  2  2  

5  3  5  3  5  2  5  2  

6  6  6  6  6  6  6  6  

4  4  4  4  4  4  4  4  

3  3  3  3  3  3  3  2  

4  4  4  4  4  4  4  4  

15  15  15  15  15  15  15  15  

10  4  10  4  11  2  11  3  

4  4  4  4  4  4  4  4 
k=4  k=5  


BLSRLE  BLSRND  BLSRLE  BLSRND  
#max  #avg.  #max  #avg.  #max  #avg.  #max  #avg.  
brock (12)  12  9  11  10  12  11  12  10  
C (7)  7  2  6  2  7  3  7  3  

7  0  7  0  7  1  7  1  

2  1  2  1  2  1  2  1  

5  2  5  2  5  2  5  2  

5  4  5  5  6  4  6  3  

4  4  4  4  4  4  4  4  

3  1  3  1  3  1  3  1  

3  3  4  4  4  4  4  4  

15  15  15  15  15  12  15  14  

11  2  10  3  10  6  10  6  

4  4  4  4  4  4  4  4 
6 Conclusion
The BLSRLE method presented in this paper is the first study for the maximum plex problem focusing on a cooperative approach between a local search procedure and a reinforcement learning strategy. BLSRLE alternates between an intensification stage with descent local search and a diversification stage with directed or random perturbations. A reinforcement learning mechanism is employed to interdependently control three parameters, the probability of using a particular type of perturbation operator, the degree of random perturbation and the number of perturbation moves used in order to escape local optima traps. A novel strategy for enumerating possible values for and a new parameter control for generating the triples have been proposed, differing from the existing approaches in the literature.
Experimental evaluations over 80 benchmark instances for = 2, 3, 4, 5 showed that the proposed BLSRLE algorithm is highly competitive in comparison with stateoftheart exact and heuristic algorithms for the maximum plex problem. In particular, BLSRLE can improve the best known results for 32 instances. Although BLSRLE has shown to be competitive when compared to existing approaches, random sampling of parameters has also shown promising performance (BLSRND), with the added benefit of reducing the number of design choices that are required and reducing the burden of parameter tuning. Our current work in this area is focused on using different action selection models, comparing the ability of different approaches to learn good parameter triples. In future we will examine other automated parameter tuning methods, such as irace [11], to manage the parameter combinations that are considered during the search process.
References
 [1] P. Auer, N. CesaBianchi, P. Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 [2] B. Balasundaram, S. Butenko, I.V. Hicks. Clique relaxations in social network analysis: The maximum kplex problem. Operations Research, 59(1):133–142, 2011.
 [3] U. Benlic, J.K. Hao. Breakout local search for maximum clique problems. Computers & Operations Research, 40(1):192–206, 2013.
 [4] U. Benlic, J.K. Hao. Breakout local search for the maxcutproblem. Engineering Applications of Artificial Intelligence, 26(3):1162–1173, 2013.
 [5] U. Benlic, M.G. Epitropakis, E.K. Burke. A hybrid breakout local search and reinforcement learning approach to the vertex separator problem. European Journal of Operational Research, 261(3):803–818, 2017.
 [6] N. Berry, T. Ko, T. Moy, J. Smrcka, J. Turnley, B. Wu. Emergent clique formation in terrorist recruitment. In The AAAI04 Workshop on Agent Organizations: Theory and Practice, 2004.
 [7] V. Boginski, S. Butenko, O. Shirokikh, S. Trukhanov, J.G. Lafuente. A networkbased data mining approach to portfolio selection via weighted clique relaxations. Annals of Operations Research, 216(1):23–34, 2014.
 [8] N. Du, B. Wu, X. Pei, B. Wang, L. Xu. Community detection in largescale social networks. In Proceedings of the 9th WebKDD and 1st SNAKDD 2007 workshop on Web mining and social network analysis, pages 16–25. ACM, 2007.
 [9] D. Gibson, R. Kumar, A. Tomkins. Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st international conference on Very large data bases, pages 721–732. VLDB Endowment, 2005.
 [10] K.R. Gujjula, K.A. Seshadrinathan, A. Meisami. A hybrid metaheuristic for the maximum kPlex problem. Examining Robustness and Vulnerability of Networked Systems, pages 83–92, 2014.
 [11] M. LópezIbán̈ez, J. DuboisLacoste, L.P. Cáceres, M. Birattari, T. Stützle. The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3: 4358, 2016.
 [12] H.H. Holger. Automated Algorithm Configuration and Parameter Tuning. Autonomous Search. Springer Berlin Heidelberg, pages 37–71, 2011.
 [13] D.S. Johnson, M.A. Trick. Cliques, coloring, and satisfiability: second DIMACS implementation challenge, October 1113, 1993, volume 26. American Mathematical Soc., 1996.
 [14] G. Karafotias, M. Hoogendoorn, A.E. Eiben. Why parameter control mechanisms should be benchmarked against random variation. In Evolutionary Computation (CEC), 2013 IEEE Congress on, pages 349–355. IEEE, 2013.
 [15] R.M. Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85–103. Springer, 1972.
 [16] V.E. Krebs. Mapping networks of terrorist cells. Connections, 24(3):43–52, 2002.
 [17] B. McClosky, I.V. Hicks. Combinatorial algorithms for the maximum kplex problem. Journal of combinatorial optimization, 23(1):29–49, 2012.
 [18] Z. Miao, B. Balasundaram. Cluster detection in largescale social networks using kplexes. In IIE Annual Conference. Proceedings, page 1. Institute of Industrial and Systems Engineers (IISE), 2012.
 [19] H. Moser, R. Niedermeier, M. Sorge. Exact combinatorial algorithms and experiments for finding maximum kplexes. Journal of combinatorial optimization, 24(3):347–373, 2012.
 [20] M.E.J. Newman. The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98(2):404–409, 2001.
 [21] M.G.C. Resende, C.C. Ribeiro. Greedy randomized adaptive search procedures: Advances, hybridizations, and applications. In Handbook of metaheuristics, pages 283–319. Springer, 2010.
 [22] S.B. Seidman, B.L. Foster. A graphtheoretic generalization of the clique concept. Journal of Mathematical Sociology, 6(1):139–154, 1978.
 [23] R.S. Sutton and A.G. Barto. Introduction to reinforcement learning, volume 135. MIT Press Cambridge, 1998.
 [24] S. Trukhanov, C. Balasubramaniam, B. Balasundaram, Se. Butenko. Algorithms for detecting optimal hereditary structures in graphs, with application to clique relaxations. Computational Optimization and Applications, 56(1):113–130, 2013.
 [25] Q. Wu and J.K. Hao. A review on algorithms for maximum clique problems. European Journal of Operational Research, 242(3):693–709, 2015.
 [26] M. Xiao. On a generalization of nemhauser and Trotter’s local optimization theorem. Journal of Computer and System Sciences, 84:97–106, 2017.
 [27] Y. Zhou, J.K. Hao, B. Duval. Reinforcement learning based local search for grouping problems: A case study on graph coloring. Expert Systems with Applications, 64:412–422, 2016.
 [28] Y. Zhou and J.K. Hao. Frequencydriven tabu search for the maximum splex problem. Computers & Operations Research, 86:65–78, 2017.
Appendix A Appendix

BKV  FDTS  BLSRLE  

Max(Avg.)  time(s)  Max(Avg.)  time(s)  

24  30  0.05  30  0.028  

16  16  0.27  16  0.112  

19  20  0.02  20  0.0235  

20  23  0.07  23  0.05  

23  36  7.24  36  2.0655  

27  36  15.37  36  2.635  

  36  6.59  36  0.957  

27  36  4.60  36  0.752  

  29  12.35  30(29.05)  9.6445  

  30(29.30)  31.20  30(29.95)  50.016  

  30(29.20)  11.71  30(29.95)  55.7225  

  29  14.35  29  1.0425  

  95(93.75)  58.59  96(95.55)  54.1065  

  51  1.89  51  0.4015  

  22(21.90)  62.35  23(22.05)  10.8295  

  105(103.40)  69.41  109(107.3)  69.554  

  65  22.29  65  9.86  

  23  69.37  24(23.3)  21.172  

  81(80.95)  57.72  81  6.72  

12  12  0.00  12(10.95)  0.0095  

24  24  0.01  24(22.3)  0.0125  

58  58  0.01  58(56.85)  0.0135  

126  126  0.22  126(124.9)  0.018  

14  14  0.00  14(12.4)  0.021  

26  26  0.00  26(25.25)  0.0215  

64  64  0.02  64(62.45)  0.022  

  21  25.35  21  8.4675  

  19  2.81  19  0.6185  

  66  0.09  66  0.0365  

  64  0.14  64  0.0815  

  87  26.62  87(86.2)  8.13  

  101(100.45)  10.68  101(99.35)  0.457  

  114  0.35  114(112.5)  0.185  

512  512  4.66  512  2.2325  

46  64  1.13  64  0.4625  

32  32  0.00  32  0.0215  

8  8  0.00  8  0.0185  

128  128  0.09  128  0.0185  

20  20  0.01  20  0.015  

16  16  0.00  16  0.0045  

  32  0.05  32  0.016  

8  8  0.00  8  0.007  

18  18  0.00  18  0.02  

21  21  0.09  21  0.015  

  45  8.19  45  1.361  

  90(87.80)  66.21  93  47.53  

351  351  0.41  351  0.049  

990  990  7.40  990  0.84  

  3240(3125.35)  138.36  3240  31.3  

36  36  0.00  36  0.007  

  15  0.15  15  0.038  

  67  0.81  67  0.118  

  98  2.19  98  0.9785  

  17  22.47  17  6.555  

  93  0.41  93  0.115  

  133  17.85  133  0.7665  

12  12  0.00  12  0.0095  

30  36  0.02  36  0.0315  

43  52  0.06  52  0.0805  

14  14  0.20  14  0.142  

  50  0.12  50  0.062  

  72  1.24  72  0.2565  

13  15  0.13  15  0.0625  

50  62  1.35  62  0.327  

73  89  2.13  89  0.163  

  25  6.73  25(23.4)  3.7185  

  46(45.70)  1.51  46(45.7)  0.7615  

36  37  0.21  37  0.107  

125  125  0.02  125(122.5)  0.02  

  105  0.02  105(102)  0.011  

  73  7.48  73(69.5)  11.0195  

  22  3.61  22(20.75)  24.101  

  61  2.49  61(60.9)  60.1145  

  47(46.10)  0.46  47(46.8)  38.1925  

  38  11.17  39(37.35)  21.7735  

  150  0.09  150  0.0345  

  26  0.03  26  0.039  

  61  2.25  61  0.448  

  18  0.09  18  0.021  

  30  0.45  30  0.104 

BKV  FDTS  BLSRLE  

Max(Avg.)  time(s)  Max(Avg.)  time(s)  

27  35  3.59  35  0.3945  

17  18  0.06  18  0.061  

19  23  0.03  23  0.027  

21  26  0.03  26  0.0215  

23  41  41.37  41  5.678  

29  41  33.51  41  1.2835  

41  5.90  41  0.376  

30  41  1.76  41  0.4995  

  34(33.20)  24.43  34  30.1935  

  34(33.15)  26.40  34(33.75)  47.5185  

  34(33.15)  28.46  34(33.9)  47.9195  

  33  32.04  34(33.05)  5.788  

  107(106.00)  48.91  109(108.75)  41.6625  

  58  0.07  58  0.022  

  25(24.50)  22.79  26(25.05)  14.574  

  118(116.80)  63.85  123(121.6)  63.8995  

  75  4.81  75  0.4745  

  26(25.55)  53.64  27(26.1)  23.646  

  92(91.75)  55.11  93(92.45)  36.314  

12  12  0.00  12(11.5)  0.005  

24  24  0.02  24(22.3)  0.0165  

58  58  0.00  58(57.45)  0.0115  

126  126  0.23  126(125.45)  0.0465  

14  14  0.00  14(12.9)  0.0165  

26  26  0.00  26(25.2)  0.011  

64  64  0.03  64(62.5)  0.0305  

  24(23.05)  5.63  24(23.4)  25.494  

  21  0.12  21  0.2195  

  76  0.03  76  0.0245  

  73  0.12  73  0.035  

  112  0.19  112(110.2)  0.1505  

  132  0.25  132(129.8)  0.0325  

  136  0.10  136(133.9)  0.039  

512  512  8.62  512  1.499  

51  68(67.20)  20.64  68(67.95)  53.0775  

40  40  0.00  40  0.011  

10  10  0.00  10  0.011  

128  129  22.41  128  0.0655  

20  25  0.28  25  0.2625  

19  19  0.00  19  0.0145  

  38  0.38  38  0.118  

9  9  0.00  9  0.001  

22  22  0.00  22  0.0115  

22  23  0.06  23  0.213  

  53(52.75)  53.67  53(52.75)  49.822  

  107(103.45)  67.88  112(106.8)  70.3805  

351  351  0.45  351  0.1055  

990  990  7.48  990  1.954  

  3240(2788.70)  147.51  3240  29.2  

36  36  0.00  36  0  

  18  4.09  18  1.9085  

  76  28.89  76  0.5585  

  111  3.50  111  0.311  

  19  3.73  19  0.83  

  107(106.50)  29.46  107  9.111  

  150  3.90  150  0.542  

14  14  0.01  14  0.0145  

33  41  0.06  41  0.019  

43  59  0.09  59  0.0545  

14  16  0.11  16  0.0655  

  57  0.07  57  0.0255  

  81  1.94  81  0.136  

13  17  0.45  17  0.2405  

50  70  0.26  70  0.0735  

73  100  1.46  100  0.144  

  33  1.70  33(30.9)  0.0215  

  60  0.01  60  0.0155  

48  49  1.90  49(48.95)  9.4395  

125  125  0.03  125  0.022  

  105  0.02  105(104)  0.0245  

  96  0.08  96(91.5)  0.0245  

  29  4.92  29(26.25)  0.0075  

  81(80.45)  17.54  81(80.05)  0.0225  

  61  0.57  61(60.75)  36.4435  

  50(49.45)  24.71  50(47.4)  22.13  

  200  0.12  200(195)  0.05  

  30  0.14  30  0.035  

  69  0.07  69  0.0485  

  21  0.56  21  0.1105  

  35  5.31  35  5.346 

BKV  FDTS  BLSRLE  

Max(Avg.)  time(s)  Max(Avg.)  time(s)  

27  39  2.36  39  0.966  

17  20  0.04  20  0.022  

19  26  0.09  26  0.0535  

21  30  0.20  30  0.0565  

23  46(45.50)  36.57  46  5.3855  

29  45  2.14  45  0.245  

  46(45.90)  55.95  46  3.172  

30  46  25.67  46  5.104  

  37  27.12  37  0.7105  

  38(37.15)  32.46  38(37.75)  52.655  

  38(37.10)  36.78  38  68.3545  

  37  33.58  37  4.0245  

  119(118.15)  60.39  122(121)  53.142  

  65  0.38  65  0.02  

  28(27.15)  11.72  28  33.969  

  132(129.65)  76.66  137(135.35)  67.061  

  84  4.72  84  0.3695  

  29(28.20)  26.93  30(29.05)  6.135  

  103(102.25)  36.65  104(103.1)  11.4765  

14  14  0.00  14  0.0145  

24  24  0.01  24(22.1)  0.0155  

58  58  0.00  58(56.95)  0.0175  

126  126  0.16  126(124.9)  0.026  

15  15  0.00  15(14.35)  0.0105  

26  26  0.00  26(25.3)  0.016  

64  64  0.04  64(62.7)  0.0145  

  27(26.25)  32.14  27(26.85)  36.39  

  24  1.07  24  0.4795  

  84  0.05  84  0.0355  

  80  0.19  80  0.0815  

  124  0.16  124(122.65)  0.121  

  138  0.15  138(135)  0.0925  

  136  0.12  136(133.15)  0.0835  

512  513(512.15)  16.82  513(512.1)  1.8635  

51  79(78.05)  34.37  79(78.65)  41.8755  

48  48  0.00  48  0.023  

12  12  0.00  12  0.011  

128  152  0.97  152  0.902  

20  32  0.02  32  0.0185  

21  24  0.00  24  0.0075  

  48  0.09  48  0.021  

12  12  0.00  12  0.009  

24  28  0.00  28  0.01  

22  28  0.02  28  0.0505  

  61  5.04  61(60.55)  10.423  

  125(123.20)  73.91  125(124.05)  58.103  

351  351  0.45  351  0.0675  

990  990  7.44  990  2.507  

  3135(2660.75)  190.01  3240  73.26  

44  45  0.00  45  0.0005  

  20  7.17  20(19.5)  0.5625  

  84  1.30  84  0.13  

  122  29.84  122  0.536  

  21  1.43  21  0.6005  

  117(116.55)  41.90  117  3.613  

  164  48.76  165(164.9)  47.2155  

14  16  0.02  16  0.0195  

33  46  0.03  46  0.0205  

43  65  0.11  65  0.0285  

14  18  0.15  18  0.076  

  62  0.25  62  0.049  

  89  1.73  90(89.95)  35.301  

13  19  2.00  19  14.0355  

50  79  9.42  79  0.7935  

73  109  1.35  109  0.112  

  41  6.39  41(39.15)  0.0415  

  75  0.01  75  0.0145  

48  60  0.02  60  0.011  

125  125  0.03  125  0.0275  

  105  0.03  105(104.5)  0.0255  

  100  0.03  100  0.019  

  36(35.70)  55.40  35(33.75)  0.016  

  100  0.06  100  0.0325  

  76  10.34  76(75.25)  5.833  

  61  0.29  61(57.65)  14.6585  

  200  0.15  200  0.059  

  33  0.05  33  0.028  

  77  5.74  77  0.408  

  24  1.63  24  0.1875  

  39  31.01  39  2.3455 