MCN
Learning to solve the Multilevel Critical Node Problem. Code for the paper https://arxiv.org/pdf/2007.03151.pdf
view repo
Learning heuristics for combinatorial optimization problems through graph neural networks have recently shown promising results on some classic NPhard problems. These are singlelevel optimization problems with only one player. Multilevel combinatorial optimization problems are their generalization, encompassing situations with multiple players taking decisions sequentially. By framing them in a multiagent reinforcement learning setting, we devise a valuebased method to learn to solve multilevel budgeted combinatorial problems involving two players in a zerosum game over a graph. Our framework is based on a simple curriculum: if an agent knows how to estimate the value of instances with budgets up to B, then solving instances with budget B+1 can be done in polynomial time regardless of the direction of the optimization by checking the value of every possible afterstate. Thus, in a bottomup approach, we generate datasets of heuristically solved instances with increasingly larger budgets to train our agent. We report results close to optimality on graphs up to 100 nodes and a 185 × speedup on average compared to the quickest exact solver known for the Multilevel Critical Node problem, a maxminmax trilevel problem that has been shown to be at least Σ_2^phard.
READ FULL TEXT VIEW PDF
Combinatorial optimization algorithms for graph problems are usually des...
read it
While research in robust optimization has attracted considerable interes...
read it
Neural combinatorial optimization (NCO) aims at designing problemindepe...
read it
The multiple traveling salesman problem (mTSP) is a wellknown NPhard
p...
read it
The Generalized Traveling Salesman Problem (GTSP) is one of the NPhard
...
read it
For NPhard combinatorial optimization problems, it is usually difficult...
read it
Combinatorial optimization problems (COPs) on the graph with reallife
a...
read it
Learning to solve the Multilevel Critical Node Problem. Code for the paper https://arxiv.org/pdf/2007.03151.pdf
The design of heuristics to tackle realworld instances of NPhard combinatorial optimization problems over graphs has attracted the attention of many Computer Scientists over the years Gonzalez (2007)
. With advances in Deep Learning
Goodfellow et al. (2016) and Graph Neural Networks Wu et al. (2020), the idea of leveraging the recurrent structures appearing in the combinatorial objects belonging to a distribution of instances of a given problem to learn efficient heuristics with a Reinforcement Learning (RL) framework has received an increased interest Bengio et al. (2018); Mazyavkina et al. (2020). Although these approaches show promising results on many fundamental NPhard problems over graphs, such as Maximum Cut Barrett et al. (2020) or the Traveling Salesman Problem Kool et al. (2019), the range of combinatorial challenges on which they are directly applicable is still limited.Indeed, most of the combinatorial problems over graphs solved heuristically with Deep Learning Bai et al. (2020); Barrett et al. (2020); Bello et al. (2016); Cappart et al. (2019); Khalil et al. (2017); Kool et al. (2019); Li et al. (2018); Ma et al. (2019)
are classic NPhard problems for which the canonical optimization formulation is a singlelevel Mixed Integer Linear Program: there is one decisionmaker
^{1}^{1}1We will use interchangeably the words decisionmaker, agent and player. Note that decisionmaker, player and agent are usually used in Operations Research, Game Theory and Reinforcement Learning, respectively. Similarly for the words decision, strategy and policy.
seeking to minimize a linear cost subject to linear constraints and integer requirements. However, in many realworld situations, decisionmakers interact with each other. A particular case of such setting are sequential games with a hierarchy between players: an upper level authority (a leader) optimizes its goal subject to the response of a sequence of followers seeking to optimize their own objectives given the actions previously made by others higher in the hierarchy. These problems are naturally modeled as Multilevel Programming problems (MPs) and can be seen as a succession of nested optimization tasks, i.e. mathematical programs with optimization problems in the constraints Bracken and McGill (1973); Zhang et al. (2015).Thus, finding an optimal strategy for the leader in the multilevel setting may be harder than for singlelevel problems as evaluating the cost of a given strategy might not be possible in polynomial time: it requires solving the followers optimization problems. In fact, even Multilevel Linear Programming with a sequence of players (levels) is hard Blair (1992); Dudás et al. (1998); Jeroslow (1985). In practice, exact methods capable to tackle mediumsized instances in reasonable time have been developed for maxminmax Trilevels, minmax Bilevels and more general Bilevel Programs (e.g., Carvalho et al. (2020); Fischetti et al. (2017, 2019); Lozano and Smith (2017); Tahernejad et al. (2020)).
Despite the computational challenges intrinsic to MPs, these formulations are of practical interest as they properly model hierarchical decision problems. Originally appearing in economics in the bilevel form, designated as Stackelberg competitions von Stackelberg (1934), they have since been extended to more than two agents and seen their use explode in Operations Research Lachhwani and Dwivedi (2017); Sinha et al. (2018). Thus, research efforts have been directed at finding good quality heuristics to solve those problems, e.g. DeNegre (2011); Fischetti et al. (2018); Forghani et al. (2020); Talbi (2013). Hence, one can ask whether we can make an agent learn how to solve a wide range of instances of a given multilevel problem, extending the success of recent Deep Learning approaches on solving singlelevel combinatorial problems to higher levels.
In this paper, we propose a simple curriculum to learn to solve a common type of multilevel combinatorial optimization problem: budgeted ones that are zerosum games played in a graph. Although the framework we devise is set to be general, we center our attention on the Multilevel Critical Node problem (MCN) Baggio et al. (2020) and its variants. The reasons for such a choice are manifold. First, the MCN is an example of a DefenderAttackerDefender game Brown et al. (2006) which received much attention lately as it aims to find the best preventive strategies to defend critical network infrastructures against malicious attacks. As it falls under the global framework of network interdiction games, it is also related to many other interdiction problems with applications ranging from floods control Ratliff et al. (1975) to the decomposition of matrices into blocks Furini et al. (2019). Moreover, an exact method to solve the problem has been presented in Baggio et al. (2020) along with a publicly available dataset of solved instances^{2}^{2}2https://github.com/mxmmargarida/CriticalNodeProblem, which we can use to assess the quality of our heuristic. Lastly, complexity results are available for several variants and subproblems of MCN, indicating its challenging nature Nabli et al. (2020).
Contributions. Our contribution rests on several steps. First, we frame generic Multilevel Budgeted Combinatorial problems (MBC) as Alternating Markov Games Littman (1996); Littman and Szepesvari (1996). This allows us to devise a first algorithm, MultiLDQN, to learn values. By leveraging both the combinatorial setting (the environment is deterministic) and the budgeted case (the length of an episode is known in advance), we motivate a curriculum, MultiLCur. Introducing a Graph Neural Networks based agent, we empirically demonstrate the efficiency of our curriculum on versions of the MCN, reporting results close to optimality on graphs of size up to .
Paper structure. Section 2 formalizes the MBC problem. In Section 3, we provide an overview of the relevant literature. The MBC is put in a MultiAgent RL setting in Section 4 along with the presentation of our algorithmic approaches: MultiLDQN and MultiLCur. Section 5 states the particular game MCN in which our methodology is validated in Section 6.
The general setting for the MPs we are considering is the following: given a graph , two concurrent players, the leader and the follower, compete over the same combinatorial quantity , with the leader wanting to maximize it and the follower to minimize it. They are given a total number of moves and a sequence of budgets . Although our study and algorithms also apply to general integer cost functions , for the sake of clarity, we will only consider situations where the cost of a move is its cardinality. We focus on perfectinformation games, i.e. both players have full knowledge of the budgets allocated and previous moves. The leader always begins and the last move is attributed by the parity of . At each turn , the player concerned makes a set of decisions about the graph. This set is denoted by and constrained by the previous moves . We consider games where players can only improve their objective by taking a decision: there is no incentive to pass. Without loss of generality, we can assume that
is odd. Then, the Multilevel Budgeted Combinatorial problem (MBC) can be formalized as:
(1) 
MBC is a zerosum game as both leader and follower have the same objective function but their direction of optimization is opposite. A particular combinatorial optimization problem is defined by specifying the quantity , fixing , and by characterizing the nature of both the graph (e.g directed, weighted) and of the actions allowed at each turn (e.g labeling edges, removing nodes). The problem being fixed, a distribution of instances is determined by setting a sampling law for random graphs and for the other parameters, having specified bounds beforehand: , , . Our ultimate goal is thus to learn good quality heuristics that manage to solve each .
In order to achieve that, we aim to leverage the recurrent structures appearing in the combinatorial objects in the distribution by learning graph embeddings that could guide the decision process. As data is usually very scarce (datasets of exactly solved instances being hard to produce), the goto framework to learn useful representations in these situations is Reinforcement Learning Sutton and Barto (1998).
The combination of graph embedding with reinforcement learning to learn to solve distributions of instances of combinatorial problems was introduced by Dai et al. Khalil et al. (2017). Thanks to their S2VDQN metaalgorithm, they managed to show promising results on three classic budgetfree NPhard problems. Thenceforth, there is a growing number of methods proposed to either improve upon S2VDQN results Barrett et al. (2020); Cappart et al. (2019); Kool et al. (2019); Li et al. (2018); Ma et al. (2019) or tackle other types of NPhard problems on graphs Bai et al. (2020). As all these approaches focus on single player games, they are not directly applicable to MBC.
To tackle the multiplayer case, MultiAgent Reinforcement Learning (MARL) Littman (1994); Shoham et al. (2007) appears as the natural toolbox. The combination of Deep Learning with RL recently led to one of the most significant breakthrough in perfectinformation, sequential twoplayer games: AlphaGo Silver et al. (2017). Although neural network based agents managed to exceed human abilities on other combinatorial games (e.g. backgammon Tesauro (2002)), these approaches focus on one fixed board game. Thus, they effectively learn to solve only one (particularly hard) instance of a combinatorial problem, whereas we aim to solve a whole distribution of them. Hence, the MBC problem we propose to study is at a crossroads between previous works on MARL and deep learling for combinatorial optimization.
Finally, taking an other direction, some shifted their attention from specific problems to rather focus on general purpose solvers. For example, methodologies have been proposed to speed up the branchandbound implementations for (singlelevel) linear combinatorial problems by learning to branch Balcan et al. (2018)
using Graph Convolutional Neural Networks
Gasse et al. (2019), Deep Neural Networks Zarpellon et al. (2020) or RL Etheve et al. (2020) to name some recent works; see the surveys Bengio et al. (2018); Lodi and Zarpellon (2017). To the best of our knowledge, the literature on machine learning approaches for general multilevel optimization is restricted to the linear noncombinatorial case. For instance, in
He et al. (2014); Shih et al. (2004)the linear multilevel problems are converted into a system of differential equations and solved using recurrent neural networks.
Whereas single agent RL is usually described with Markov Decision Processes, the framework needs to be extended to account for multiple agents. This has been done in the seminal work of Shapley
Shapley (1953) by introducing Markov Games. In our case, we want to model twoplayer games in which moves are not played simultaneously but alternately. The natural setting for such situation was introduced by Littman in Littman (1994, 1996); Littman and Szepesvari (1996) under the name of Alternating Markov Games.An Alternating Markov Game involves two players: a maximizer and a minimizer. It is defined by the tuple with and the set of states and actions, respectively, for player ,
the transition function mapping stateactions couples to probabilities of next states and
a reward function. For , we define as the expected reward of the concerned agent for following the optimal minimax policy against an optimal opponent starting from state . In a similar fashion, is the expected reward for the player taking action in state and both agents behaving optimally thereafter. Finally, with the introduction of the discount factor , we can write the generalized Bellman equations for Alternating Markov Games Littman (1996):(2)  
(3) 
We now have all the elements to frame the MBC in the Alternating Markov Game framework. The leader is the maximizer and the follower the minimizer. The states consist of a graph and a tuple of budgets , beginning with . Thus, the value function is defined with:
(4) 
The game is naturally sequential with an episode length of : each time step corresponds to a level . The challenge of such formulation is the size of the action space that can become large quickly. Indeed, in a graph with nodes, and ’s budget , if the action that he/she can perform is "removing a set of nodes from the graph" (a common move in network interdiction games), then the size of the action space for the first move of the game is . To remedy this, we define the sets of individual decisions available at each level . Then, we make the simplifying observation that a player making a set of decisions in one move is the same as him/her making a sequence of decisions in one strike. More formally, we have the simple lemma (proof in Appendix A.1):
The Multilevel Budgeted Combinatorial optimization problem (1) is equivalent to:
In this setting, the length of an episode is no longer but : the leader makes sequential actions, then the follower the following ones, etc. To simplify the notations, we redefine the as the sets of actions available for the agent playing at time . As each action takes place on the graph, is readable from . Moreover, we now have .
The environments considered in MBC are deterministic and their dynamics are completely known. Indeed, given a graph , a tuple of budgets and a chosen action (e.g removing the node ), the subsequent graph , tuple of budgets and next player are completely and uniquely determined. Thus, we can introduce the next state function that maps stateaction couples to the resulting afterstate , and as the function that maps the current state to the player whose turn it is to play. As early rewards weight the same as late ones, we set . Finally, we can rewrite equations (2) and (3) as:
(5)  
(6) 
The definition of depends on the combinatorial quantity and the nature of the actions allowed.
Having framed the MBC in the Markov Game framework, the next step is to look at established algorithms to learn in this setup. Littman originaly presented minimax Qlearning Littman (1994, 1996) to do so, but in matrix games. An extension using a neural network to estimate has been discussed in Fan et al. (2019). However, their algorithm, MinimaxDQN, is suited for the simultaneous setting and not the alternating one. The main difference being that the former requires the extra work of solving a Nash game between the two players at each step, which is unnecessary in the later as a greedy policy exists Littman (1994). To bring MinimaxDQN to the alternating case, we present MultiLDQN, an algorithm inspired by S2VDQN Khalil et al. (2017) but extended to the multilevel setting. See Appendix B.1 for the pseudocode.
With MultiLDQN, the learning agent directly tries to solve instances drawn from , which can be very hard theoretically speaking. However, Lemma 4.1 shows that, at the finest level of granularity, MBC is actually made of nested subproblems. As we know , the maximum number of levels considered in , instead of directly trying to learn the values of the instances from this distribution, we can ask whether beginning by focusing on the simpler subproblems and gradually build our way up to the hard ones would result in better final results.
This reasoning is motivated by the work done by Bengio et al. Bengio et al. (2009) on Curriculum Learning. Indeed, it has been shown empirically that breaking down the target training distribution into a sequence of increasingly harder ones actually results in better generalization abilities for the learning agent. But, contrary to the continuous setting devised in their work, where the parameter governing the hardness (entropy) of the distributions considered is continuously varying between and , here we have a natural discrete sequence of increasingly harder distributions to sample instances from.
Indeed, our ultimate goal is to learn an approximate function to (or equivalently to (6)) so that we can apply the subsequent greedy policy to take a decision. Thus, has to estimate the stateaction values of every instance appearing in a sequence of decisions. Although the leader makes the first move on instances from , as our game is played on the graph itself, the distribution of instances on which the second decision is made is no longer but instances from on which a first optimal action for the leader has been made. If we introduce the function
(7) 
then, from top to bottom, we want to estimate the values of taking an action starting from the states where the first action is made, then where the second action is made on instances with original budget at level , and all the way down to
(8) 
As the maximum total budget is , has to effectively learn to estimate values from different distributions of instances, one for each possible budget in for each . But the instances in these distributions are not all equally hard to solve. Actually, the tendency is that the deeper in the sequence of decision a distribution is, the easier to solve are the instances sampled from it. For example, the last distribution contains all the instances with a total remaining budget of at most that it is possible to obtain for the last move of the game when every previous action was optimal. The values of these instances can be computed exactly in polynomial time^{4}^{4}4Assuming the quantity is computable in polynomial time. by checking the reward obtained with every possible action. Thus, if we had access to the , then a natural curriculum would be to begin by sampling a dataset of instances , find their exact value in polynomial time, and train a neural network on the couples . Once this is done, we could pass to the . As these instances have a total budget of at most , we can heuristically solve them by generating every possible afterstate and, by using the freshly trained , take a greedy decision to obtain their approximate targets. In a bottom up approach, we could continue until is trained on the different distributions. The challenge of this setting being that we do not know and hence, are not available. To remedy this, we use a proxy, , obtained by following the random policy for the sequence of previous moves, i.e., we use instead of . Doing so is provably interesting (proof in Appendix A.2):
, .
Thus, by learning the value of instances sampled from , we also learn values of instances from . To avoid the pitfall of catastrophic forgetting Lange et al. (2019) happening when a neural network switches of training distribution, each time it finishes to learn from a and before the transition to , we freeze a copy of and save it in memory as an “expert of level ”. All this leads to the Algorithm 1.
The MCN Baggio et al. (2020) is a trilevel budgeted combinatorial problem on a weighted graph . The leader is called the defender and the follower is the attacker. The defender begins by removing (vaccinate) a set of nodes, then the attacker labels a set of nodes as attacked (infected), and finally the defender removes (protects) a set of new nodes that were not attacked. Once all the moves are done, attacked nodes are the source of a cascade of infections that propagate through arcs from node to node. All nodes that are not infected are saved. As and where removed from the graph, they are automatically saved and the leader receives and , the weights of the nodes in and , as reward when performing those actions. The quantity that the defender seeks to maximize is thus the sum of the weights of the saved nodes in the end of the game, while the attacker aims to minimize it. An example of a game is presented in Figure 1. The problem can be written as:
(9) 
With unit weights, the MCN has been shown to be at least NPhard on undirected graphs and at least hard on directed ones. In the more general version, with positive weights and costs associated to each node, the problem is complete Nabli et al. (2020).
We studied versions of the MCN: undirected with unit weights (MCN), undirected with positive weights (MCN), and directed with unit weights (MCN). The first distribution of instances considered is , constituted of ErdosRenyi graphs Erdos and Renyi (1960) with size and arc density . For the weighted case, we considered integer weights . The second distribution of instances focused on larger graphs with , . To compare our results with exact ones, we used the budgets reported in the experiments of the original MCN paper Baggio et al. (2020): , and .
The architectures presented in Figure 2
was implemented with Pytorch Geometric
Fey and Lenssen (2019) and Pytorch 1.4 Paszke et al. (2019). At the beginning, a node has only features : its weight and an indicator of whether it is attacked or not. The first step of our node embedding method is to concatenate the node’s two features with its Local Degree Profile Cai and Wang (2018) consisting in simple statistics on its degree. Following the success of Attention on routing problems reported in Kool et al. (2019), we then apply their Attention Layer. As infected nodes are the ones in the same connected component as attacked ones in the graph, we sought to propagate the information of each node to all the others it is connected to. That way, the attacker could know which nodes are already infected before spending the rest of his/her budget, and the defender could realize which nodes are to protect in his/her last move. So, after the Attention Layers, we used an APPNP layer Klicpera et al. (2019) that, given the matrix of nodes embedding , the adjacency matrix with inserted selfloops , its corresponding diagonal degree matrix, and a coefficient , recursively applies times:(10) 
To achieve our goal, the value of must be at least equal to the size of the largest connected component possible to have in the distribution of instances . We thus used . Finally, the graph embedding method we used is the one presented in Li et al. (2016). Given two neural networks and which compute, respectively, a score and a projection to
, the graph level representation vector it outputs is:
(11) 
To train our agent and at inference, we used one gpu of a cluster of NVidia V100SXM2 with 16G of memory^{5}^{5}5We make our code publicly available: https://github.com/AdelNabli/MCN. Further details of the implementation are discussed in Appendix D.
We compared our algorithms on and trained our best performing one on . But as it is, comparing MultiLDQN with MultiLCur may be unfair. Indeed, MultiLDQN uses a network whereas MultiLCur uses a value network. The reason why we used instead of in our second algorithm are twofold. First, as our curriculum leans on the abilities of experts trained on smaller budgets to create the next training dataset, computing values of afterstates is necessary to heuristicaly solve instances with larger budgets. Second, as MCN is a game with one player removing nodes from the graph, symmetries can be leveraged in the afterstates. Indeed, given the graph resulting of a node deletion, many couples of graph and node to delete could have resulted in . Thus, has to learn that all these possibilities are similar, while only needs to learn the value of the shared afterstate, which is more efficient Sutton and Barto (1998). To fairly compare the algorithms, we thus introduce MultiLMC, a version of MultiLDQN based on a value network and using MonteCarlo samples as in MultiLCur. Its pseudocode is available in Appendix B.3.



Results from Table 1 indicate that MultiLCur is the best performing algorithm on . Thus, we trained our learning agent with it on and tested its performance on the datasets generated in Baggio et al. (2020). We compare the results with other heuristics: the random policy (for each instance, we average the value given by random episodes), and the DAAD heuristic Baggio et al. (2020). The latter consists in separately solving the two bilevel problems inside MCN: is chosen by setting to and exactly solving the DefenderAttacker problem, while and are determined by solving the subsequent AttackerDefender problem. The metrics we use are the optimality gap and the approximation ratio . Given , the number of instances of a said type for which the optimal value is available, and . In Table 2, we report the inference times in seconds for our trained agents. The ones for the exact method and DAAD are from Baggio et al. (2020).
MCN  MCN  MCN  

exact  random  DAAD  cur  cur  cur  
20  29  68  3.32  6  0.3  1.00  0.4  0.5  1.00  5.7  1.07  6.9  1.07 
40  241  52  2.64  13  7.6  1.09  0.9  5.0  1.06  11.9  1.13  6.5  1.07 
60  405  68  3.24  38  7.3  1.09  1.5  4.4  1.05  4.4  1.05  3.7  1.04 
80  636  55  2.28  60  3.8  1.04  2.8  2.7  1.03  1.6  1.02  2.8  1.03 
100  848  45  1.86  207  2.7  1.03  8.7  49.6  1.50  1.8  1.02  4.1  1.05 
Although the results in Table 1 are the outcome of a total of episodes and optimization steps for all algorithms, our experiments show that we can divide by the data and the number of steps without sacrificing much the results on for the curriculum, which cannot be said of MultiLDQN that is data hungry, see Appendix E for details. The major drawback of MultiLCur is that it needs to compute all possible afterstates at each step of the rollout. This does not scale well with the graph’s size: having nodes for the first step means that there are graphs of size to check. Thus, the curriculum we present is a promising step towards automatic design of heuristis, while opening new research directions on restricting the exploration of rollouts.
Table 2 reveals that the results given by the MultiLCur algorithm are close to optimal for a fraction of the time necessary to both DAAD and the quickest exact solver known (MCN, presented in Baggio et al. (2020)). For the MCN instances, the jump in the metrics for graphs of size
is due to one outlier among the
exactly solved instances of this size. When removed, the values of and drop to and . The performances measured are consistent accross different problems as we also report low values of and for MCN and MCN. The curriculum we devised is thus a robust and efficient way to train agents in a Multilevel Budgeted setting.The methodology presented in this work bridges different research fields: Combinatorial Optimization, Game Theory and Reinforcement Learning. Indeed, it contributes for tackling Combinatorial Optimization problems where players rational behavior is considered, while acknowledging the intractability of exact solutions and, consequently, taking advantage of the recent advances in Reinforcement Learning to approximate them. In particular, advances in those areas will lead to improvements of our methodology. For instance, an interesting major benefit of our method is its ability to incorporate and thus take advantage of any exact methods developed for level problems with . Hence, advancements on optimization directly translate on improvements of our approach.
Multilevel Budgeted Combinatorial problems are of very practical interest. Namely, they can be interpreted as models of robust solutions given that the follower always selects the worstcase scenario for the leader. Therefore, the approaches presented in this paper represent a novel direction on building robust solutions for combinatorial problems.
Finally, we highlight the need and difficulty of scaling up methods for multilevel optimization due to their theoretical and empirical complexity. Note that scalability is of utmost importance as many graph problems of practical interest are associated with social networks which are extremely large. This motivates the development of heuristics, like the one presented here, that can both serve as standalone methods or warmstart/guide exact solvers. The main drawback of heuristics for multilevel optimization is on their evaluation: given a leader’s strategy, its associated reward (value) can only be evaluated if the remaining levels are solved to optimality. This means that in opposition to singlelevel optimization, one must be very careful on the interpretation of the estimated reward associated with an heuristic method: we can be overestimating or underestimating it. In other words, it means that in the remaining levels, players might be failing to behave in an optimal way. Consequently, further research is necessary to provide guarantees on the quality of the obtained solution, namely, on dual bounds.
The authors wish to thank the support of the Institut de valorisation des données and Fonds de Recherche du Québec through the FRQ–IVADO Research Chair in Data Science for Combinatorial Game Theory, and the Natural Sciences and Engineering Research Council of Canada through the discovery grant 201904557. This research was enabled in part by support provided by Calcul Québec (
www.calculquebec.ca) and Compute Canada (www.computecanada.ca).Optuna: a nextgeneration hyperparameter optimization framework
. arXiv preprint arXiv:1907.10902. Cited by: §D.1.Proceedings of the 34th National Conference on Artificial Intelligence, AAAI
, Cited by: §1, §1, §3.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. Cited by: §D.1.IEEE Transactions on Evolutionary Computation
22 (2), pp. 276–295. Cited by: §1.The Multilevel Budgeted Combinatorial optimization problem (1) is equivalent to:
We immediately have the following relation:
As the same reasoning holds with , we can apply it recursively, which closes the proof. ∎
,
For all , for all , we define as the set of optimal actions at time in state for the player , where we made evident the dependence of on previous actions. As by assumption we consider games where players can only improve their objective by taking a decision, we have that . For a given and subsequent , recall that
is defined as a random variable with values in
and following the uniform law. Given , we take, one of the possible sequence of optimal decisions. Then, using the chain rule, it is easy to show by recurrence that
. In words, every optimal sequence of decisions is generated with a strictly positive probability. ∎As the player currently playing is completely determined from , we can use the same neural network to estimate all the stateaction values, regardless of the player. We call the sum of all the budgets in such that an episode stops when .
As we use MonteCarlo samples as targets, the values of the targets sampled from the replay memory is not dependent on the current expert as in DQN Mnih et al. (2015) but on a previous version of , which can become outdated quickly. Thus, to easily control the number of times an old estimate is used, we decided to perform an epoch on the memory every time new samples were pushed, and used a capacity so that the total number of times a MonteCarlo sample is seen is directly .
In order to constitute a test set to compare the results given by our heuristics to exact ones, we used the exact method described in Baggio et al. (2020) to solve a small amount of instances. The algorithm they described was thought for the MCN problem, but is directly applicable without change on MCN. However, in order to monitor the learning at each stage of the curriculum for MCN as in Table 1, there is a need to solve instances where node infections were already performed in the sequence of previous moves but there is still some budget left to spend for the attacker, which is not possible as it is in Baggio et al. (2020). Moreover, small changes need to be made in order to solve instances of MCN.
We denote by the set of nodes that are already infected at the attack stage and the indicator of whether node is in or not. Then, the total set of infected nodes after the attacker spend his/her remaining budget and infect new nodes is . In order to find , we use the AP algorithm of Baggio et al. (2020), with the following modification to the rlxAP optimization problem:
Comments
There are no comments yet.