Learning Constructive Primitives for Online Level Generation and Real-time Content Adaptation in Super Mario Bros
Procedural content generation (PCG) is of great interest to game design and development as it generates game content automatically. Motivated by the recent learning-based PCG framework and other existing PCG works, we propose an alternative approach to online content generation and adaptation in Super Mario Bros (SMB). Unlike most of existing works in SMB, our approach exploits the synergy between rule-based and learning-based methods to produce constructive primitives, quality yet controllable game segments in SMB. As a result, a complete quality game level can be generated online by integrating relevant constructive primitives via controllable parameters regarding geometrical features and procedure-level properties. Also the adaptive content can be generated in real time by dynamically selecting proper constructive primitives via an adaptation criterion, e.g., dynamic difficulty adjustment (DDA). Our approach is of several favorable properties in terms of content quality assurance, generation efficiency and controllability. Extensive simulation results demonstrate that the proposed approach can generate controllable yet quality game levels online and adaptable content for DDA in real time.
In video gaming industry, game content construction and generation are laborious and costed enormously. As a potential solution, procedural content generation (PCG)  is a game design and development methodology that aims generating game content automatically via algorithms, which receives a lot of attention from academic research communities and game industries. It is anticipated that the proper use of PCG techniques would reduce the cost of game design and development dramatically. Moreover, PCG could also provide a way that automatically generates personalized games that can adapt content towards a player’s preference and optimizing their cognitive/affective experience.
In the PCG research, a number of methods have been proposed for game content generation. These techniques have been applied to different game genres ranging from platform games to first person shooter (see  for review) with different styles, e.g., online or offline content generation . Online style tends to speed up content construction process, while offline style can be used for endless or adaptive game generation. Although wide varieties of PCG techniques have been proposed, there are still some open challenges for content generation, including game quality assurance , generation efficiency , and controllability .
Super Mario Bros (SMB), a classic 2D platform game, has become a popular test bed for PCG-related researches . In this platform game, a player runs from the left side of the screen to the right side, fights enemies, and rescues the Princess Peach. SMB has a number of different game elements (e.g., enemies, coins, tubes and cannons) that can be generated via PCG. In recent years, several SMB level generators have been developed for level generation track of the Mario AI Championship  as well as presented in publications [5, 6, 7, 8, 9, 10]. While those level generators can generate complete levels in SMB, there are still several issues to be addressed:
Quality assurance. The quality of procedural SMB levels is generally not as high as handcrafted levels in real games ; levels generated by existing generators often contain aesthetically unappealing items (e.g., weird enemy/coin/box decoration), unplayable or unbalanced structures, unreachable resources (e.g., coins and boxes), and unexplainable difficulty curves. Such problems could result in negative gameplay experience.
Controllability. The geometrical features (e.g., coordinate of each enemy and tube) and some properties of procedural levels (e.g., linearity  and density ) are not directly controllable in a number of existing level generators. Instead a game designer has to encode the desired properties in handcrafted rules (e.g., ) in order to evaluate or control the procedural levels . However, those heuristic rules may be difficult to formulate  and could even slow down the level generation .
Unlike most of the existing works, we propose an alternative approach in this paper in order to tackle the aforementioned issues. Motivated by the learning-based PCG (LBPCG) framework  and other existing works [5, 6, 7, 8, 9, 10], we explore the content space in SMB from a different perspective by taking short game segments into account. To address the quality assurance issue, we exploit the synergy between rule-based and learning-based methods; easy-to-design rules are employed for removal of apparently unappealing game segments, and then active learning by encoding a game designer’s knowledge implicitly is applied to obtain high quality game segments, hereinafter named constructive primitives (CPs). Those CPs not only provide quality building blocks but also enable to control the local geometry and the level properties effectively in the SMB level generation. As a result, a complete quality game level can be efficiently generated online by integrating relevant constructive primitive together via controllable parameters on geometrical features, e.g., coordinates of enemies and cannons, and level properties, e.g., linearity, density and leniency. Moreover, adaptable content can be generated in real time by choosing the proper CPs with a given adaptation criterion, e.g., dynamic difficulty adjustment by matching content difficulty and player’s performance. It is worth stating that our approach presented in this paper significantly distinguishes from existing segment-based works in SMB (e.g., ) in terms of quality assurance, efficiency and controllability, which is discussed later on. Thus, learning CPs paves an alternative way in addressing those PCG issues in SMB.
The main contributions of this paper are summarized as follows: a) a novel approach to producing quality yet controllable game segments or CPs in SMB; b) a controllable online level generator and enabling techniques for real-time content adaptation based on CPs; and c) a thorough evaluation on our proposed approach.
The rest of paper is organized as follows. Sect. II reviews the previous works. Sect. III presents our approach to learning CPs in SMB. Sect. IV describes two methods for applications of CPs in online level generation and real-time content adaptation. Sect. V reports simulation results, and the last section discusses relevant issues and implications of this research.
Ii Related Work
We review the relevant works in PCG and content adaptation that motivate our work presented in this paper.
Ii-a Procedural Content Generation
First of all, many SMB level generators were developed under the search-based procedural content generation (SBPCG) framework . In such approaches, game developers first construct a content space that contains all possible procedural levels and then employ heuristic/stochastic search algorithms, e.g., genetic algorithm, to find out high quality levels from the content space. For instance, Sorenson et al.  proposed a search-based SMB level generator for level generation track of the Mario AI Championship . Their approach used a set of handcrafted constraints and a challenge-based metric as evaluation functions. During content generation, game levels were first tested by these constraints. Survivals from the test would be used in the search-based optimization. As a result, only game levels of the highest challenge-related fitness values were released to players. In this approach, the properties of procedural levels can be controlled by parameters used in evaluation functions. However, their level generator has to take several minutes to generate 50 levels together due to the huge content space and the use of multiple evaluation functions that may slow down the level generation . In addition, identifying proper constraints and formulating heuristic evaluation functions need game developers’ wisdom and great efforts. This process is often tedious and time-consuming as game content is observable but hard to explain and abstract. Hence, it is difficult to build up an explicit relationship between game content and its quality measurement via handcrafted constraints and heuristic evaluation functions. Similarly, other search-based SMB level generators, e.g., [6, 7, 8], suffer from the same problems and some of them do not adequately address the quality assurance issue pointed out in Sect. I.
As an alternative methodology in PCG, constructive PCG is also applied in the development of SMB level generators. In such approaches, a set of constructive rules are designed by human experts and then used to convert high-level game parameters, e.g., the number of enemies, game difficulty, style, and random seed to concrete game levels. Furthermore, constructive rules have to assure the game quality. Parameterized Notch generator  is a typical constructive level generator for SMB. In this approach, a procedural level is controlled and generated via constructive rules working on several content features, e.g., the number of enemies, the averaging width of gaps, and the spatial diversity of gaps. In comparison to the SBPCG approaches, creating constructive rules might be even more difficult ; constructive rules are often more complicated than observable constraints. Hence, imperfect constructive rules may lead to low quality and limited controllability over game levels. This fundamental limitation also exists in other constructive SMB generators, e.g., [16, 17, 4, 10].
Unlike the SBPCG and the constructive PCG approaches that generate a game level at a global level, there are SMB level generators that first generate segments by exploring the local properties and then merge segments to generate a complete game level. We name such level generators segment-based approaches. For example, Smith et al. developed the Launchpad level generator  for generic 2D platform games including SMB. Instead of generating a complete level directly, they first used a grammar-based method that generates game segments named “rhythm groups”. Then they merged these rhythm groups together to form a complete level. Before releasing a generated level to player, they used several critics to examine whether the generated game level is acceptable. As a salient characteristic, this approach provides human designers/users with a variety of parameters to control the procedural levels ranging from frequencies of geometry components to level path equation. However, some procedural level properties, e.g., linearity and component frequencies, had to be controlled by designer-specified evaluation functions, which often slows down its generation process. In addition, coordinates of game elements are not controllable directly in this approach. Although our proposed approach is motivated by the segmented-based approaches, ours considerably distinguishes from the existing works in several aspects including the controllability of geometrical features (e.g., coordinate of each enemies, cannons and hills) and procedural level properties (e.g., density , leniency  and linearity ). In particular, our approach generates quality segments of the fixed-size via combining rule-based and learning-based evaluation functions rather than the handcrafted ones used in the existing segment-based methods, e.g., .
Ii-B Content Adaptation
Content adaptation is demanded in generating the personalized content to optimize cognitive/affective experience during gameplay . In general, such techniques can be applied to different types of game content ranging from non-player character behaviour or game AI to game geometry (see  for a review). As our work concerns only the game geometry adaptation, we briefly review the relevant work regarding SMB in this context.
A typical example of geometry adaptation is the personalized level generator in SMB [17, 18]. This level generator is controllable so that various game levels can be generated with a number of controllable features. For content adaptation, an affective model was created to map a player’s behaviour and the controllable content features onto each of the player’s affective states. With such a model, the level generator can generate personalized game levels for an individual. For an effective mapping, the controllable content features were selected based on their impact on the affective states. With the selected features, the game developers then designed a number of constructive rules to control these content features. As argued in Sect. II-A, proper constructive rules are difficult to be handcrafted and the use of evaluation functions to control a complete procedural level could slow down content generation. Similar to this work [17, 18], other geometry adaptation methods in SMB, e.g., [20, 19, 21, 4], are also subject to the same limitations. By making use of CPs, we propose a method that tends to overcome such limitations in content adaptation.
Iii Learning Constructive Primitives
In this section, we first describe the motivation underlying our approach and then present our approach to producing constructive primitives (CPs) in SMB.
As summarized in Sect. I, there are three non-trivial issues in existing SMB level generators: game quality assurance, generation efficiency, and procedural level controllability.
By a close look at existing SMB level generators reviewed in Sect. II, we observe that the content space on all the complete procedural levels is huge. As there are an enormous variety of combinations among game elements and structures at procedural levels, an approach working on such content space inevitably faces a greater challenge in managing quality assurance and efficiency in PCG. Thus, controlling the level generation may be rather complicated and difficult . Nevertheless, a complete procedural level in SMB can be decomposed into a number of segments as evident in segment-based level generators . Thus, partitioning a procedural level into fixed-size game segments without relying on any concepts, e.g., rhythm, allows us to explore the SMB content space from an alternative perspective. As a result, all the possible segments form a new content space of lower complexity. We believe that it is less difficult to understand the properties of short game segments and hence the use of those segments as building blocks would facilitate tackling three non-trivial issues in SMB.
For quality assurance, there are generally two methodologies in developing such a mechanism in PCG [1, 14]: deductive vs. inductive. To adopt the deductive methodology, game developers have to understand the content space fully and know how to formulate/encode their knowledge into rules, fitness or constraints explicitly. In the presence of a huge content space, however, it would be extremely difficult to understand the entire content space so that either a small region of a content space is merely taken into account or less accurate (even conflicted) rules/constraints have to be used in PCG. The former could significantly limit the number of games generated by PCG while the latter is responsible for low quality games generated by PCG. Nevertheless, we observe that some rules/constraints are easy to design/identify while a complete set of rules for evaluating the content quality are hard to handcraft. For example, overlapped tubes in SMB is unacceptable and can be easily detected with a simple rule. On the other hand, a learning-based PCG (LBPCG) framework  was recently proposed where an inductive methodology, i.e. learning from data, was advocated for quality assurance. As game content is observable but less explainable, it is easier for game developers to make a judgement on quality for a specific game by applying their knowledge implicitly than to encode their knowledge into rules or constraints. Thus, the LBPCG suggests that a quality evaluation function should be learned from data annotated by game developers. Nevertheless, the annotation for producing learning examples may be more time-consuming than designing simple rules to detect those apparent low quality games. Hence, a hybrid approach to quality assurance would allow us to exploit the synergy between rule-based and learning-based methods.
With the motivation described above, we propose a hybrid approach to producing CPs, quality yet controllable game segments, in SMB. Fig. 1 illustrates the main steps of our approach. First of all, game developers choose a region of interest from the entire content space via control parameters. Then game segments in the region of interest are evaluated by a set of easy-to-design handcrafted conflict resolution rules and the subsequent data-driven quality evaluation function that deals with more complicated quality issues. Survivals of game segments become CPs.
Iii-B Content Space
Based on our observation from empirical studies, it is sufficient to cover rich yet diverse types of games by using a game segment of 20 in length and 15 in height. Some typical game segment instances are illustrated in Fig. 2.
The SMB content is naturally specified by a 2D grid similar to an image. However, this leads to a 300-dimensional content space in our case where there are a lot of redundancy, e.g., the uniform background. Motivated by the previous work , we employ a list of design elements as our content space representation where a design element refers to an atomic unit used in a procedural level generation [1, 6], e.g., enemy, boxes, coins, cannon, gap and so on. By using this representation, we can not only specify the content space concisely but also gain the direct controllability on low-level content features, e.g., coordinates of enemies and coins. As listed in Table I, 85 controllable features are employed in our representation. Such representation is similar to the previous work . In our content space, the design elements in each type are sorted in decreasing order along dimension.
|1||of initial platform||42 - 47||, , , , and of the second tube|
|2||number of gaps||48 - 53||, , , , and of the third tube|
|3 - 5||, and of the first gap||54||number of boxes|
|6 - 8||, and of the second gap||55 - 58||, , and of the first boxes|
|9 - 11||, and of the third gap||59 - 62||, , and of the second boxes|
|12||number of hills||63||number of enemies|
|13 - 15||, and of the first hill||64 - 66||, and of the first enemy|
|16 - 18||, and of the second hill||67 - 69||, and of the second enemy|
|19||number of cannons||70 - 72||, and of the third enemy|
|20 - 24||, , , and of the first cannon||73 - 75||, and of the forth enemy|
|25 - 29||, , , and of the second cannon||76 - 78||, and of the fifth enemy|
|30 - 34||, , , and of the third cannon||79||number of coins|
|35||number of tubes||80 - 82||, and of the first coins|
|36 - 41||, , , , and of the first tube||83 - 85||, and of the second coins|
While design element parameters in Table I have a wide range that specifies the entire content space, we confine our concerned content space to a non-trivial region of the entire content space by setting the maximum number of gaps, hills, tubes, cannons, boxes, coins and enemies appeared in a game segment are 3, 2, 3, 3, 2, 2 and 5 respectively. Consequently, there are roughly game segments in our content space. This content space should be sufficient for generating content with a variety of geometrical features, level structures and difficulties required by SMB.
Iii-C Conflict Resolution
In our content space, there are quite a number of game segments that contain conflicting design elements. For instance, “…Tube(7,2,3,4,4,flower)…Cannon(7,3,1,3,2)…” represents a game segment of at least one tube and one cannon but their coordinates are same. Thus, the cannon and tube are overlapped together and this conflicting situation makes the segment aesthetically unappealing.
Iii-D Learning Constructive Primitives
After filtering out those obviously unappealing game segments, the tailored content space still contains a lot of low quality segments, e.g., segments of unreachable coins and boxes, segments of being too difficult/easy to play, segments of unbalanced resources and aesthetically unappealing structures. Inspired by the LBPCG work , we would learn a quality evaluation function from annotated game segments to remove unplayable/unacceptable segments. To carry out this idea, a binary classifier is trained where its input is the 85D feature vector of a game segment and its output is a binary label that predicts the quality of a game segment. Game segments labeled as positive are CPs and would be used for online level generation and content adaptation described in Sect. IV.
To establish a data-driven evaluation function, training examples are required but have to be provided from game developers. As the tailored content space is still huge, it is infeasible to annotate all possible games in this content space. To keep the content space manageable, a proper sampling can be applied to achieve a much smaller data set of the same properties as the content space. Motivated by the success in the LBPCG work , we conduct clustering analysis on the data set and further employ active learning based on the clustering results to minimize a game developer’s efforts in data annotation. In summary, this CP learning process is depicted in Fig. 3.
For sampling, we apply the simple random sampling with replacement , an unbiased sampling technique, to the tailored content space for a manageable data set. As a result, we randomly set all the controllable features in the tailored content space to ensure that each game segment in this content space has the equal probability to be selected. The size of data set is determined via the sample size determination (SSD) algorithm suggested in . With the theoretical justification, the SSD can decide the size of a sampled data set without loss of non-trivial information. By applying the SSD to our tailored content space, it is suggested that a data set of 19,000 games should be sufficient.
We apply the CURE algorithm  on the sampled game set for clustering analysis since this hierarchical clustering algorithm can deal with a large data set and discover the clusters of different sizes in complex shapes. There are four parameters in CURE algorithm: the number of clusters, sampling rate, shrink factor and the number of representative points. By using the dendrogram tree achieved, the number of clusters is automatically decided based on the longest k-cluster lifetime . The rest of parameters are set to defaults suggested in ; i.e., 2.5% for sampling rate, 0.5 for shrink factor and 10 representative points, respectively. Due to the existence of two different feature types, i.e. nominal and ordinal, we employ the mixed-variable distance metric  in the CURE. After clustering, we found 106 clusters from this sampled data set, and the clustering results would be used to facilitate active learning.
Iii-D3 Active Learning
For binary classification, there are two error types: false negative (type-I error) where a high quality segment is misclassified as low quality and false positive (type-II error) where a low quality segment is misclassified as high quality. Obviously, a type-II error could result in a catastrophic effect while a type-I error simply shrinks the content space slightly. As a result, we formulate our classification as a cost-sensitive learning problem where the type-II error incurs a higher cost. By looking into several state-of-the-art classification techniques, we found that the weighted random forests(WRFs) , a cost-sensitive oblique random forests  classifier, fully meet our requirements for active learning. In our work, the parameters of WRFs  are set via validation as follows: 2:1, 50, 5, 10 and 9 for the cost ratio, the number of trees, the number of combined features, the number of feature groups selected at each node, and depth of trees, respectively.
After clustering, a small number of segments are selected from each cluster to form a validation set. The number of segments selected from each cluster is proportional to the cluster size. Totally, there are 800 segments in the validation set. We annotate each game in the validation set by visual inspection in order to evaluate the generalization performance of a classifier during active learning.
During active learning, we randomly choose 100 segments and annotate them visual inspection to train the initial WRFs. In each iteration, we find 100 segments of the highest uncertainty scores, defined by where is the predicted label of segment , and is the probability of this prediction, and annotated them to be examples for re-training WRFs. The active learning stops when the accuracy of WRFs on the validation set no longer increases. Our active learning algorithm is summarized in Algorithm 1.
Once the learning-based evaluation function is constructed, we use it along with the aforementioned rules to produce CPs of favorable properties that ensures to gain the direct control of CPs via the relevant design elements in terms of geometrical features, level structures and difficulties.
Iv Online Generation and Real-time Adaptation
In this section, we come up with the techniques in applying constructive primitives to online procedural level generation and real-time content adaptation in terms of DDA.
Iv-a Online Game Generation
As described in Sect. III, CPs provide quality building blocks and hence lumping them together can easily lead to a procedural level of aesthetically appealing content with a path between entrance and exit. In SMB, there are a variety of procedural levels that can be categorized based on a number of properties, e.g., density , leniency  and linearity . As our CPs are represented by design elements, we can generate a procedural level of pre-setting property via controllable level generation parameters.
Motivated by the previous works [7, 11], we employ three controllable level generation parameters, i.e., density, leniency and linearity, to generate a variety of levels online. The density controls the complexity of geometrical structures, e.g., a high density leads to many overlapping hills. The leniency decides the level difficulty in gameplay; intuitively, a high leniency results in an easy-to-play level. The linearity is yet another parameter that ensures there is a linear structure in a generated level; a large value leads to a level of highly linear structures. Each level generation parameter is carried out by setting the proper values to relevant design elements in CPs as follows:
Leniency. This parameter is implemented via controlling the number and type of enemies, number and width of gaps, number of cannons and tube flowers in CPs.
Density. This parameter is decided by the number and coordinates of hills in CPs.
Linearity. This parameter operates by specifying the height of platform, the number of hills, coordinates of tubes and cannons in CPs.
Each level generation parameter is set to , which divides our CPs into three categories reflecting the different properties specified by a specific parameter.
To generate a complete level, we first specify the desired values to level generation parameters that fix the parameter values of relevant design elements and set other irrelevant design elements in CPs randomly. Thus, an iterative process is undertaken by merging the CPs of the specified properties together until reaching a pre-specified length or the number of CPs pre-specified. As each CP is produced very efficiently, our level generator works online. The online level generation algorithm is summarized in Algorithm 2. It is worth stating that linearity may conflict with density. Hence, we stipulate that the value set to density overrides that of linearity for their shared design elements. In other words, we do not want to generate highly linear procedural levels, which is often considered as aesthetically unappealing.
Iv-B Real-time Content Adaptation
For content adaptation, we confine our work on CPs to only the dynamic difficulty adjustment (DDA) where the game difficulty is automatically adjusted according to a player’s performance.
For DDA, it is essential to define content difficulty and measure a player’s performance. In our work, a difficulty parameter of five levels is defined for each CP based on its relevant design elements that affect a player’s performance, such as the number and the type of enemies, the number and the width of gaps, the number of cannons and tube flowers. For instance, a CP of the highest difficulty level contains as least two enemies, one cannon and one gap while there are at most one Goomba and one flower tube in a CP of the lowest difficulty level. Measuring a player’s performance may be complicate if all the aspects need considering. In our work, we simplify this by only taking the survival rate on CPs into account.
By means of the segment-based methodology, we would make use of our CPs for real-time DDA, i.e., the performance is measured locally on a CP instead of a level and the DDA is done instantly in response to a player’s local performance. This is naturally a sequential decision process and we would formulate it as a Markovian decision process (MDP) . Thus, our goal is to find an optimal policy that maximizes the expected long-term performance via adjusting the difficulty of generated CPs. To attain this goal, we define a regret at time (i.e., after CPs have been played) as the absolute difference between an expected survival rate and the expectation of rewards:
where is a pre-set optimal survival rate and is a binary reward: if the player survives and otherwise. For instance, we set if the performance of the survival rate 0.8 is expected. For a given , levels of proper difficulties can be generated for a player to gain the expected performance. Thus, minimizing this regret is key to our content adaptation. To solve this optimization problem, we employ the Thompson sampling , an effective and efficient heuristic method used in binary reward case in MDP.
Let denote a player’s survival rate of difficulty level (). Thus, a player survives with the probability and reciprocally dies with the probability when they play a CP of difficulty level . During gameplay, one plays a number of CPs of different difficulty levels sequentially. When a CP of difficulty level is completed at time , a binary reward is assigned. Given the player’s survival rate corresponding to , the reward likelihood is
Furthermore, let denote the historical profile regarding corresponding difficulties and rewards after CPs have been played. By using a conjugate prior, , the posterior distribution of survival rate based on the likelihood in Eq. (2) is , which leads to where and are the number of survives and deaths when playing the CPs of difficulty level in .
For content adaptation, we follow the typical setting in real SMB games: at the beginning, a player is put in small state, i.e., the weakest form of Mario, where the player is not allowed to use powerful weapon (e.g., throwing fireballs) and then turns into other states by powering up with a mushroom or fire flower. Based on the gameplay information recorded in , CPs are randomly produced according to and the CP of difficulty level is chosen as a game segment to play if it results in the least regret defined in Eq. (1). After a CP of difficulty level is played, the posterior probability is updated based on the performance on the CP. Thus, this content adaptation process continues until a player quits, as summarized in Algorithm 3. It is worth stating that this algorithm is presented for a life-long gameplay scenario but easily adapted for multiple gameplay episodes by substituting the Beta conjugate prior, , with the posterior obtained from the last episode, , in the initialization whenever a new episode starts.
V Experimental Results
In this section we report results in the CP learning, online level generation and simulated DDA. The game engine adopted in our experiments is a modified version of the open-source Infinite Mario Bros used in the Mario AI Championship [4, 30]. Our level generators that yield results reported in this section are publicly available on our project website111http://staff.cs.manchester.ac.uk/s̃hipa/mario.html.
V-a Results on Constructive Primitive Learning
|1||coordinate of the first gap||11||number of enemies||21||coordinate of the first cannon|
|2||of the first hill||12||number of cannons||22||coordinate of the first box|
|3||coordinate of the first enemy||13||coordinate of the first cannon||23||of the first box|
|4||of the first hill||14||coordinate of the first enemy||24||of the third enemy|
|5||of the first gap||15||of the second hill||25||coordinate of the first coin|
|6||of the first cannon||16||of the first tube||26||of the second gap|
|7||number of gaps||17||coordinate of the first coin||27||of the second cannon|
|8||of the first cannon||18||of initial platform||28||of the second cannon|
|9||type of the first enemy||19||coordinate of the second gap||29||coordinate of the third cannon|
|10||of the first cannon||20||number of hills||30||coordinate of the first tube|
Based on the learning algorithm described in Sect. III, Fig. 4 illustrates the evolutionary performance of our active learning on the validation set, including types I and II error rates as well as their average, the half total error rate (HTER). From Fig. 4, it is observed that the active learning converges after 1100 data points.
While the final HTER is around 11.67%, the corresponding type-I error rate is around 19.66% and some misclassified segment instance are shown in Fig. 5(A). By analyzing the clustering results, we find out that those segments are concentrated in a quite small region in the content space. Therefore, they are unlikely to be sampled so as to missed in establishing the validation set. It is evident from Fig. 4 that our cost-sensitive classifier performs well in minimizing the type-II error; the type-II error rate is approximately 3.69%. Among those low quality segments misclassified as high quality, about 0.74% segments contain unreachable resources (e.g., coins and boxes) and the rest segments are either too easy/difficult to play or less appealing aesthetically as exemplified in Fig. 5(B). While such segments may result in negative gameplay experience, fortunately, none of unplayable segments in the validation set was misclassified.
In this experiment, we use the permutation test pertaining to RFs  to measure the importance of design elements in the CP learning. As a result, Table II list the top 30 design elements that significant affect the quality of CPs. From Table II, it is observed that there is a close connection between the design element such as gap/hill/cannon and quality of a game segment. In addition, and coordinates and geometrical properties of game elements (e.g., their width and height) are among the most important features for the CP prediction.
V-B Results on Online Level Generation
In all our experiments, a game level are confined to a 2D map of 200 in length and 15 in height, as same as the setting in previous works, e.g., [5, 6, 7, 8, 9]. We evaluate our online level generator in terms of expressive range, controllability and generation efficiency.
V-B1 Expressive Range
Expressive range refers to the range and variation of procedural levels according to an evaluation metric  that measures the property of procedural levels. By using a specific metric, the expressive range is often used to visualize the space of all possible procedural games. For evaluation, we use linearity , density , leniency  and compression distance  as our metrics. Such metrics allows us to reveal global properties of game levels generated by a level generator.
For linearity, we use the method suggested in  to find a line that fits the profile of a procedural level, and the coefficient of determination is used to estimate the degree of linearity. For density, we count the number of all possible standing positions in a game level . For leniency, we assign a value to each type of game elements as same as used in [5, 7] (i.e., enemy: -1.0, gap, cannon or flower tube: -0.5, and powerup: +1.0). The overall leniency score is the sum of the three values. Finally, the normalized compression distance (NCD)  is employed to measure the dissimilarity between two game levels; the higher this distance the more dissimilar two levels.
For a thorough evaluation, we compare our level generator with a number of typical SMB level generators reviewed in Sect. II, including Notch , Parameterized Notch (PN) , Grammar Evolution (GE)  and Launchpad generators . For fairness, we adopt the experimental protocols suggested in . As a result, each of level generators generates 200 procedural levels for evaluation in terms of linearity, density and leniency222The levels generated by those generators were provided by N. Shaker and only 200 procedural level images are available. and the NCD metric is applied to 200 pairs of levels. The level generation parameters used in ours are set randomly and others use their default settings. To our knowledge, the parameters in PN and Launchpad were set randomly, and there is no parameter in GE. The scores measured with a metric are normalized to the range of [0,1].
The results on expressive ranges are shown in Fig. 6. It is observed from Fig. 6 that Launchpad can generate both linear and nonlinear levels while the rest generators tend to generate non-linear levels. As described in Sect. IV.A, the setting in our generator prefers nonlinear levels. Regarding density, the expressive ranges of GE and ours are wider than others. However, others tend to generate levels with a low density. Regarding leniency, Launchpad and PN tend to generate levels of medium difficulty levels. In contrast, others including ours may generate both difficult and easy game levels. Among all the generators, the expressive range of ours is widest in terms of leniency and density. The expressive range differences among those level generators account for the use of different content spaces and level generation parameters. From Fig. 6, it is evident that PN receives the lowest NCD score, which implies relatively similar levels are generated. In contrast, others generate levels of a greater diversity as there are larger NCD distances among levels they generate, and ours generates levels of medium diversity in comparison to others.
We further evaluate the controllability of our level generator. In general, controllability can be reflected in the expressive ranges of procedural levels generated with different level generation parameter settings . As ours have three level generation parameters and each may take one of three values as described in Sect. IV.A, we exhaustedly generate nine sets of levels by fixing one parameter with a specific value and randomly setting all other parameters each time. To see the controlling effect clearly, we also generate a set of levels by setting all the parameters randomly. Thus, we achieve 10 level sets where each contains 1000 levels for reliability. In terms of linearity, density and leniency, the expressive ranges of levels controlled by different parameters are shown in Fig. 7 where it is clearly seen that the levels of a specific property are generated by properly controlling a parameter.
For exemplification, Fig. 8 - 10 illustrate some levels generated by controlling parameters in a specific way. By visual inspection, we observe that the level shown in Fig. 8 (C) is smoother than those in Fig. 8 (A) and (B) since it is generated by using the highest linearity value. From Fig. 9, it is seen that the profile shown in Fig. 9 (C) this level is easier to play (e.g., fewer enemies and narrower gaps) than those shown in Fig. 9 (C) due to the use of the highest leniency value. It is evident from Fig. 10 that there are more complicated geometrical structures (e.g., more overlapping hills) in the level shown in Fig. 10 (C) than those shown in Fig. 10 (A) and (B) as the highest density value is applied.
Generation efficiency is often evaluated by the actual time taken in a level generation. As we do not have codes of level generators used for comparison in our experiments, we can only report the result on ours. By testing on a PC (Intel Core i5-3470 processor with 8GB memory), our level generator takes only 0.057 sec on average to generate a procedural level, 2D map, which should be able to meet the online generation requirements.
V-C Results on Real-time Content Adaptation
For the evaluation of our proposed real-time content adaptation for DDA, we conduct simulations with sophisticated Mario controllers of different types instead of human players as suggested in . The use of agents for DDA test may benefit from stable agents’ behavior, their diversified playing styles and a wide range of skills . As a result, we employ 15 agents submitted to the Gameplay track of the Mario AI Competition . To test our adaptive generator, we use completion rate, the ratio of the actual distance travelled over to the length of a game level being played, as a evaluation criterion for DDA . Moreover, we further examine the online learning performance of our adaptation algorithm based on the completion rates of three typical agents in response to adaptive content generated for an optimal survival rate, . In our simulation, a level generated with our adaptation algorithm is limited to a maximum length of 200.
For reliability, each of 15 agent played on three sets of adaptive games generated by Algorithm 3 with different optimal survival rates, , where each set consists of 200 levels. For a baseline, we also randomly generated 200 levels of the same length and refer them to static games as no DDA is applied and each agent also plays the games in the static game set.
Fig. 11 illustrates the mean and the standard deviation of completion rates achieved on four game sets by 15 agents333The slight difference between the performance presented in  and that reported here is due to different Mario start state settings: the fire state in theirs but the small state in ours.. As shown in Fig. 11, Peter’s, Andy’s and Robin’s agents outperform other agents in terms of the averaging completion rate thanks to the A* algorithm used in their implementation. Hence, we regard these three agents as “skilful players” and all the rest are “novices”. It is observed from Fig. 11 that our real-time adaptation algorithm works well for all the novice agents given the fact that their completion rates on adaptive game sets are higher than that on the static game set, which implies easier levels were dynamically generated to improve their performance. In contrast, the DDA performance varies for three skillful agents. While the completion rates of Robin’s agent are achieved as expected, our adaptation algorithm does not work well for Peter’s and Andy’s agents. According to , these two agents can survive from nearly all the procedural levels no matter how difficult they are. Thus, our algorithm could hardly find games to challenge them. From Fig. 11, it is also evident that a higher value of generally leads to a higher completion rate, as required by DDA.
To evaluate the online learning performance of our adaptation algorithm, we adopted Rafael’s, Sergio’s and Robin’s agents since they represent agents at different levels in light of playing styles and skills: Rafael’s and Sergio’s agents are novices and Robin’s agent is a skilful player. In our experiment, each agent played 30 successive games generated in a gameplay episode by our adaptation algorithm via setting . For comparison, three agents also played 30 randomly generated static game levels of the same length. For reliability, we repeated this experiment for 30 trials and the mean and the standard derivation of completion rates achieved by three agents are illustrated in Fig. 12. From Fig. 12(A) and (B), it is seen that the averaging completion rates of Rafael’s and Sergio’s agents on adaptive games are always higher than those on static games thanks to the adaptation that generates easier levels. In particular, the complete rates on adaptive games gradually increase and becomes stable after 5-8 games were played roughly. In contrast, the completion rates of Robin’s agent gradually decrease as observed from Fig. 12(C) where the complete rates on adaptation games are always lower than those on their counterpart static games after 14 games were played thanks to the adaptation that keeps generate more difficult games.
For demonstration, we exhibit three procedural levels generated by our adaptation algorithm for three aforementioned agents in Fig. 13. In the level for Rafael’s agent shown in Fig. 13(A), there are fewer enemies than those generated for Sergio’s and Robin’s agents shown in Figs. 13(B) and 13(C). The level shown in Fig. 13(B) is at the intermediate difficult level where there are several enemies flower tubes, while the one shown in Fig. 13(C) contains a lot of enemies and gaps, a difficult level to play apparently.
In summary, the experimental results reported in this section demonstrate that our proposed algorithms based on CPs work effectively in generating quality and adaptable SMB games.
In this section, we discuss the issues arising from our work and relate ours to pervious works.
It is well known that automatically generated procedural SMB levels is generally worse than those handcrafted levels . This is further exacerbated by other simultaneous requirements in generation efficiency and controllability. By exploiting the nature of 2-D platform games, we effectively limit this problem to a smaller yet manageable content space consisting of short game segments without compromising the content diversity. Furthermore, we explore and exploit the synergy between rule-based and learning-based methodologies to produce quality building blocks, i.e., constructive primitives (CPs). While our hybrid quality assurance method appears effective, there is an issue to be addressed: how to trade-off between two methodologies. On the one hand, strict yet aggressive rules may completely remove all the unplayable content but often get rid of playable content as well so that the diversity of playable content is significantly limited. Due to the high complexity of content space, it is extremely difficult to formulate rules, in particular, concerning all the aspects of game quality . On the other hand, a learning-based approach may address all the quality assurance issues with a single evaluation function. It works efficiently with game developers’ judgment on quality of training examples instead of handcrafting constraints and deductive rules based on the understanding of an entire content space [5, 6, 7, 8, 9, 17, 16, 15]. For example, the developer took only two hours in annotating 1900 games via visual inspection for our CP active learning as described in Sect. III. However, a learning-based approach rarely yields the error-free performance. Thus, our work presented in this paper suggests the ultimate goal for a hybrid approach as follows: without compromising the content diversity, efficient rule-based methods should concentrate on unplayability only to avoid catastrophic failures, and learning-based methods should take charge of other quality assurance aspects after unplayable content removal.
While our online level generator clearly benefits from CPs in efficiency, it gains remarkable controllability via design elements, a direct content representation concerning low-level geometrical features, working at a local level for CPs. By controlling relevant design elements directly, ours generates a procedural level efficiently by integrating CPs of desired properties. It is noticed that similar design elements were used as content representation in previous works [6, 7] where those attributes are specified at an entire procedural level. As a result, such approaches have to formulate constructive rules or evaluation functions to control level properties globally. While those rules or evaluation functions have to be handcrafted with a great effort [12, 31], it may work less efficiently especially for generating procedural levels of a considerable length . Thus, our approach significantly distinguishes those working at a global level [6, 7] in terms of content representation and resultant controllability.
In general, our online level generator may be viewed as a hybrid PCG approach if we position it in light of the existing taxonomy . On the one hand, we use a generate-and-test method to produce CPs for quality assurance. On the other hand, a procedural level is constructively generated via a number of controllable parameters for efficiency . Apparently, ours distinguishes from those generate-and-test (e.g., [5, 6, 7, 8]) or constructive (e.g., [10, 17, 16, 15]) SMB level generators. As a hybrid approach, however, the desired level properties have to be specified via setting controllable parameters at a local level. This would be a potential weakness when such properties are unknown or hard to specify.
For real-time DDA, our algorithm seems to resemble some previous works [20, 19]. However, they differ in difficulty controllability and performance measurement. While a CP of certain difficulty is achieved by setting a controllable parameter in ours, a proper segment in theirs has to be generated via a rhythm-based mechanism or a set of constructive rules [20, 19]. In addition, we use the survival rate, an objective metric, for measuring performance while they adopt the subjective player’s feedback to decide the appropriateness of a difficulty level. While our CP-based adaptation algorithm is promising in performance-driven DDA, it is subject to a number of limitations: (a) the current evaluation is solely based on agents instead of human players and the adaptation process does not seem rapid; and (b) For DDA, the assumption used in our simulation, a strong relationship between player’s performance and experience, may be questionable as suggested in recent studies (e.g., ); (c) it is unclear how to adapt content via CPs in the presence of subjective feedback on an entire procedural level (e.g., [17, 18, 21]), which has been studied under the EDPCG framework .
In conclusion, we have presented a novel approach to online level generation and real-time content adaptation in SMB via learning constructive primitives. Our approach has been evaluated via comparing to state-of-the-art SMB level generators in terms of quality, efficiency and controllability. Experimental results suggest that it meets the online level generation requirements and has a potential in real-time content adaptation. In our ongoing work, we are aiming addressing issues discussed above and further investigate our algorithms in different applications, e.g., experience-driven DDA. Although segment-based experience-driven content adaptation has been studied for SMB (e.g., [19, 20]), their work entirely relies on the self-reported feedback on each short segment, which severely interrupts the gameplay experience . In our work, we are going to overcome this fundamental weakness by exploring player’s behavior and relevant content features at the CP level for a scenario that only self-reported feedback on a complete procedural level is available. We anticipate that such real-time content adaptation minimizes gameplay interruption and yields optimal gameplay experience reciprocally. Furthermore, we would like to investigate the feasibility in extending our approach proposed in this paper to other game genres such as first-person shooter.
The authors are grateful to N. Shaker for providing SMB procedural level images used in our comparative study and to those members in the Google PCG Interest Group who responded to our enquiries for useful discussions.
-  J. Togelius, G. N. Yannakakis, K. O. Stanley, and C. Browne, “Search-based procedural content generation: A taxonomy and survey,” IEEE Trans. Comput. Intell. AI in Games, vol. 3, no. 3, pp. 172-186, 2011.
-  J. Togelius, A. J. Champandard, P. L. Lanzi, M. Mateas, A. Paiva, M. Preuss, and K. O. Stanley, “Procedural content generation: goals, challenges and actionable steps,” Artificial and Comput. Intell. in Games, vol. 6, pp. 61-75, 2013.
-  Nintendo EAD, “Super Mario World,” (Game) 1990.
-  N. Shaker, J. Togelius, G. N. Yannakakis, B. Weber, T. Shimizu, T. Hashiyama, N. Sorenson, P. Pasquier, P. Mawhorter, G. Takahashi, G. Smith, and R. Baumgarten, “The 2010 Mario AI championship: Level generation track,” IEEE Trans. Comput. Intell. AI in Games, vol. 3, no. 4, pp. 332-347, Sep. 2011.
-  G. Smith, J. Whitehead, M. Mateas, M. Treanor, J. March, and M. Cha, “Launchpad: A rhythm-based level generator for 2-d platformers,” IEEE Trans. Comput. Intell. AI in Games, vol. 3, no. 1, pp. 1-16, Mar. 2011.
-  N. Sorenson, P. Pasquier, and S. DiPaola, “A generic approach to challenge modeling for the procedural creation of video game levels,” IEEE Trans. Comput. Intell. AI in Games, vol. 3, no. 3, pp. 229-244, Sep. 2011.
-  N. Shaker, M. Nicolau, G. N. Yanakakis, J. Togelius, and M. O’Neill, “Evolving levels for Super Mario Bros using grammatical evolution,” in Proc. IEEE Conf. Comput. Intell. Games, 2012, pp. 304-311.
-  S. Dahlskog and J. Togelius, “A multi-level level generator,” in Proc. IEEE Conf. Comput. Intell. Games, 2014, pp. 1-8.
-  P. Mawhorter and M. Mateas, “Procedural level generation using occupancy regulated extension,” in Proc. IEEE Conf. Comput. Intell. Games, 2010, pp. 351-358.
-  S. Snodgrass and S. Ontañón, “Experiments in map generation using markov chains,” in Proc. FDG Workshop on Procedural Content Generation, 2014.
-  G. Smith and J. Whitehead, “Analyzing the expressive range of a level generator,” in Proc. Workshop on Procedural Content Generat. Games, 2010.
-  G. N. Yannakakis, and J. Togelius. “Experience-driven procedural content generation,” IEEE Trans. Affective Comput., vol. 2, no. 3, pp. 147-161, 2011.
-  B. Horn, S. Dahlskog, N. Shaker, G. Smith, and J. Togelius, “A comparative evaluation of precudural level generators in Mario AI framework,” in Proc. FDG Workshop on Procedural Content Generation, 2014.
-  J. Roberts and K. Chen, “Learning-based procedural content generation,” IEEE Trans. Comput. Intell. AI in Games, vol. 7, no. 1, pp. 88-101, 2015.
-  N. Shaker, G. N. Yannakakis, and J. Togelius, “Feature analysis for modeling game content quality,” in Proc. IEEE Conf. Comput. Intell. Games, 2011, pp. 126-133.
-  P. Persson, “Infinite Mario bros,” [Online]. Available: http://www.mojang.com/notch/mario/.
-  C. Pedersen, J. Togelius, and G. N. Yannakakis, “Modeling player experience in super mario bros,” in Proc. IEEE Conf. Comput. Intell. Games, 2009, pp. 132-139.
-  N. Shaker, G. N. Yannakakis, and J Togelius, “Towards automatic personalized content generation for platform games,” in Proc. Artif. Intell. Interact. Digital Entertain., 2010.
-  S. Bakkes and S. Whiteson, “Towards challenge balancing for personalised game spaces,” in Proc. FDG Workshop on Procedural Content Generation, 2014.
-  M. Jennings-Teats, G. Smith, and N. Wardrip-Fruin, “Polymorph: A model for dynamic level generation,” in Proc. Artif. Intell. Interact. Digital Entertain., 2010, pp. 138-143.
-  N. Shaker, G. N. Yannakakis, J. Togelius, M. Nicolau, and M. ONeill, “Evolving personalized content for super mario bros using grammatical evolution,” in Proc. Artif. Intell. Interact. Digital Entertain., 2012.
-  S. K. Thompson, Sampling (second edition). New York: Wiley, 2002.
-  S. Guha, R. Rastogi, and K. Shim, “CURE: an efficient clustering algorithm for large databases,” ACM SIGMOD Record, vol. 27, no. 2, 1998.
-  A. L. N. Fred and A. K. Jain, “Combining multiple clusterings using evidence accumulation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 6, pp. 835-850, 2005.
-  J. Han and M. Kamber, Data mining. San Francisco, CA, itd: Morgan Kaufmann, 2001.
-  C. Chen, A. Liaw, and L. Breiman, “Using random forest to learn imbalanced data,” Univ. California, Berkeley, CA, Tech. Rep., 2004.
-  L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5-32, 2001.
-  R. Bellman, “A Markovian decision process,” Journal of Mathematics and Mechanics, vol. 6, pp. 679-684, 1957.
-  W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, pp. 285-294, 1933.
-  J. Togelius, S. Karakowskiy, and R. Baumgarten “The 2009 Mario AI Competition,” in Proc. IEEE Congr. Evol. Comput., 2009.
-  N. Shaker, J. Togelius, and M. J. Nelson, Procedural Content Generation in Games, A textbook and an overview of current research. Springer. (to appear)
-  D. Buckley, K. Chen, and J. Knowles, “Rapid skill capture in a first-person shooter,” IEEE Trans. Comput. Intell. AI in Games, vol. 8, 2016. (to appear)