Reinforcement Learning for Super Smash Bros. Melee Robert Olson, Logan Hensley University of Oklahoma: 1800 Parrington Oval. Norman, OK
[email protected] ,
[email protected]
Abstract Video games have long been the subjects of reinforcement learning problems going all the way back to Pong, and as games have grown in complexity, so have the reinforcement learning problems surrounding them. We implemented two approaches to reinforcement learning in the complex fighting game Super Smash Bros. Melee. The first approach was to use Q-Learning on an agent in the game and the second was to use SARSA learning. Both of these had their state spaces represented using a tree structure where each node described a particular state attribute. We compare the learning of these approaches to a random agent, a heuristic agent included in the game, and an agent implemented by researchers at MIT. The agents were run in the Dolphin Nintendo Gamecube emulator environment, with open-source software used to retrieve state information from the game.
1. Learning Task 1.1 Goal In the long term our goal was to create an agent which plays Super Smash Bros. Melee at a high level, capable of competing with humans. However, the scope of this project covers only the lowest level of training, wherein the learning agent is set against the lowest difficulty heuristic opponent included in the game. Thus, our goal for this project is for the learning agent to outperform the heuristic opponent in head-to-head matchups.
1.2 Criteria In order to evaluate the learning of our agents, it is necessary to track several key variables. First, so that it is possible to determine whether the implementation itself is working as intended, we track the average reward earned by the agent over the last 10,000 frames. Should this value grow over the course of the training, we can conclude that the learning algorithms themselves are at least implemented correctly. Second, we track the self-destructs of the agent over the last 10,000 frames. For context, in Super Smash Bros. Melee, the players do not win by simply inflicting damage to the opponent. Rather, they attempt force each other off of
the edges of a stage in order to achieve victory (similar to sumo wrestling). Damage increases the knock-back of the agent, and makes falling off the stage more likely when an opponent is hit. Self-destructs, then, occur when the player falls off the stage without having been knocked off by the opponent. Tracking the self-destructs of the agent gives us a metric which should be expected to correlate inversely with learning and is an objective measure of good or bad play. Even if the agent shows a good reward curve, such a metric does not necessarily imply that the agent is playing well, as our rewards could be suboptimal. Self-destructs over time are a metric which can be used for this purpose instead.
2. Literature Review The domain of fighting games in general has been explored often by researchers in the past, though the specific game of Super Smash Bros. Melee has been the subject of relatively little research. Discussion of the importance of tracking multiple recent states and the data size difficulties this produces informed our design of state representation in our approach (Ricciardi, 2008) while several discussions of multiple levels of reward for different actions informed our reward implementation (Lee, 2005). Specifically, it would appear that the domain of fighting games is more heavily reliant on multiple reward functions than other domains to achieve strong learning (Mendonca, 2015). Despite the rarity of SARSA implementations in fighting games, one approach for a different game did so and showed several similarities to our domain which indicated to us that SARSA had the potential to be a superior method (Graepel, 2004). Q-learning was also shown to be a suitable learning approach for the Super Smash Bros. domain by researchers at MIT, and their use of supercomputers in conjunction with many concurrent agents indicated to us that such resources might be necessary to the success of our own implementation (Firoiu, 2017). There was also some discussion over working to train agents to play more similarly to humans (Saini, 2014). Should the agent begin to perform well without these
additional restrictions, we would consider this another goal of interest for the future.
3. Q-Learning Approach 3.1 Motivation The motivation for pursuing Q-Learning in our project was driven by Phillip, the Super Smash Brothers AI written by researchers at MIT. As well as using Tensorflow, the MIT team used Q-Learning with a neural network to predict Q for unvisited states. Their Q-Learning agent yielded successful results. The had trouble getting meaningful results until they ran it on a supercomputer for a week. Afterwards, the researchers were no longer able to beat Phillip at the game. Aside from Phillip, we wanted to use Q-Learning because it always takes the action that yields maximum reward. There are many situations in competitive Super Smash Brothers that are guaranteed to result in an opening for dealing damage to your opponent or killing them, and in those situations, it is always beneficial to take the action that gives maximum reward.
3.2 Design Both learning approaches utilized a novel state representation method. Because of our large state space, we built a tree representation of it, wherein each leaf represented a single state and each node branched depending on some attribute of the state. This significantly shrunk our state space as each node of the tree branches based on some discretization of the attribute it represents, and new nodes/leaves are added to the tree as they are reached, keeping the state space only as large as the number of states actually visited. The specific attributes of our state space include: x position of the agent, y position of the agent, distance of the opponent from the agent, direction of the opponent relative to the agent, direction the agent is facing, direction the opponent is facing, and percent damage on opponent. The tree is structured with some assumption of the most important attributes in mind. This assumption left room for an easy way to generalize states by applying the Q calculation to all states within a given tree branch. For instance, when a self-destruct occurs, it is only dependent on the attributes of the agent, and not the attributes of the opponent. An example of a self-destruct would be our agent running off the ledge. The opponent did not influence our agent to self-destruct, so the negative reward should be applied to all states that relate to the agent attributes. By organizing the tree so that all opponent states occur at the bottom half of the tree, we can apply the self-destruct reward to all leaves given a state that is only dependent on agent actions, not opponent actions.
4. SARSA Learning Approach 4.1 Motivation Q-Learning is widely considered to be the superior learning algorithm for fighting style games because of the aggressive action it usually incentivizes. Despite this, however, we wanted to introduce some novelty by attempting to implement the SARSA algorithm because of the unique nature of Super Smash Bros. relative to other fighting games. Unlike most other fighting games, Super Smash Bros. has environmental hazards which the player must avoid in order to achieve victory. Specifically, the player loses a life if they fall off either end of the stage, and this should be avoided above all else. Because of this unique mechanic, we felt that SARSA had the potential to be more effective than Q-Learning as it tends to incentivize cautious play by considering real rewards instead of optimal possible rewards when updating Q-values. We hoped that this would cause the agent to more quickly identify risky plays near the edge of the stage and better avoid accidentally killing itself. Additionally, MIT researchers were already able to show that Q-Learning could be effectively implemented in Super Smash Bros. and we wanted to demonstrate that SARSA could be effective as well.
4.2 Design As with the Q-Learning approach, we utilized the tree previously described in order to construct our state space. It was reasonably simple to add the ability to run under SARSA to the agent as the general structure of exploring states, getting a reward, and updating Q remains the same. The primary difference being that SARSA waits to update Q until it has chosen its next action and received reward not one but two steps ahead of the state being updates.
5. Results Our first experiment entailed the implementation of the simple Q-Learning agent. This agent simply uses the tree structure previously mentioned, where each leaf of the tree has a list of possible actions, and computes Q through the exploration of its state and action space. We compare its results to a random agent and a simple heuristic agent, and Phillip, the agent implemented by the aforementioned MIT researchers in late 2016. The random and heuristic agents show expected results: very low and relatively constant reward for the random agent and mediocre, constant reward for the heuristic. This fact means that these agents’ performances are reliable for comparing our own results to. Phillip, on the other hand, shows strong growth over the course of several days through
many simultaneously running agent and represents a very successful learning agent in this domain. Our second experiment was to implement the SARSA algorithm, comparing it to the same benchmarks as the QLearning algorithm, as well as comparing between the two. Other experiments were run using Q-learning with some adjustments. Specifically, we performed tests with varying values for α, one with a significantly reduced set of state attributes, and one with a function used to generalize state/action pairs where self-destructs occurred. Unfortunately, several difficulties limited our ability to run the agents effectively. In pursuit of strong results like those displayed by Phillip, we had initially hoped to utilize some of the methodologies employed by the MIT researchers. Specifically, we had intended to build the code in such a way that it could run without a game display in Docker, allowing it to run more quickly and be utilized in any OS. We then hoped to load many docker instances onto the OU super computer so that we could run many simulations as was the case with Phillip, which was run in over 50 instances on the MIT super computer for a week with every instance sharing learning information through a custom setup. We were able to get the Docker version of the code working, which significantly sped up simulation times on our own systems, but were denied installation of Docker on the supercomputer for security reasons. As a result, we were not able to run the agents for a sufficient amount of time or with a sufficient number of trials to show strong learning. However, we were able to take advantage of a special game mode that allowed us to train up to four agents at once, rather than just two. This increased the learning rate of our agents while keeping the required processing power the same.
of time it was given. On the other hand, our SARSA agent performed far above expectations. While we were unable to gather a significant sample size because of previously explained limitations, the run we were able to get for SARSA over the course of ~3 days shows a lot of promise, especially compared to Q-Learning. We see a clear upward trend in reward and a clear downward trend in self-destructs. Given more time, we would like to perform further testing as this appears to be the most promising approach. Our hypothesis was that the SARSA agent would be able to defeat the heuristic agent, but would perform overall worse than Q-Learning. While we weren’t able to conclusively show that it could defeat the heuristic agent, we did show that it far outperformed Q-Learning.
Figure 2: Reward per 10000 frames, SARSA
For basic Q-learning, our initial hypothesis was that, with sufficient time, the agent would be able to outperform the heuristic agent which it was trained against on average.
Figure 3: Self-Destructs per 10000 frames, SARSA
Figure 1: Reward per 1000 frames, Q-Learning
As can be seen in the chart, our Q-Learning agent did not appear to learn in any meaningful capacity for the amount
As previously described, the three other experiments were implemented with Q-learning as the baseline, which weakened their results significantly. We expected the self-destruct generalization to minimize self-destructs and cause the agent to perform better than basic Q-Learning and we expected that a severely reduced state space would help the agent learn more quickly but less effectively. However, because of the very poor performance of the basic Q-Learning which they were built on, even if these experiments performed better, the difference is essentially invisible. For the most part, they appear nearly identical to the reward chart for Q-Learning in Figure 1. However, despite our limited
simulation power, we were able to run the agent with the self-destruct generalization ability for an extended length of time with the one computer we had available to us to run simulations. This run was particularly noteworthy, as it does show a very slight learning curve and drop in self-destruct actions, indicating that it was somewhat successful. Given that this experiment was run for an extended length of time (~5 days) compared to others and shows some promise, we feel that it is possible that the normal Q-Learning, despite the appearance of its learning curve currently, does have the potential to learn given longer run time. As explained previously, however, the resources necessary to verify this have simply been unavailable to us so far.
Figure 6: Reward per 1000 frames, reduced states
6. Analysis
Figure 4: Reward per 10000 frames, Self-Destruct Handling
The above chart shows an incredibly small growth in reward over time, but Phillip shows us that an extremely long amount of time simulating is required to achieve any real learning in this domain. Note that the reason the reward is higher than that of simple Q-Learning is due to some adjustment in the rewards conferred between the two runs. This may have had an effect on the learning, but we are inclined to believe that the difference stems from the longer run time providing us a much clearer picture of the long-term trend in learning. It is difficult to say whether this upward trend in reward would continue if it was run for as long as Phillip was, but given its current state it seems like the agent may have achieved learning after all, just at a very slow rate.
Figure 5: Reward per 10000 frames, baseline agents
Our final run of Q-Learning with the updated reward system had an average reward of just over -.001. This is a very slight improvement over the random agent, which has an average reward of just under -.001. Our agent also has higher peaks than the heuristic agent, and lower valleys than the random agent. This large range of error is probably a result of us drastically limiting the state space in order to improve learning speed, but at the cost of increasing the stochasticity of the environment. This research proved to be a learning experience in many forms. All of our hypotheses were wrong in some way or another. As mentioned earlier, we had expected SARSA to be the inferior learning algorithm because of its defensive approach. Instead, SARSA showed to have a sharp increase in learning at the beginning of the trial, with a plateau shortly afterwards. We believe this is because SARSA learned to avoid the hazardous ledges and stay closer to center stage, minimizing the negative reward from falling off the edge. However, we believe that afterwards, it struggled to take the offensive approach towards the heuristic agent, causing the learning to plateau. Given the chance to continue with this experiment, our first action would be to find a way to run more experiments in a shorter amount of time. This way we could get more definitive results on separate experiments, and draw better conclusions. Another approach we would like to try would be a deep neural net that could handle the vast amount of input data that Super Smash Bros. is capable of producing. We are disappointed by the results shown by our agents and feel that it underperformed our expectations by a fairly large amount. We faced a number of technical difficulties that hindered our progress significantly due primarily to the complexity of the domain and the sheer amount of simulation time an agent needs in order to effectively learn. Given more time, and perhaps the use of a super computer to run simulations on, we feel that we could develop a much better learning agent. As it stands however, the scope of the hypothesis was out of our reach given the time available to us.
7. References Firoiu, V., Whitney, W. E., & Tenenbaum, J. B. (2017). Beating the World's Best at Super Smash Bros. Melee With Deep Reinforcement Learning. CoRR. Graepel, T., Herbrich, R., & Gold, J. (2004). Learning to Fight. Proceedings of the International Conference on Computer Games: Artificial Intelligence, Design and Education (pp. 193200). Cambridge: Microsoft Research Ltd. Lee, L. (2005). Adaptive Behavior for Fighting Game Characters. San Jose: San Jose State University. Mendonca, M. R., Bernardino, H. S., & Neto, R. F. (2015). Simulating Human Behavior in Fighting Games using Reinforcement Learning and Artificial Neural Networks. 14th Brazilian Symposium on Computer Games and Digital Entertainment (pp. 152-159). Piaui: IEEE. Ricciardi, A., & Thill, P. (2008). Adaptive AI for Fighting Games. Stanford: Stanford University. Saini, S. S. (2014). Mimicking Human Player Strategies in Fighting Games Using Game Artificial Intelligence Techniques. Loughborough: Loughborough University. spxtr (January 20, 2017) Super Smash Bros. Melee CPU. https://github.com/spxtr/p3