The Great Big Shuffler Experiment


One of the big concerns of players in any competitive online game
is whether or not there are exploits in the game that are giving unfair advantages to those who know such exploits. In MTG Arena, one area where this concern has pervaded since the beginning is the shuffler. This article introduces a new, crowdsourced effort to investigate this issue. The effort itself is concerned with investigating multiple facets of the issue, but to keep it simple, this article is focused on only one of those facets. The following sections provide details to this facet of investigation, its rationale, strengths, and weaknesses, and the ways by which you can contribute to its development (since it is a crowdsourced effort).


The Issue


MTGA, like any CCG simulator, uses an algorithm in order to “shuffle” your library and draw your opening hand. According to the devs, MTGA uses the Fisher-Yates shuffling algorithm which is a valid choice. However, no system for generating randomness is perfect and as can be read from the Wikipedia article on the algorithm, there are potential sources of bias to watch out for which can be detected through proper experimentation.

The specific issue is described as follows. Any deck that MTGA shuffles starts with an original ordering which is based on how you created the deck. Consider the following decklist:

30 Plains (ELD) 250
30 Swamp (ELD) 258

If you import this decklist exactly, you will be creating a deck that always starts out as a pile where there are 30 plains on top followed by 30 swamps on the bottom. This is what MTGA will start with before shuffling.

The potential exploit of interest is if the shuffler favors one half of the deck over the other. That is, if there is a difference between the distribution or plains and swamps that end up in your opening hand.

Why is this important? Suppose there is a discrepancy, and say, the bottom half ends up in your opening hand more often than the top half, then this can be exploited by arranging your starting deck such that cards which you want to see early are on the bottom and those that you want to see late are on the top.


Background on the investigative effort
While issues like the one described have been debated on over and over again across all MTGA forums of discussion, little has been done to provide actual evidence that such exploits exist. Perhaps the ones that are most popularly mentioned are those conducted by redditor Douglasjm which are posted on reddit here and here. However, both of these, especially the second one which deals with the same issue that was detailed in the previous section have been heavily criticized. In the second study, there was a clear, fundamental misuse of statistical analysis which I will just describe briefly as follows: one is not supposed to use inferential methods with the intention to prove a null hypothesis.
Anyway, yesterday, one contributor from the MTG Arena forums, DeadlyK1tten, posted a thread proposing a crowdsourced approach to investigating the shuffler. The post is located here. DeadlyK1tten presented the issues that are proposed for testing and graciously provided the Python code that can be used in order to automate the process of extracting the relevant data from MTGA logfiles. This means that anyone interested can simply follow the procedures on the thread (first post), and post the automatically generated results on the thread which can then be aggregated together to produce reasonable inferences. In addition to this, the data needed for the specific issue that I described in the previous section can be easily drawn manually, and so people can also contribute by following instructions that I will provide on this article without needing to work with Python.


Instructions for Manually Conducting the Experiment
The question we want to answer in this experiment is as follows: is there sufficient evidence to indicate that the shuffler favors one half of a deck over the other half.

Procedure:
Step 1: Construct the following deck by copying it and then importing it on the MTG Arena client

30 Plains (ELD) 250
30 Swamp (ELD) 258

Step 2: Go to Sparky and play a best-of-one game.

Step 3: Count the number of swamps in your opening hand

Step 4: Accept the hand, concede and go back to Step 2. Repeat this process a desired number of times (I did it 20 times and it took no more than 15 minutes).

Step 5: Tally your result by indicating the number of times you got each possible number of swamps  (from 0 to 7) in your opening hand separated by commas. For example: 0,1,3,6,3,3,2,0. This output means that there were 0 times that I got no swamps in my opening hand, 1 time that I got 1 swamp in my opening hand, 3 times that I got 2 swamps in my opening hand, 6 times that I got 3 swamps in my opening hand, and so on. Outputs should be reported in this format so that it will be consistent with the output as reported from the Python script.

Step 6: Post your result on the MTGA forum linked again here.
Note that it is essential to follow all the steps exactly so that the trials can be assumed as coming from identical distributions. Note also that by doing this experiment using the instructions on DeadlyK1tten’s post, you can avoid needing to do Step 4 (the code will do the tally for you). You will just need to copy and paste the result from the code’s output on the forum.


Why this will work
As in any study, the more data you get, the better you can approximate the truth. This is true as long as the data you are getting is honest data. As such, suppose everyone just follows the instructions detailed above and does even as few as 10 games each, if 100 people do it, we will have 1000 total unit samples. If 1000 people do it, we will have 10,000 samples. This should be enough for us to draw a clear picture on whether or not the claim does have merit (that is, if one half of the deck is favored by the shuffler over the other half in the specific setting given). 


Issues with the experiment
One issue of course is how much we can trust the data to be honest. That is, it is possible for a person to 1.) Make a mistake while doing the experiment or 2.) Purposively report fictitious results. This issue can be addressed by having a large enough number of people contribute if we assume that the chances of either of this happening is very small. This is because the relatively much larger amount of honest data can either drown out the erroneous/dishonest data, or it can make such data clearly distinguishable from honest data.

The second issue is that this experiment is limited to a very specific setting. That is, we need you to use the same decklist and this decklist (30 plains, 30 swamps), is not something that you would actually take to a ranked match with you. As such, it is of course, very possible for a bias not to exist or be detectable in this setting but for it to exist in other, more natural settings. This is a weakness. However, you can also choose to contribute to the other experiments in the same thread that deal with other, more natural settings. For example, one setting is the natural Brawl format setting where you can just play Brawl games normally and get some results (using DeadlyK1tten's code since it would not be easy to do the tally for this one manually). 

Also, note that like I said before, statistical methodologies are not built to prove null hypotheses. That is, our objective here is to expose evidence (if such evidence does exist) that the shuffler is rigged, not to provide proof that the shuffler is not rigged. This means that if we do find evidence, then that should serve as a basis for the devs to act and fix the problem. If we do not, then this should provide at least some peace of mind to those concerned that we were not able to detect sufficient cause for concern. If this would motivate more people to conduct their own investigations using more sophisticated settings, that would be very welcome.

and as always... May the shuffler be with you.

Comments