Tireless Tracker

Analyzing Your Own Cube Drafts

October 15th, 2019 — Jett Crowdis

I’ve always adored data and statistics in the context of Magic (I’m a huge Frank Karsten fan). When I fell in love with Cube about three years ago, I was acutely interested in what data-based resources existed for Cube design. But Cube is a subjective enterprise. Different Cubes have different goals in gameplay and drafting, and drafters have preferred playstyles and varied skill levels that can shift a Cube metagame. Couple this with Magic’s inherent complexity, and you have a recipe for disagreements in both card and archetype evaluation.

I’ve been collecting data on my own Cube for almost two years, and I currently help the XMage Cube Group curate and analyze their 3-0 decklist dataset. I’ve analyzed cards like Experimental Frenzy and Demonlord Belzenlock to evaluate their power level through simulation. If a Cube problem exists and can be looked at with data or simulation, I’ll always give it a go.

This is the first installment in a series titled Tireless Tracker. The goal of this series is to answer questions about Magic using real world data with a particular focus on laying a statistical foundation for Cube discussion. This is an ambitious goal, as data analysis is never truly objective and is riddled with biases and noise. We’ll be fighting these shortcomings by clearly defining the questions we’re attempting to answer and acknowledging the limitations of our approaches.

Each installment of Tireless Tracker will have at least four components.

  • Problem: This section will discuss what aspect of Magic or Cube we’re trying to investigate. I’ll discuss the problem’s features and complexities and what answering the question might tell us.
  • Data: In this section, we’ll discuss the dataset. I’ll explain where the data comes from and how it was analyzed. I’ll also cover any confounders in the dataset, which are features that make analysis difficult.
  • Analysis: This is where I’ll discuss my analysis of the dataset, what conclusions we can draw, and how confident we are in these conclusions.
  • Resources: This section will contain links to the dataset, any code I used to analyze it, and any other resources pertaining to the article.

Analyzing Your Own Cube Drafts

I talk often with people on the MTG Cube Talk Discord about tracking my own Cube data, and I’ve been delighted to see that many also do this or are interested in doing so. The major hurdle everyone faces is finding a way to keep track of Cube data in a way that is both efficient and fruitful. Keeping track of everyone’s first and last picks from each pack is easy but may not tell you much. Keeping track of every individual pick and each decklist’s winrates is too time-consuming, even if it contains lots of useful data.

Problem

How should a Cube owner collect data? I’ve been using this method for about two years now: after every draft, I ask the drafters to take a picture of their deck and send it to me. Later, I transcribe the decklists into a simple text file on my computer, noting the colors of the deck, its archetype, its match and game record, and the cards it contains. I then use Python to parse all these files and combine their information into one dataset. I will provide the Python script and instructions for use in Resources.

Currently, the script outputs a few different analyses:

  • Archetype Analysis: This script will analyze the win rates of each archetype and which cards appear most frequently in that archetype. It allows for sub-archetypes — a UB Control Reanimator deck can be classified as both Control and Reanimator. If dates are given to the decklists, it will analyze their win rates over time.
  • Card Analysis: This script will analyze which cards appear most in your decklists and their maindeck vs. sideboard rates if sideboards are given. It will also output individual win rates for each card (this feature comes with significant limitations in interpretability, see Analysis).
  • Color Balance: This script will analyze what colors are most often drafted in your Cube. It will do this based both on the decks themselves and the cards in the decks. For example, if I have one UB Control deck in my dataset, blue and black share an equal archetype representation (0.5-0.5). But if that deck is playing 5 black cards and 17 blue cards, black will have a 5/23 = 0.217 card representation.

Data

Over the past two years, I’ve collected 404 decklists from drafts of my Cube. It is a Strix Scale 8-F unpowered Cube, and I aim to maximize power and efficiency within the unpowered design restriction (for example, I include Mana Drain and Mind Twist). Reanimator and creature-cheat strategies are well supported, and I have no planeswalker quotas. I typically draft with 4-6 friends, occasionally a full 8 person draft or a 2-3 player Winston draft. You can find all the decklists used in this article here.

Confounders

A confounder is any feature of a dataset that prevents accurate analysis. There are two main types of confounders: bias and noise.

A bias is a trend in the data that exists as a result of some external force. In the case of drafted decklists, the primary bias is drafter preference. As a player, I love drafting aggressive decks and will actively pick Goblin Guide and Sulfuric Vortex over most cards in the Cube. Because I am the most experienced player in my playgroup at drafting my Cube, this means that aggressive decks may be overrepresented in terms of win rate.

A similar bias exists for individual cards. If skilled players think that mediocre a card is good, that card may have a high win rate because skilled players draft it. The same is true in reverse — good cards that are underdrafted by skilled players may end up in the hands of less experienced players, resulting in a lower win rate.

Whereas bias refers to a consistent variation in a direction, noise refers to random variation. There are innumerable sources of noise in this dataset. For example, an aggro deck can do poorly in a draft because no aggro cards are opened, or an incredible deck can lose all its games due to random chance. Unlike bias, noise can be reduced with a large enough dataset, but it is never truly eliminated. Noise will be ubiquitous in all the analyses that we do.

Analysis

Archetype and Subarchetype Breakdown

In my own Cube, I’ve chosen to keep a higher order archetype breakdown (Aggro, Midrange, Control), and a sub-archetype breakdown (Ramp, Combo, Reanimator). This means that all decks are either Aggro, Midrange, or Control, but some have sub-archetypes (Control-Reanimator, Midrange-Ramp, etc). Here are their win rates in my Cube:

Archetype
Decks
Game Record
Win Rate
Aggro
102
445-328
0.58 ± 0.03
Midrange
169
656-658
0.50 ± 0.03
Control
135
490-532
0.48 ± 0.03
Ramp
63
300-214
0.58 ± 0.04
Combo
31
122-116
0.51 ± 0.06
Reanimator
25
74-97
0.43 ± 0.07
± 95% confidence interval calculated via bootstrap.

The win rates do not average to 0.50 because of sub-archetypes. For example, I classify all ramp decks as midrange decks. This leads to ramp decks being counted “twice” in the above table, increasing the overall win rate of midrange. It is clear that aggro and ramp decks are the top dogs in my Cube. Given the natural speed of these decks, this has led to a faster paced Cube environment than many other unpowered Cubes that I have played.

We can also investigate the common cards in each archetype:

Archetype
Most Common Cards
Aggro
Strip Mine, Sulfuric Vortex, Porcelain Legionnaire
Midrange
Polluted Delta, Recurring Nightmare, Demonic Tutor
Control
Ponder, Coldsteel Heart, Azorius Signet
Ramp
Birds of Paradise, Craterhoof Behemoth, Gaea's Cradle
Combo
Sneak Attack, Emrakul, the Aeons Torn, Oath of Druids
Reanimator
Entomb, Reanimate, Griselbrand
Most common cards by archetype

The most common cards in each archetype tend to either be archetype enablers (Sneak Attack, Reanimate), powerhouses in the archetype (Sulfuric Vortex, Craterhoof Behemoth), or flexible cards that will fit any color deck in the archetype (Strip Mine, Coldsteel Heart). This makes sense given that these cards either pull you into an archetype or are flexible in terms of color commitment.

I keep track of the dates of the decklists, so we can also interrogate the change in archetype win rates over time. To do this, I use a rolling average, which examines the average win rates of archetypes in a certain window of time. This enables us to make comparisons between time frames. In this case I use a 70 deck rolling average.

We can see from this graph that aggro has always been a powerhouse in my Cube, although there have been times where it was not a top performer. I’ve taken a look at decklists during this time frame, and I’ve discovered that this was when the number of aggro decks per draft increased. The natural conclusion is that aggro’s average win rate decreased because people fought for the archetype. I suspect that this is why aggro does so well in my Cube generally — there are usually only one or two people drafting it. In theory, archetype win rates are self-correcting; players will realize which archetypes are the best and will compete to draft them, lowering their average win rate. Aggro decks likely dodge this self-correction, as many players who play Cube simply do not like playing aggro even if it is “optimal”. As a Cube designer, this presents a conundrum. Do I provide tools to other archetypes against aggro, or does this punish drafters for recognizing that aggro is underestimated and drafting it? While I’ve introduced some tools against aggro and ramp decks decks like Pyroclasm, Whipflare, and Plague Engineer, these are questions I haven’t yet answered for myself.

This analysis also supports something I’ve suspected for some time — the performance of an archetype in a Cube depends not just on the cards in the Cube but also on the playgroup. I have played many Cubes where midrange or control strategies were dominant, despite these Cubes supporting aggro and ramp just as well as my own. This is likely because the experienced players in those playgroups enjoy drafting those strategies. This difference trickles down to card evaluations as well; spot removal and wraths are very important in my Cube, where you need to interact or die against aggro and ramp decks. Slower value engines, however, are more important in a Cube where midrange or control decks are popular. This can dramatically affect how we perceive the power level of cards and is important to remember in discussions.

Smokestack
Life from the Loam
Sun Titan
Birthing Pod
Land Tax
The strength of these cards depends on your playgroup preferences too! Ask yourself what your playgroup enjoys to draft and if these cards fit the speed of your Cube.

Individual Card Win Rates

When I first started collecting data on my Cube, I hoped to evaluate the strength of individual cards. In theory, strong cards lead to strong decks, so maybe looking at the cards in winning decks could identify the performers and the duds. The easiest approach is to look at card “win rates”. For example, if every deck that has Tinker in it wins every game it plays, then the “win rate” of Tinker is 100%. I analyzed the cards that cards that have been in more than 20 decks (272 cards). I’ve chosen some illuminating examples to show below, but you can view the full table here.

Rank
(Out of 272)
Card
Games
Win Rate
1
Fireblast
192
0.646
2
Hellrider
172
0.628
6
Jackal Pup
207
0.614
10
Carnage Tyrant
158
0.608
11
Joraga Treespeaker
298
0.607
159
Mind Twist
342
0.500
188
Mana Drain
310
0.484
260
Vampiric Tutor
282
0.433
270
Toxic Deluge
268
0.398
Individual card win-rates.

You might notice that the individual card win rates do not have any associated uncertainties. When we calculate uncertainties, we make implicit assumptions about underlying properties of the data. We’ll explore how confident we can be in these results soon, but first let’s cover the most obvious thing about this analysis:

The win rates of individual cards are entirely reflective of archetype win rates, not individual card strength.

Mind Twist
Mana Drain
Vampiric Tutor
Toxic Deluge
Not good enough for cube?

Mind Twist and Mana Drain are probably the strongest cards in my Cube, yet their win rates pale in comparison to the mighty Jackal Pup. While I’m still a fan of Jackal Pup in cube, its win rate is so high because it goes in aggro decks, which dominate in my Cube. Jackal Pup even “outperforms” other strictly better aggro cards; like a bad poker player winning a few hands, this is because of noise. Ramp cards like Carnage Tyrant and Joraga Treespeaker similarly overperform because of the strength of the ramp archetype. Conversely, Vampiric Tutor and Toxic Deluge aren’t bad cards; they just see play in decks that do poorly in my Cube compared to aggro or ramp.

We can examine the effect of archetype imbalance on these data by normalizing individual card win rates to archetype win rates. To do this, we simply divide individual card win rates by the average archetype win rate of decks that contain that card. For example, before normalization Sulfuric Vortex and Jace, the Mind Sculptor have win rates of 0.58 and 0.48 respectively. Sulfuric Vortex only sees play in aggro decks (archetype win rate = 0.58), while Jace, the Mind Sculptor sees play in mostly control (archetype win rate = 0.48). After normalization, they share a normalized win rate of 1.00. Cards that see play in multiple archetypes are normalized to a weighted average of archetype win rates.

Rank
(Out of 272)
Card
Games
Normalized Win %
1
Upheaval
211
1.17
2
Balance
183
1.16
3
Reanimate
211
1.16
8
Fireblast
192
1.13
35
Joraga Treespeaker
298
1.08
93
Mana Drain
310
1.03
104
Mind Twist
342
1.02
250
Vampiric Tutor
282
0.92
267
Toxic Deluge
268
0.87
Normalized individual card win-rates

You can view the full table here. At first glance, this table seems more reflective of individual card strength, as several Cube heavy weights like Upheaval and Balance rise to the top. Yet Mana Drain and Mind Twist are still ranked in the middle of the pack, and Vampiric Tutor doesn’t budge an inch. I believe these to be some of the most powerful cards in my Cube, so their low ranking indicates either that my beliefs are wrong or there are biases that remain after normalization. While this approach reduces bias due to different archetype win rates, it does not reduce noise; for example, Jackal Pup still ranks higher than other strictly better aggro cards. In any case, we can’t even trust this table to evaluate individual card strength. Why?

The performance of decks that contain a card tells us very little about the strength of that card.

This might seem paradoxical. After all, a strong card will inevitably increase the win rates of the decks that contain it, right? Nearly 300 games have been played with decks containing Joraga Treespeaker — is that not enough data to tell us if Treespeaker is a powerful card?

It’s not even close.

Imagine trying to estimate a basketball player’s skill not by their individual stats, but only by the performance of all the teams they’ve played on during their career. This is akin to what we’re trying to do with individual cards, and it’s a nearly impossible task.

A deck’s performance is a complicated function of the strength of the cards that compose it. An individual card has a relatively small impact on a deck’s performance, and its strength can vary based on context and the game state. Furthermore, a card’s win rate is strongly correlated with the win rate of the archetypes that often play it. Practically, this means that we can be very confident in the average performance of decks that contain a card and still know almost nothing about the strength of the card itself.

For example, if we apply the method we used to calculate uncertainty of archetypes to individual card win rates, we find that the win rate of Joraga Treespeaker is 0.61 ± 0.05. This seems reasonably confident, but it merely reflects the consistent win rate of decks that often contain Treespeaker — ramp decks. We have no basis to determine if Treespeaker specifically is causing that high win rate, so our uncertainty in its strength should be much larger.

Of course, with an incredibly large dataset, we could determine if Treespeaker is a good ramp card, since we might see ramp decks with Treespeaker have a consistently higher win rate than those without. But how many games would we need to confidently make this conclusion? We can investigate with simulation.

In statistics, simulation is the process of modeling a process with a computer to learn more about its properties. Every simulation needs a generative model, or a sequence of steps that produces data. We can devise a generative model for how card win rates are produced:

  • We’ll have a 450 card Cube, with each card having a number indicating its “strength” from 0-10. Most cards have a strength around 5. Specifically, the card strengths will be normally distributed with mean 5, standard deviation 2.
  • To make a deck, we randomly choose 45 cards from the Cube, then choose the best 23 of those cards. The deck’s total strength is the sum of the strengths of its cards.
  • Decks play against each other in best of three. For a given game, a deck has a probability to win equal to its strength divided by the total strength of the two decks. For example, if a deck with strength 100 plays against one with strength 150, it has a 100/250 = 40% chance of winning a game.
  • In each “tournament”, eight decks will play against each other with three total matches. Matches are played as a best of three games.

To evaluate how many games we need to see strong cards have high win rates, we can devise a test. Let’s remove 20 random cards and replace them with cards that have strength 10 (better than 99% of the Cube). This is known as a “spike-in”. We can test how many games it takes to see these spike-in cards develop higher win rates than other cards.

We can simulate a set number of tournaments, and identify both the average rank of the spike-in cards and the percentage of spike-in that ended up in the top 100 win rates:

Number of
Tournaments
Average # of
Games/Card
Average Spike-in Rank
(out of 450)
% of Spike-in
in Top 100
0
0
225
22%
100
300
124
43%
500
1,500
99
57%
1,000
3,000
83
66%
10,000
30,000
32
99%
10
100%
Individual Card Spike-in Simulation. 0 tournaments played refers to having no information (ie if each card were randomly assigned a rank). ∞ tournaments refers to having perfect information on the cards. These are our baselines of reference.

We can see that only when approach the 30,000 game threshold do we start to see fully accurate ranking of the spike-in cards. This means that in our simulated model of Cube, it takes us 30,000 games to recognize that the best cards are indeed the best based on win rates alone. Furthermore, this model is an extreme simplification of Magic; it does not account for variance in player skill, the fact that a card’s power level can fluctuate based on matchup and game context, and the fact that you only draw part of your deck in a given game. These additional sources of variation will make accurate ranking of cards more difficult. It is also trying to detect very strong cards (10s on a scale of 0-10), since these cards separate out from others the fastest. The plot below shows our confidence in the estimated rank of cards based on their true rank

True vs. Simulated Rank

To make this plot, I ran a 10,000 tournament simulation, noting how the cards ranked based on win rate (estimated rank) against their true rank. I then repeated this process 20,000 times to determine the average estimated rank and the confidence interval around this value. The graph shows that we can be very confident in strong cards (rank ≈ 1) and weak cards (rank ≈ 450) after 10,000 tournaments, as indicated by the narrow confidence interval. But for cards that are in the middle (rank ≈ 200), the confidence interval is huge. This means that in this simulated model, if a card is actually the 225th strongest card, you’re likely to see it rank anywhere from 30th to 400th in terms of win rate alone after 30,000 games. This means that although Joraga Treespeaker is ranked at #11 in terms of win rate, we should have very little confidence in this ranking.

Using only win rates, I estimate that you would need upwards of 100,000 games to evaluate strong or very weak cards in real Cubes and probably millions to evaluate average ones. So despite my hopes, it’s basically impossible to evaluate the strength of individual cards with just raw win rates.

But not all hope is lost. By looking at just the win rates, we ignore other important variables (most notably, other cards present in the decks) and hope they just average out. There are statistical frameworks that can account for these variables. We could, for example, fit our data to the generative model described above. This approach allows us to account for all the cards in a deck and would estimate the actual strength of cards on a 0-10 scale. While I suspect we will still be statistically underpowered, we’ll explore this method in a future installment.

Color Analysis

The script also analyzes the colors of the decks. Looking at my dataset,

Color
Deck %
Average Card %
White
38%
44%
Blue
60%
37%
Black
42%
40%
Red
40%
44%
Green
31%
56%
Color Analysis — Deck % refers to the percentage of decks that contain the color. Card % refers to what fraction of cards are that color in a deck that contains that color.

Clearly, blue is at the top of my Cube in terms of what cards are drafted. But while 60% of decks play some amount of blue, those decks tend to play fewer blue cards relative to their other colors. This implies that blue is being fought over and is often splashed. Conversely, decks playing green play 56% green cards, which means that green decks tend to be heavy green. This makes sense, given that green is not a color splashed often in my Cube and the decks are often Mono Green.

Sideboard Analysis

Of the 404 decklists, 261 are complete with main deck and sideboards. We can examine how often each individual card makes the main deck of a draft. Like the earlier win rate analysis, I only consider cards present in more than 20 decks:

Rank
(Out of 272)
Card
Games
Main %
1
Llanowar Elves
288
1.00
1
Lightning Bolt
333
1.00
1
Ponder
211
1.00
1
Fractured Identity
288
1.00
1
Swords to Plowshares
286
1.00
267
Acidic Slime
170
0.44
268
Brimaz, King of Oreskos
172
0.43
271
Hellrider
172
0.41
272
Ravages of War
167
0.40
Sideboard statistics - Main % refers to the rate a card is main decked vs. sideboarded.

The entire table can be found here. There are 28 cards that have never been sideboarded (rank 1), and they tend to be either ramp cards, efficient removal, or card selection. As we move into the lower ranks, we see aggro cards, multicolor cards, and narrow archetype enablers emerge. These cards often go very late in packs when no one is drafting their archetypes. Cards I consider to be on the lower end of my Cube’s power level also have lower maindeck rates (Cryptic Command, etc). I haven’t decided if I should let this inform my Cube design. Should I cut cards that have low maindeck rates? Am I willing to cube narrow cards with low maindeck rates if they generate interesting gameplay, enable unique archetypes, or are very powerful? These are questions every Cube designer must tackle, but I’m confident having these data will allow me to make more informed decisions.

Conclusions

This analysis toolkit is clearly most helpful in tracking archetype win rates and balance in your Cube. Before doing this analysis, I had assumed that Ramp was underperforming based on my personal experience; I never seemed to do well with Mono Green or G/X ramp decks. As a result, I introduced changes several months ago that were designed to boost ramp decks (more 1 mana value elves and more planeswalkers). Those changes caused ramp deck win rates to sky rocket; in one time frame, they were pushing 70%. Upon looking at the data, I was surprised to find that ramp has always done well, and my changes turned a good archetype into one that was upsetting the balance of my cube. Ramp’s win rate is gradually dropping as I introduce tools against it and the metagame self-corrects, but this serves as a warning against making sweeping changes based on human perception.

The individual card analysis was not as fruitful as I’d hoped. We can’t rank cards based on win rates alone, and the win rates are mostly reflective of archetype balance anyway. But there are still interesting things to learn about individual cards; examining the maindeck vs. sideboard rates of cards has taught me what my playgroup is drafting and what cards are popular. While this data-based approach to cubing isn’t for everyone, it has enabled me to learn more about my environment and make more informed adjustments to my Cube.

Resources

The decklists and the code used for analyzing my decklists and creating the simulations can be found on GitHub here. If you have no coding experience, feel free to reach out via Reddit (u/Tjornan) or email ([email protected]) and we can work together to analyze your decklists!

Lucky Paper Newsletter

Our infrequent, text-only newsletter is a friendly way to stay up-to-date with what we’re doing at Lucky Paper. See past newsletters

Memory Jar — Donato Giancola