This is better than A/B test
Everyone loves A/B tests and understands that it is the best way to see which of the two (or more) options is the best. But is it the best way? Here I’m going to show a modern method that beats old A/B tests.

What is A/B test?
The idea of A/B test is to understand which of two (or more) things is the best. For example, you might have four different app logos. To understand which of these logos drive most downloads, we need to test each of the variations. Users are divided into four segments and every segment is shown different logo. In the end, there should be significant differences between segments. There is dozen of calculators online that tells you how big the difference should be when we know the number of users in each segment.

Overall this sounds great, right? The problem although is that we show the logo for the same amount of users. What if we have shown the logo for 10% of the users and there are no downloads, should we still continue showing this? No! There are zero downloads and we should instead split the segment between three other segments to get more data for those. This way we don’t waste those users by testing with a logo that isn’t definitely going to work.

Multi-Armed Bandit
The multi-armed bandit problem is often explained using a casino example. You go to a casino and there is five (or any other amount) of slot machines (one-armed bandit is an old name for a slot machine) and all machines have different payout rates (for example payout rate of 98% means that theoretically every $1 you put in you should get back 98 cents). You want to find the machine that is paying back as much money as plausible. You have a limited amount of money and you need to maximize the amount of money you have after playing a certain time. The best way of course is to not play at all but in theory, we assume that you have a gambling problem and you need to play at least dozen of rounds.
This problem could be approached using A/B test. If we assume that the payout rate differences between machines are really small, we need to pull each leaver many times. Too many times. Because you don’t have that much money you just might found the best machine. After finding it you are almost broke and couldn’t exploit it.
Using the right terms A/B test only explores instead of exploiting. Bandit algorithms try to solve this by exploiting the same time when exploring. This way we use the best (or at least one of the best) machine more than worst machines. There is a lot of different kind of ways to solve this but I’m going show one simple but effective method.

The intuition is that with the probability of epsilon (ε) we choose a random action and with the probability of 1-ε we choose the best action. Q is a matrix where we keep track of the estimated value. Value estimation is just an average of seen rewards. N is a matrix where we keep track of how many times we have chosen certain action. We need this for average calculation. This way we exploit what we know so far but also keep learning more by exploring.

That’s it. Next, let’s prove that this is better than A/B testing.
What medicine works best?
Let’s set up a situation. Our task is to solve which of the four different cancer medicine works best. Tests are done with humans so we need to find the best medicine as fast as plausible.

Let’s imagine that we have 10,000 patients to test with although it is a high number in real life. First, there is a code for the environment we are using. Basically, it just gives the probabilities for medicines to work.
Let’s start by using A/B testing. Because we have 10,000 patient and 5 medicine it’s 2,000 patient per medicine.
Next, let’s test the simple bandit algorithm we learned. Epsilon is something that affects the amount of exploration. I just chose 0.01 (a.k.a. it chooses random action 1% of the time) because it is often used. In real life, you want to test different things with test data.

Conclusion
We saw how the bandit algorithm can save more lives when we tried different medicines. The original question was to find the best logo as fast as plausible.

Above rank-1-tensor shows the number of times bandit algorithm chose different medicine from A to E. As we can see it chose the best medicine a lot more than others and this proves us that we could have stopped this a lot earlier and go all-in with the best medicine.
Is this then the thing you should use? I think so because this finds the best option faster than A/B test but also in many cases, we want to find the best option but also exploit the same time. As I mentioned previously there is a lot of different bandit algorithms and in case you are interested in reading more about those I recommend reading the second chapter of this free book.
~Lankinen