The correct framework for AB testing your mobile app
AB tests are a great way to experiment to find improvements for your app. Since there’s no app exactly alike, there’s no one right way to go about running AB tests or making decisions to improve your app, we’ve taken the learnings from our customers to put together the following guide to help app developers looking to start effectively AB testing their app.
Our framework is built upon 7 key steps:
1. Know your data, inside and out.
2. Form a hypothesis about a change that will lead to an improvement.
3. Select the metric that will best measure whether there is an improvement.
4. Estimate the data required to run the AB test.
5. Run the AB test to test the hypothesis
6. Analyze the results
Know Your Data
Making your app better would be difficult without a clear understanding of what it does today. So it should go without saying that an understanding of the existing data is crucial for formulating good hypotheses and running AB tests. First, you should have a good sense for your:
(1) Analytics Event schema – that is the naming conventions, areas in the app where the events fire, parameters and additional details of the events.
(2) User averages – the average number of actions, revenue earned, retention and other user-KPIs within your app
(3) Understand well as the daily, weekly and seasonal fluctuations those KPIs experience – it’s not uncommon for ad revenue to fluctuate by 10% per day, and nearly 50% seasonally. It’s also crucial to analyze them by important slices of the data such as country, OS version or campaign source. How can you optimize a user’s behavior if you don’t understand the typical behavior to begin with?
But knowing simple user-counts isn’t enough. You also need an understanding of the relationships and causations of user-events and how they relate to user behavior of your users. For example– if you have a puzzle app that offers in-app purchases for hints– do you know what drives users to purchase? Are there specific users that will always buy hints? Or are the users buying hints on specific, difficult-to-solve puzzles? If you don’t have a good hypothesis on what drives user-behavior, it will be very difficult to design an effective test to drive more in-app purchases, since you won’t have a good idea of what to change, or how to measure it.
Form a Hypothesis and Select a Metric
Red button vs. Blue button. Easy level vs. Hard level. Continue as guest vs. Create an account. All of these are ideas that we see customers frequently experiment with. You already have an idea on what events and actions make users valuable. Your hypothesis is translating this idea into a specific variation or condition to make your test. Then– as important as the idea being tested—are choosing appropriate metrics to measure the outcome. You may be testing a new feature to see how it impacts user retention rate, or the revenue contribution from in-app purchases. Your hypothesis should clearly state what is being tested, and the metric should clearly articulate what is being compared to determine success.
Examples of a strong hypothesis and measurement metric:
-“Adding a new difficulty mode will increase sales of in-app purchases, as measured by Day 30 cumulative revenue.”
-“Prompting new players create an account after playing 3 levels will improve Day 7 retention.”
Examples of poor stand-alone hypotheses — because they are not specific and do not focus on a metric to be measured:
-“Players like easier levels more than hard levels”
-“Users like blue prompts better than the red prompts”
Estimate the Data Required
Often, a component of running AB tests is to use some statistics to aid in analyzing the results and making a decision on the outcome. What is often overlooked is that the statistics come into play before the test is ever configured or started. First up is your sample size determination – that is choosing the amount of observations that are needed to measure the outcome of the test. Estimating the sample size required for the AB test will help you understand how many data points you need to gather “enough” data. By extension, if you know how many actions or data points occur in a given day, then you can estimate how long a test will need to run.
This is easiest illustrated with an example. Let’s say you want to run a test comparing two features, with your test-metric being the Day 7 retention rate. Let’s assume your baseline Day 7 retention rate is 15% (you should be able to easily look this up, because you Know Your Data inside and out) and you are testing a change that should improve that to at least 20% (that being your determined minimum improvement that you would accept as worthwhile to roll out). Assuming you’d like 95% confidence level – with an 80% statistical power, or rate of not getting false negatives, you would need 903 data points for each of Version A and B, for a total of 1806 data points, to gather statistically significant samples. If your app averages 500 users making it Day 7 each day, you would have this data in 4 days.
If you were willing to accept a lower confidence interval, say 90%, then you would only need 711 data points for each of Variation A and Variation B, and would have the data in 3 days. Counterintuitively, if you “relax” the outcome of your test to just measure *any* gain over the baseline 15% retention rate (even improving to 16% would be a big win) you would need 20,557 data points per Variation, or 83 days of data to achieve. This is because to achieve statistical significance on a 1% change, it will take much more data than a much more “obvious” outcome of a 20% change.
Run the AB test
This is the step where you finally setup and run the AB test. You may run this via a custom change to your app or with the help of specialized tools like Firebase or Optimizely. Depending on what you’re changing, it may also require app updates to allow the configuration of the app to test your hypothesis. We also find it sometimes makes sense to rollout these changes to a very small percentage of your users, to ensure all is well with the change, the data is being gathered, and there’s no technical hiccups in the way. Then sit back and don’t peak.
Analyze the results
Remember, no peaking. One of the biggest issues we come across is not estimating the minimum data required before running the test, then looking at the data too early and making knee-jerk decisions. If you flipped a coin 3 times and it came up heads each time, would you immediately conclude the coin has heads on both sides? Of course not! So don’t do the same with AB tests.
The next common mistake we see is to keep running the test “until there’s 95% confidence of the results.” When comparing the results of Variation A and B, they are either significant at the 95% confidence level, or they are not. Either is telling: if the test outcome is not significant at the 95% confidence level, the data implies the experiment may not yield the same results if run again. In other words, the test outcome is inconclusive. Gathering more data won’t necessarily change this. But there are always exceptions to this as well. Say that you gathered 4 days of data, but your users fluctuate with a weekly pattern and you really should get at least 1 full week of data to capture that usage. This should be taken into account when estimating the sample size and time to gather the data, and the AB test should run accordingly.
So how should you think about confidence levels? A good shorthand way of thinking about confidence levels is the 3-per rule. Let’s assume we run an experiment every day for a year. A 90% confidence level will give false positives 3 times per month. A 95% confidence level will give us false positives ~3 times per quarter. A 99% confidence level will only give false positives ~3 times per year. For most purposes, 95% confidence is an accepted choice, but some very sensitive decisions may require 99% or even higher.
What about the stats themselves, such as the p-test or t-test? There are lots of tools available, from simple websites, to Excel, to enterprise software. Is your data statistically significant? Great! Not significant? That’s an answer as well – while statistically inconclusive, it can still provide valuable information to future iterations. In addition, the data might be statistically significant at less rigorous confidence levels – perhaps its significant at the 90% confidence level and that rate of false positives is acceptable for this particular test. Regardless of what the stats say, this information helps inform the next version of the test to be run.
It’s unlikely that you find a massive uplift from a single AB test. Instead, it’s much more likely that you’ll make small, incremental improvements on targeted segments of your users, and many tests will come up inconclusive or even lose to the baseline. After running an AB test, try to understand why you did or didn’t get a statistically significant result. Is it inconclusive because it doesn’t actually impact behavior? Or did it positively impact one metric but hurt another? Did a holiday in a key geo skew the results? Or maybe only a specific subset of users changed their behavior? All of these possibilities need to be studied to help inform you of the next iteration of the AB test to run. This is why it’s so important to have specific hypotheses with specific metrics to measure the AB test – so that you can easily isolate and focus on variables and metrics to test for a specific outcome, and easily adjust them for subsequent tests.
Anecdotally both our advisor, data scientist Andre Cohen and a customer data scientist Alex McArdle both have the same recommendation: Set-up your test with the idea of having an obvious outcome, even if it’s a failure, it’s easier to iterate on failed AB tests than to waste time running useless ones.
This is meant to offer a basic framework that you can tailor to your own specific needs. One of the biggest downsides of focusing on the statistics as part of an AB testing framework is the frequency with which the stats say the results are inconclusive, or the measured differences are small and feel meaningful while the math says something else. Just like every other important business decision you make, the statistics should just be one of many factors you use when deciding how to proceed. It is also important to not discount your experience and judgment in these matters, and ultimately you should make the user experience that you want to make.
The Netflix tech blog recently began a series on experimentation and decision making, providing a great primer on the various factors that go into how they design, plan and execute AB testing on their service. You can read it here.