Firebase AB Testing your app: tips from a pro
Firebase AB testing has become a de-facto method for driving iterative changes in a mobile app. App developers can build experiments to increase retention, engagement and user-LTVs by making tweaks from design, to mechanics, to app-economy.
But apps are complex, there are millions of possible iterations, where do you start?
Watch the full discussion below.
Audio Transcript Below
(lighly edited for length and clarity)
- 6:01 – What is AB testing to Random Logic Games
- 8:12 – Toolsets used
- 14:00 – Types of tests Random Logic Games runs
- 20:11 – Measuring success
- 25:55 – Testing app economy changes
- 29:55 – Best advice for getting started
Adam Landis 0:02
Welcome back everyone. Today I’d like to welcome Alex McArdle from Random Logic Games. Thank you for joining Alex. So today’s topic is going to be all about AB testing and Alex take us through His day in life, AB testing goals: what do you guys do, how he looks at the market, how he looks at testing apps. But before we jump into that: Alex, tell us a little bit about yourself. How did you how did you find your way into Random Logic Games?
Alex McArdle 0:34
So I went to the University of Alabama in Birmingham, which is Alabama, the little brother college of the football team. I was doing neuroscience and anthropology there. My boss, current boss, Andrew Stone, was a graduate of UAB. So he posted an internship, on a job site — that he’s looking for a position at a gaming company. I had a background with science, as I said, so I figured I can do some data-based stuff. And I interned there. And I really liked the company. And as an intern they paid us, which was nice. It’s probably like 2017, or Yeah, I would say 2017.
Adam Landis 1:28
Going on five years now…
Alex McArdle 1:30
Yeah. And then yeah, hired pretty soon after that. So yeah, coming up on five, I guess.
Adam Landis 1:36
And the company Random Logic Games for those of–
Alex McArdle 1:37
You don’t know, right, Random Logic Games we’re a mobile gaming company, mainly, board games, puzzle games. And we’re based out of Birmingham, I guess we used to have an office we no longer do. Now we’re based out of wherever my boss lives. And so yeah, so just mostly just a mobile gaming company with apps: Infinite Word Search and Infinite Connections right now.
Adam Landis 2:11
And Guess The Emoji–
Alex McArdle 2:14
Which reached probably peak some number of years ago, I’m not sure. You’re just set from the sound of how they talk about it. It sounds like it was before me. We had a ton of games, word games for the world, we’re all over the place. And we’re always building more.
Adam Landis 2:35
So your role specifically, you started out as an intern: what is your like day to day?
Alex McArdle 2:36
I do live-ops and on-the-fly changes, user UI stuff, not necessarily design, but coming up with concepts and then testing those concepts or testing a economy change. So basically, anything that can affect the user is probably going through me at some point. Before it gets released in a build, planning out what to test and what to implement, and running it by the team and just being like, just communicating with the developers and the designers. And we’re a small company. So we do have a hand in bit of everything. Sometimes it’s UA, so I’ll help on that side of things.
Adam Landis 2:50
So data-driven, market-data driven. So you’ve already mentioned the science background analytics and stuff. But when you’re saying your role is to help with user experience, what is your goal? Generally, when you have a task that associates with the user experience…
Alex McArdle 3:48
So with a user experience, my goal is generally just to improve our monetization, but that comes via different ways. So like, let’s say, you come up with an idea, like: “hey, I think this will increase retention”, that’s also a fine thing, because that’s just gonna last keep the user in the game longer. If my goal is to have a user do a certain task more, I might come up with an idea of how to make that a little bit more enticing or something along those lines.
Adam Landis 4:21
Good, so the end goal is increased monetization, that is increased value per user, but that could be driven through retention–
Alex McArdle 4:30
–Retention, video-watching spending more in-app purchases, subscribing to it for we have a subscription-based thing, anything that will help monetization and there’s not always strictly monetization, focus, because some of the things kind of like, go hand in hand like retention. So if you just increase engagement, the downstream effect of that is going to be hopefully monetizes better but–
Adam Landis 4:58
— and monetize: you’ve got– you said–, videos or reward-videos interstitials and then banner?
Alex McArdle 5:07
Yeah, we still do. We are largely an ad based company, I think like 90% of revenues or somewhere around there. It might be we were looking recently it might we might not be given that purchases enough credit might be a little bit more, but hard to track. Yeah, well. So we were mainly focused there. And that’s how we’ve always done it. And we’re taking some steps more towards in-app purchases. And that’s some kind of stuff I could talk about later as an example for AB test. But definitely ad-based, so retention and time spent in the game is going to monetize better, because they’re forced to see interstitial ads at some point in time. And then other mechanics that focus on videos,
Adam Landis 5:51
You mentioned AB testing and as that’s the title of the talk. First, in your own words: define what is AB testing as it means to you and to Random Logic Games.
Alex McArdle 6:01
Alright, so with the science background, you typically have your control when your test group and I guess that’s a good way to think about AB tests. But I don’t always think of it like that, because sometimes you want to test something right out of the gates. So say we’re launching a new app. And we really can’t make up our minds on which of these two experiences in the app will be better. As long as there’s just a strict AB, you have one group versus a different group, that’s still an AB test, it doesn’t have to necessarily have what would you call a control. So it’s safer to most people will be more familiar with the idea of here’s the way my app functions now, because it probably your app probably exists, or your product probably exists. So then you make a change, that change would be the B group, the A group would be the control what exists already. But I definitely don’t think you should launch an app. And if you’re torn, or on an experience, that is crucial economy thing, for example, like, should we charge? How much should we charge for this hint? Or how many should we give them per stage, you can go out the gates with a really stingy experience and a really giving experience. And that’s an AB test. Well, you’re just going to measure the monetization between those groups. Where you don’t where you need to focus on the AB test more, is not putting a bunch of variables in it at once. So the the classic, independent dependent variable thing they are.
Adam Landis 7:51
So in your words, it’s two or more variants with very specific changes that are trying to drive an outcome. But you said something interesting, which is, you wouldn’t launch an app with an important feature that’s not testable….
Alex McArdle 8:12
Yes. We try to do that now. I’m not involved in the design, But generally, they’re like: “here’s the app.” And so I’m like, “Okay, well should we do X, Y, or Z to this?” And if we can’t decide, we might have an suspicion that ones are a safer route, I’ll at least tell the developers to give me the ability — we use Firebase Remote Config– so as soon as the apps online, I can change the experience for X percent of users.
Adam Landis 8:58
That’s actually one of my future questions is Firebase is the tool set you’re using?
Alex McArdle 8:59
Yes, for AB testing. It hasn’t always been. We’ve used plenty of others. Delta DNA was a big one, then Delta DNA version two after the different company (ed: Unity) took over. We’re happy with with Firebase right now, because the upsides super, super cheap.
Adam Landis 9:00
–and it works.
Alex McArdle 9:27
Yeah, it works. It does some cool statistical analysis for you. The downside is the limitations of it because it’s limited in ways they report– an example being video ad views: they don’t tell you how many video ad views are viewed. They tell you how many users viewed a video ad view. So that’s a weird way of reporting an event and you’re limited on the number you can use. You can work around that if you programmed events in different ways, but ultimately you’re stuck with like five or six. And probably because it’s free service, it’s not going offer, super crazy data capabilities. Or you have to pay for an upper tier to get different stuff. But anyway it’s limited.
Adam Landis 10:21
I actually haven’t heard that, I’ve heard that “what it is” is out there and that’s pretty similar to what I’ve heard elsewhere. And then also, one other downside is the remote config takes a long time to load on App startup. When the app starts up, that can take up to 30 seconds, if you’re trying to test an onboarding screen, red versus blue, it takes 30 seconds to load up, but you’re gonna miss your chance,
Alex McArdle 10:53
I have not experienced that personally, there is some people who slipped through the cracks, it seems. So that’s probably a situation like that. But there’s never really been a significant number. We use other we use Kochava as well to measure other event data. And so for example, Kochava tells us which Firebase group they’re in. And you can see like maybe out of 100,000 users 95% or more of them are going to be in one of the test groups. And then occasionally, there’s a few in that report is zero, which is not in our test. So we know they didn’t, they probably didn’t fetch via their internet connection, their Wi Fi was bad. Maybe they’re in a different country that has less reliable service. But we do some app startup stuff. And I’ve always testing it here on Wi Fi, it works pretty fast. And what I meant about the paying for more analytics is not for your testing. It’s just that you can integrate, you can pay for BigQuery or whatever. Like, that doesn’t necessarily help you with your AB testing.
Adam Landis 12:04
Yeah, just gives you access to the raw data. And then for folks, obviously, this is a sponsored webinar, like how to use us (AdLibertas) in relation to the AB test with Firebase?
Alex McArdle 12:17
We’ve worked with Adam before on other products. And he came to me with this one. And this one actually really solved a couple problems we had with what I just talked about with Firebase. So something like video ad view,: “I don’t want to know how many users watched a video, I want to know the total amount of video ads.” That’s something super simple. But what’s really been my favorite part of the AdLibertas, the LTV curves and forecasting, just the straightforward way of setting it up. So deltaDNA used to be able to do some of these things for us, it was super complicated to do it. Even going into Firebase and then a separate analytics thing is so much easier than using their one product when whenever we used them. So I’ve definitely enjoyed just revenue. I want to see the LTV over time, this group over time, I get all the retention data, I can get any event I want in Firebase reported, I’m not limited. So now when I set up my Firebase experiments, I don’t care. Because I know I’m going to be using the AdLibertas, to analyze all of it.
Adam Landis 13:40
So the AB tests are run through Firebase. And then the success is measured by AdLibertas.
Alex McArdle 13:47
Yeah, AdLibertas is what I’m using for to analyze any, pretty much anything and everything really,
Adam Landis 13:52
And the types of tests you’re running. So you’re talking a lot about tests and talking a lot about variants and kind of success. What are some real life examples of the things that you’re testing today?
Alex McArdle 14:05
Okay, so a good example that I think is best for this would be on our app, Infinite Connections, a match tile game. Just super simple, connect the pieces. The board sizes can vary widely. And there’s all sorts of other apps like this on the App Store, of course, is not a novel concept. But we played some of the competitor apps and saw that some of them were using giant board shapes, 10 by 10 grids, whereas where you might have been using like 8 by 4, and that’s dozens of tiles more on a level than then on another. So I would personally play in through competitors apps and I was like “I don’t really like this” but there’s, there’s got to be some justification for which they do it. So in my head, I was like, “Alright, so we’ll test a couple of variants.” And it was an A-B-C-D test. But again, as long as there was one variable changed, it was basically the board sizes either had giant board sizes, or they didn’t.
Adam Landis 15:35
But it’s not just visual, like the board size, as you’re saying– more tiles– that increases complexity, so time in the app, like people are playing longer,
Alex McArdle 15:43
Right, the experiment setup is easier than figuring out all the things that change, when you’re on a 10 by 10, board, you might spend three minutes, whereas on a different board, you might spend 30 seconds. So I got a little frustrated, playing their apps, so the reason why I tested the frequency of the different board size– large board sizes– is because I thought less frequent, big board sizes would be beneficial. I don’t really love them but I can’t let my opinion drive the test. So how much does the user like them? Do you want every board huge? Or do you want every other board or every five? So that’s kind of what the test was, it was regular gameplay, which maxed out at probably like, let’s say 60 icons. And then there was one that had maxed out at let’s say, 90 icons, 100 icons. And then there was other test groups that varied in how often they–
Adam Landis 15:46
Icons I’m sorry, icons are tiles?
Alex McArdle 16:33
Yeah you can add 60, tiles, 100, tiles, whatever. And then how often do they see that 100 job boards? So turns out blasting them with giant board sizes was not the answer. So it was a good intuition. But I was still anti-board size. But turns out the big boards were better. So the test group that won the experiment– and we mainly measured this with LTV, which includes all our ad views and retention and everything was– so the control group was a loser in this test. And we found a reliable winner. And you ask how do you find a reliable winner? Well, just variety of statistics but mainly focused around LTV, make sure we saw no drop-offs in certain placements, or videos or interstitials or any retention.
Adam Landis 18:14
I’m getting off the detail, so I just want to make sure I’ve got I’ve got clear: you’re basically testing the complexity of the app and seeing how that impacts in variety different ways. The overall result being the LTV of the user going up as a winner, is that right?
Alex McArdle 18:31
Yes. Four different variants.
Adam Landis 18:33
And you found one that wins over the control. When you say winner, and you’re looking at LTV–what type of percent increase you looking at here?
Alex McArdle 18:42
It can’t be neck and neck is pretty much what I’ve determined. Especially with Firebase, and other tests, you might see a 1% increase. And if there’s 1000 users, it’s meaningless, right, but if there’s 130,000 users, and day one retention increased by 1%. And I don’t mean that it went from 19% to 20%. I mean, it was a 1% of 19 increases, that’s still not great. It’s not what we want to see. But there is a point in which you get real statistical significance with enough users in there and enough time, but we’d rather see bigger jumps. And in this case, we did see big jumps. I mean, there I think there was probably like three four cent per ARPU increases.
Adam Landis 19:36
Wow, per day?
Alex McArdle 19:38
Yeah, that was a long-tail 90 day test. Probably 10s of 1000s of users. It was significant and every data point we pretty much cared about.
Adam Landis 19:54
So we go about this internally quite a bit, and I’d be curious to get your feedback, being in the streets: statistical significance. To you, is that something that you actually run numbers on and get a mathematical output, or is that a feeling?
Alex McArdle 20:11
I’ll do a p value. I’ll use math. There is a feeling to it, what I was trying to get at earlier, when I was saying, When you see a 1% change, *meh*, if you’re seeing like a 20%, change right out the gates, and it’s like written really, really helping. It’s not good practice to be like, “this is the winner,”, but you can kind of start having your eye on a winner, but it’s better like to be a better way of doing it is to be objective and set: “I’m going to run this experiment this many days until I have this many users and in this test, and then until I can be 90%,” or whatever your threshold, scientific papers is typically going to be 95 or 99% confidence interval.– “So I’m 99% confident that this change in this experiment is a result of the variable that I changed and not just random users.” What’s going on behind the scenes is we’re buying people from different countries, we’re buying people from different networks we’re buying all sorts of different types of users: Goal retention-based buying versus return-on-ad spend based buying.
Adam Landis 21:45
So don’t these changes– the campaigns and other things– don’t they throw a big wrench in your measurements?
Alex McArdle 21:51
They throw a big wrench in your measurements, which is why you have to run experiments for a long time, get a lot of users. If your change is this small my example is 1%, I wouldn’t trust it. Typically what I do, if it’s an ABCD, and A & B are real close, C & D are definitely out of the equation, I’ll ditch those, increases user size of both the AB tests, by putting it just a straight up A & B test getting rid of C and D and run it again. So if there’s if the difference is not significant, or if it is significant, but the change is extremely small. Even if it’s an improvement and an extremely small improvement, you know, you may want to increment forward because like a 1% gain is a 1% gain. And then maybe I can change something else and get another 1%. Next thing, you know, you’ve you know, you’ve really changed revenue.
Adam Landis 22:52
But I would prefer to head to head them when the difference is very small. But it’s comes from just talks I’ve had with people in the industry, and a particular dude who I’ve just like, really sought advice from often, he’s like “you want to see 5 to 10% changes, you don’t want to see 1% percent gains.” I don’t and I don’t mean objective, like, retention is not going from 19% to 20%, that is a big change. I meant like if you take 1% [better overall].
You’re not talking about a percent of ARPDAU or something because that might actually be a more significant, you mean an LTV change after 30 days is 1% better –not enough for you to come up with a conclusion.
Alex McArdle 23:52
Yeah, unless the LTV is extremely high, if it’s $100, a 1% change is fine. But that’s not the case here. So like if we’re dealing with countries like Mexico, we’re buying for inventory for let’s say five to 10 cents, and we’re trying to make five to 10 cents, 1% of that is not not a huge change. So I wouldn’t call a test there. But if we’ve increased to 10% with 1000s of people then yes. Okay, now I’m moving forward.
Adam Landis 24:21
So the types of tests that you’re finding from your experience, the outsider’s perspective, and is like, “Oh, I’m gonna test this Button, Red versus blue.” Are there certain types of tests that you think are actually more valuable to start with?
Alex McArdle 24:46
Yeah, I think economy I think economy changes are big….
Adam Landis 24:50
And those are just like?
Alex McArdle 24:56
So you watch a video for coins for some of our apps, right? So a coin for a hint: we’ve often ran a one-to-one ratio: one ad for one hint. Well, some apps, it’ll take you three ad views to get enough coins to spend on one hint. Wordscapes is in a big example of this, they have a reveal letter hint, it will their reveal letter and you can’t watch 1 ad for it. You have to watch 4 ads before you get one “reveal letter”
Adam Landis 25:55
So when you say economy, it’s anything that affects the overall paradigm?
Alex McArdle 26:00
Right. When I say economy, I mean the in-game currency economy, or changes in-app purchase prices or the amount of rewards rewarded by that an app purchase. Those are big changes. Changing the appearance of your app is great. Sometimes, we’ve seen some improvements by adding tile sets. But I don’t think changing out the specific tiles — one tile set changes a smiley face emoji to a frowny-face emoji– is where you’re going to see results. Of course, there are psychological tricks like that, people use all the time red for sales and certain buttons you want green or whatever. But–
Adam Landis 27:01
The low-hanging fruit is the economy, testing the economy?
Alex McArdle 27:05
I can give another example. The board size, that’s something that really changes gameplay.
Adam Landis 27:13
That’s gameplay, totally different changing the complexity…
Alex McArdle 27:16
Okay, so let’s go to economy. We would consider ads an economy thing for us. Because it’s currency when you run out of time. So if you run out a time, you can watch a video for more time, right? We recently tested this thing where, you had a dialog that says “Would you like to watch a video for more time?”, and then you had another dialog that had a countdown, 5 – 4 – 3? And you only had five seconds to click the button. That change, big change. Turn it blue? Not a big change.
Adam Landis 28:09
Got it? For someone who’s just getting started and saying: Where do I start with testing? You’re saying the visual, pretty nature of the app, is a lot less impactful to the bottom-line revenue than economy and gameplay or complexity metrics?
Alex McArdle 28:25
Yeah, and any sort of mechanics you can come up with.
Adam Landis 28:30
And that’s app-usage mechanics or game mechanics in your world,
Alex McArdle 28:33
Yeah, your app can’t look crappy….
Adam Landis 28:43
I have a client who makes apps look crappy on purpose, because he’s trying to drive the economy rather than the visual nature.
Alex McArdle 28:51
Well, okay. I mean, sure. Yeah. Different strategy. That makes sense. If you’re developing i a nice, pretty game in Unity, don’t just be like “Oh, it doesn’t need to look okay.” Especially if you have a competitor’s app, you want your app to be on par with the competition.
Adam Landis 29:18
Right. Okay. So we’re about out of time here. So let me let me close by saying: you’re talking to someone who’s starting to setup AB tests, and they’re starting to engineer — using Firebase — what are what are some tips you give them, pointers, or, “hey, this is what I would do.” If I wanted to start a A-B testing in my app:
Alex McArdle 29:45
I’m going to share the best advice that was given to me, which was something that I struggled with and was doing wrong, and that is doing big changes. So if you’re rewarding a hint, every level and then you say, “Well, what happens if I reward every a hint every 2 levels instead of 1… Do 10 levels instead of 1 don’t do 2. Don’t increment by one, go for big changes and get big results. And then you have big confidence.
Adam Landis 30:17
And big confidence, that’s, that’s actually key: I’ve heard this too. The most outcomes, you’ll have an AB test is no viable outcome either way.
Alex McArdle 30:30
Yeah, a lot of times I spent 90 days for no damn reason, there’s no change.
Adam Landis 30:41
Start with big swings. I like that.
Alex McArdle 30:43
I would rather you see a big decrease. Because then you’re like, Whoa, crap, that’s not the way to go, move on to the next idea. Say you exposed the test to 50% of your revenue users, and it really hurt the revenue, that’s bad. That’s why I would also say do smaller, or do a weighted test, you can do 90% on the control and 10% test, you could still get a swing, you don’t have to give it to everybody, you can you can weight the tests, do an incremental changes, or with a giant change to a smaller amount of users. That’s something I’ve taken advantage of, I can run a test where LTV is decreasing is has taken a hit over 90 days, but only exposed it to 1000s of people rather than 100 thousands.
Adam Landis 32:32
Look for big swings, you can mitigate risk by cutting out a lot.
Alex McArdle 32:37
Yeah, and then and then like I said, Don’t get too caught on text changes, text changes work, but they’re not going to be as big of a difference as a different mechanic like the availability to use or interact with that dialog. A dialog forcing an answer in 5 seconds is going to have more impact than the text that say’s “hey want to watch an ad?”
Adam Landis 33:28
Economy, playability, usability versus text / image, look for big swings
Alex McArdle 33:37
But do it everywhere. If you’ve got the time and resources, we’ve tested different push notifications. But the clear winners are in gameplay changes, economy changes, UI-changes — but big changes, not just you know, the side menu is green instead of blue.
Adam Landis 34:35
I think if anyone looks at this, they’re going to say, the complexity is getting bigger, but the reality is, this is what you do all day every day. I think you summarize it well, we can close on that which is “focus on big changes, mitigate your risk and focus on gameplay focus on big monetization changes, and then it won’t be too broad.
Alex McArdle 34:56
Yeah, I’m getting bogged down by all the options!
Adam Landis 35:04
Well, thank you, your time and expertise is very helpful. I will be chasing down for some new features. I think you gave me some ideas for how, how to help you measure AB tests better. So that’ll be in a future call.
Alex McArdle 35:19
Alright, sounds good.
Adam Landis 35:20
Alright Alex, thank you for joining.
Transcribed by https://otter.ai