Margaret-Ann Seger, Head of Product at Statsig, gave this presentation at the Product-Led Summit in San Francisco in 2023.
I lead the product team at a company called Statsig. Statsig is an experimentation and feature flagging platform that powers companies like Notion, Figma, Atlassian, and even Microsoft to be more data-driven in how they build products.
- Why experimentation was important at Uber
- Three arguments against experimentation
- Effect size vs. sample size
- Leveraging smart defaults
- Setting your experiment up for success
- Consider a rebrand
- Using advanced statistical techniques
- Why top-line metrics are a trap
- Using tools as an agent of change
- The importance of being pragmatic
Why experimentation was important at Uber
A lot of folks are launching products without experimenting and without being very data-driven.
I tend to learn things the hard way. One of my early experiences really hit home on why it's important to experiment. The year was 2016. I was working on our international markets at Uber, and the whole company had been working together on this big rider app redesign.
In 2016, the Uber app turned itself on its head by going destination-first. This was a big cross-company effort. There was about 12+ months of work involved and 10 big feature updates all rolled into this beautiful new app.
We launched it, we measured it as an experiment, and we started tracking the metric impact across different regions. And something interesting happened. The app was very positive in most regions, but in India and Southeast Asia, it was very bad for trips. Trips were down, revenue was down, and everything was progressing.
I was responsible for those regions at the time, and I was trying to figure out what was going on. I started to unpack all those changes, but the problem was that we’d bundled 10+ product changes into this redesign, so pinpointing exactly which one was causing these metric regressions was pretty tough.
So we started to undo them and tried to figure out which one was the culprit, and it turned out that the crux of the entire redesign was at fault.
There was previously a flow, where you entered your current location, dropped a pin, the driver would come to you, and then you’d tell them where you were going or enter it in the app optionally, but it wasn't required. We’d flipped that and said, “No, you have to enter your destination upfront. It’s mandatory.”
A bunch of great things came from this. You could provide the user an upfront price that was accurate, you could tell the driver where you were going, and so they could choose to accept or not accept based on how far they were going.
But in regions like India and Southeast Asia, it turned out that that just wasn't the culture around ride-hailing. People wanted the driver to come to them, and they’d tell the driver something like, “I'm in this neighborhood, three doors down after the church on the right.” It was all point of interest-based.
Furthermore, Google Maps locations wouldn’t snap to places that people recognized. Maps coverage just wasn't as good back in 2016.
I learned a lot from this experience, but the number one thing I learned was how important it is to be iterative and data-driven in your development. Launch one change at a time, test it, understand it, and then move on to the next because you can't just bundle these things in together and expect it to work seamlessly.
Three arguments against experimentation
When you think of world-class experimentation, you probably think of companies like Meta, Google, and Amazon, companies that have billions of users to experiment on, in-house, super-optimized platforms, plus decades of experimentation culture. The reality is that everyone's running experiments, but the problem is that only some of them have control groups and randomization.
Anytime we're launching a new feature as a PM, we're experimenting. But if you're not measuring the impact of that, you can't optimize it.
I’m going to talk about how we, as PMs, can be that voice of experimentation at a company and how we can counter a lot of the often-cited arguments that people bring up to say that they don't have the time or energy to invest in experimentation.
The arguments we’ll cover are:
- We're too small to experiment.
- We can't afford to slow down.
- We just don't have the resources.
So, let's jump in.
Effect size vs. sample size
Statistical power is the probability of detecting an effect if there actually is a true effect to detect.
The reason why this is important is because it essentially dictates how many samples you're going to need to enroll in your experiment to reach a conclusion that you can trust.
The number of samples that you enroll dictates how long you have to run an experiment for. If you're a big company, you might be able to enroll a ton of samples really quickly. But if you're a small company, it might take you significantly longer to reach that same sample size.
People often talk about sample size and say, “We're small. We just don't have the people to experiment on.” But the reality is that there's another concept at play called ‘effect size,’ which is just as important, if not more important, than sample size.
Effect size is the amount of lift you're trying to detect and how big that is. If you’re Google or Facebook, you're probably working on super-optimized flows, and you want to detect a 0.1% lift. But if you're a startup, these are brand new flows, you're new, and you need to see big impacts for it to impact your top line, so you might be going for a 20, 30, or 40% lift.
The reason this is important is because this equation that dictates power depends linearly on effect size but on the square root of sample size, which means that sample size is much less important than effect when you're looking at whether you can experiment.