It’s been a busy few months but I have a new topic, and it’s something I’ve been thinking a lot about as we build out the team and culture at my current company.
One thing that really highlights the vibe in a startup is their attitude towards the F-word… and nope, I’m not talking about whether or not swearing is acceptable in the workplace, I’m talking about Failure.
We can illustrate some of the perils of being afraid of failure using a game played with schoolchildren when trying to teach the importance of boundary testing, and it goes something like this:
I’m going to come up with a rule or series of rules that define whether a set of 3 numbers you provide have met some hidden criteria. You can try to discover the rule by presenting me a set of 3 numbers and I will reply with “yes” or “no” depending on whether they have or haven’t met my rules.
Your job is to guess the criteria.
The most common approach is that the guesser picks 3 initial numbers, in somewhat of a pattern, for instance [3, 6, 9], and I reply “yes”, at which point they immediately form a hypothesis (e.g. consecutive integer multiples of the first number) and then test it with another set which confirms their hypothesis, such as [5, 10, 15], to which I reply “yes”. In their mind I have now confirmed that they were right all along, and they may offer another similar test case which will be confirmed, or just proclaim straight away that they know the answer.
The reason this happens much more often than not (try it with your friends if you don’t believe me) is based on a quirk of human psychology — we like being told we are right!
The flaw, in this case, is that the hypothesized rule was not, in fact, correct. The actual rule could have been that I wanted 3 increasing numbers with at least a gap of 2 between them, so a set of [0.5,9,23.76] would also have resulted in a “yes”. The problem here was our guesser never tested for the negative case, they didn’t ask anything to which they were expecting a “no” and therefore were unable to hone in on the actual boundaries of our system.
The same applies on a much grander scale when we look at company cultures and the way we try to develop and build products.
The value in enabling failure is not only that it allows people to feel comfortable enough to innovate even when we are not 100% certain of the results, but more broadly is that it helps us find the boundary of what is possible within our own systems.
But having acknowledged this, we must also not ignore the valid reasons which contribute towards the traditional resistance to failure in any form. Specifically, these can be broken down into a couple of areas, each of which can be addressed to enable healthy failures to happen:
#1 Sunk cost fallacy
When a great deal of work and expense has already been spent pursuing a specific route which some members of your company then identify as the incorrect one, this can cause a failure resistant environment because nobody wants to be the person who suggests that all the work done so far was a waste, and should be ‘scrapped’.
This reluctance causes work to continue along the same lines even further, incurring more cost, and compounding the problem. This is often caused by poor target setting, long realization-of-value timelines, and inappropriate metric tracking which hinders early course correction.
Addressing these areas by moving towards shorter cycle times, prioritizing quick user value, and building an expectation that frequent course correction is normal and encouraged will all help reduce failure resistance from this cause.
#2 Repetitive failures (not learning from outcomes)
When the same failure or type of failure happens repeatedly this is a signal that the organization is not learning from the failure, and not making the changes necessary to reduce it.
While we should to a certain extent encourage failure as a route to learning, in fact, a common backronym of ‘FAIL’ is First-Attempt-In-Learning, repeated failures of the same type show us that this is not happening correctly. In this case, it is important to identify the steps to prevent this failure from happening again.
There are variations of this issue that include the same person causing the same failure over and over again, or the same team, or alternatives such as new colleagues in a certain role always failing in the same manner at an early point in that role.
Each of these variants has slightly different approaches to resolution, but are more likely to stem from structural issues rather than being a positive sign of learning when they happen repeatedly.
#3 High cost of failure
While we want to encourage failures that help us learn, there must still be a limit to the risk we can take in order to capture that learning, and this should be balanced against the value of that learning.
For instance, experimenting in a way that risks the existence of the entire organization in order to learn a relatively insignificant tidbit about user behavior is probably inappropriate.
However, this same risked outcome may be more appropriate in an early-stage startup company considering a ‘pivot’, which may indeed risk the future of the company, but for significantly greater learning.
We should endeavor wherever possible to build systems and structures that somewhat isolate the risk of damage from the risk of failure.
#4 Unlimited blast radii from failure
Beyond a certain scale, it may not be acceptable to risk the entire organization's existence in a single experiment. Therefore we need to build into the organization mechanisms that allow us to conduct such tests and experiments in ways that they may fail safely. Then we can be confident the impact of such failures will be constrained.
Note that this is not an attempt to remove all damage or pain resulting from a failure and create a complete “sandbox” environment, but rather an isolation mechanism that limits the scope of where that pain is felt so that other parts of the wider organization can continue to function and lend support, even when there are failures and resulting damage in one area.
#5 Poor failure recovery
The pain felt from a failure not only comes from the initial event, which could be a system going offline, or an app crashing, but also how well the recovery process works.
By building systems with robust failover protocols and resilient design we can limit the impact of outages, reduce the time to recover from a failure and get back to a well-functioning system. Plus, we can minimize the data loss or infrastructure damage that may have occurred.
When we view failures as an inevitable and even encouraged part of a system and organization development lifecycle, then we can design resilient systems which handle many failures gracefully.
#6 Excessively risky testing
When experimenting where we think there is a strong likelihood of failure and we do not have the requisite precautions outlined above in place, in some circumstances it is prudent to create a separate “sandbox” environment for this kind of testing.
This allows us to learn more about a subject before experimenting on our real systems. We should endeavor to balance the likelihood of failure against the ease of generating that learning from a lower risk environment where possible, and where experimentation effort levels are similar. As opposed to testing on excessively risky environments just because we can.
Allow yourself to fail
It's obvious at any level that failure has a cost, and by definition, a failure is when we haven’t succeeded at our goal. But taking a step back we see a far greater cost that is no less real - only trying things where we don’t think we will fail.
In the long run, the price we may pay in failure pales in comparison to the price we'll pay if we don’t let ourselves even try. As scary as it can be sometimes, it’s something I’ve been trying to keep in mind more frequently as we take this calculated risk on a new startup.