17 Jan 2021 / ~6 min

Tests Should Build Confidence

In automated software testing, the default approach among developers is bottom up, and the aims are high coverage and working software.

This approach misunderstands the goal of testing and often fails to deliver on the goal of reliably shipping working software.

The idea of the bottom up approach is that you use tests to show that individual parts are correct, that they integrate, and that your system is correct as a result.

This approach leads to the typical test pyramid with many tests for small parts at the bottom (unit), fewer tests of larger parts in the middle (integration) and even fewer broad tests at the top (end-to-end). The tests at the bottom are small and quick and the tests at the top are broad and slow. ¹

The goal of this approach is “working software”, implying focus on correctness. To be successful, the test coverage has to be nearly complete. ² We evaluate each test by whether it increases coverage and adds to the evidence of correctness.

However, if we consider the purpose of tests with nuance, we can find a better approach.

The goal of a working software engineer isn’t to show the system is correct, it is to be confident that the system is working as expected. ³ That’s why we write tests.

We don’t just want to produce working software, we want to be confident that the software is working. It’s not enough for the software to work, we have to know it. That confidence is what lets us evolve our software to meet the needs of its users without unnecessary stress and delays, and confidence should be the goal.

With that goal in mind, the way to evaluate a test is by how much it increases our confidence that the system is working.

Consider this thought experiment: we have a piece of software with no tests, and our goal is convince ourselves that it works.

The first thing to do — the thing that would increase our confidence the most — would be to run it and check that it doesn’t crash.

Next, we would supply different inputs and check the outputs. If we have a UI, click around and take some actions. After a while, we might not have convinced ourselves that the system behaves correctly in all edge cases, but we know that it works.

Contrast that with the traditional pyramid approach.

We would start by adding unit tests for individual functions and modules, building up a comprehensive test suite from the bottom up. After a while, we know that individual modules work correctly in all edge cases, but we don’t know whether the program crashes when we run it.

While the example is contrived, the reality of many teams isn’t that different. They write many tests in the name of following best practises, but when it comes time to deploy something they aren’t confident.

When we look at their test suites, we find common problems.

As the pyramid prescribes, they have good unit test coverage. The trouble begins with the integration tests.

Because the good practise is to test each unit in isolation, internal dependencies are mocked. Because external dependencies like databases and APIs are hard to test against and make the tests slow, they are mocked or skipped. ⁴

Because running end to end tests with a moderately complex service that uses a database, calls other services and uses external APIs is hard, it is not done. And admittedly it is hard. Working with external dependencies makes tests hard to set up, it makes them slow and it makes them flaky. ⁵

The cost of this approach is great. When the tests pass, no one can be sure that the system won’t crash on startup in production or that it won’t throw errors when accessing the database. They get around this by running the system locally or doing manual tests in staging, but it is a red flag.

Another red flag is writing tests that feel tedious or pointless. Viewed through the lens of building confidence, if a test passing doesn’t make you any more confident about the final product working, there’s no reason to add it. ⁶

Following the confidence-building approach, a typical test suite looks like an hourglass not a pyramid.

We still have many small units tests. We want to be confident that we’ve covered all the edge cases in our logic, and testing those is easiest closest to the source. ⁷ In this layer we also include tests that call the database because we want to test that integration thoroughly. ⁸

On the other end, we have many end-to-end tests.

For a web service, those would be tests that use a browser to access the service as a user would. For an API, they would be calling it as a client. We use live dependencies for everything unless technically impossible (like a payment gateway that doesn’t have a test mode — in that case you should find a better one).

The end-to-end tests should aim to exercise every feature at least once. In dynamic languages, you should be running all your code to make sure there aren’t any crashes. If you have a UI, it has to be tested.

Absent is much of a middle layer. In my experience, the ultimate test of modules “integrating” is that they work in an end-to-end scenario. Testing them beyond that doesn’t provide additional confidence.

One downside of this approach is that confidence isn’t easy to measure, but there are questions we can ask. Would you be nervous about your system getting deployed or shipped when your tests pass? Are there any (implicit) manual testing steps? Do you have to watch for problems after every deploy?

If you answered yes to any of these, you aren’t confident. Think about what tests would increase your confidence, what would you check manually? Automate it.

What is the right level of confidence? This is my favourite test: would your tests make you confident enough to deploy at 5pm on Friday, close your laptop and go home?

See Martin Fowler’s test pyramid for a comprehensive example. ↩
Because this approach relies on high coverage, it can degrade into an unquestioned “tests are good” attitude where high test coverage is the goal without any regard to its usefulness.

Less experienced teams are especially prone to blindly following the approach, leading to rigidity and testing for testing’s sake.

In practise, tests are not universally good. Like any code, they serve a purpose and have a cost. We have to write them and they are a liability to maintain. The cost can outweigh the benefit.

How many times have you seen a test that feels like assert(1 == 1) that was justified by the dogmatic “but we have to have a test for this”? How many times have you seen the same thing tested 3 times at 3 different layers for the sake of every class having tests?

This is wasteful and it slows down progress. Every time we change that code, we have to update 3 sets of tests instead of 1, making changes harder. ↩
In theory, if you could prove that a system is correct, you would be confident that it’s working. But in practise you can’t — see the halting problem.

Even if you got close, there are many pitfalls. You proved that your logic is correct, but can you be sure that the UI works as expected without testing it end-to-end? ↩
One common anti-pattern is the excessive use of mocks — mocking all external dependencies of a module and making assertions about those mocks (like which ones got called). In that case we aren’t testing much.

This is the most salient in dynamic languages where we can’t even rely on the type system to enforce basics like that there isn’t a typo in both the method we call and in the test that mocks it. ↩
The best practise of avoiding slow and flaky tests is right. Your tests should never be flaky because you learn to not trust them if they randomly fail and they become pointless. Treat fixing those as high priority.

Slowness is subtler. Fast tests are preferable for a tight feedback loop when writing code. Waiting more than about 10 seconds for tests to run disrupts your flow.

On the other hand, end-to-end tests aren’t always the tests you’d run in a tight feedback loop. The vast majority won’t need to be run locally on every change, only during CI.

This is where advances in tooling can help. Cypress is a new, amazing tool for browser testing. It is designed to solve the flakiness and slowness problems shared by all of its predecessors like Selenium. ↩
For me, writing tests is never “fun”, but it shouldn’t feel like a tedious chore you’re doing to tick a box. Every good test is a service to your future self. If it feels like it isn’t then it’s ok to skip it. ↩
A mistake that I’ve made before is relying too much on the end-to-end tests to test all logic. When you start from E2E, it’s easy to keep adding test cases, but then you do end up with a test suite that’s slower and more complex than it needs to be.

A good approach is to test business logic exhaustively at the unit level, and then test the whole feature at least once at the end-to-end level. You could visualise it as an upside down T shape. ↩
As mentioned in ⁴, I think that having tests that mock the database is a mistake. It’s useful to draw a distinction between external dependencies that are an integral part of your application — it couldn’t work without them — and those that aren’t. Don’t mock the integral parts.

For example, you shouldn’t mock the database, it is an integral part and it is under your control. It is ok to mock (in unit-level tests) an external API you use to look up the weather in one feature, if it is not integral and is likely used in a simple way. ↩