Scientific Tests

Algorithms are really easy to mess up. Take your pick for how: overfitting to training data, having bad training data, having too little training data, encoding human bias from your training data in the model and calling it “objective”. Feeding in new data that’s in the wrong format. Typos, subtle typos, nightmarishly subtle typos. Your cat stepping on the keyboard when you’re out of the room.

Having done this for a while, my first impulse any time I get amazing performance with an algorithm is to be deeply suspicious. This isn’t because algorithms can’t be incredibly powerful; you really can get amazing performance if you do them right. But you can also get seemingly amazing performance if you do them wrong, and there are a lot of ways to be wrong.

The core issue, I think, is that there are so many choices involved in the making and maintaining of an algorithm, and if the algorithm is trying to do something complicated, those choices can have complicated downstream effects. You can’t readily anticipate what these effects might be (this problem is hard enough that you’re building an algorithm for it, after all), but your brain tells you that there can’t be that much of a difference between a threshold being 0.3 and that same threshold being set to 0.4. So you blithely make the change to 0.4, expecting minimal effects, and then the whole thing just collapses underneath you.

I’m saying this because, while developers on the whole have gotten pretty on board with the concept of unit tests and test coverage for code, I’m not as sure about what currently exists around tests for data science algorithms in biology, medicine, and health. I’m not talking about tests that confirm a function gives the number we expect it to based on an old run of it (e.g. asserting that f(3.4) = 6.83 because we ran it once with 3.4 and got 6.83 as the answer). Those kinds of tests basically act like a flag that something’s changed, and if you change your function f, you can just paste in the new output from f to make the test pass.

I’m talking about using a lot of data — a representative sample of what you’ve collected— in your code’s tests, to assert that some macro-property of the algorithm’s output is preserved. If you make a change to the algorithm, and that macro-property changes, your code should let you know about it. If this sounds like one flavor of functional testing, that’s because it is. But the key thing I’m arguing for is that the performance of data science-components of a product— code two steps removed from sklearn or torch or R—be tested functionally as well.

Let’s talk through an example. Imagine you’ve got an algorithm for distinguishing sleep from wake over the course of the night. In my own experience, algorithms of this kind can have an unfortunate tendency to flip quickly, in response to a threshold being changed, from thinking a person slept a lot over the night, to thinking that person barely slept at all. There are ways to address this, and it’s not always an issue, but it’s still something to look out for.

So how could you look out for it? Take a representative sample of data from a group of sleeping people, and add a test that runs your sleep algorithm on all of them and asserts that everyone gets detected to have at least some baseline amount of sleep. This way you don’t have to worry that changes to the algorithm made some people better but other people unexpectedly, dramatically worse. You can rest easy because assurances of this kind are built into your development pipeline.

At Arcascope, a lot of our scientific tests center around the mean absolute error of our prediction of melatonin onset. We want to predict when somebody’s melatonin onset is happening so we can map it to other quantities of interest: minimum core body temperature, peak athletic performance, peak fatigue, you name it. But we also want to do continual development work on these algorithms, to keep bringing the mean absolute error down over time. How can we make sure that our changes actually make things better? How can we make sure that a change that improves performance for some people isn’t sabotaging others? Tests that confirm the properties we care about are preserved, no matter what we do in the backend.

I mentioned above that I don’t know what’s out there elsewhere in digital health, and I don’t. Maybe a lot of people are writing tests of this kind in their code! All I know is that when we realized we could do this— add fundamental performance checks for the algorithms that make up our backend, embedding deidentified human data from our studies directly into our testing suite— it was a very cool moment. It helps me sleep a lot better at night knowing those tests are there.

Which is good, because sleeping more means I’m introducing fewer nightmarishly subtle typos to our code. Wins all around.