Recently, the FDA released a discussion paper on AI software (you can find the discussion paper, and leave your own comments here.
Leaving the debate itself aside, one thing that is clear to us is that core parts of this conversation aren’t well understood outside of the more technical, AI circles. And that leaves out a number of important voices for the commentary – domain experts such as doctors, life science professionals, etc. who aren’t as well versed in machine learning.
In order to understand the questions the FDA is asking, it’s important to understand the motivation behind all of this discussion in the first place. At its core, the challenge related to AI and Machine Learning (ML) is change. And how to deal with change requires discussion.
Unlike traditional products, AI and ML software adapts based on either (a) changes to the algorithms or (b) changes to the training data (remember, these tools are generally explicitly trained to do a task by having labeled examples to learn from). And this adaptation will have known effects (e.g., predictions are more accurate or apply in more cases), but also unknown (since seeing inside how the model works can range from challenging to nearly impossible).
As an example of unforeseen circumstances, researchers have shown that certain image recognizers can be tricked by adding “static” to images to produce different predictions. For instance, in the example in this article, the system is tricked into predicting that a picture of a panda is a gibbon, just by adding static to it.
And this is the core of the FDA document – what types of changes require the FDA to re-analyze the software? How does one measure the change? How can you ensure that the software performs at least as well as when it was initially approved?
The document is insightful and careful, but to make sense of it requires a core understanding of the two most common ways that AI and ML software can change.
An interesting aspect of AI and ML software is that the fundamental component that does the prediction, classification, etc. (called the “model”) can be changed and updated, while the core software can function similarly.
For instance, consider software that identifies malignant tumors using a picture from a mobile phone (an example from the FDA document). You snap a picture, and the software renders its judgment – benign or malignant.
And perhaps the company has 10,000 examples of benign images and 10,000 examples of malignant images to train its algorithms.
Using the same example data, the software developers can create completely different ML algorithms that produce different, hopefully better, models. These models will then power the judgement made by the mobile app.
So, the data is the same (same 10,000 examples of each), the software functionality is the same (take a picture, get a judgement), but behind the scenes the model can change completely.
So, why is this an issue? First, hopefully the new model is better. But according to the FDA, we need to consider what better means. Also, perhaps the software gained new functionality (maybe it works in wider cases). But how can we be sure it works in these new areas? And are the changes in behavior so great as to warrant a new approval? Going further, how can we be sure that the model didn’t get better at one task, but in the real-world it wouldn’t do as well?
Perhaps the biggest difference between AI software and just software (or, really, most products) is that the product can adapt, even after it’s made. Using your billing software doesn’t fundamentally change its performance. It still tracks bills.
But with AI and ML, you can create a computer program that adapts either through use (and feedback) or by giving it even more training examples. For instance, most of us have had articles suggested to us based on ones we’ve clicked on in the past. This is an example of an adaptive algorithm.
As a more relevant example, consider software that identifies something in an X-Ray (to use another example from the FDA). The exact same software (including both the workflow and the model) could be significantly more accurate 6 months later after being given thousands more training examples. So, the model has improved, solely based on new data, though the software and algorithm haven’t changed at all. This is tricky to manage because the training data just affected something that has already been approved for use in a clinical setting.
On the one hand, if the accuracy improves, that is great. But, as we mentioned with algorithm changes, a pitfall could be that certain biases become further reinforced. So, while it seems to improve, it has learned to focus more specifically on the given examples, at the expense of other images it might see (we call this “overfitting”). In fact, all of the issues related to change when the algorithm is different apply here, but interestingly, the only thing that has changed is the data used to train and refine the algorithm!
Our point is not to render an opinion on the FDA document or answer their questions. That is for you to do, dear reader. But we want to make sure everyone understands the issues a bit more clearly so that the voices being heard aren’t only those with AI backgrounds, but those with other interests in healthcare as well.