The FDA has approved AI-based PET/MRI “denoising”. How safe is this technology?

The FDA has approved AI-based PET/MRI “denoising”. How safe is this technology?

By LUKE OAKDEN-RAYNER, MD

Super-resolution* promises to be one of the most impactful medical imaging AI technologies, but only if it is safe.

Last week we saw the FDA approve the first MRI super-resolution product, from the same company that received approval for a similar PET product last year. This news seems as good a reason as any to talk about the safety concerns myself and many other people have with these systems.

Disclaimer: the majority of this piece is about medical super-resolution in general, and not about the SubtleMR system itself. That specific system is addressed directly near the end.

Zoom, enhance

Super-resolution is, quite literally, the “zoom and enhance” CSI meme in the gif at the top of this piece. You give the computer a low quality image and it turns it into a high resolution one. Pretty cool stuff, especially because it actually kind of works.

In medical imaging though, it’s better than cool. You ever wonder why an MRI costs so much and can have long wait times? Well, it is because you can only do one scan every 20-30 minutes (with some scans taking an hour or more). The capital and running costs are only spread across one to two dozen patients per day.

So what if you could get an MRI of the same quality in 5 minutes? Maybe two to five times more scans (the “getting patient ready for the scan” time becomes the bottleneck), meaning less cost and more throughput.

This is the dream of medical super-resolution.

But this isn’t AI making magical MRI machines. There is a hard limit to the speed of MRI – the time it takes to acquire an image is determined by how long it takes for spinning protons to relax after they get energised with radio waves. We are running up against some pretty fundamental subatomic physics here.

AI in this context works after you take the image, as a form of post-processing. We can actually do 5 minute scans already, you just get a very noisy picture (those pesky protons aren’t relaxed enough). In radiology, we call this sort of image “non-diagnostic”, since most of the detail is hidden in the grainy fuzz. In modern practice, we would simply repeat the study rather than reporting a study we can’t safely interpret.

MRI noise – the image on the left is a simulated example of what the image on the right would look like if you scanned for 1/10th of the time. From here.

Enter, artificial intelligence.

If CSI taught us anything, it is that really good computers can fix this problem. Proton spins? Pish posh, we have computers! The whole idea, phrased a bit cynically but not actually misrepresented, is that the AI system can look at the low quality image and … guess what it would have looked like if you scanned it for longer.

Super-resolution! This is an AI made image – they feed in the grainy one from above and try to produce the high quality one.

Now, obviously, “guessing” doesn’t sound ideal in a risk critical situation like medical imaging.

Well … it gets worse.


Out-of-distribution errors

The real problem is not that AI makes up stuff in general, but it is that it might fail to correctly guess the important bits.

Here is the side by side comparison of the AI generated “denoised” image (on the left) and the diagnostic quality image (on the right).

If you look closely you can see that it isn’t perfect (there is still some blurring present), but the overall anatomy looks pretty reasonable. Nothing is horribly out of place, and maybe the blurring/quality issue is solvable with improved AI algorithms.

But here is the key question:

As a diagnostic radiologist, would you prefer a picture that:

  • a) shows you the normal anatomy clearly?
  • b) shows you disease clearly (if it is present)?

This example image doesn’t contain any disease, so you might think it can’t tell us anything about whether this sort AI can cope with disease features. On the contrary, this picture shows us pretty convincingly that it probably fail to generate the features of disease, particularly for small volume, rare diseases.

This, like most things in AI, is all about the training data.

These models work, at a base level, by filling in the blanks. You have an area where the pixel values are corrupted by noise? The job of the model is to pick “realistic” pixel values to replace the ones already there. To understand this, let’s look at this example at the scalp on the forehead.

Here we have the noisy version on the left, and we see there are some diagonal stripes. In particular, there are two bright stripes. In the middle image (the diagnostic scan) we can see that the top-left stripe is less bright than the one below it, but on the noisy image they don’t look as different.

So how come the AI image on the right gets the correct brightness for this layer?

This AI model would have been trained on thousands of MRI head images. For every single example it has ever seen, there is a dull layer and a bright layer under it. It just fills it in with what is consistently there.

What about for visual features that aren’t always there though? Let’s look at the back of the neck.

One thing you notice in the original scan (middle) is the black dot. This is a blood vessel (imagine looking at a tube end-on).

But this blood vessel doesn’t seem to be there on the AI version (right). Why?

Well, blood vessels are rare and highly variable in position. They never occur in the same place, apart from the biggest ones (and even then they vary more than almost all other anatomy). So when an AI model looks at the noisy image (left) which doesn’t give a strong indication a blood vessel was there, what will it do?

Will it fill in normal fat, which is seen in this area in 99.99999…% of people, or will it randomly decide to put a blood vessel there?

We would say that this image feature (a blood vessel dot in that exact location) is outside of the training distribution. It has never (or rarely) been seen before, and as such, the model has no reason to change the pixel values to be “blood vessel pixels”.

It is likely that any AI super-resolution algorithms will create images of people without any superficial blood vessels. While this is clearly unnatural (these people would die quite quickly if they existed), it isn’t really a problem. Blood vessels, like most anatomy, are almost never relevant in radiology practice.

But this hints at an underlying flaw. Things that are small, and rare, won’t be reproduced accurately.

And unfortunately, that pretty much describes most disease.


Preventing the worst case scenario

I won’t go into too much detail here, since I hope it is obvious that people with any given disease are much less common than people without that disease. It is also true that many diseases (but not all) have features that vary in location and appearance, or are very small.

The “swallow-tail sign” in Parkinson’s disease. For reference, this structure is about 1-2cm in size, the gap between the swallow’s tail feathers (the dark bands on the left image) is maybe a millimetre.

So what is the risk of these super-resolution systems in practice?

Well, these AIs could straight up delete the visual features of diseases, while making images that look like they are diagnostic quality. This is really the worst possible outcome, since human assessment of image quality is a safety net in current practice. Destroying our ability to know when an image is flawed isn’t a good idea.

So far, I’ve sounded pretty negative about this technology. I’m not, at least not entirely. I think it could work, and I definitely hope that it does. The promise is so huge in terms of cost savings and expanded access that it is worth exploring, at least.

But I am concerned about safety. To achieve the promise of this technology, we need to mitigate the risks.

The only real answer is to do testing to make sure these models are safe. Ideally you would run a clinical trial and show that patients have similar outcomes with AI generated images or normal diagnostic quality ones, but as I said in a recent piece we are unlikely to see pre-marketing clinical trials any time soon.

Instead, we will need to focus on the major risk: not “are the images visually similar?” but “are the images diagnostic?”.

Unlike the earlier images, these black dots are actually signs of a disease (amyloid angiopathy). Do we really think the AI will be better with these dots than it was with blood vessels?

I don’t want to pick on the company with the FDA approval, but they are the only group I can talk about (as I have said before for other first mover developers, it isn’t their fault that this technology is new and untested).

As far as I can tell (acknowledging that 510(k) decision summaries are often incomplete), the FDA only required Subtle Medical to show that their system produces decent looking images. The summary is here (pdf link), I’ve included an excerpt below:

The main performance study, utilizing retrospective clinical data, was divided into two tests.

For the noise reduction performance test, acceptance criteria were that signal-to-noise ratio (SNR) of a selected region of interest (ROI) in each test dataset is on average improved by greater than or equal to 5% after SubtleMR enhancement compared to the original images, and (ii) the visibility of small structures in the test datasets before and after SubtleMR is on average less than or equal to 0.5 Likert scale points. This test passed.

For the sharpness increase performance test, acceptance criteria were that the thickness of anatomic structure and the sharpness of structure boundaries are improved after SubtleMR enhancement in at least 90% of the test datasets. This test passed.

So they showed that small structures look fairly similar, but they make no mention of diagnostic accuracy. At minimum, in my opinion, you should need to show that doctors can diagnose disease with these images. Of course, that is a horribly high bar, since there are thousands of diseases that can occur on a brain MRI^. But you need a high bar. Even if the model is fine for most diseases, you only need one disease that is not visualised 5% of the time and patients are at risk.

Now, to be fair to Subtle Medical and the FDA, the SubtleMR product is not licensed for use with non-diagnostic input images. The summary specifically states “The device’s inputs are standard of care MRI images. The outputs are images with enhanced image quality.” In this setting, the radiologist will always be able to double check the original image (although assuming they will do so is a bit silly).

But this is not the goal of these systems, the goal is to use non-diagnostic images to produce high-quality ones. The alternative (making diagnostic images look better) is a questionable value proposition at best.

I note that the earlier product (SubtlePET) does not dismiss this goal – in fact they say on their website that the AI model allows for 4x faster PET imaging. I also note that the summary for that FDA approval does not mention clinical testing either.

In my opinion, we need urgent independent clinical testing of the core medical super-resolution technology, probably by an academic centre or group of centres. Whether this involves testing the diagnosability of all diseases or just a high risk subset^, until some form of clinical assessment has been done we can’t know if super-resolution is safe.

I truly do want this technology to work. Until it is appropriately tested however, I will be sceptical.  The risk is much higher than is immediately obvious for these “simple denoising algorithms”*.


* I might get some folks a bit upset by calling this super-resolution though out the piece, as it is often called denoising in this context. I honestly think calling these systems “denoisers” is a bit cheeky. It sounds like all they are doing is removing noise from an otherwise normal image. As we have seen, this isn’t the case at all. In fact a more appropriate term used in machine learning circles is “AI hallucination”. I doubt anyone will start marketing a “medical image hallucinator” anytime soon though 

^ yet another situation where having expert defined “high risk image features” would be worthwhile.

Luke Oakden-Rayner is a radiologist (medical specialist) in South Australia, undertaking a Ph.D in Medicine with the School of Public Health at the University of Adelaide. This post originally appeared on his blog here.