Clinical Depth: The Power of Knowing More than the Minimum

Clinical Depth: The Power of Knowing More than the Minimum


In medicine, contrary to common belief, it is not usually enough to know the diagnosis and its best treatment or procedure. Guidelines, checklists and protocols only go so far when you are treating real people with diverse constitutions for multiple problems under a variety of circumstances.

The more you know about unusual presentations of common diseases, the more likely you are to make the correct diagnosis, I think everyone would agree. Also, the more you know about the rare diseases that can look like the common one you think you’re seeing in front if you, rather than having just a memorized list of rule-outs, the better you are at deciding how much extra testing is practical and cost effective in each situation.

Not everyone with high blood pressure needs to be tested in detail for pheochromocytoma, renal artery stenosis, coarctation of the aorta, Cushing’s syndrome, hyperaldosteronism, hyperparathyroidism or thyroiditis. But you need to know enough about all of these things to have them in mind, automatically and naturally, when you see someone with high blood pressure.

Just having a lifeless list in your pocket or your EMR, void of vivid details and depth of understanding, puts you at risk of being a burned-out, shallow healthcare worker someday replaced by apps or artificial intelligence.

The power of knowing these exceptions to the common rules in enough detail to naturally be able to reference them is what makes a doctor a “docere”, a true learned professional.

I recently came across he term “airmanship”, which is when you intimately know your plane, the weather and the gravitational, centrifugal and and all the other physical forces that can alter your flight. Airmanship is taught in rigorous military training that brings you close to the limits of what can be done and far beyond what you will see most days as a commercial pilot, in order to prepare you for those times when everything depends on your judgement.

Primary care medicine may not seem like heroic aerial acrobatics, but it can actually involve a fair amount of flying by the seat of our pants, which must be a real expression straight out of advanced flight school.

Only experience and in-depth knowledge empower you with an appreciation for nuances. Is it necessary to treat mild renal artery stenosis if the blood pressure is easily controlled with medication? A patient with low potassium and high blood pressure probably does have hyperaldosteronism, but do you have to do anything more than prescribe spironolactone regardless of why the potassium runs low?

There is another side to having deep knowledge, besides making you a cost effective clinician. Patients trust you more if you show that you know a lot about why you’re recommending a certain intervention. And that is not a trivial consideration. Opinions on everything from when life begins and ends to whether coconut oil is good or bad vary so much that what your family doctor says is only one in a crowded field of competing views.

Even guidelines for the most common diseases we treat change too often for patients to feel comfortable just because we tell them that the target numbers or best practices have changed since the last time we saw them.

So, on the most basic level, our demonstrated knowledge in diagnosis and treatment builds case-specific credibility.

Patients usually take great comfort in seeing that you have considered reasonable differential diagnoses and know how the treatment you recommend works and also what to do if the treatment doesn’t work.

But the other consideration is that if we demonstrate a breadth and depth of our medical and scientific knowledge, we also gain a broader credibility and authority when we apply our knowledge and understanding to related areas. Obviously, we shouldn’t claim authority in unrelated areas like fashion or finance. That phenomenon, called ultracrepidarianism, has always been rampant in our culture, for instance in advertisements that more doctors smoke Camels than any other brand of cigarette. But we do have a role as well educated generalists in helping patients evaluate medical news, for example.

The third level is distinctly different from ultracrepidarianism, and that is the authority patients place in our general wisdom, for lack of a more politically correct word; years of schooling and experience with life, disease and death allow us to say things people need to hear in certain situations. Our words of encouragement, our little gestures of caring and kindness can have much greater impact because of the position of authority we may have earned in people’s lives.

I just read a senior psychiatrist’s list of 50 pieces of advice for younger colleagues and his Number 15 really resonated with me:

“15. Try to create rare magic moments—things you say to patients that they will remember always and use in changing their lives.” – Allen Francis, MD, Professor Emeritus and former Chair, Department of Psychiatry, Duke University

This is an earned power that needs to be carefully considered, because we can just as easily hurt or undermine our patients if we speak carelessly or impulsively.

Hans Duvefelt is a Swedish-born rural Family Physician in Maine. This post originally appeared on his blog, A Country Doctor Writes, here.

The Health Data Goldilocks Dilemma | Vince Kuraitis & Deven McGraw

The Health Data Goldilocks Dilemma | Vince Kuraitis & Deven McGraw


Which is better: sharing access to all health data across platforms so that interoperability is achieved, or protecting some data for the sake of privacy? Health data privacy experts Vince Kuraitis, founder of Better Health Technologies, and Deven McGraw, Chief Regulatory Officer at Ciitzen, are crowdsourcing opinions and insights on what they are calling The Health Data Goldilocks Dilemma. How much data protection is ‘juuuust right’? What should be regulated? And, by whom? The duo talks through their views on the data protection conversation and urge others to join in the conversation via their blog series called, “The Health Data Goldilocks Dilemma,” on The Health Care Blog.

Filmed at the HIMSS Health 2.0 Conference in Santa Clara, CA in September 2019.

Prehab Tool and AI Win Big at the 2019 RWJF Live Pitch

Prehab Tool and AI Win Big at the 2019 RWJF Live Pitch



Six finalists competed in an exciting live pitch for the Robert Wood Johnson Foundation’s 2019 Innovation Challenges at the 2019 Health 2.0 Annual Conference. They demoed their technologies in front of an audience of health care professionals, investors, provider organizations, and members of the media. The Home and Community Based Care Challenge sought technologies that support the advancement of at-home or community based care. The Social Determinants of Health Innovation Challenge called for solutions that increase access to services related to social determinants of health.

the 3-day Conference, Jessica DaMassa, Executive Producer & Host of
@WTF_Health, spoke with the finalists about their experience competing in the
RWJF Innovation Challenges, their personal highlights, and what’s next!

Home and Community Based Care
Innovation Challenge Finalists

First Place:

Ooney’s home-based web-app for older adults, Prehab Pal, delivers individualized prehabilitation to accelerate postoperative functional recovery and return to independence after surgery.

Second Place:

Wizeview uses artificial intelligence to automate and organize information collected during home visits, supporting the management of medically complex populations at the lowest cost per encounter.

Third Place:

Heal doctor house calls, paired with Heal Hub remote patient monitoring and telemedicine, offer a complete connected care solution for patients with chronic conditions.

Social Determinants of Health
Innovation Challenge Finalists

First Place:

Social Impact AI Lab is a consortium of nonprofit social services agencies and technology providers with artificial intelligence solutions to address social disconnection in child welfare.

Second Place:

Community Resource Network’s Social Determinants of Health Client Profile creates a whole-person picture across physical, behavioral, and social domains to expedite help for those most at risk, fill in the gaps in care, and optimize well-being.

Third Place:

Open City Labs matches patients with community services and government benefits that address SDoH seamlessly. The platform will integrate with HIEs to automate referral, eligibility screening & benefits enrollment.

Congratulations to the six winners and thank you to all of the participants involved in both Innovation Challenges. To learn more about these efforts, you can visit the Home and Community Based Care and Social Determinants of Health Challenge websites.

Catalyst @ Health 2.0 (“Catalyst”) is the industry leader in digital health strategic partnering, hosting competitive innovation “challenge” events, as well as developing and implementing programs for piloting and commercializing novel healthcare technologies.

The FDA has approved AI-based PET/MRI “denoising”. How safe is this technology?

The FDA has approved AI-based PET/MRI “denoising”. How safe is this technology?


Super-resolution* promises to be one of the most impactful medical imaging AI technologies, but only if it is safe.

Last week we saw the FDA approve the first MRI super-resolution product, from the same company that received approval for a similar PET product last year. This news seems as good a reason as any to talk about the safety concerns myself and many other people have with these systems.

Disclaimer: the majority of this piece is about medical super-resolution in general, and not about the SubtleMR system itself. That specific system is addressed directly near the end.

Zoom, enhance

Super-resolution is, quite literally, the “zoom and enhance” CSI meme in the gif at the top of this piece. You give the computer a low quality image and it turns it into a high resolution one. Pretty cool stuff, especially because it actually kind of works.

In medical imaging though, it’s better than cool. You ever wonder why an MRI costs so much and can have long wait times? Well, it is because you can only do one scan every 20-30 minutes (with some scans taking an hour or more). The capital and running costs are only spread across one to two dozen patients per day.

So what if you could get an MRI of the same quality in 5 minutes? Maybe two to five times more scans (the “getting patient ready for the scan” time becomes the bottleneck), meaning less cost and more throughput.

This is the dream of medical super-resolution.

But this isn’t AI making magical MRI machines. There is a hard limit to the speed of MRI – the time it takes to acquire an image is determined by how long it takes for spinning protons to relax after they get energised with radio waves. We are running up against some pretty fundamental subatomic physics here.

AI in this context works after you take the image, as a form of post-processing. We can actually do 5 minute scans already, you just get a very noisy picture (those pesky protons aren’t relaxed enough). In radiology, we call this sort of image “non-diagnostic”, since most of the detail is hidden in the grainy fuzz. In modern practice, we would simply repeat the study rather than reporting a study we can’t safely interpret.

MRI noise – the image on the left is a simulated example of what the image on the right would look like if you scanned for 1/10th of the time. From here.

Enter, artificial intelligence.

If CSI taught us anything, it is that really good computers can fix this problem. Proton spins? Pish posh, we have computers! The whole idea, phrased a bit cynically but not actually misrepresented, is that the AI system can look at the low quality image and … guess what it would have looked like if you scanned it for longer.

Super-resolution! This is an AI made image – they feed in the grainy one from above and try to produce the high quality one.

Now, obviously, “guessing” doesn’t sound ideal in a risk critical situation like medical imaging.

Well … it gets worse.

Out-of-distribution errors

The real problem is not that AI makes up stuff in general, but it is that it might fail to correctly guess the important bits.

Here is the side by side comparison of the AI generated “denoised” image (on the left) and the diagnostic quality image (on the right).

If you look closely you can see that it isn’t perfect (there is still some blurring present), but the overall anatomy looks pretty reasonable. Nothing is horribly out of place, and maybe the blurring/quality issue is solvable with improved AI algorithms.

But here is the key question:

As a diagnostic radiologist, would you prefer a picture that:

  • a) shows you the normal anatomy clearly?
  • b) shows you disease clearly (if it is present)?

This example image doesn’t contain any disease, so you might think it can’t tell us anything about whether this sort AI can cope with disease features. On the contrary, this picture shows us pretty convincingly that it probably fail to generate the features of disease, particularly for small volume, rare diseases.

This, like most things in AI, is all about the training data.

These models work, at a base level, by filling in the blanks. You have an area where the pixel values are corrupted by noise? The job of the model is to pick “realistic” pixel values to replace the ones already there. To understand this, let’s look at this example at the scalp on the forehead.

Here we have the noisy version on the left, and we see there are some diagonal stripes. In particular, there are two bright stripes. In the middle image (the diagnostic scan) we can see that the top-left stripe is less bright than the one below it, but on the noisy image they don’t look as different.

So how come the AI image on the right gets the correct brightness for this layer?

This AI model would have been trained on thousands of MRI head images. For every single example it has ever seen, there is a dull layer and a bright layer under it. It just fills it in with what is consistently there.

What about for visual features that aren’t always there though? Let’s look at the back of the neck.

One thing you notice in the original scan (middle) is the black dot. This is a blood vessel (imagine looking at a tube end-on).

But this blood vessel doesn’t seem to be there on the AI version (right). Why?

Well, blood vessels are rare and highly variable in position. They never occur in the same place, apart from the biggest ones (and even then they vary more than almost all other anatomy). So when an AI model looks at the noisy image (left) which doesn’t give a strong indication a blood vessel was there, what will it do?

Will it fill in normal fat, which is seen in this area in 99.99999…% of people, or will it randomly decide to put a blood vessel there?

We would say that this image feature (a blood vessel dot in that exact location) is outside of the training distribution. It has never (or rarely) been seen before, and as such, the model has no reason to change the pixel values to be “blood vessel pixels”.

It is likely that any AI super-resolution algorithms will create images of people without any superficial blood vessels. While this is clearly unnatural (these people would die quite quickly if they existed), it isn’t really a problem. Blood vessels, like most anatomy, are almost never relevant in radiology practice.

But this hints at an underlying flaw. Things that are small, and rare, won’t be reproduced accurately.

And unfortunately, that pretty much describes most disease.

Preventing the worst case scenario

I won’t go into too much detail here, since I hope it is obvious that people with any given disease are much less common than people without that disease. It is also true that many diseases (but not all) have features that vary in location and appearance, or are very small.

The “swallow-tail sign” in Parkinson’s disease. For reference, this structure is about 1-2cm in size, the gap between the swallow’s tail feathers (the dark bands on the left image) is maybe a millimetre.

So what is the risk of these super-resolution systems in practice?

Well, these AIs could straight up delete the visual features of diseases, while making images that look like they are diagnostic quality. This is really the worst possible outcome, since human assessment of image quality is a safety net in current practice. Destroying our ability to know when an image is flawed isn’t a good idea.

So far, I’ve sounded pretty negative about this technology. I’m not, at least not entirely. I think it could work, and I definitely hope that it does. The promise is so huge in terms of cost savings and expanded access that it is worth exploring, at least.

But I am concerned about safety. To achieve the promise of this technology, we need to mitigate the risks.

The only real answer is to do testing to make sure these models are safe. Ideally you would run a clinical trial and show that patients have similar outcomes with AI generated images or normal diagnostic quality ones, but as I said in a recent piece we are unlikely to see pre-marketing clinical trials any time soon.

Instead, we will need to focus on the major risk: not “are the images visually similar?” but “are the images diagnostic?”.

Unlike the earlier images, these black dots are actually signs of a disease (amyloid angiopathy). Do we really think the AI will be better with these dots than it was with blood vessels?

I don’t want to pick on the company with the FDA approval, but they are the only group I can talk about (as I have said before for other first mover developers, it isn’t their fault that this technology is new and untested).

As far as I can tell (acknowledging that 510(k) decision summaries are often incomplete), the FDA only required Subtle Medical to show that their system produces decent looking images. The summary is here (pdf link), I’ve included an excerpt below:

The main performance study, utilizing retrospective clinical data, was divided into two tests.

For the noise reduction performance test, acceptance criteria were that signal-to-noise ratio (SNR) of a selected region of interest (ROI) in each test dataset is on average improved by greater than or equal to 5% after SubtleMR enhancement compared to the original images, and (ii) the visibility of small structures in the test datasets before and after SubtleMR is on average less than or equal to 0.5 Likert scale points. This test passed.

For the sharpness increase performance test, acceptance criteria were that the thickness of anatomic structure and the sharpness of structure boundaries are improved after SubtleMR enhancement in at least 90% of the test datasets. This test passed.

So they showed that small structures look fairly similar, but they make no mention of diagnostic accuracy. At minimum, in my opinion, you should need to show that doctors can diagnose disease with these images. Of course, that is a horribly high bar, since there are thousands of diseases that can occur on a brain MRI^. But you need a high bar. Even if the model is fine for most diseases, you only need one disease that is not visualised 5% of the time and patients are at risk.

Now, to be fair to Subtle Medical and the FDA, the SubtleMR product is not licensed for use with non-diagnostic input images. The summary specifically states “The device’s inputs are standard of care MRI images. The outputs are images with enhanced image quality.” In this setting, the radiologist will always be able to double check the original image (although assuming they will do so is a bit silly).

But this is not the goal of these systems, the goal is to use non-diagnostic images to produce high-quality ones. The alternative (making diagnostic images look better) is a questionable value proposition at best.

I note that the earlier product (SubtlePET) does not dismiss this goal – in fact they say on their website that the AI model allows for 4x faster PET imaging. I also note that the summary for that FDA approval does not mention clinical testing either.

In my opinion, we need urgent independent clinical testing of the core medical super-resolution technology, probably by an academic centre or group of centres. Whether this involves testing the diagnosability of all diseases or just a high risk subset^, until some form of clinical assessment has been done we can’t know if super-resolution is safe.

I truly do want this technology to work. Until it is appropriately tested however, I will be sceptical.  The risk is much higher than is immediately obvious for these “simple denoising algorithms”*.

* I might get some folks a bit upset by calling this super-resolution though out the piece, as it is often called denoising in this context. I honestly think calling these systems “denoisers” is a bit cheeky. It sounds like all they are doing is removing noise from an otherwise normal image. As we have seen, this isn’t the case at all. In fact a more appropriate term used in machine learning circles is “AI hallucination”. I doubt anyone will start marketing a “medical image hallucinator” anytime soon though 

^ yet another situation where having expert defined “high risk image features” would be worthwhile.

Luke Oakden-Rayner is a radiologist (medical specialist) in South Australia, undertaking a Ph.D in Medicine with the School of Public Health at the University of Adelaide. This post originally appeared on his blog here.

ACCESS Act Points the Way to a Post-HIPAA World

ACCESS Act Points the Way to a Post-HIPAA World


The Oct. 22 announcement starts with: “U.S. Sens. Mark R. Warner (D-VA), Josh Hawley (R-MO) and Richard Blumenthal (D-CT) will introduce the Augmenting Compatibility and Competition by Enabling Service Switching (ACCESS) Act, bipartisan legislation that will encourage market-based competition to dominant social media platforms by requiring the largest companies to make user data portable – and their services interoperable – with other platforms, and to allow users to designate a trusted third-party service to manage their privacy and account settings, if they so choose.”

Although the scope of this bill is limited to the largest of the data brokers (messaging, multimedia sharing, and social networking) that currently mediate between us as individuals, it contains groundbreaking provisions for delegation by users that is a road map to privacy regulations in general for the 21st Century.

The bill’s Section 5: Delegation describes a new right for us as data subjects at the mercy of the institutions we are effectively forced to use. This is the right to choose and delegate authority to a third-party agent that can manage interactions with the institutions on our behalf. The third-party agent can be anyone we choose subject to their registration with the Federal Trade Commission. This right to digital representation by an entity of our choice with access to the full range of our direct control capabilities is unprecedented, as far as I know.

The problem with HIPAA, and with Europe’s General Data Protection Regulation (GDPR) is a lack of agency for the individual data subject. These regulatory approaches presume that all of the technology is controlled by our service providers and none of the technology is controlled by the us as data subjects. There are major limitations to this approach. 

First, it depends on regulation and bureaucracy around data uses (“notice and consent”) which typically lag the torrid pace of tech and business innovation. The alternative of mandating the technical ability to delegate, per this bill, reduces the scope of necessary regulation while still allowing the service providers to innovate.

Second, a right to delegate control gives the data subject a lot more market power in highly concentrated markets like communications or hospital networks where effective and differentiated competition is scarce. A patient, for example, will have a choice among hundreds of digital representatives even when that patient is in a market served by only one or two hospital networks. These digital representatives will compete on a national scale even as our provider choices are limited by geography or employment.

Third, the advent of patient-controlled technology enabled by mandated delegation means that machine learning, artificial intelligence, and expertise in general, can now move closer to patient. For example, patient groups that share a serious disease can organize as a cooperative to make the best use of their health records and hire expert physicians and engineers to design and operate the delegate.

Fourth, the right to specify a delegate means that, for the first time, our service providers will have to come to us. Under the current practice, patients are forced to navigate different user interfaces, portal designs, privacy statements, and associated dark patterns designed to manipulate us in different ways by each of our service providers. We are forced to figure out the idiosyncrasies of every service provider afresh. A right to delegation means that patients will have a consistent user interface and a more consistent user experience across our service providers even if the delegate is relatively dumb in the expert systems sense. 

Anyone who has sought the services of an attorney or a direct primary care physician understands the value of an expert fiduciary that is more-or-less substitutable if they fail to satisfy. These learned intermediaries are understood as essential when we face asymmetries of power relative to a court or hospital. The ACCESS Bill is a breakthrough because it extends our right to choose a delegate to the digital institutions that are now deeply embedded in our lives.

Adrian Gropper, MD, is the CTO of Patient Privacy Rights, a national organization representing 10.3 million patients and among the foremost open data advocates in the country. This post first appeared in Bill of Health here.

YTH Live 2020

YTH Live 2020


There are many public health
conferences that focus on young people, or that center around youth issues, but
very few that actually include the young people’s voices that we are claiming
to uplift as public health professionals.

There are also very few conferences
that emphasize innovation in healthcare, that are pointed towards solutions
rather than discussing problems at length without clear ways of solving them.

These core issues are at the heart of the annual YTH Live conference. Each year (we’re on our twelfth!), we showcase the boldest technologies in health and cutting-edge research in all facets of youth health and wellness. We also have attendees that range from IT professionals to high school students, with over 25% of last year’s attendees and speakers being young people themselves.

YTH’s Communications Coordinator
Erin McKelle has first-hand experience of this. “I first attended YTH Live when
I was a senior in high school. It was the first conference I ever spoke at and
all of my fears about being the only young person in the room were quickly put
to rest, once I saw that YTH plans a youth conference that actually centers
around youth voices,” she says. “I’m proud to now be working for the
organization years later, after serving on the Youth Advisory Board, paying the
mission of youth empowerment forward to the next generation of youth leaders.”

YTH Live 2020 will focus on the
overall health and wellness of youth in the US and around the world, seeking to
understand how innovative technology can be leveraged to improve health
outcomes for all and promote health equity. Whether you are an Executive
Director, developer, or youth advocate, YTH Live can help you learn about the
latest trends in health, innovation, and technology, as well as facilitate
connecting, as we host a networking event each year for our 450+ attendees to
find new partnerships, make new contacts, and share their work with like-minded
professionals in a more focused setting.

If you have an innovative health
technology you’d like to share, interesting research, or a project you’d like
to signal-boost, we invite you to submit a proposal in our open call for
abstracts for next year’s conference. As for tips on what to submit, McKelle
also offers words of wisdom. “We really look for ideas that are centered in
innovation, pushing the envelope on what is normally done in health and
wellness,” Erin advices. “Topically, the Program Committee will be evaluating
proposals that hit all three areas of youth-centered design or impact, health,
and technology,” she explains.

You have until Wednesday, November 6 at 11:59pm PST to submit your proposal, which will then be reviewed by our Program Committee. Learn more about YTH Live and submit your abstract by the 6th! We look forward to seeing your ideas at YTH Live 2020.

Erin McKelle serves as Communications Coordinator at ETR for the YTH initiative.

Health in 2 Point 00, Episode 99 | (Reverse) Takeover Edition with Bayer G4A

Health in 2 Point 00, Episode 99 | (Reverse) Takeover Edition with Bayer G4A

Today on Health in 2 Point 00… hold on, where’s Jess? On Episode 99, I do a reverse takeover with Priyanka Kashyap and Sophie Park at Bayer’s office in Berlin. Priyanka tells us about what Bayer G4A is doing these days with the 5 startups in their Advance Track: Blackford Analysis in radiology; Carepay and RelianceHMO improving affordability and access for patients in Africa; NeuroTracker, which is in the neuro space but is working with the oncology team at Bayer; and Prevencio, a diagnostic solution in the cardiovascular space. Sophie also gives us a rundown of the 6 startups in the Growth Track at G4A: Wellthy, a digital therapeutics company out of India; Litesprite, for mental health; BioLum, a pulmonology startup working on detecting nitric oxide levels in the blood; Upside Health with its chronic pain management software; and finally Visotec and Okko Health in ophthalmology. —Matthew Holt

Another MCQ Test on the USMLE

Another MCQ Test on the USMLE


One of the most fun things about the USMLE pass/fail debate is that it’s accessible to everyone. Some controversies in medicine are discussed only by the initiated few – but if we’re talking USMLE, everyone can participate.

Simultaneously, one of the most frustrating things about the USMLE pass/fail debate is that everyone’s an expert. See, everyone in medicine has experience with the exam, and on the basis of that, we all think that we know everything there is to know about it.

Unfortunately, there’s a lot of misinformation out there – especially when we’re talking about Step 1 score interpretation. In fact, some of the loudest voices in this debate are the most likely to repeat misconceptions and outright untruths.

Hey, I’m not pointing fingers. Six months ago, I thought I knew all that I needed to know about the USMLE, too – just because I’d taken the exams in the past.

But I’ve learned a lot about the USMLE since then, and in the interest of helping you interpret Step 1 scores in an evidence-based manner, I’d like to share some of that with you here.


If you think I’m just going to freely give up this information, you’re sorely mistaken. Just as I’ve done in the past, I’m going to make you work for it, one USMLE-style multiple choice question at a time._

Question 1

A 25 year old medical student takes USMLE Step 1. She scores a 240, and fears that this score will be insufficient to match at her preferred residency program. Because examinees who pass the test are not allowed to retake the examination, she constructs a time machine; travels back in time; and retakes Step 1 without any additional study or preparation.

Which of the following represents the 95% confidence interval for the examinee’s repeat score, assuming the repeat test has different questions but covers similar content?

A) 239-241

B) 237-243

C) 234-246

D) 228-252


The correct answer is D, 228-252.

No estimate is perfectly precise. But that’s what the USMLE (or any other test) gives us: a point estimate of the test-taker’s true knowledge.

So how precise is that estimate? That is, if we let an examinee take the test over and over, how closely would the scores cluster?

To answer that question, we need to know the standard error of measurement (SEM) for the test.

The SEM is a function of both the standard deviation and reliability of the test, and represents how much an individual examinee’s observed score might vary if he or she took the test repeatedly using different questions covering similar material.

So what’s the SEM for Step 1? According to the USMLE’s Score Interpretation Guidelines, the SEM for the USMLE is 6 points.

Around 68% of scores will fall +/- 1 SEM, and around 95% of scores fall within +/- 2 SEM. Thus, if we accept the student’s original Step 1 score as our best estimate of her true knowledge, then we’d expect a repeat score to fall between 234 and 246 around two-thirds of the time. And 95% of the time, her score would fall between 228 and 252.

Think about that range for a moment.

The +/- 1 SEM range is 12 points; the +/- 2 SEM range is 24 points. Even if you believe that Step 1 tests meaningful information that is necessary for successful participation in a selective residency program, how many people are getting screened out of those programs by random chance alone?

(To their credit, the NBME began reporting a confidence interval to examinees with the 2019 update to the USMLE score report.)

Learning Objective: Step 1 scores are not perfectly precise measures of knowledge – and that imprecision should be considered when interpreting their values.


Question 2

A 46 year old program director seeks to recruit only residents of the highest caliber for a selective residency training program. To accomplish this, he reviews the USMLE Step 1 scores of three pairs of applicants, shown below.

  1. 230 vs. 235
  2. 232 vs. 242
  3. 234 vs. 249

For how many of these candidate pairs can the program director conclude that there is a statistical difference in knowledge between the applicants?

A) Pairs 1, 2, and 3

B) Pairs 2 and 3

C) Pair 3 only

D) None of the above

The correct answer is D, none of the above.

As we learned in Question 1, Step 1 scores are not perfectly precise. In a mathematical sense, an individual’s Step 1 score on a given day represents just one sampling from the distribution centered around their true mean score (if the test were taken repeatedly).

So how far apart do two individual samples have to be for us to confidently conclude that they came from distributions with different means? In other words, how far apart do two candidates’ Step 1 scores have to be for us to know that there is really a significant difference between the knowledge of each?

We can answer this by using the standard error of difference (SED). When the two samples are >/= 2 SED apart, then we can be confident that there is a statistical difference between those samples.

So what’s the SED for Step 1? Again, according to the USMLE’s statisticians, it’s 8 points.

That means that, for us to have 95% confidence that two candidates really have a difference in knowledge, their Step 1 scores must be 16 or more points apart.

Now, is that how you hear people talking about Step 1 scores in real life? I don’t think so. I frequently hear people discussing how a 5-10 point difference in scores is a major difference that totally determines success or failure within a program or specialty.

And you know what? Mathematics aside, they’re not wrong. Because when programs use rigid cutoffs for screening, only the point estimate matters – not the confidence interval. If your dream program has a cutoff score of 235, and you show up with a 220 or a 225, your score might not be statistically different – but your dream is over.

Learning Objective: To confidently conclude that two students’ Step 1 scores really reflect a difference in knowledge, they must be >/= 16 points apart.


Question 3

A physician took USMLE Step 1 in 1994, and passed with a score of 225. Now he serves as program director for a selective residency program, where he routinely screens out applicants with scores lower than 230. When asked about his own Step 1 score, he explains that today’s USMLE are “inflated” from those 25 years ago, and if he took the test today, his score would be much higher.

Assuming that neither the test’s content nor the physician’s knowledge had changed since 1994, which of the following is the most likely score the physician would attain if he took Step 1 in 2019?

A) 205

B) 225

C) 245

D) 265

The correct answer is B, 225.


I hear this kind of claim all the time on Twitter. So once and for all, let’s separate fact from fiction.

FACT: Step 1 scores for U.S. medical students score are rising.

See the graphic below.

FICTION: The rise in scores reflects a change in the test or the way it’s scored.

See, the USMLE has never undergone a “recentering” like the old SAT did. Students score higher on Step 1 today than they did 25 years ago because students today answer more questions correctly than those 25 years ago.

Why? Because Step 1 scores matter more now than they used to. Accordingly, students spend more time in dedicated test prep (using more efficient studying resources) than they did back in the day. The net result? The bell curve of Step 1 curves shifts a little farther to the right each year.

Just how far the distribution has already shifted is impressive.

When the USMLE began in the early 1990s, a score of 200 was a perfectly respectable score. Matter of fact, it put you exactly at the mean for U.S. medical students.

Know what a score of 200 gets you today?

A score in the 9th percentile, and screened out of almost any residency program that uses cut scores. (And nearly two-thirds of all programs do.)

So the program director in the vignette above did pretty well for himself by scoring a 225 twenty-five years ago. A score that high (1.25 standard deviations above the mean) would have placed him around the 90th percentile for U.S. students. To hit the same percentile today, he’d need to drop a 255.

Now, can you make the argument that the type of student who scored in the 90th percentile in the past would score in the 90th percentile today? Sure. He might – but not without devoting a lot more time to test prep.

As I’ve discussed in the past, this is one of my biggest concerns with Step 1 Mania. Students are trapped in an arms race with no logical end, competing to distinguish themselves on the metric we’ve told them matters. They spend more and more time learning basic science that’s less and less clinically relevant, all at at the expense (if not outright exclusion) of material that might actually benefit them in their future careers.

(If you’re not concerned about the rising temperature in the Step 1 frog pot, just sit tight for a few years. The mean Step 1 score is rising at around 0.9 points per year. Just come on back in a while once things get hot enough for you.)

Learning Objective: Step 1 scores are rising – not because of a change in test scoring, but because of honest-to-God higher performance.


Question 4

Two medical students take USMLE Step 1. One scores a 220 and is screened out of his preferred residency program. The other scores a 250 and is invited for an interview.

Which of the following represents the most likely absolute difference in correctly-answered test items for this pair of examinees?

A) 5

B) 30

C) 60

D) 110


The correct answer is B, 30.

How many questions do you have to answer correctly to pass USMLE Step 1? What percentage do you have to get right to score a 250, or a 270? We don’t know.

See, the NBME does not disclose how it arrives at a three digit score. And I don’t have any inside information on this subject. But we can use logic and common sense to shed some light on the general processes and data involved and arrive at a pretty good guess.

First, we need to briefly review how the minimum passing score for the USMLE is set, using a modified Angoff procedure.

The Angoff procedure involves presenting items on the test to subject matter experts (SMEs). The SMEs review each question item and predict what percentage of minimally competent examinees would answer the question correctly.

Here’s an example of what Angoff data look like (the slide is from a recent lecture).

As you can see, Judge A suspected that 59% of minimally competent candidates – the bare minimum we could tolerate being gainfully engaged in the practice of medicine – would answer Item 1 correctly. Judge B thought 52% of the same group would get it right, and so on.

Now, here’s the thing about the version of the Angoff procedure used to set the USMLE’s passing standard. Judges don’t just blurt out a guess off the top of their head and call it a day. They get to review data regarding real-life examinee performance, and are permitted to use that to adjust their initial probabilities.

Here’s an example of the performance data that USMLE subject matter experts receive. This graphic shows that test-takers who were in the bottom 10% of overall USMLE scores answered a particular item correctly 63% of the time.

(As a sidenote, when judges are shown data on actual examinee performance, their predictions shift toward the data they’ve been shown. In theory, that’s a good thing. But multiple studies – including one done by the NBME – show that judges change their original probabilities even when they’re given totally fictitious data on examinee performance.)

For the moment, let’s accept the modified Angoff procedure as being valid. Because if we do, it gives us the number we need to set the minimum passing score. All we have to do is calculate the mean of all the probabilities assigned for that group of items by the subject matter experts.

In the slide above, the mean probability that a minimally competent examinee would correctly answer these 10 items was 0.653 (red box). In other words, if you took this 10 question test, you’d need to score better than 65% (i.e., 7 items correct) to pass.

And if we wanted to assign scores to examinees who performed better than the passing standard, we could. But, we’ll only have 3 questions with which to do it, since we used 7 of the 10 questions to define the minimally competent candidate.

So how many questions do we have to assign scores to examinees who pass USMLE Step 1?

Well, Step 1 includes 7 sections with up to 40 questions in each. So there are a maximum of 280 questions on the exam.

However, around 10% of these are “experimental” items. These questions do not count toward the examinee’s score – they’re on the test to generate performance data (like Figure 1 above) to present in the future to subject matter experts. Once these items have been “Angoffed”, they will become scored items on future Step 1 tests, and a new wave of experimental items will be introduced.

If we take away the 10% of items that are experimental, then we have at most 252 questions to score.

How many of these questions must be answered correctly to pass? Here, we have to use common sense to make a ballpark estimate.

After all, a candidate with no medical knowledge who just guessed answers at random might get 25% of the questions correct. Intuitively, it seems like the lower bound of knowledge to be licensed as a physician has to be north of 50% of items, right?

At the same time, we know that the USMLE doesn’t include very many creampuff questions that everyone gets right. Those questions provide no discriminatory value. Actually, I’d wager that most Step 1 questions have performance data that looks very similar to Figure 1 above (which was taken from an NBME paper).

A question like the one shown – which 82% of examinees answered correctly – has a nice spread of performance across the deciles of exam performance, ranging from 63% among low performers to 95% of high performers. That’s a question with useful discrimination for an exam like the USMLE.

Still, anyone who’s taken Step 1 knows that some questions will be much harder, and that fewer than 82% of examinees will answer correctly. If we conservatively assume that there are only a few of these “hard questions” on the exam, then we might estimate that the average Step 1 taker is probably getting around ~75% of questions right. (It’s hard to make a convincing argument that the average examinee could possibly be scoring much higher. And in fact, one of few studies that mentions this issue actually reports that the mean item difficulty was 76%.)

The minimum passing standard has to be lower than the average performance – so let’s ballpark that to be around 65%. (Bear in mind, this is just an estimate – and I think, a reasonably conservative one. But you can run the calculations with lower or higher percentages if you want. The final numbers I show below won’t be that much different than yours unless you use numbers that are implausible.)

Everyone still with me? Great.

Now, if a minimally competent examinee has to answer 65% of questions right to pass, then we have only 35% the of the ~252 scorable questions available to assign scores among all of the examinees with more than minimal competence.

In other words, we’re left with somewhere ~85 questions to help us assign scores in the passing range.

The current minimum passing score for Step 1 is 194. And while the maximum score is 300 in theory, the real world distribution goes up to around 275.

Think about that. We have ~85 questions to determine scores over around an 81 point range. That’s approximately one point per question.

Folks, this is what drives #Step1Mania.

Note, however, that the majority of Step 1 scores for U.S./Canadian students fall across a thirty point range from 220 to 250.

That means that, despite the power we give to USMLE Step 1 in residency selection, the absolute performance for most applicants is similar. In terms of raw number of questions answered, most U.S. medical student differ by fewer than 30 correctly-answered multiple choice questions. That’s around 10% of a seven hour, 280 question test administered on a single day.

And what important topics might those 30 questions test? Well, I’ve discussed that in the past.

Learning Objective: In terms of raw performance, most U.S. medical students likely differ by 30 or fewer correctly-answered questions on USMLE Step 1 (~10% of a 280 question test).


Question 5

A U.S. medical student takes USMLE Step 1. Her score is 191. Because the passing score is 194, she cannot seek licensure.

Which of the following reflects the probability that this examinee will pass the test if she takes it again?

A) 0%

B) 32%

C) 64%

D) 96%

The correct answer is C, 64%.

In 2016, 96% of first-time test takers from U.S. allopathic medical schools passed Step 1. For those who repeated the test, the pass rate was 64%. What that means is that >98% of U.S. allopathic medical students ultimately pass the exam.

I bring this up to highlight again how the Step 1 score is an estimate of knowledge at a specific point in time. And yet, we often treat Step 1 scores as if they are an immutable personality characteristic – a medical IQ, stamped on our foreheads for posterity.

But medical knowledge changes over time. I took Step 1 in 2005. If I took the test today, I would absolutely score lower than I did back then. I might even fail the test altogether.

But here’s the thing: which version of me would you want caring for your child? The 2005 version or the 2019 version?

The more I’ve thought about it, the stranger it seems that we even use this test for licensure (let alone residency selection). After all, if our goal is to evaluate competency for medical practice, shouldn’t a doctor in practice be able to pass the exam? I mean, if we gave a test of basketball competency to an NBA veteran, wouldn’t he do better than a player just starting his career? If we gave a test of musical competency to a concert pianist with a decade of professional experience, shouldn’t she score higher than a novice?

If we accept that the facts tested on Step 1 are essential for the safe and effective practice of medicine, is there really a practical difference between an examinee who doesn’t know these facts initially and one who knew them once but forgets them over time? If the exam truly tests competency, aren’t both of these examinees equally incompetent?

We have made the Step 1 score into the biggest false god in medical education.

By itself, Step 1 is neither good nor bad. It’s just a multiple choice test of medically-oriented basic science facts. It measures something – and if we appropriately interpret the measurement in context with the test’s content and limitations, it may provide some useful information, just like any other test might.

It’s our idolatry of the test that is harmful. We pretend that the test measures things that it doesn’t – because it makes life easier to do so. After all, it’s hard to thin a giant pile of residency applications with nuance and confidence intervals. An applicant with a 235 may be no better (or even, no different) than an applicant with a 230 – but by God, a 235 is higher.

It’s well beyond time to critically appraise this kind of idol worship. Whether you support a pass/fail Step 1 or not, let’s at least commit to sensible use of psychometric instruments.

Learning Objective: A Step 1 score is a measurement of knowledge at a specific point in time. But knowledge changes over time.


Score Report

So how’d you do?

I realize that some readers may support a pass/fail Step 1, while others may want to maintain a scored test. So to be sure everyone receives results of this test in their preferred format, I made a score report for both groups.



Just like the real test, each question above is worth 1 point. And while some of you may say it’s non-evidence based, this is my test, and I say that one point differences in performance allow me to make broad and sweeping categorizations about you.


But thanks for playing. Good luck in the SOAP!


Nice job. You’ve got what it takes to be licensed. (Or at least, you did on a particular day.)


Sure, the content of these questions may have essentially nothing to do with your chosen discipline, but your solid performance got your foot in the door. Good work.


You’re not just a high scorer – you’re a hero and a legend.


Wow! You’re a USMLE expert. You should celebrate your outstanding performance with some $45 tequila shots while dancing at eye-level with the city skyline.




You regard USMLE Step 1 scores with a kind of magical thinking. They are not simply a one-time point estimate of basic science knowledge, or a tool that can somewhat usefully be applied to thin a pile of residency applications. Nay, they are a robust and reproducible glimpse into the very being of a physician, a perfectly predictive vocational aptitude test that is beyond reproach or criticism.


You realize that, whatever Step 1 measures, it is a rather imprecise in measuring that thing. You further appreciate that, when Step 1 scores are used for whatever purpose, there are certain practical and theoretical limitations on their utility. You understand – in real terms – what a Step 1 score really means.

(I only hope that the pass rate for this exam is as high as the real Step 1 pass rate.)

Dr. Carmody is a pediatric nephrologist and medical educator at Eastern Virginia Medical School. This post originally appeared on The Sheriff of Sodium here.

Climate Change is not an ‘Equal Opportunity’ Crisis

Climate Change is not an ‘Equal Opportunity’ Crisis
Sam Aptekar
Phuoc Le


In the last fifteen years, we have witnessed dozens of natural disasters affecting our most vulnerable patients, from post-hurricane victims in Haiti to drought and famine refugees in Malawi. The vast majority of these patients suffered from acute on chronic disasters, culminating in life-threatening medical illnesses. Yet, during the course of providing clinical care and comfort, we rarely, if ever, pointed to climate change as the root cause of their conditions. The evidence for climate change is not new, but the movement for climate justice is now emerging on a large scale, and clinicians should play an active role.

Let’s be clear: there is no such thing as an “equal opportunity”
disaster. Yes, climate change poses an existential threat to us all, but not on
equal terms. When nature strikes, it has always been the poor and historically
underserved who are most vulnerable to its wrath. Hurricane Katrina provides an
example of how natural disasters target their victims along racial and
socioeconomic lines even in the wealthiest nations. Writes, “A black homeowner in New Orleans was more than three times as
likely to have been flooded as a white homeowner. That wasn’t due to bad luck;
because of racially discriminatory housing practices, the high-ground was taken
by the time banks started loaning money to African Americans who wanted to buy
a home.” Throughout the world, historically marginalized communities have been
pushed to overcrowded, poorly-built, and unsanitary neighborhoods where natural
disasters invoke much greater harm.

Photo from video on Democracy Now! article: “New Orleans After Katrina: Inequality Soars as Poor Continue to Be Left Behind in City’s ‘Recovery”

The poor also tend to work more physically demanding jobs that
become particularly dangerous with rising temperatures. Scientific American
reported more than 20,000 workers dying in Central America and southern Mexico from
chronic kidney disease caused by extreme temperatures and unreasonable
employment conditions. According to the World Health Organization, climate
change is expected to cause 250,000 additional deaths per year from diarrhea, malnutrition,
malaria, and stress.

Figure from May 2019 Somalia Humanitarian Bulletin (OCHA)

Moreover, resource-denied countries have the greatest economic
reliance on agriculture, which is by far the most vulnerable industry to
anthropogenic weather changes. Throughout the Horn of Africa, droughts have
been recorded at historically intense levels (the 2016-17 rains in Somalia are
the driest on record) and have destroyed the economic sustenance of
millions of farmers. According to Oxfam,
“The region was hit by an 18-month drought caused by El Niño and higher
temperatures linked to climate change.” They estimate that 10.7 million people
currently face severe hunger throughout Ethiopia, Kenya, Somalia, and
Somaliland as their crops and cattle die. With resource-denied countries such
as these relying so strongly on agriculture to keep their economies afloat, the
World Bank reported that climate change has the ability to send more than 100 million people into
poverty by 2030. More than 23.3 million people are already in need of humanitarian aid in the Horn
of Africa.

Volunteers in Freeport, Grand Bahama, Bahamas rescuing families during Hurricane Dorian (AP Photo/Ramon Espinosa)

Climate change has
already made certain regions of the world uninhabitable and threatens the
sociopolitical stability of numerous others. According to the Internal
Displacement Monitoring Center, there were 18.8 million climate-related displacements in 2017 alone. The Syrian
Civil War, which left millions in search of a new home and catalyzed political
instability throughout the region, holds climate change as one of its many contributing factors.

Unfortunately, these patterns
show no signs of slowing down. Globally, the number of weather-related
disasters has tripled since the 1960s.
In September, Hurricane Dorian battered the Bahamas and left a humanitarian crisis in
its wake; thousands are homeless, without food, water, and electricity as the
islands remain flooded. This was just two years after Hurricane Maria destroyed
thousands of homes in Puerto Rico and only one year after Hurricane Matthew
killed 49 people
and caused $10.8 billion in damage in North Carolina. The United States Geological
Survey reports, “With increasing global surface temperatures the possibility
of more droughts and increased intensity of storms will occur.”

The classification of hurricanes, tornadoes, and droughts as
“natural” disasters suggests their origin are separate from human behavior,
that they exist purely in the realm of nature where man has no influence. But
if we look at the destruction they have caused historically, we see that their
effects are almost completely determined by human action, specifically our
social, economic, and political policies that continue to leave some more
vulnerable than others. While the Silicon Valley dreams of future technological
solutions to climate change, there are social policies that we, as healthcare
professionals, can address right now.

Climate change is a public health emergency, and as guardians of
the public’s health, it is our role as healthcare professionals to continuously
stress the magnitude of the situation. We must assert with medical expertise
that as “natural” disasters intensify and transform entire ecosystems, the poor
and historically underserved have been, and will continue to be, the hardest
hit. By providing honest, evidence-driven accounts of climate change and its
health consequences, healthcare professionals can elevate the voices of
millions who are left out of most contemporary climate movements and bring
their stories to the fore as we continue to fight climate change together.

Internist, Pediatrician, and Associate
Professor at UCSF, Dr. Le is also the co-founder of two health equity
organizations, the HEAL Initiative and Arc Health. 

Sam Aptekar is a recent graduate of UC Berkeley and a current
content marketing and blogging affiliate for Arc Health Justice.

This post originally appeared on Arc Health here.

Leveraging Time by Doing Less in Each Chronic Care Visit

Leveraging Time by Doing Less in Each Chronic Care Visit


So many primary care patients have several multifaceted problems these days, and the more or less unspoken expectation is that we must touch on everything in every visit. I often do the opposite.

It’s not that I don’t pack a lot into each visit. I do, but I tend to go deep on one topic, instead of just a few minutes or maybe even moments each on weight, blood sugar, blood pressure, lipids, symptoms and health maintenance.

When patients are doing well, that broad overview is perhaps all that needs to be done, but when the overview reveals several problem areas, I don’t try to cover them all. I “chunk it down”, and I work with my patient to set priorities.

What non-clinicians don’t seem to think of is that primary health care is a relationship based care delivery that takes place over a continuum that may span many years, or if we are fortunate enough, decades.

Whether you are treating patients, coaching athletes, raising children or housebreaking puppies, the most effective way to bring about change is just about always incremental. We need to keep that in mind in our daily clinic work. Small steps, small successes create positive feedback loops, cement relationships and pave the way for bigger subsequent accomplishments.

Sometimes I avoid the biggest “problem” and work with patients to identify and improve a smaller, more manageable one just to create some positive momentum. That may seem like an inefficient use of time, but it can be a way of creating leverage for greater change in the next visit.

I actually think the healthcare culture has become counterintuitive and counterproductive in many ways; it helps me when I focus intensely on the patient in front of me, forgetting my list of “shoulds” (target values, health maintenance reminders and all of that) and first laying the foundation for greater accomplishments with less effort in the long run.

Six months ago I wrote this about how I try to start each patient visit. And in my Christmas reflection seven years ago I wrote about the moment when a physician prepares to enter an exam room:

I have three fellow human beings to interact with and offer some sort of healing to in three very brief visits. Three times I pause at the doorway before entering my exam room, the space temporarily occupied by someone who has come for my assessment or advice. Three times I summarize to myself what I know before clearing my mind and opening myself up to what I may not know or understand with my intellect alone. Three times I quietly invoke the source of my calling.

It’s all about the patient, the flesh and blood one in front if you in that very moment and what he or she needs most from us today. In physics I learned that you get better leverage when your force is applied a greater distance from the fulcrum. In human relationships and in medicine it is the opposite; the closer you are, the greater leverage you achieve.

Hans Duvefelt is a Swedish-born rural Family Physician in Maine. This post originally appeared on his blog, A Country Doctor Writes, here.