Bland-Altman review of Pulse Oximeter

Last week we tested the new Pulse Oximeter app from digiDoc and showed a poor correlation when measured against our standard ICU/ED monitors.

The developers of the Oximeter app contacted the LITFL team and suggested we use an alternate method of analysis – Bland-Altman, rather than Pearson’s correlation…so I solicited the help of my medical statistics gurus, Lisa Woolfson and Bob Phillips, and we worked out another way of looking at the stats. 

Why is Bland–Altman better than a correlation coefficient?

Bland and Altman discuss in their original paper the reasons that a correlation isn’t suitable for testing a new diagnostic tool against an old one.

A correlation just tells you if there is a linear relationship between two tests, i.e. as the results get higher in one, they also get higher in the other. So if you plotted the old and new test results against each other, you’d get a straight line. This shows simple correlation, but it doesn’t show accuracy for clinical purposes.

For example, if the Oximeter app gave a HR of double what our standard monitors give, it could still have a high correlation. The table below shows an example of this (fabricated results).

fabricated results

Running Pearson’s correlation on this results will give r=1. This shows a perfect positive correlation. But, we can see from glancing at the results, the app isn’t accurate. So Pearson’s looks for a straight line, but doesn’t show whether the results are actually close.

Will this help Oximeter results?

If the Pearson correlation was r=1, we may find that using Bland-Altman shows that actually the new tool is not useful at all. This was what Bland and Altman were demonstrating in their Lancet paper – using Pearson’s coefficient was not stringent enough for clinical accuracy.

Unfortunately, the Pearson’s correlation for the Oximeter app was low 0.59 for HR and 0.37 for sats. So there is no correlation. Running a Bland-Altman is unlikely to make this look any better.

What are the Bland-Altman results for the Oximeter app?



Erm, so what do these charts actually show? 

Well, the charts show what the difference between the measurements (app and monitor) are, compared with the average of these measures. There are a few lines on this

  • Solid blue line – best guess of the ‘average difference’
  • Dotty blue line – the 95% CI of this best guess of the ‘average difference’
  • Red dotty line – the 95% CI of the differences expected between each measurement

The difference between these types of CI are quite confusing, at first sight, but if you think of a different context it makes more sense.

Imagine instead a graph plotting a test result against (year based) school class.

The blue line is the average score on a test, rising as the children get older, and the dotty blue line is the 95% CI of this average. The dotty red line reflects the estimate of the variability in the scores of each child in the classes.

Heart Rate:

  • Bias – on average reads less than our standard HR monitor by about 10 (the blue line)
  • Imprecision– the individual differences are very large (the red dotty line).
  • Changing bias– We can also see that the app is increasingly less good as the heart rate gets higher, showing that the bias is not stable across the range of pulse rates, and that the red dotty lines actually underestimate the amount of real difference between the measurements.

Oxygen Saturations: 

  • Bias – the app reads on average 1% lower than our standard sats monitors.
  • Precision – the app could read up to 4 greater or 6 smaller (1 +/-5) – better than the HR
  • Changing bias – unlike with HR, the bias doesn’t change as the sats gets higher – it’s equally bad regardless as you can see by how much variability there is in the points.

The oximeter app is certainly better at measuring saturations than heart rate. Overall, it is biased, imprecise and gets less reliable as the HR gets higher.

Where does Root Mean Square (RMS) come into this? 

The developers have been told that to get CE/FDA approval to use the app as a medical device, the RMS of the app has to be in the range of 0–2 RMS. The RMS of the sats measurements is 2.12. The RMS of the HR is 12.791.

So, whichever way you look at it, the app isn’t good enough for FDA approval.

What’s the conclusion?

  • The conclusion… is the same as last week. The app is a great idea, and would be fabulous if it worked, but it’s simply not good enough to be clinically useful.
  • I know the developers are working on improving this, so hopefully they will be able to bring it up to standard to that it can benefit us in clinical practice.
  • In fact there has been a new revision and update produced today


  • Altman DG, Bland JM. (1983). Measurement in medicine: the analysis of method comparison studies. The Statistician 32, 307-317 [PDF]
  • Bland JM, Altman DG. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, i, 307-310 [PDF] [Online]
  • Bland JM. Applying the Right Statistics: Analyses of Measurement Studies [Online]
  • Bland JM, Altman DG. (1999) Measuring agreement in method comparison studies. Statistical Methods in Medical Research 8, 135-160 [Online]
Print Friendly