Deficiencies in Clinical Natural Language Processing Software – A Review of 5 Systems

We have recently assessed the accuracy of Clinical NLP software available through either open source projects or commercial demonstration systems at processing pathology reports. The attached whitepaper discusses the twenty-eight deficiencies we observed in our testing of five different systems.

Deficiencies in Current Practices of Clinical Natural Language Processing – White Paper

Our analysis is based on the need for industrial strength language engineering that must cope with a greater variety of real-world problems than that experienced by research solutions. In a research setting, users can tailor their data and pre-processing solutions to address the answer to a very specific investigation question unlike real-world usage where there is little, or no, control over input data. As a simple example, in a language engineering application the data could be delivered in a standard messaging format, say HL7, that has to be processed no matter what vagaries it embodies. In a research project that data could be curated to overcome the uncertainties created by this delivery mechanism by removing the HL7 components before the CNLP processing was invoked, a fix not available in a standard clinical setting.

When an organisation is intending to apply a CNLP system to their data the topics discussed in this document need to be assessed for their potential impact on their desired outcomes.

The evaluations were based on two key principles:

  • There is a primary function to be performed by CNLP, that is, Clinical Entity Recognition (CER).
  • There is one secondary function and that is Relationship Identification.

Any other clinical NLP processing will rely on one or both of these primary functions. For the purposes of this conversation we exclude “text mining” which uses a “bag-of-words” approach to language analysis and is woefully inadequate in a clinical setting.

Assessed Software: 

  • Amazon Concept Medical
  • Stanford NLP + Metathesaurus
  • OPenNLP + Metathesaurus
  • GATE + Metathesaurus
  • cTAKES

The systems have the listed deficiencies to a greater or lesser extent. No system has all these problems. The deficiencies discussed are compiled across the 5 topics under the following headings:

  • Deficiencies in Understanding Document Structure
  • Deficiencies in Tokenisation
  • Deficiencies in Grammatical Understanding
  • Deficiencies in the CER Algorithms
  • Deficiencies in Semantics and Interpreting Medical Terminology

3 thoughts on “Deficiencies in Clinical Natural Language Processing Software – A Review of 5 Systems”

  1. Hi Jon,

    Your whitepaper lists 28 specific deficiencies you found across the 5 different clinical NER systems you tested; however, it doesn’t mention *which* of these systems exhibited each deficiency, or to what extent. Would you consider publishing those results?

    That data would be rather valuable not only in evaluating which approach to use for a given project, but also to determine what additional preprocessing steps could be taken to circumvent some of those deficiencies.

    Also, can you speak a bit to what data you used to test these systems and identify those 28 deficiencies?


    1. HI Adam, You make a very good point on the usefulness of being more specific. However because we are a commercial organisation and some of the systems are from other vendors it would be seen as poor form by us to be directly criticising a potential competitor. Hence we decided to pool all results rather than single out individual software offerings.
      We tested some systems on a very small collection of cancer pathology reports which to us represented a range of more complex cases as we believe accuracy on the most difficult material is the only meaningful assessment – in other words its comparatively easy to get 70% accuracy for many things – higher is much more difficult.
      Our original aim was to get a picture of what could be done with other systems in comparison to our own technology. Instead we discovered that many systems had their own intrinsic deficiencies so we switched our focus to discovering what we could find whilst giving ourselves a limited time to complete the investigation. Hence the study is not comprehensive across any particular data set nor across all functionalities of the software, that is, potentially more deficiencies are waiting to be discovered.
      In terms of our own in-house software we don’t have any of the explicit deficiencies, e.g. tokenisation, but we are still challenged by the statistical and contextual issues for which there is no deterministic or reliable heuristic.

      1. Thanks for your reply, Jon! Given your organizational ties, avoiding directly criticizing potential competitors certainly makes sense.

        Still, with Comprehend Medical’s recent release, I would love to see an impartial comparison of these systems across a wider dataset, demonstrating potential pitfalls of each approach (and whether those deficiencies are common, or only occur in the more complex cases you alluded to). Perhaps in the future, another unaffiliated team can use your whitepaper as a jumping off point for that analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.