AI Assessment in Clinical Applications

There is considerable amount of talk about bias in AI as applied to clinical settings and how to eliminate it. This problem has now attracted the attention of standards bodies with a view to legislating how AI systems should be validated and certified.

AI consists of three types of logic: inductive, deductive and abductive.

However, most modern references to AI are really talking about the inductive type better known within the industry as Machine Learning (ML), which allocates data objects into classes based on the statistical distribution of their features.

Bias in machine learning cannot be eliminated as it is an intrinsic aspect of the method. ML uses a sample of data objects for which it has their features or attribute values and knows their correct classes, and so trains a classification predictive model on this training data. Hence, biases in collecting the sample are intrinsic.

But other biases apart from the data collection process are also introduced along the pathway of developing a working ML predictive classifier.

I don’t believe you can make learning algorithms bias-free. Just like drugs where there is always a percentage of people who have an adverse reaction, so there will be a percentage of people for whom the classifier prediction will be wrong because of:

a. The incompleteness of the learning sets;
b. The conflicting and contradictory content within the training set;
c. The limitations of the learning model to represent the distributions in the training data;
d. The choice of attributes to represent the training examples;
e. The normalisations made of the training data;
f. The time scope of the chosen training data;
g. The limitations of the expertise of those experts deciding on the gold standard values;
h. The use of “large data” sets that are poorly curated, where “good data” is needed.

If legislators are going to make the AI technology regulation-free, then they should at least require the algorithms to be provided on an accessible web site where “users” can submit their own data to test its validity for their own data sets. Then the users can have confidence or not that the specific tool is appropriate for their setting.

Training set data is typically collected in one of two ways.

1. Curated data collection: careful election and curation of data examples to span a specific range of variables. Its most serious drawback is the effort of manual labour to collate and curate the data collection and hence it is popular for more targeted objectives using smaller training sets.

2. Mass data collection: data is collected en masse from a range of sources and data elements are captured by “deep learning” strategy of compiling large feature vectors for each classification object using an automatic method of collation and compilation.

This approach is popular because it can be highly automated and supports belief in the fallacy that more data means better quality results.

What we don’t need is more “big data”, but rather we need more “good data”.

How do we get GOOD DATA?

The delivery of a machine learning processing system into production means a supply of potential learning material is flowing through the system. Any sensible and well-engineered system will have a mechanism for identifying samples that are too borderline to safely classify. We divert those samples into a Manual processing category.

The challenge with the materials in the Manual class is how to utilise them for improvement in the resident system. A good deal of research has gone into the process of Active Learning, which is the task of selecting new materials to add to the training materials from a large set of untrained samples.

There are two major dimensions to this selection process: Samples that fall near the class boundaries, and samples that represent attribute profiles significantly different to anything in the training materials.

Automated detection of both of these types of samples requires two different analytical methods known as Uncertainty Sampling and Diversity Sampling respectively. An excellent text on these processes is by Robert Munro, Human-in-the-Loop-Machine Learning, MEAP Edition Version 7, 2020, Manning Publications.

GOOD DATA can be accumulated through a strategy of Continuous Process Improvement (CPI) and any worthwhile clinical Machine Learning system will need to have that in place, otherwise the system is trapped in an inescapable universe of self-fulfilling prophecies with no way of learning what it doesn’t know, and should know, to be safe. A clinical ML system without CPI is hobbled by “unknown unknowns” which can lead to errors, both trivial and catastrophic.

Deficiencies in Clinical Natural Language Processing Software – A Review of 5 Systems

We have recently assessed the accuracy of Clinical NLP software available through either open source projects or commercial demonstration systems at processing pathology reports. The attached whitepaper discusses the twenty-eight deficiencies we observed in our testing of five different systems.

Deficiencies in Current Practices of Clinical Natural Language Processing – White Paper

Our analysis is based on the need for industrial strength language engineering that must cope with a greater variety of real-world problems than that experienced by research solutions. In a research setting, users can tailor their data and pre-processing solutions to address the answer to a very specific investigation question unlike real-world usage where there is little, or no, control over input data. As a simple example, in a language engineering application the data could be delivered in a standard messaging format, say HL7, that has to be processed no matter what vagaries it embodies. In a research project that data could be curated to overcome the uncertainties created by this delivery mechanism by removing the HL7 components before the CNLP processing was invoked, a fix not available in a standard clinical setting.

When an organisation is intending to apply a CNLP system to their data the topics discussed in this document need to be assessed for their potential impact on their desired outcomes.

The evaluations were based on two key principles:

  • There is a primary function to be performed by CNLP, that is, Clinical Entity Recognition (CER).
  • There is one secondary function and that is Relationship Identification.

Any other clinical NLP processing will rely on one or both of these primary functions. For the purposes of this conversation we exclude “text mining” which uses a “bag-of-words” approach to language analysis and is woefully inadequate in a clinical setting.

Assessed Software: 

  • Amazon Concept Medical
  • Stanford NLP + Metathesaurus
  • OPenNLP + Metathesaurus
  • GATE + Metathesaurus
  • cTAKES

The systems have the listed deficiencies to a greater or lesser extent. No system has all these problems. The deficiencies discussed are compiled across the 5 topics under the following headings:

  • Deficiencies in Understanding Document Structure
  • Deficiencies in Tokenisation
  • Deficiencies in Grammatical Understanding
  • Deficiencies in the CER Algorithms
  • Deficiencies in Semantics and Interpreting Medical Terminology

How much document detritus do you have to sift through to find what you need in a Cancer Registry?

Many registries have to report to at least 3 authorities, usually: State Registry, Commission on Cancer and CDC while others may also report to SEER.

While there is mostly common content, the differences can be both in the data required and the rules defining that data. To compound the problem, these complexities can often be dwarfed by two other content filters: a. separating the reportable cancer documents from the other two classes of reports, namely the non-reportable cancer reports and the non-cancer reports, and, b. finding the date of initial diagnosis in a complex patient record.

Dealing with the first of these problems, the separation of document types, is particularly acute for regional and central registries.

The Greater California CR has determined that about 60% of the documents they receive are irrelevant to their needs. In a volume of greater than 600,000 reports per annum that amounts to over 360,000 irrelevant documents, a huge overburden of work before the CTRs can get down to doing the task they have been trained (and paid) for.

As part of our research into the size of this problem, we’d like to understand the extent of the overburden in your registry, so here’s a very short questionnaire:

  1. Is your institution a Hospital/Regional CR/ Central CR?
  2. How many reports per annum do you receive?
  3. What proportion of those reports are irrelevant to the registry work?

You can answer the questionnaire here. Your help will be gratefully accepted.

Cross-posted to LinkedIn.

The Many Faces of Natural Language Processing

In the last 10 years the NLP label has been forced from the lab into mainstream usage fuelled principally by the work of Google and others in providing us with tremendously powerful search engines, often masquerading as NLP.

It also happens that Google processes don’t have their origins in NLP but rather in Information Retrieval (IR), a field in which a number of methods are anathema to true NLPers, namely ignoring grammar, truncating words to stems, ignoring structure and generally treating a text as just a big bag of words with little to no interrelationships.

True NLP has a history as deep as computing itself with the very earliest computer usage for language processing occurring in the 1950s. Since then it has emerged as the discipline of Computational Linguistics with serious attention to all the lexical, grammatical and semantic characteristics of language, a far more complicated task than the simple text mining that Google, Amazon and other search engines have taken on.

A richer description of the history of NLP and its rivalry with Information Retrieval is presented in a previous blog but here is a quick and easy guide to understanding the current phrase usage around language processing, ranked in ascending levels of complexity:

  • Text Mining aka IR – rules, regular expressions, bag of words
  • True NLP aka Computational Linguistics – the field of computing the structure of language
  • Statistical NLP (SNLP) – True NLP plus Machine Learning
  • Neural NLP aka Deep Learning– Statistical NLP using particular types of Machine Learning to broaden domain knowledge
  • Language Engineering – Building production grade SNLP solutions

As a quick and pragmatic way to separate IR and NLP consider the result of searching for “liver cancer” in a range of clinical documents. Text Mining will correctly identify any document that contains the literal string “liver cancer” and possibly “cancer of the liver” if it is using grammatical searching as an adjunct to Information Retrieval. NLP will return a superset of documents containing not only the Text Mining results, but “liver core biopsy shows adenocarcinoma”, “hepatocellular carcinoma”, “liver mets”, “liver CA”, “HCC” and many more lexicogrammatical representations of the concept of “liver cancer”.

Cross-posted to LinkedIn.

AI-based coding the only way for Registries to handle increasing loads and CTR shortages

I have spoken at a number of conferences in the past few years and attended even more as a delegate. One perennial topic raised in conversations is the worrying shortage of CTRs and its impact on the efficiency of registries and the need to stretch budgets further than ever.

The shortages and shortcomings are mostly due to factors outside the control of registries and has resulted in increasing reports of growing backlogs, staff burnout, and a reduction in registry efficiency.

Recruitment of new staff has proven problematic with the cohort of senior CTRs moving into retirement age, exacerbated by the introduction of more and more coding requirements. Young potential staff have more work and eduction opportunities available to them so they find clinical coding a less attractive occupation. The higher mobility of these young staff makes it all the more difficult to retain them even once they are recruited

Delays in recruiting new staff have their own knock on effects by increasing backlogs and making it more difficult to provide timely statistics on the distribution and growth of disease cohorts.

The value of Registries has been increasing in that they have been able to provide valuable information for health planning authorities, but the registries themselves are looking for more ways in which they can contribute to the care of cancer patients. As the most comprehensive source of information about a patient they represent a valuable resource for clinical carers when they are making decisions for patients. The efforts made by Registries to extend their capabilities has led to further pressure to make more information available. The expansion in medical investigations, especially in the imaging and pathology technologies has also increased the demand on registries to collect and analyse even more data.

Large pharmaceutical organisations and other medical research groups are increasingly wanting a larger the range of data made available to them and this puts even further demand on registries.

The question for all registry directors is how to cope with:

  • decreasing staff availability;
  • increased data volumes;
  • increased data requests;
  • faster turnaround requirements; and,
  • diversification of their functions.

The popular answer in today’s medical technology enthusiasm is for Artificial Intelligence (AI), but can it really help? And what can it actually do? Here are some suggestions with their advantages:


  • Sort out reportable cancer reports from non-cancer reports and non-reportables;
  • Extract the five core attributes of {Site, Histology, Grade, Behaviour, Laterality} from the cancer reports;
  • Codify the five core attributes to ICD O3.

The advantages of these functions for a cancer registry would be highly significant. The first function would remove a serious and substantial bottleneck to fulfilling the main task of Case Identification and Coding . In some states the amount of unuseful reports can reach as high as 60%, so in a state like California this can represent about 600,000 reports that are unwanted and a nuisance that clutters up the principal work of the registry.

The second function should be designed to process simple reports automatically and move the cryptic reports to a Manual processing workflow stream. In this way the mundane work would be removed off the registrar’s desktop and they could spend additional time on the more difficult cases.

The third function is important in semi-automating the coding of reports. This requires sophisticated AI as there are over 800 histology codes in ICD-O3 and many of them are defined by complex combinations of content and rules. This labyrinthine standard is better served by a strong AI solution that makes fewer mistakes, with mechanism for identifying the most difficult to be passed to Manual Processing.

AI holds out a promise of increasing the efficiency of cancer registries by releasing staff to do the most difficult work. It is not self evident that this use of AI will cause disruption to the workforce. It should act as a support mechanism for staff who are already surrounded by a shrinking workforce, increasing demands, and flooding data lake that they cannot escape.

I would welcome the chance to hear of your particular challenges and explore ways we could work together to enhance the productivity of your registry.

Cross-posted to LinkedIn.