The Search for Reliable and Deployable Clinical AI

The recent announcement by AMIA of the “2022 Artificial Intelligence Evaluation Showcase” is no doubt welcomed by research experimenters but will it provide revelations in how to conduct better and more effective Clinical AI studies to produce truly valuable operational deployments? (See amia-2022-artificial-intelligence-evaluation-showcase/artificial-intelligence)

The showcase is divided into 3 phases with results from each phase to be presented at traditional conferences conducted by AMIA.
Phase I involves presenting at the AMIA 2022 Informatics Conference held in March a “system description, results from a study of algorithm performance, and an outline of the methods of the full evaluation plan” .
Phase 2 involves a presentation at the AMIA 2022 Clinical Informatics Conference in May “to address usability and workflow aspects of the AI system, including prototype usability or satisfaction, algorithm explainability, implementation lessons, and/or system use in context” .
Phase 3 involves presenting a submission at the AMIA 2022 Annual Symposium in November “to summarize the comprehensive evaluation to include research from prior submissions with new research results that measure the impact of the tool”.

So in coalescing these three statements and drawing on the organisers other comments I would like to reframe their words into these prospective and admirable outcomes:
a. improve scale and scope of the evaluation of AI tools so that we get fewer limited and poor quality AI publications; and,
b. encourage the development of multidisciplinary teams with a wider range of expertises that would improve the quality of AI evaluation.

However what might be the unexpected outcomes of the Showcase?
Certainly some researchers will gain publication for their work which they will welcome and is not without merit of itself, but if that is the only objective then why run a special Showcase. Why not use the normal mechanisms AMIA has available, such as run a special issue of JAMIA on AI/ML. Will the three phase submission and presentation format provide something that normal publication channels don’t provide – it’s not self evident from the published promotional materials for the Showcase. If the objective is to promote research into putative AI solutions for clinical data processing tasks then it is not anything different to the publication avenues currently available. If the aim is to bring forward the adoption of AI technologies into working environments then there are a number of unspoken obstacles not addressed by the Showcase call to arms.

So I ask the question will the Showcase create motivations that will drive us in the wrong direction for the improvement of productive Clinical AI solutions. That may seem unfair to the organisers who no doubt are working hard for legitimate outcomes, but the world of AI development has a number of deficits that may be reinforced rather than diminished by this well intentioned initiative.

You might wonder why am I so disbelieving that this honourable initiative will provide useful outputs that will push the Clinical AI industry further in a positive way. You might ask why I feel compelled to make a submission to the Showcase yet deep down think the exercise will be futile and a waste of time. I am buoyed in my misgivings by the recent article in the MIT Technology Review with the byline : “Hundreds of the AI tools have been built to catch COVID. None of them helped.” As well as my previous review of the topic (see https://www.jon-patrick.com/2021/04/ai-assessment-in-clinical-applications/).

In this conversation we will restrict our interpretation of “AI” to “supervised machine learning” (ML) the most common form of AI technology in use for analysing clinical data, and we draw on our experience in Clinical Natural Language Processing (CNLP) to formulate our analysis. It will be up to others to decide how applicable this commentary is to their own ML contexts.

Here are some of my musings over the obstacles facing the Clinical AI industry that it would be helpful for the showcase to specifically address.

1. DOES THE DATA MATCH THE OBJECTIVES. Research projects exploring the use of ML techniques for clinical case data is a FAR CRY from building industrial quality technology that clinicians find trustworthy enough to use. Research projects are conducted under a number of limitations which often are not clearly understood by the research teams. Typically, the data set used for training the ML models are flawed and inadequate in terms of the project objectives without obviously being so. The data can be flawed because it doesn’t cover all the corner cases of the problem space; that is, where the training sample is not properly representative of the problem space. This commonly occurs when the data has been provided opportunistically rather than selectively according to the project objectives.
The data can also be flawed because the values are poorly expressed for ML purposes. For example, in one data set of GP notes the medicines files held the precise pharmacy descriptions, which caused all data values to be virtually unique across 60K records and therefore unsuitable as a classifying feature. The remediation required a physician to go through the records and create values of {normal, weaker than normal, stronger than normal} as a surrogate for the prescription details that were thought to be meaningful to the project objective.
Providing a justification of each variable and its domain range used in a model would be a useful validation criterion.

2. IS THE ML ALGORITHM APPROPRIATE FOR THE TASK. Might the showcase lead to a plethora of studies using the current popular fad of Deep Learning which can be inappropriate in many health circumstances. Deep Learning is metaphorically a heavyweight technology that suits the needs of steel workers assembling a new skyscraper, whereas many clinical case studies need to be assembled with a watchmakers toolkit of minute components assembled with delicacy to achieve the highest accuracy required to effectively support clinical work. Deep learning techniques have been reported as useful in some settings, especially imaging, and are said to have great power because they are trained on very large data sets, but it also begs a number of questions e.g. How are models corrected when small but important errors are found? How are gold standard values established to be 99.99% pure gold (better than 24 karat of 99.9%) for such large data sets? How does Deep Learning incorporate the specific knowledge, rules and standards of professional practices when those practices vary from year to year especially when the large training set only becomes available year(s) after the fact? How does it correctly identify the extremely rare event (like certain diseases) that are definitionally much like a common event.
As a generalisation Deep Learning provides the least transparency of all ML algorithms, yet at the same time as a counterpoint, there are researchers endeavouring to travel in the opposite direction and increase the explicability of AI applications. See the $1m prize awarded to Cynthia Rudin of Duke University, for research into ML systems that people can understand (https://www.wsj.com/articles/duke-professor-recognized-for-bringing-more-clarity-to-ai-decision-making-11634229889).
Assessing the ML algorithm for its appropriateness to the applied task would be a useful evaluation criterion.

3. BAD DATA AND POWERFUL ALGORITHMS JUST SET US BACK. Researchers can be pressed to use data that is available on hand and so either tackle a problem of not great value or misinterpret the meaning, value and generalisation of their outcomes. This situation can lead to routine processes being used on poor quality data both in its definition and in its gold standard training classes so that whilst results are produced their value is limited. It must be accepted that good researchers (especially young researchers) will learn from these misguided efforts and go on to do better work next time round, so it has good educational and praxis value, but the interim impact can be of limited research value and so waste a great deal of time of all the external people in its assessment chain when presented for publication or deployment. Assessing the meaningfulness of the data in the context of the problem space would be a useful assessment criteria. To give them their due the showcase organisers might well have faith that will be achieved in Phases 2 and 3 of their programme.

4. RESEARCH EFFORT DOES NOT EQUAL SATISFACTORY INDUSTRIAL PERFORMANCE. The requirements for producing industrial quality technology is often beyond the competency and experience of research teams. This can be even more true for research teams embedded in large corporations who treat their techniques as the only way to resolve the task and like a hammer treat everything in the world as a nail. (see https://www.jon-patrick.com/2019/02/deficiencies-in-clinical-natural-language-processing-software-a-review-of-5-systems/ for an example). A working solution that is costly for staff to integrate into existing workflows requires considerable planning for adoption but even more ingenuity and experience to provide the best software engineering techniques. The supply of the source data for the working solution has to be secured and monitored on a daily basis once the operational system is in place. The continuous storage of incoming data, its efficient application to the ML algorithms and the delivery of outputs to the points of usage are all complex engineering and organisational matters, which researchers are commonly insensitive to if not entirely inexperienced with. The software engineering of complex workable solutions is just as important to successful industrial quality solutions as the ML algorithms, data sampling and model optimisation, but is invariably ignored in research publications and by the researchers themselves. The cry so often is – “It is all about the data” – which is so far from the truth for real solutions.
Assessing the software engineering development and maintenance requirements of a proposed AI solution would be a useful evaluation criteria.

5. WILL A CLINICIAN CHANGE THEIR EXPERT OPINION. The produced systems are rarely tested on the real criterion of success – will a clinician actually use this technology to correct their own expert opinion? Just asking them if they approve of the solution is not sufficient. ML projects are normally tested for their accuracy, where the most common test used, 10-fold cross validation (10CV), is probably the weakest test that could be applied. In our work we ignore it as a test as it provides little information that could be the basis of action to improve processing. Even experiments that use a held out set are little better. The best computational test is validation against a post implementation test, that is, new data that has never been seen and is drawn from the world of real practice. This approach then necessitates more infrastructure and ongoing commitment to improving the solution – has the user client committed to that effort and for how long?

However, the ultimate test is the client. Will they suspend their own judgement to accept or even consider the judgement of the delivered tool? If not, then not all is lost. An ML system designed to replace human tasks can readily become an advisory to the human, prodding them to think of things they might not have otherwise thought of. But also one has to be careful not to overreach – the recent AMIA Daily Downloads (25th November 2021 EADT) headlines “Epic’s sepsis algorithm may have caused alert fatigue with 43% alert increase during pandemic”.
Assessing the extent to which clinicians will revise their opinions would be a useful verification criteria.

6. AI IS USED IN MANY HEALTH SETTINGS THAT ARE NOT CLINICAL CARE. Many AI/ML applications that are used in health are for Public Health purposes or other secondary usage. These will not be able to show improved health outcomes, as required by Phase 3 of the Showcase, but rather they contribute to greater efficiency and productivity in the workplace. At best they represent second order effects on health outcomes. The narrowness of the Showcase call for participation appears to be based on a limited view of the breadth of ML applications in the health sector as distinct to the clinical sector. One just needs to look at the papers presented in past AMIA conferences to realise there is probably more applications of ML to secondary use of clinical data than primary use in clinical practice. Cancer Registries are a good example of the secondary use of most and ideally all pathology reports generated ACROSS the whole country that describe some aspect of cancer cases. If registries of all shades and persuasion are to keep up with the increasing count of patients and methods of treatment then Clincal NLP using ML will be a vital tool in their analytics armoury.
Assessing the extent to which an AI technology makes work more efficient or reliable would be a useful productivity criteria.

7. PRESS REPORTS IGNORE ACCURACY. It is frustrating to read press reports that laude the “accomplishments” of an AI application without any content on the reliability of the application. Errors mean different things in different contexts. The meaningfulness of false positives(FPs) and false negatives(FNs) are generally undersold and often ignored. In clinical work an FP can have as serious a consequence as an FN as it means a patient receives inappropriate care endangering their health, even life perhaps, as much as an FN which would lead to a missed diagnosis and failure to deliver appropriate care. However in population based health applications usually FNs need to be minimised and a certain higher level of FPs can be tolerated as a compromise to minimise FNs. Assessment of the importance of FNs and FPs to the acceptability of the AI application would be a useful reliability criterion.

While I feel the motivation and projected outcomes for the Showcase are hazy, individuals will have to asses for themselves if it sufficiently addresses the difficulties in the field for participation to provide reciprocal value for the gargantuan effort and cost needed by contributors to be involved. It is the question we are asking ourselves at the moment.

Just making something different isn’t sufficient,

someone else has to use it meaningfully for it to have value.

How can we assess the value of a work from its Abstract – examples drawn from the AMIA 2022 AI Showcase

The AMIA 2022 AI Showcase has been devised as a 3-stage submission process where participants will be assessed at submissions 1 was 2 for their progression to the 2nd and 3rd presentations. The stages coincide with the three AMIA conferences:

Informatics Summit, March 21-24, 2022; Clinical Informatics Conference (CIF), May24-26 2022; and, the Annual Symposium, November 5-9, 2022.

Submission requirements for each stage are:

Stage 1. System description, results from a study of algorithm performance, and an outline of the methods of the full evaluation plan

Stage 2. Submission to address usability and workflow aspects of the AI system, including prototype usability or satisfaction, algorithm explainability, implementation lessons, and/or system use in context.

Stage 3. Submissions to summarize the comprehensive evaluation to include research from prior submissions with new research results that measure the impact of the tool.

Now that we have the abstracts for the Stage 1 conference we can analyse the extent to which the community of practice could satisfy those criteria. As well we can identify how the abstracts might fall short of the submission requirements and so aid the authors in closing the gap between their submission and the putative requirements. As well they will be able to ensure their poster or presentation fill some of the gaps where they have the information.

There is also the open question of what effect the “call to arms” in the Showcase was effective and how it might be improved in the future.

A succinct definition of the desirable contents of an Abstract right well be:

  1. What was done?
  2. Why was it done?
  3. What are the results?
  4. What is the practical significance?
  5. What is the theoretical significance?

The Clinical AI Showcase has now provided a set of abstracts for the presentations and posters so how might we rate the informativeness of the abstracts against these criteria.

A review of the 22 abstracts has been conducted and a summary based on these 5 content attributes with 4&5 being merged has been provided in Table 2. The summaries have been designated by their key themes and collated into Table 1.

Using the publicised criteria for each of the three Stages of the AI Showcase it would appear that only 9 of the submissions are conformant, i.e. the papers categorised as Extended ML model, Development of ML model and Developed and Deployed. It is an open question as to whether the abstracts classed as Comparison of ML models fulfils Stage 1 criteria. The authors would do well to clarify the coverage of their work in their full presentation/poster.

The other categories appear to exist in a certain type of limbo. The Methods study only group appears out of place given the objectives of the AI Showcase. The Design only group would appear to be stepping out somewhat prematurely, although the broader advertising for the Showcase certainly encouraged early stage entries. As mentioned in my previous blog (blood reference) it would be exceedingly difficult for teams to meet the purported content for deadlines for the Stages 1 and 2 if the ideas for the project are in an embryonic design stage.

Teams with nearly or fully completed projects were able to submit works that also fulfilled many of the criteria for Stage 2 of the Showcase. The Developed and Deployed group showed progression in their projects that had reached deployment but in no case reported usability or workflow aspects with the exception of one paper that claimed their solution was installed at the bedside.

Two abstracts did not describe clinical applications of their ML but rather secondary use and these papers were doing NLP.

Good Abstract Writing

Most abstracts provided reasonable descriptions of the work they had done or intended to do. It was rare for abstracts to describe their results or the significance of their work, this undoubtably can be corrected in Stages 2 or 3 of the Showcase where they are required to report on their assessment of their tools practical use. Only one paper provided information on all four desirable abstract content items.

What can the Showcase Learn and do better

This Showcase has the admirable objective of encouraging researchers and clinical teams to perform AI projects to a better quality and in a more conclusive manner. However its Stages cover a cornucopia of objectives set out in a timeline that is unrealistic for projects just starting and poorly co-ordinated for projects near to or at completion. This is well evidenced by the some 40+ ML projects included in the Conference programme that are not part of the Showcase. If the Showcase is to continue, as it should, then a more considered approach to staged objectives, encouragement of appropriate teams, and more thoughtful timing would be a great spur to its future success.

Might I Humbly Suggest (MIHS) that a more refined definition of the stages be spelled out so that

            a. groups just starting ML projects are provided with more systematic guidelines and milestones, and;

            b. that groups in the middle of projects can ensure that they have planned for an appropriate level of completeness to their work.

Stage 1. What is the intended deliverable and Why it is necessary – Which clinical community has agreed to deployment and assessment.

Stage 2. What was done in the development – How the deliverable was created and what bench top verifications were conducted.

Stage 3. Deployment and Clinical Assessment – What were the issues in deployment. What were the methods and results of the clinical assessment. What arrangements have been made for the maintenance and improvement of the deliverable.

This definition excludes groups performing ML projects purely for their own investigative interest but without a specific participating clinical community. The place for their work is within the general programme of AMIA conferences. It also means that strictly speaking only 3 of the current acceptances would fit this definition for Stage 1, although 3 of the others could be contracted to fit this definition.

A concerning factor in the current timeline design is the short 2-month span between the deliverables for Stages 1 and 2. A group would pretty much have to have completed Stage 2 to submit to Stage 1 and be ready to submit to Stage 2 in 2 months.

Lastly the cost of attending 3 AMIA conferences in the one year would be excessively taxing especially to many younger scholars in the field. AMIA should provide a two-thirds discount to each conference to those team representatives who commit to entering the Showcase. This would be a great encouragement to get more teams involved.

Paper TopicN
Design only3
Comparison of ML models5
Extended ML model1
Development of ML model5
Developed and Deployed – No operational assessment3
Methods study only1
Abstract Unavailable2
Table 1. Category of key themes on the 22 Abstracts accepted into the AI Showcase.

PaperWhat was doneWhy was it doneWhat are their resultsWhat is the significanceCommentsCategory
Overgaard – CDS Tool for asthma – PresentationDesign of desirable features of AI solution.to make review of the patient record more efficient.UnspecifiedUnknown – no clinical deployment.Paper is about the putative design for a risk assessment tool and data extraction from the EHR.Design only
Rossi – 90-day Mortality Prediction Model – PresentationDeployed 90-day mortality prediction model. **to align of patient preferences for advance care directives with therapeutic delivery, and improve rates of hospice admission and length of stay.UnspecifiedUnknown – clinical deployment planned.Model is partially implemented with operationally endorsed workflows.Development of ML model – Planned Deployment
Estiri – Unrecognised Bias in COVID Prediction Models – PresentationInvestigation of four COVID-19 prediction models for bias using an AI evaluation framework. **AI algorithm biases could exacerbate health inequalitiesUnspecifiedUnknown – no clinical deployment.Two bias topics are defined : (a) if the developed models show biases; (b) has the bias changed over time for patients not used in the development.Comparison of ML models
Liu – Explainable ML to predict ICU Delerium – PresentationA range of features were identified and a variety of MLs evaluated. Three prediction models were evaluated 6,12,24 hours. ****To more accurately predict the onset of deliriumDescribed but numerics not providedImplied due to described implementation designPaper describes some aspect of all characteristics but not always completely.Comparison of ML models
Patel – Explainable ML to predict periodontal disease – Presentation“new” variables added to existing models revealed new associations.Discover new information about risk factors.Described but associations not providedUnknown. No clinical assessment of the discovered associations, no clinical deployment.AI methods not described.Extended ML model
Liang – Wait time prediction for ER/ED – VirtualDevelop ML classifier (Tensorflow) to predict ED/ER wait times. Training and test sets described.No explanationUnspecifiedunknown -clinical deployment status unclearFocus is on the ML processes with little other informationDevelopment of ML model
Patrick – Deep Understanding for coding pathology reports – VirtualBuilt a system to identify cancer pathology reports and code them for 5 data items (Site, Histology, Grade, Behaviour, laterality)California Cancer registry requested an automated NLP pipeline to improve production line efficiencies.Various accuracies providedImprovements over manual case identification and coding provided.The work of this blog author.Developed and Deployed – no operational assessment – Not clinical application
Eftekhari/Carlin – ML sepsis prediction for hematopoietic cell transplant recipients – PosterDeployed an early warning system that used EMR data for HCT patients. Pipeline of processing extending to clinal workflows.Sepsis in HCT patients has a different manifestation to sepsis in other settings.Only specified results is deploymentUnknown – no clinical assessment.Deployment is described showing its complexity. No evaluations.Developed and Deployed – no operational assessment
Luo – Evaluation of Deep Phenotype concept recognitions on external EHR datasets – PosterRecognises human phenotype Ontology concepts in biomedical al textsNo explanationUnspecifiedUnknown – no clinical deployment.Abstract is the least informative. One sentence only.Development of ML model – Not clinical application
Pillai – Quality Assurance for ML in Radiation Oncology – PosterFive ML models were built and voting system devised to decide if a radiology treatment plan was Difficult or No Difficult. Feature extraction was provided.To improve clinical staffs scrutiny of difficult plans to reduce errors downstream. Feature extraction to improve interpretability and transparency. ****UnspecifiedSystem planned to be integrated into clinical workflow.Mostly about ML approach but shows some forethought into downstream adoption.Comparison of ML models – Deployment planned
Chen – Validation of prediction of Age-related macular degeneration – PosterML Model to predict later AMD degeneration using 80K images from 3K patients.to predict the risk of progression to vision-threatening late AMD in subsequent yearsUnspecifiedUnknown – no clinical deployment.Focus is on the ML processes with little other information.Development of ML model
Saleh – Comparison of predictive models for paediatric deterioration – PosterPlan to develop and implement ML model to augment prediction of paediatric clinical deterioration within the clinical workflow.Detecting deterioration in paediatric cases is effective at only 41% using existing tools.Unspecified – planning stage onlySystem planned to be integrated into clinical workflow.Early conceptualisation stage. Well framed objective and attentive to clinical acceptability. No framing of datasets, variables and ML methods.Design only
Shah – ML for medical education simulation of chest radiography – Posternot available     
Mathur – Translational aspects of AI – PosterEvaluation of the TEHAI framework compared to other frameworks for AI with emphasis on translational and ethical features of model development and its deployment.A lack of standard training data and the clinal barriers to introducing AI into the workplace warrant the development of a AI evaluation framework.Qualitative assessment of 25 features by reviewers.No in vitro evaluation – only qualitative assessmentThis is an attempt to improve the evaluation criteria we should be suing on AI systems. it fails to make convincing case it isa better method than alternatives.Methods study only
Yu – Evaluating Pediatric sepsis predictive model – Posternot available     
Tsui – ML prediction for clinical deterioration and intervention – PosterBuilt a ML for an Intensive care warning system for deterioration events and pharmacy interventions. It uses bedside monitor, and EHR data providing results in real-time.No explanationUnspecifiedUnknown – no clinical assessment.Operational system. Only a description of the deliverables – no evaluationsDeveloped and Deployed – no operational assessment
Rasmy – Evaluation of DL model for COVID-19 outcomes – PosterA DL algorithm developed to predict for COVID-19 cases on admission: in-hospital mortality, need for mechanical ventilation, and long hospitalization.No explanationUnspecified – no numerics suppliedunknown – no clinical deploymentSeems to concentrate solely on the DL modelling.Development of ML model
Wu – ML for predicting 30-day cancer readmissions – PosterML models built to identify 30-day unplanned admissions for cancer patients.Unplanned dance readmissions have significantly poorly outcomes so the aim is to reduce them.No Results but promised in the poster/presentationUnknown – no clinical assessment.No ML details just a justification in AbstractComparison of ML models
Mao – DL model for Vancomycin monitoring – PosterA DL pharmacokinetic model for Vancomycin was compared to a Bayesian modelTo provide a more accurate model of Vancomycin monitoring.The DL model performed better than the Bayesian model.. No numerics provided.Unknown – no clinical assessment.Focus is on the ML processes with little other information.Comparison of ML models
Ramnarine – Policy for Mass Casualty Trauma triage – PosterDesign of a strategy to build an ML for ER/ED triage and retriage categorisation for mass casualty incidentsTo fill a void in national standards for triage and retriageUnspecified – design proposal onlyUnknown – no clinical assessment of practicality of acceptance.This is a proposal with no concrete strategy for implementation and what would be used in the investigation from either a data source of ML strategy type.Design only
Table 2 Summary of the 22 Abstracts according to 4 content attributes ascribed for good abstracts.

Deep Understanding – Where does it come from?

Modern language processing technologies have arisen from two separate pathways in algorithm history.

The first language processing efforts in the 1960’s were conducted by linguists using early computers to process linguistic structures. The linguists built algorithms that identified parts of speech and parsed sentences according to language structure. As this field developed it became known as Computational Linguistics and continues today as a professional research community.

In 1990’s the computer scientists introduced machine learning methods into the field of computational linguistics following the spirit of corpus linguistics where they studied the statistical characteristics of corpora. These methods  known as Statistical Natural Language Processing became an  important subfield to what had also become known as Natural Language Processing (NLP).

In the second half of the ‘90s the rise of Google spurred the adoption of Text Mining, an approach that ignored the linguistics character of language and worked on just the text strings. This has become a fashionable approach due to the ease with which programmers can become engaged as enthusiasts in the processing strategies without the need to engage in the field of linguistics. This approach has become so popular it has assumed the title of NLP so today NLP is a title that is a conflation of two methodological paradigms.

Google has also fuelled a new approach since the early 2010s with its development of neural net machine learning methods called Deep Learning. This fits with their general approach of using extremely large data sets to characterize language texts. Whilst this approach has shown some useful features it is not universally applicable and some of its limitations are now beginning to emerge.

The two heritages of language processing are now Computational Linguistics and Text Mining each having separate origins and algorithmic philosophies and each now claiming the rubric of Natural Language Processing.

However algorithmic processing of text is not an end in itself but rather part of a journey to a functionality that is of value to users. The two most dominant user needs are to recognize the nature of a document (aka document classification), and, identify entities of some interest within the text (aka Semantic Entity Recognition). But the meaningfulness of this processing comes about from how the NLP analysis is then utilized, not the NLP of itself. It is this use of the NLP analysis, whether it be from the computational linguistics paradigm or the text mining paradigm, that creates the value for the user.

  • DEEP UNDERSTANDING takes its heritage from computational linguistics and for processing clinical documents moves beyond text mining methods to bring to bear the full value of:
  • coalescing grammatical understanding, semantic entity recognition, extensive clinical knowledge stores, and machine learning algorithmic power,
  • so as to produce some transform of the native textual prose into some meaningful conceptual realm of interest that mimics the work of people who have the responsibility to interpret the clinical texts,
  • at an accuracy level equal to or better than the human experts who do the task, and,
  • knows its own limitations and passes the task back to the human processor when it can’t complete the task accurately enough, and
  • provides an automatic feedback mechanism to identify potential errors to improve performance as the body of processing materials grows, so as to sustain Continuous Process Improvement.
Deep Understanding 4 stage processing system for coding pathology reports to ICD-O-3 and the NAACCR Rule book with a recursive cycle of Active Learning for Continuous Process Improvement.

An example is the work we do for cancer registries that need to translate cancer pathology reports into an agreed set of codes according to an international coding system known as the International Classification of Diseases – Oncology Version 3 (ICD-O-3).

Our Deep Understanding pipeline consists of 4 major components as shown in the diagram:

1. A machine learning document classifier which determines if a pathology report is identified as a cancer report or not.

2. A clinical entity recognizer that is a machine learner that identifies forty two different semantic classes of clinical content needed for correctly coding the report.

3. A coding inductive inference engine which holds the knowledge of the ICD-O-3 codes and many hundreds of prescribed rules for coding. It receives access to the text annotated for clinical entities and applies computational linguistics algorithms to parse the text and map the entities to the classification content to arrive at ICD-O-3 codes. The task is complicated by needing to code for 5 semantic objects:

Semantic ObjectNumber of Codes
Body site for the location of the cancer in the body300+
Disease histology800+
Behaviour of the cancer5
Grade or severity of the disease20+
Laterality, or side of the body where the disease is located, if a bilateral organ6

4. A process to cycle knowledge learnt from the operational use of the pipeline process back to the machine learning NLP components to maintain Continuous Process Improvement (CPI) or Continuous Learning.

5. Each process has been periodically retrained and tuned over time to attain an accuracy rivalling and on some aspects bettering human coding performance.

While Deep Learning is a new and innovative method for learning the characteristics of a large corpus of texts it can be and usually is only an NLP subsystem of a larger processing objective. If that objective is to compute an understanding of the corpus content to the extent it can intelligently transform the text into higher order conceptual content to an accuracy level achieved by humans, then it is performing DEEP UNDERSTANDING.

AI Assessment in Clinical Applications

There is considerable amount of talk about bias in AI as applied to clinical settings and how to eliminate it. This problem has now attracted the attention of standards bodies with a view to legislating how AI systems should be validated and certified.

AI consists of three types of logic: inductive, deductive and abductive.

However, most modern references to AI are really talking about the inductive type better known within the industry as Machine Learning (ML), which allocates data objects into classes based on the statistical distribution of their features.

Bias in machine learning cannot be eliminated as it is an intrinsic aspect of the method. ML uses a sample of data objects for which it has their features or attribute values and knows their correct classes, and so trains a classification predictive model on this training data. Hence, biases in collecting the sample are intrinsic.

But other biases apart from the data collection process are also introduced along the pathway of developing a working ML predictive classifier.

I don’t believe you can make learning algorithms bias-free. Just like drugs where there is always a percentage of people who have an adverse reaction, so there will be a percentage of people for whom the classifier prediction will be wrong because of:

a. The incompleteness of the learning sets;
b. The conflicting and contradictory content within the training set;
c. The limitations of the learning model to represent the distributions in the training data;
d. The choice of attributes to represent the training examples;
e. The normalisations made of the training data;
f. The time scope of the chosen training data;
g. The limitations of the expertise of those experts deciding on the gold standard values;
h. The use of “large data” sets that are poorly curated, where “good data” is needed.

If legislators are going to make the AI technology regulation-free, then they should at least require the algorithms to be provided on an accessible web site where “users” can submit their own data to test its validity for their own data sets. Then the users can have confidence or not that the specific tool is appropriate for their setting.

Training set data is typically collected in one of two ways.

1. Curated data collection: careful election and curation of data examples to span a specific range of variables. Its most serious drawback is the effort of manual labour to collate and curate the data collection and hence it is popular for more targeted objectives using smaller training sets.

2. Mass data collection: data is collected en masse from a range of sources and data elements are captured by “deep learning” strategy of compiling large feature vectors for each classification object using an automatic method of collation and compilation.

This approach is popular because it can be highly automated and supports belief in the fallacy that more data means better quality results.

What we don’t need is more “big data”, but rather we need more “good data”.

How do we get GOOD DATA?

The delivery of a machine learning processing system into production means a supply of potential learning material is flowing through the system. Any sensible and well-engineered system will have a mechanism for identifying samples that are too borderline to safely classify. We divert those samples into a Manual processing category.

The challenge with the materials in the Manual class is how to utilise them for improvement in the resident system. A good deal of research has gone into the process of Active Learning, which is the task of selecting new materials to add to the training materials from a large set of untrained samples.

There are two major dimensions to this selection process: Samples that fall near the class boundaries, and samples that represent attribute profiles significantly different to anything in the training materials.

Automated detection of both of these types of samples requires two different analytical methods known as Uncertainty Sampling and Diversity Sampling respectively. An excellent text on these processes is by Robert Munro, Human-in-the-Loop-Machine Learning, MEAP Edition Version 7, 2020, Manning Publications.

GOOD DATA can be accumulated through a strategy of Continuous Process Improvement (CPI) and any worthwhile clinical Machine Learning system will need to have that in place, otherwise the system is trapped in an inescapable universe of self-fulfilling prophecies with no way of learning what it doesn’t know, and should know, to be safe. A clinical ML system without CPI is hobbled by “unknown unknowns” which can lead to errors, both trivial and catastrophic.