The Search for Reliable and Deployable Clinical AI

The recent announcement by AMIA of the “2022 Artificial Intelligence Evaluation Showcase” is no doubt welcomed by research experimenters but will it provide revelations in how to conduct better and more effective Clinical AI studies to produce truly valuable operational deployments? (See amia-2022-artificial-intelligence-evaluation-showcase/artificial-intelligence)

The showcase is divided into 3 phases with results from each phase to be presented at traditional conferences conducted by AMIA.
Phase I involves presenting at the AMIA 2022 Informatics Conference held in March a “system description, results from a study of algorithm performance, and an outline of the methods of the full evaluation plan” .
Phase 2 involves a presentation at the AMIA 2022 Clinical Informatics Conference in May “to address usability and workflow aspects of the AI system, including prototype usability or satisfaction, algorithm explainability, implementation lessons, and/or system use in context” .
Phase 3 involves presenting a submission at the AMIA 2022 Annual Symposium in November “to summarize the comprehensive evaluation to include research from prior submissions with new research results that measure the impact of the tool”.

So in coalescing these three statements and drawing on the organisers other comments I would like to reframe their words into these prospective and admirable outcomes:
a. improve scale and scope of the evaluation of AI tools so that we get fewer limited and poor quality AI publications; and,
b. encourage the development of multidisciplinary teams with a wider range of expertises that would improve the quality of AI evaluation.

However what might be the unexpected outcomes of the Showcase?
Certainly some researchers will gain publication for their work which they will welcome and is not without merit of itself, but if that is the only objective then why run a special Showcase. Why not use the normal mechanisms AMIA has available, such as run a special issue of JAMIA on AI/ML. Will the three phase submission and presentation format provide something that normal publication channels don’t provide – it’s not self evident from the published promotional materials for the Showcase. If the objective is to promote research into putative AI solutions for clinical data processing tasks then it is not anything different to the publication avenues currently available. If the aim is to bring forward the adoption of AI technologies into working environments then there are a number of unspoken obstacles not addressed by the Showcase call to arms.

So I ask the question will the Showcase create motivations that will drive us in the wrong direction for the improvement of productive Clinical AI solutions. That may seem unfair to the organisers who no doubt are working hard for legitimate outcomes, but the world of AI development has a number of deficits that may be reinforced rather than diminished by this well intentioned initiative.

You might wonder why am I so disbelieving that this honourable initiative will provide useful outputs that will push the Clinical AI industry further in a positive way. You might ask why I feel compelled to make a submission to the Showcase yet deep down think the exercise will be futile and a waste of time. I am buoyed in my misgivings by the recent article in the MIT Technology Review with the byline : “Hundreds of the AI tools have been built to catch COVID. None of them helped.” As well as my previous review of the topic (see

In this conversation we will restrict our interpretation of “AI” to “supervised machine learning” (ML) the most common form of AI technology in use for analysing clinical data, and we draw on our experience in Clinical Natural Language Processing (CNLP) to formulate our analysis. It will be up to others to decide how applicable this commentary is to their own ML contexts.

Here are some of my musings over the obstacles facing the Clinical AI industry that it would be helpful for the showcase to specifically address.

1. DOES THE DATA MATCH THE OBJECTIVES. Research projects exploring the use of ML techniques for clinical case data is a FAR CRY from building industrial quality technology that clinicians find trustworthy enough to use. Research projects are conducted under a number of limitations which often are not clearly understood by the research teams. Typically, the data set used for training the ML models are flawed and inadequate in terms of the project objectives without obviously being so. The data can be flawed because it doesn’t cover all the corner cases of the problem space; that is, where the training sample is not properly representative of the problem space. This commonly occurs when the data has been provided opportunistically rather than selectively according to the project objectives.
The data can also be flawed because the values are poorly expressed for ML purposes. For example, in one data set of GP notes the medicines files held the precise pharmacy descriptions, which caused all data values to be virtually unique across 60K records and therefore unsuitable as a classifying feature. The remediation required a physician to go through the records and create values of {normal, weaker than normal, stronger than normal} as a surrogate for the prescription details that were thought to be meaningful to the project objective.
Providing a justification of each variable and its domain range used in a model would be a useful validation criterion.

2. IS THE ML ALGORITHM APPROPRIATE FOR THE TASK. Might the showcase lead to a plethora of studies using the current popular fad of Deep Learning which can be inappropriate in many health circumstances. Deep Learning is metaphorically a heavyweight technology that suits the needs of steel workers assembling a new skyscraper, whereas many clinical case studies need to be assembled with a watchmakers toolkit of minute components assembled with delicacy to achieve the highest accuracy required to effectively support clinical work. Deep learning techniques have been reported as useful in some settings, especially imaging, and are said to have great power because they are trained on very large data sets, but it also begs a number of questions e.g. How are models corrected when small but important errors are found? How are gold standard values established to be 99.99% pure gold (better than 24 karat of 99.9%) for such large data sets? How does Deep Learning incorporate the specific knowledge, rules and standards of professional practices when those practices vary from year to year especially when the large training set only becomes available year(s) after the fact? How does it correctly identify the extremely rare event (like certain diseases) that are definitionally much like a common event.
As a generalisation Deep Learning provides the least transparency of all ML algorithms, yet at the same time as a counterpoint, there are researchers endeavouring to travel in the opposite direction and increase the explicability of AI applications. See the $1m prize awarded to Cynthia Rudin of Duke University, for research into ML systems that people can understand (
Assessing the ML algorithm for its appropriateness to the applied task would be a useful evaluation criterion.

3. BAD DATA AND POWERFUL ALGORITHMS JUST SET US BACK. Researchers can be pressed to use data that is available on hand and so either tackle a problem of not great value or misinterpret the meaning, value and generalisation of their outcomes. This situation can lead to routine processes being used on poor quality data both in its definition and in its gold standard training classes so that whilst results are produced their value is limited. It must be accepted that good researchers (especially young researchers) will learn from these misguided efforts and go on to do better work next time round, so it has good educational and praxis value, but the interim impact can be of limited research value and so waste a great deal of time of all the external people in its assessment chain when presented for publication or deployment. Assessing the meaningfulness of the data in the context of the problem space would be a useful assessment criteria. To give them their due the showcase organisers might well have faith that will be achieved in Phases 2 and 3 of their programme.

4. RESEARCH EFFORT DOES NOT EQUAL SATISFACTORY INDUSTRIAL PERFORMANCE. The requirements for producing industrial quality technology is often beyond the competency and experience of research teams. This can be even more true for research teams embedded in large corporations who treat their techniques as the only way to resolve the task and like a hammer treat everything in the world as a nail. (see for an example). A working solution that is costly for staff to integrate into existing workflows requires considerable planning for adoption but even more ingenuity and experience to provide the best software engineering techniques. The supply of the source data for the working solution has to be secured and monitored on a daily basis once the operational system is in place. The continuous storage of incoming data, its efficient application to the ML algorithms and the delivery of outputs to the points of usage are all complex engineering and organisational matters, which researchers are commonly insensitive to if not entirely inexperienced with. The software engineering of complex workable solutions is just as important to successful industrial quality solutions as the ML algorithms, data sampling and model optimisation, but is invariably ignored in research publications and by the researchers themselves. The cry so often is – “It is all about the data” – which is so far from the truth for real solutions.
Assessing the software engineering development and maintenance requirements of a proposed AI solution would be a useful evaluation criteria.

5. WILL A CLINICIAN CHANGE THEIR EXPERT OPINION. The produced systems are rarely tested on the real criterion of success – will a clinician actually use this technology to correct their own expert opinion? Just asking them if they approve of the solution is not sufficient. ML projects are normally tested for their accuracy, where the most common test used, 10-fold cross validation (10CV), is probably the weakest test that could be applied. In our work we ignore it as a test as it provides little information that could be the basis of action to improve processing. Even experiments that use a held out set are little better. The best computational test is validation against a post implementation test, that is, new data that has never been seen and is drawn from the world of real practice. This approach then necessitates more infrastructure and ongoing commitment to improving the solution – has the user client committed to that effort and for how long?

However, the ultimate test is the client. Will they suspend their own judgement to accept or even consider the judgement of the delivered tool? If not, then not all is lost. An ML system designed to replace human tasks can readily become an advisory to the human, prodding them to think of things they might not have otherwise thought of. But also one has to be careful not to overreach – the recent AMIA Daily Downloads (25th November 2021 EADT) headlines “Epic’s sepsis algorithm may have caused alert fatigue with 43% alert increase during pandemic”.
Assessing the extent to which clinicians will revise their opinions would be a useful verification criteria.

6. AI IS USED IN MANY HEALTH SETTINGS THAT ARE NOT CLINICAL CARE. Many AI/ML applications that are used in health are for Public Health purposes or other secondary usage. These will not be able to show improved health outcomes, as required by Phase 3 of the Showcase, but rather they contribute to greater efficiency and productivity in the workplace. At best they represent second order effects on health outcomes. The narrowness of the Showcase call for participation appears to be based on a limited view of the breadth of ML applications in the health sector as distinct to the clinical sector. One just needs to look at the papers presented in past AMIA conferences to realise there is probably more applications of ML to secondary use of clinical data than primary use in clinical practice. Cancer Registries are a good example of the secondary use of most and ideally all pathology reports generated ACROSS the whole country that describe some aspect of cancer cases. If registries of all shades and persuasion are to keep up with the increasing count of patients and methods of treatment then Clincal NLP using ML will be a vital tool in their analytics armoury.
Assessing the extent to which an AI technology makes work more efficient or reliable would be a useful productivity criteria.

7. PRESS REPORTS IGNORE ACCURACY. It is frustrating to read press reports that laude the “accomplishments” of an AI application without any content on the reliability of the application. Errors mean different things in different contexts. The meaningfulness of false positives(FPs) and false negatives(FNs) are generally undersold and often ignored. In clinical work an FP can have as serious a consequence as an FN as it means a patient receives inappropriate care endangering their health, even life perhaps, as much as an FN which would lead to a missed diagnosis and failure to deliver appropriate care. However in population based health applications usually FNs need to be minimised and a certain higher level of FPs can be tolerated as a compromise to minimise FNs. Assessment of the importance of FNs and FPs to the acceptability of the AI application would be a useful reliability criterion.

While I feel the motivation and projected outcomes for the Showcase are hazy, individuals will have to asses for themselves if it sufficiently addresses the difficulties in the field for participation to provide reciprocal value for the gargantuan effort and cost needed by contributors to be involved. It is the question we are asking ourselves at the moment.

Just making something different isn’t sufficient,

someone else has to use it meaningfully for it to have value.