How can we assess the value of a work from its Abstract – examples drawn from the AMIA 2022 AI Showcase

The AMIA 2022 AI Showcase has been devised as a 3-stage submission process where participants will be assessed at submissions 1 was 2 for their progression to the 2nd and 3rd presentations. The stages coincide with the three AMIA conferences:

Informatics Summit, March 21-24, 2022; Clinical Informatics Conference (CIF), May24-26 2022; and, the Annual Symposium, November 5-9, 2022.

Submission requirements for each stage are:

Stage 1. System description, results from a study of algorithm performance, and an outline of the methods of the full evaluation plan

Stage 2. Submission to address usability and workflow aspects of the AI system, including prototype usability or satisfaction, algorithm explainability, implementation lessons, and/or system use in context.

Stage 3. Submissions to summarize the comprehensive evaluation to include research from prior submissions with new research results that measure the impact of the tool.

Now that we have the abstracts for the Stage 1 conference we can analyse the extent to which the community of practice could satisfy those criteria. As well we can identify how the abstracts might fall short of the submission requirements and so aid the authors in closing the gap between their submission and the putative requirements. As well they will be able to ensure their poster or presentation fill some of the gaps where they have the information.

There is also the open question of what effect the “call to arms” in the Showcase was effective and how it might be improved in the future.

A succinct definition of the desirable contents of an Abstract right well be:

  1. What was done?
  2. Why was it done?
  3. What are the results?
  4. What is the practical significance?
  5. What is the theoretical significance?

The Clinical AI Showcase has now provided a set of abstracts for the presentations and posters so how might we rate the informativeness of the abstracts against these criteria.

A review of the 22 abstracts has been conducted and a summary based on these 5 content attributes with 4&5 being merged has been provided in Table 2. The summaries have been designated by their key themes and collated into Table 1.

Using the publicised criteria for each of the three Stages of the AI Showcase it would appear that only 9 of the submissions are conformant, i.e. the papers categorised as Extended ML model, Development of ML model and Developed and Deployed. It is an open question as to whether the abstracts classed as Comparison of ML models fulfils Stage 1 criteria. The authors would do well to clarify the coverage of their work in their full presentation/poster.

The other categories appear to exist in a certain type of limbo. The Methods study only group appears out of place given the objectives of the AI Showcase. The Design only group would appear to be stepping out somewhat prematurely, although the broader advertising for the Showcase certainly encouraged early stage entries. As mentioned in my previous blog (blood reference) it would be exceedingly difficult for teams to meet the purported content for deadlines for the Stages 1 and 2 if the ideas for the project are in an embryonic design stage.

Teams with nearly or fully completed projects were able to submit works that also fulfilled many of the criteria for Stage 2 of the Showcase. The Developed and Deployed group showed progression in their projects that had reached deployment but in no case reported usability or workflow aspects with the exception of one paper that claimed their solution was installed at the bedside.

Two abstracts did not describe clinical applications of their ML but rather secondary use and these papers were doing NLP.

Good Abstract Writing

Most abstracts provided reasonable descriptions of the work they had done or intended to do. It was rare for abstracts to describe their results or the significance of their work, this undoubtably can be corrected in Stages 2 or 3 of the Showcase where they are required to report on their assessment of their tools practical use. Only one paper provided information on all four desirable abstract content items.

What can the Showcase Learn and do better

This Showcase has the admirable objective of encouraging researchers and clinical teams to perform AI projects to a better quality and in a more conclusive manner. However its Stages cover a cornucopia of objectives set out in a timeline that is unrealistic for projects just starting and poorly co-ordinated for projects near to or at completion. This is well evidenced by the some 40+ ML projects included in the Conference programme that are not part of the Showcase. If the Showcase is to continue, as it should, then a more considered approach to staged objectives, encouragement of appropriate teams, and more thoughtful timing would be a great spur to its future success.

Might I Humbly Suggest (MIHS) that a more refined definition of the stages be spelled out so that

            a. groups just starting ML projects are provided with more systematic guidelines and milestones, and;

            b. that groups in the middle of projects can ensure that they have planned for an appropriate level of completeness to their work.

Stage 1. What is the intended deliverable and Why it is necessary – Which clinical community has agreed to deployment and assessment.

Stage 2. What was done in the development – How the deliverable was created and what bench top verifications were conducted.

Stage 3. Deployment and Clinical Assessment – What were the issues in deployment. What were the methods and results of the clinical assessment. What arrangements have been made for the maintenance and improvement of the deliverable.

This definition excludes groups performing ML projects purely for their own investigative interest but without a specific participating clinical community. The place for their work is within the general programme of AMIA conferences. It also means that strictly speaking only 3 of the current acceptances would fit this definition for Stage 1, although 3 of the others could be contracted to fit this definition.

A concerning factor in the current timeline design is the short 2-month span between the deliverables for Stages 1 and 2. A group would pretty much have to have completed Stage 2 to submit to Stage 1 and be ready to submit to Stage 2 in 2 months.

Lastly the cost of attending 3 AMIA conferences in the one year would be excessively taxing especially to many younger scholars in the field. AMIA should provide a two-thirds discount to each conference to those team representatives who commit to entering the Showcase. This would be a great encouragement to get more teams involved.

Paper TopicN
Design only3
Comparison of ML models5
Extended ML model1
Development of ML model5
Developed and Deployed – No operational assessment3
Methods study only1
Abstract Unavailable2
Table 1. Category of key themes on the 22 Abstracts accepted into the AI Showcase.

PaperWhat was doneWhy was it doneWhat are their resultsWhat is the significanceCommentsCategory
Overgaard – CDS Tool for asthma – PresentationDesign of desirable features of AI make review of the patient record more efficient.UnspecifiedUnknown – no clinical deployment.Paper is about the putative design for a risk assessment tool and data extraction from the EHR.Design only
Rossi – 90-day Mortality Prediction Model – PresentationDeployed 90-day mortality prediction model. **to align of patient preferences for advance care directives with therapeutic delivery, and improve rates of hospice admission and length of stay.UnspecifiedUnknown – clinical deployment planned.Model is partially implemented with operationally endorsed workflows.Development of ML model – Planned Deployment
Estiri – Unrecognised Bias in COVID Prediction Models – PresentationInvestigation of four COVID-19 prediction models for bias using an AI evaluation framework. **AI algorithm biases could exacerbate health inequalitiesUnspecifiedUnknown – no clinical deployment.Two bias topics are defined : (a) if the developed models show biases; (b) has the bias changed over time for patients not used in the development.Comparison of ML models
Liu – Explainable ML to predict ICU Delerium – PresentationA range of features were identified and a variety of MLs evaluated. Three prediction models were evaluated 6,12,24 hours. ****To more accurately predict the onset of deliriumDescribed but numerics not providedImplied due to described implementation designPaper describes some aspect of all characteristics but not always completely.Comparison of ML models
Patel – Explainable ML to predict periodontal disease – Presentation“new” variables added to existing models revealed new associations.Discover new information about risk factors.Described but associations not providedUnknown. No clinical assessment of the discovered associations, no clinical deployment.AI methods not described.Extended ML model
Liang – Wait time prediction for ER/ED – VirtualDevelop ML classifier (Tensorflow) to predict ED/ER wait times. Training and test sets described.No explanationUnspecifiedunknown -clinical deployment status unclearFocus is on the ML processes with little other informationDevelopment of ML model
Patrick – Deep Understanding for coding pathology reports – VirtualBuilt a system to identify cancer pathology reports and code them for 5 data items (Site, Histology, Grade, Behaviour, laterality)California Cancer registry requested an automated NLP pipeline to improve production line efficiencies.Various accuracies providedImprovements over manual case identification and coding provided.The work of this blog author.Developed and Deployed – no operational assessment – Not clinical application
Eftekhari/Carlin – ML sepsis prediction for hematopoietic cell transplant recipients – PosterDeployed an early warning system that used EMR data for HCT patients. Pipeline of processing extending to clinal workflows.Sepsis in HCT patients has a different manifestation to sepsis in other settings.Only specified results is deploymentUnknown – no clinical assessment.Deployment is described showing its complexity. No evaluations.Developed and Deployed – no operational assessment
Luo – Evaluation of Deep Phenotype concept recognitions on external EHR datasets – PosterRecognises human phenotype Ontology concepts in biomedical al textsNo explanationUnspecifiedUnknown – no clinical deployment.Abstract is the least informative. One sentence only.Development of ML model – Not clinical application
Pillai – Quality Assurance for ML in Radiation Oncology – PosterFive ML models were built and voting system devised to decide if a radiology treatment plan was Difficult or No Difficult. Feature extraction was provided.To improve clinical staffs scrutiny of difficult plans to reduce errors downstream. Feature extraction to improve interpretability and transparency. ****UnspecifiedSystem planned to be integrated into clinical workflow.Mostly about ML approach but shows some forethought into downstream adoption.Comparison of ML models – Deployment planned
Chen – Validation of prediction of Age-related macular degeneration – PosterML Model to predict later AMD degeneration using 80K images from 3K predict the risk of progression to vision-threatening late AMD in subsequent yearsUnspecifiedUnknown – no clinical deployment.Focus is on the ML processes with little other information.Development of ML model
Saleh – Comparison of predictive models for paediatric deterioration – PosterPlan to develop and implement ML model to augment prediction of paediatric clinical deterioration within the clinical workflow.Detecting deterioration in paediatric cases is effective at only 41% using existing tools.Unspecified – planning stage onlySystem planned to be integrated into clinical workflow.Early conceptualisation stage. Well framed objective and attentive to clinical acceptability. No framing of datasets, variables and ML methods.Design only
Shah – ML for medical education simulation of chest radiography – Posternot available     
Mathur – Translational aspects of AI – PosterEvaluation of the TEHAI framework compared to other frameworks for AI with emphasis on translational and ethical features of model development and its deployment.A lack of standard training data and the clinal barriers to introducing AI into the workplace warrant the development of a AI evaluation framework.Qualitative assessment of 25 features by reviewers.No in vitro evaluation – only qualitative assessmentThis is an attempt to improve the evaluation criteria we should be suing on AI systems. it fails to make convincing case it isa better method than alternatives.Methods study only
Yu – Evaluating Pediatric sepsis predictive model – Posternot available     
Tsui – ML prediction for clinical deterioration and intervention – PosterBuilt a ML for an Intensive care warning system for deterioration events and pharmacy interventions. It uses bedside monitor, and EHR data providing results in real-time.No explanationUnspecifiedUnknown – no clinical assessment.Operational system. Only a description of the deliverables – no evaluationsDeveloped and Deployed – no operational assessment
Rasmy – Evaluation of DL model for COVID-19 outcomes – PosterA DL algorithm developed to predict for COVID-19 cases on admission: in-hospital mortality, need for mechanical ventilation, and long hospitalization.No explanationUnspecified – no numerics suppliedunknown – no clinical deploymentSeems to concentrate solely on the DL modelling.Development of ML model
Wu – ML for predicting 30-day cancer readmissions – PosterML models built to identify 30-day unplanned admissions for cancer patients.Unplanned dance readmissions have significantly poorly outcomes so the aim is to reduce them.No Results but promised in the poster/presentationUnknown – no clinical assessment.No ML details just a justification in AbstractComparison of ML models
Mao – DL model for Vancomycin monitoring – PosterA DL pharmacokinetic model for Vancomycin was compared to a Bayesian modelTo provide a more accurate model of Vancomycin monitoring.The DL model performed better than the Bayesian model.. No numerics provided.Unknown – no clinical assessment.Focus is on the ML processes with little other information.Comparison of ML models
Ramnarine – Policy for Mass Casualty Trauma triage – PosterDesign of a strategy to build an ML for ER/ED triage and retriage categorisation for mass casualty incidentsTo fill a void in national standards for triage and retriageUnspecified – design proposal onlyUnknown – no clinical assessment of practicality of acceptance.This is a proposal with no concrete strategy for implementation and what would be used in the investigation from either a data source of ML strategy type.Design only
Table 2 Summary of the 22 Abstracts according to 4 content attributes ascribed for good abstracts.