Deep Understanding – Where does it come from?

Modern language processing technologies have arisen from two separate pathways in algorithm history.

The first language processing efforts in the 1960’s were conducted by linguists using early computers to process linguistic structures. The linguists built algorithms that identified parts of speech and parsed sentences according to language structure. As this field developed it became known as Computational Linguistics and continues today as a professional research community.

In 1990’s the computer scientists introduced machine learning methods into the field of computational linguistics following the spirit of corpus linguistics where they studied the statistical characteristics of corpora. These methods  known as Statistical Natural Language Processing became an  important subfield to what had also become known as Natural Language Processing (NLP).

In the second half of the ‘90s the rise of Google spurred the adoption of Text Mining, an approach that ignored the linguistics character of language and worked on just the text strings. This has become a fashionable approach due to the ease with which programmers can become engaged as enthusiasts in the processing strategies without the need to engage in the field of linguistics. This approach has become so popular it has assumed the title of NLP so today NLP is a title that is a conflation of two methodological paradigms.

Google has also fuelled a new approach since the early 2010s with its development of neural net machine learning methods called Deep Learning. This fits with their general approach of using extremely large data sets to characterize language texts. Whilst this approach has shown some useful features it is not universally applicable and some of its limitations are now beginning to emerge.

The two heritages of language processing are now Computational Linguistics and Text Mining each having separate origins and algorithmic philosophies and each now claiming the rubric of Natural Language Processing.

However algorithmic processing of text is not an end in itself but rather part of a journey to a functionality that is of value to users. The two most dominant user needs are to recognize the nature of a document (aka document classification), and, identify entities of some interest within the text (aka Semantic Entity Recognition). But the meaningfulness of this processing comes about from how the NLP analysis is then utilized, not the NLP of itself. It is this use of the NLP analysis, whether it be from the computational linguistics paradigm or the text mining paradigm, that creates the value for the user.

  • DEEP UNDERSTANDING takes its heritage from computational linguistics and for processing clinical documents moves beyond text mining methods to bring to bear the full value of:
  • coalescing grammatical understanding, semantic entity recognition, extensive clinical knowledge stores, and machine learning algorithmic power,
  • so as to produce some transform of the native textual prose into some meaningful conceptual realm of interest that mimics the work of people who have the responsibility to interpret the clinical texts,
  • at an accuracy level equal to or better than the human experts who do the task, and,
  • knows its own limitations and passes the task back to the human processor when it can’t complete the task accurately enough, and
  • provides an automatic feedback mechanism to identify potential errors to improve performance as the body of processing materials grows, so as to sustain Continuous Process Improvement.
Deep Understanding 4 stage processing system for coding pathology reports to ICD-O-3 and the NAACCR Rule book with a recursive cycle of Active Learning for Continuous Process Improvement.

An example is the work we do for cancer registries that need to translate cancer pathology reports into an agreed set of codes according to an international coding system known as the International Classification of Diseases – Oncology Version 3 (ICD-O-3).

Our Deep Understanding pipeline consists of 4 major components as shown in the diagram:

1. A machine learning document classifier which determines if a pathology report is identified as a cancer report or not.

2. A clinical entity recognizer that is a machine learner that identifies forty two different semantic classes of clinical content needed for correctly coding the report.

3. A coding inductive inference engine which holds the knowledge of the ICD-O-3 codes and many hundreds of prescribed rules for coding. It receives access to the text annotated for clinical entities and applies computational linguistics algorithms to parse the text and map the entities to the classification content to arrive at ICD-O-3 codes. The task is complicated by needing to code for 5 semantic objects:

Semantic ObjectNumber of Codes
Body site for the location of the cancer in the body300+
Disease histology800+
Behaviour of the cancer5
Grade or severity of the disease20+
Laterality, or side of the body where the disease is located, if a bilateral organ6

4. A process to cycle knowledge learnt from the operational use of the pipeline process back to the machine learning NLP components to maintain Continuous Process Improvement (CPI) or Continuous Learning.

5. Each process has been periodically retrained and tuned over time to attain an accuracy rivalling and on some aspects bettering human coding performance.

While Deep Learning is a new and innovative method for learning the characteristics of a large corpus of texts it can be and usually is only an NLP subsystem of a larger processing objective. If that objective is to compute an understanding of the corpus content to the extent it can intelligently transform the text into higher order conceptual content to an accuracy level achieved by humans, then it is performing DEEP UNDERSTANDING.

AI Assessment in Clinical Applications

There is considerable amount of talk about bias in AI as applied to clinical settings and how to eliminate it. This problem has now attracted the attention of standards bodies with a view to legislating how AI systems should be validated and certified.

AI consists of three types of logic: inductive, deductive and abductive.

However, most modern references to AI are really talking about the inductive type better known within the industry as Machine Learning (ML), which allocates data objects into classes based on the statistical distribution of their features.

Bias in machine learning cannot be eliminated as it is an intrinsic aspect of the method. ML uses a sample of data objects for which it has their features or attribute values and knows their correct classes, and so trains a classification predictive model on this training data. Hence, biases in collecting the sample are intrinsic.

But other biases apart from the data collection process are also introduced along the pathway of developing a working ML predictive classifier.

I don’t believe you can make learning algorithms bias-free. Just like drugs where there is always a percentage of people who have an adverse reaction, so there will be a percentage of people for whom the classifier prediction will be wrong because of:

a. The incompleteness of the learning sets;
b. The conflicting and contradictory content within the training set;
c. The limitations of the learning model to represent the distributions in the training data;
d. The choice of attributes to represent the training examples;
e. The normalisations made of the training data;
f. The time scope of the chosen training data;
g. The limitations of the expertise of those experts deciding on the gold standard values;
h. The use of “large data” sets that are poorly curated, where “good data” is needed.

If legislators are going to make the AI technology regulation-free, then they should at least require the algorithms to be provided on an accessible web site where “users” can submit their own data to test its validity for their own data sets. Then the users can have confidence or not that the specific tool is appropriate for their setting.

Training set data is typically collected in one of two ways.

1. Curated data collection: careful election and curation of data examples to span a specific range of variables. Its most serious drawback is the effort of manual labour to collate and curate the data collection and hence it is popular for more targeted objectives using smaller training sets.

2. Mass data collection: data is collected en masse from a range of sources and data elements are captured by “deep learning” strategy of compiling large feature vectors for each classification object using an automatic method of collation and compilation.

This approach is popular because it can be highly automated and supports belief in the fallacy that more data means better quality results.

What we don’t need is more “big data”, but rather we need more “good data”.

How do we get GOOD DATA?

The delivery of a machine learning processing system into production means a supply of potential learning material is flowing through the system. Any sensible and well-engineered system will have a mechanism for identifying samples that are too borderline to safely classify. We divert those samples into a Manual processing category.

The challenge with the materials in the Manual class is how to utilise them for improvement in the resident system. A good deal of research has gone into the process of Active Learning, which is the task of selecting new materials to add to the training materials from a large set of untrained samples.

There are two major dimensions to this selection process: Samples that fall near the class boundaries, and samples that represent attribute profiles significantly different to anything in the training materials.

Automated detection of both of these types of samples requires two different analytical methods known as Uncertainty Sampling and Diversity Sampling respectively. An excellent text on these processes is by Robert Munro, Human-in-the-Loop-Machine Learning, MEAP Edition Version 7, 2020, Manning Publications.

GOOD DATA can be accumulated through a strategy of Continuous Process Improvement (CPI) and any worthwhile clinical Machine Learning system will need to have that in place, otherwise the system is trapped in an inescapable universe of self-fulfilling prophecies with no way of learning what it doesn’t know, and should know, to be safe. A clinical ML system without CPI is hobbled by “unknown unknowns” which can lead to errors, both trivial and catastrophic.

Deficiencies in Clinical Natural Language Processing Software – A Review of 5 Systems

We have recently assessed the accuracy of Clinical NLP software available through either open source projects or commercial demonstration systems at processing pathology reports. The attached whitepaper discusses the twenty-eight deficiencies we observed in our testing of five different systems.

Deficiencies in Current Practices of Clinical Natural Language Processing – White Paper

Our analysis is based on the need for industrial strength language engineering that must cope with a greater variety of real-world problems than that experienced by research solutions. In a research setting, users can tailor their data and pre-processing solutions to address the answer to a very specific investigation question unlike real-world usage where there is little, or no, control over input data. As a simple example, in a language engineering application the data could be delivered in a standard messaging format, say HL7, that has to be processed no matter what vagaries it embodies. In a research project that data could be curated to overcome the uncertainties created by this delivery mechanism by removing the HL7 components before the CNLP processing was invoked, a fix not available in a standard clinical setting.

When an organisation is intending to apply a CNLP system to their data the topics discussed in this document need to be assessed for their potential impact on their desired outcomes.

The evaluations were based on two key principles:

  • There is a primary function to be performed by CNLP, that is, Clinical Entity Recognition (CER).
  • There is one secondary function and that is Relationship Identification.

Any other clinical NLP processing will rely on one or both of these primary functions. For the purposes of this conversation we exclude “text mining” which uses a “bag-of-words” approach to language analysis and is woefully inadequate in a clinical setting.

Assessed Software: 

  • Amazon Concept Medical
  • Stanford NLP + Metathesaurus
  • OPenNLP + Metathesaurus
  • GATE + Metathesaurus
  • cTAKES

The systems have the listed deficiencies to a greater or lesser extent. No system has all these problems. The deficiencies discussed are compiled across the 5 topics under the following headings:

  • Deficiencies in Understanding Document Structure
  • Deficiencies in Tokenisation
  • Deficiencies in Grammatical Understanding
  • Deficiencies in the CER Algorithms
  • Deficiencies in Semantics and Interpreting Medical Terminology

How much document detritus do you have to sift through to find what you need in a Cancer Registry?

Many registries have to report to at least 3 authorities, usually: State Registry, Commission on Cancer and CDC while others may also report to SEER.

While there is mostly common content, the differences can be both in the data required and the rules defining that data. To compound the problem, these complexities can often be dwarfed by two other content filters: a. separating the reportable cancer documents from the other two classes of reports, namely the non-reportable cancer reports and the non-cancer reports, and, b. finding the date of initial diagnosis in a complex patient record.

Dealing with the first of these problems, the separation of document types, is particularly acute for regional and central registries.

The Greater California CR has determined that about 60% of the documents they receive are irrelevant to their needs. In a volume of greater than 600,000 reports per annum that amounts to over 360,000 irrelevant documents, a huge overburden of work before the CTRs can get down to doing the task they have been trained (and paid) for.

As part of our research into the size of this problem, we’d like to understand the extent of the overburden in your registry, so here’s a very short questionnaire:

  1. Is your institution a Hospital/Regional CR/ Central CR?
  2. How many reports per annum do you receive?
  3. What proportion of those reports are irrelevant to the registry work?

You can answer the questionnaire here. Your help will be gratefully accepted.

Cross-posted to LinkedIn.

The Many Faces of Natural Language Processing

In the last 10 years the NLP label has been forced from the lab into mainstream usage fuelled principally by the work of Google and others in providing us with tremendously powerful search engines, often masquerading as NLP.

It also happens that Google processes don’t have their origins in NLP but rather in Information Retrieval (IR), a field in which a number of methods are anathema to true NLPers, namely ignoring grammar, truncating words to stems, ignoring structure and generally treating a text as just a big bag of words with little to no interrelationships.

True NLP has a history as deep as computing itself with the very earliest computer usage for language processing occurring in the 1950s. Since then it has emerged as the discipline of Computational Linguistics with serious attention to all the lexical, grammatical and semantic characteristics of language, a far more complicated task than the simple text mining that Google, Amazon and other search engines have taken on.

A richer description of the history of NLP and its rivalry with Information Retrieval is presented in a previous blog but here is a quick and easy guide to understanding the current phrase usage around language processing, ranked in ascending levels of complexity:

  • Text Mining aka IR – rules, regular expressions, bag of words
  • True NLP aka Computational Linguistics – the field of computing the structure of language
  • Statistical NLP (SNLP) – True NLP plus Machine Learning
  • Neural NLP aka Deep Learning– Statistical NLP using particular types of Machine Learning to broaden domain knowledge
  • Language Engineering – Building production grade SNLP solutions

As a quick and pragmatic way to separate IR and NLP consider the result of searching for “liver cancer” in a range of clinical documents. Text Mining will correctly identify any document that contains the literal string “liver cancer” and possibly “cancer of the liver” if it is using grammatical searching as an adjunct to Information Retrieval. NLP will return a superset of documents containing not only the Text Mining results, but “liver core biopsy shows adenocarcinoma”, “hepatocellular carcinoma”, “liver mets”, “liver CA”, “HCC” and many more lexicogrammatical representations of the concept of “liver cancer”.

Cross-posted to LinkedIn.