How much document detritus do you have to sift through to find what you need in a Cancer Registry?

Many registries have to report to at least 3 authorities, usually: State Registry, Commission on Cancer and CDC while others may also report to SEER.

While there is mostly common content, the differences can be both in the data required and the rules defining that data. To compound the problem, these complexities can often be dwarfed by two other content filters: a. separating the reportable cancer documents from the other two classes of reports, namely the non-reportable cancer reports and the non-cancer reports, and, b. finding the date of initial diagnosis in a complex patient record.

Dealing with the first of these problems, the separation of document types, is particularly acute for regional and central registries.

The Greater California CR has determined that about 60% of the documents they receive are irrelevant to their needs. In a volume of greater than 600,000 reports per annum that amounts to over 360,000 irrelevant documents, a huge overburden of work before the CTRs can get down to doing the task they have been trained (and paid) for.

As part of our research into the size of this problem, we’d like to understand the extent of the overburden in your registry, so here’s a very short questionnaire:

  1. Is your institution a Hospital/Regional CR/ Central CR?
  2. How many reports per annum do you receive?
  3. What proportion of those reports are irrelevant to the registry work?

You can answer the questionnaire here. Your help will be gratefully accepted.

Cross-posted to LinkedIn.

The Many Faces of Natural Language Processing

In the last 10 years the NLP label has been forced from the lab into mainstream usage fuelled principally by the work of Google and others in providing us with tremendously powerful search engines, often masquerading as NLP.

It also happens that Google processes don’t have their origins in NLP but rather in Information Retrieval (IR), a field in which a number of methods are anathema to true NLPers, namely ignoring grammar, truncating words to stems, ignoring structure and generally treating a text as just a big bag of words with little to no interrelationships.

True NLP has a history as deep as computing itself with the very earliest computer usage for language processing occurring in the 1950s. Since then it has emerged as the discipline of Computational Linguistics with serious attention to all the lexical, grammatical and semantic characteristics of language, a far more complicated task than the simple text mining that Google, Amazon and other search engines have taken on.

A richer description of the history of NLP and its rivalry with Information Retrieval is presented in a previous blog but here is a quick and easy guide to understanding the current phrase usage around language processing, ranked in ascending levels of complexity:

  • Text Mining aka IR – rules, regular expressions, bag of words
  • True NLP aka Computational Linguistics – the field of computing the structure of language
  • Statistical NLP (SNLP) – True NLP plus Machine Learning
  • Neural NLP aka Deep Learning– Statistical NLP using particular types of Machine Learning to broaden domain knowledge
  • Language Engineering – Building production grade SNLP solutions

As a quick and pragmatic way to separate IR and NLP consider the result of searching for “liver cancer” in a range of clinical documents. Text Mining will correctly identify any document that contains the literal string “liver cancer” and possibly “cancer of the liver” if it is using grammatical searching as an adjunct to Information Retrieval. NLP will return a superset of documents containing not only the Text Mining results, but “liver core biopsy shows adenocarcinoma”, “hepatocellular carcinoma”, “liver mets”, “liver CA”, “HCC” and many more lexicogrammatical representations of the concept of “liver cancer”.

Cross-posted to LinkedIn.