natural language processing

There are many claims in the medical technology circles that software does NLP. Not all of these claims are valid.

NLP has a long history and in its earliest days NLP was driven by linguists wanting to automate the grammar rules analyses they do on language corpora. However from the 1980s onwards linguists have been superseded by computer scientists and the algorithms they invent to manipulate data. At the same time as the web began to expand the computer scientists could see that they could bypass the horrendously difficult linguistic pathway of analysing the structure of language by throwing it aside and just working on the statistical characteristics of language. This lead to two pathways for processing strategies. The first is called Information Retrieval (IR) which treats documents purely as a bag of words and the second is statistical NLP (SNLP) which identifies the structure in language by its statistical characteristics rather than by grammatical rules.

So far the IR have won the day for popularity as it is best manifest by the Google search engine. However SNLP is making its way very effectively in a slow wind up to superseding IR.

The basic feature of IR is that it recognises documents that cover a topic of interest. It relies on achieving exact match to the string of letters the user types into the search engine. It has no understanding of the linguistic or conceptual meaning of those letters so it can’t tell the difference between the singular and plural form of a word. One technology that has been devised to overcome the inherent limitation of IR is searching using Regular Expressions rather than strings. This enables the user to design a matching string using multiple patterns rather than the literal string in a Google search. Regular expressions are an extension on the variety of strings you can search for but it doesn’t escape the inherent limitation that a fixed pattern of strings is searched for.

NLP has a different origin to IR and believes that the structure of the sentence is important to understanding it and the semantics of the word also effect the meaning. Hence NLP relies on methods for parsing sentences into their grammatical components and then retrieving content based on understanding that structure. SNLP has bolstered NLP by showing that some patterns of language usage are defined by statistical properties that can be exploited to correctly identify the grammatical role of words in a sentence. A good example is the use of statistical patterns to recognise the part of speech of an unknown word in a sentence by the behaviour of the words around it.

Unfortunately the claim that a system uses NLP is so widespread that the value of the methods are being obscured.

Some claims to be NLP that are false:

1. Use of strings in rules to find desired content.

2. Use of regular expressions to find desired content.

3. Use of IR find content.

The reason strings and rules are not classifiable as NLP is crucial to understanding the true values of SNLP.

Why are the differences important?

String and rule based IR systems have the advantages that they are quick and cheap to build, but their crucial disadvantage is that they can only identify what has already been defined. They also become encumbered once their rule set becomes too large as the effect of changes can’t be well predicted due to interaction between the rules. Also they will have a pronounced tendency to over produce results yielding many false positives in their search.

Statistical NLP has the key advantage that it can identify content it has never seen, so it has a serious discovery advantage. Furthermore as more knowledge is acquired it can be incorporated into the processing stream ensuring that it has a growing knowledge base. Its disadvantage is that it needs a gold standard annotation text to reach very high accuracies and that building the computational model takes more resources than the IR methods, so you need more extensive knowledge and training in the methods to exploit them effectively. Rule systems that have been in use for many years will serve their restricted objectives well and provide better Precision (correctly identifying the items requested, and not retrieving too many false hits) than SNLP systems initially. Ultimately though they will always be behind in Recall (finding all the items requested) and eventually slip behind on Precision once the SNLP system has had sufficient training.

So the next time you see someone touting their technology as NLP check and see if it is really a rule based IR technology or truly statistical.

Tag: natural language processing

What is Natural Language Processing (NLP) for Clinical Texts