In the last 10 years the NLP label has been forced from the lab into mainstream usage fuelled principally by the work of Google and others in providing us with tremendously powerful search engines, often masquerading as NLP.
It also happens that Google processes don’t have their origins in NLP but rather in Information Retrieval (IR), a field in which a number of methods are anathema to true NLPers, namely ignoring grammar, truncating words to stems, ignoring structure and generally treating a text as just a big bag of words with little to no interrelationships.
True NLP has a history as deep as computing itself with the very earliest computer usage for language processing occurring in the 1950s. Since then it has emerged as the discipline of Computational Linguistics with serious attention to all the lexical, grammatical and semantic characteristics of language, a far more complicated task than the simple text mining that Google, Amazon and other search engines have taken on.
A richer description of the history of NLP and its rivalry with Information Retrieval is presented in a previous blog but here is a quick and easy guide to understanding the current phrase usage around language processing, ranked in ascending levels of complexity:
- Text Mining aka IR – rules, regular expressions, bag of words
- True NLP aka Computational Linguistics – the field of computing the structure of language
- Statistical NLP (SNLP) – True NLP plus Machine Learning
- Neural NLP aka Deep Learning– Statistical NLP using particular types of Machine Learning to broaden domain knowledge
- Language Engineering – Building production grade SNLP solutions
As a quick and pragmatic way to separate IR and NLP consider the result of searching for “liver cancer” in a range of clinical documents. Text Mining will correctly identify any document that contains the literal string “liver cancer” and possibly “cancer of the liver” if it is using grammatical searching as an adjunct to Information Retrieval. NLP will return a superset of documents containing not only the Text Mining results, but “liver core biopsy shows adenocarcinoma”, “hepatocellular carcinoma”, “liver mets”, “liver CA”, “HCC” and many more lexicogrammatical representations of the concept of “liver cancer”.
Cross-posted to LinkedIn.