Chapter 1 -Reining in the wonderment at ChatGPT and LLMs

We have all heard the great excitement being generated by the release of the CHAT GPT generative AI platform.

But is it as exceptional as the pundits laude?

What might it be used for?

What are its limitations?

What do we need to understand about it to evaluate this sort of language processing technology?

There are two major approaches to computer driven language processing that have developed over the past 70 years.

The first approach was developed with linguistics knowledge and was focused on building algorithms that computed the features of  language that linguists had defined over 300 years of investigation. We know this as the lexicon, grammar and semantics of language. This approach has been called computational linguists (CL) and natural language processing (NLP). In the 1990s this approach was complemented with the addition of machine learning approaches that could improve the recognition of target content and became known as statistical natural language processing. The target of this approach was to show understanding of text potentially at the level of human beings – this is yet to be achieved on anything but specialised settings. However in the background of the major thrust of CL was a small group of scholars working at text generation as part of computational language translation. These scholars were initially working with mapping lexica and grammar structures between languages. However, they discovered that machine learning algorithms, trained on matched translations, that predicted the “next word” in the construction of a sentence could give them a serious productivity and accuracy gain in machine translation. This work has now culminated in many effective automatic translation engines with the Google engine being the most prominent. A second collection of scholars were working on text summarisation and focused on interpreting the key aspects of a text and regurgitating it in a briefer form.

The second approach arrived in the late 20th century with the rise of search engines best manifest by Google. This approach treated text as a bag of words and the search problem as a string matching task, and discarded the need to do syntactic and semantic parsing to process language. This was commonly labelled string processing and eventually became known as Text Mining. In the 2000s, this approach was turned on its head with the addition large scale neural net machine learners used to create a more extensive characterisation of a word and its usage contexts in a body of text. 

In a quirk of fate this turn actually brought the field back to linguistics in capturing the spirit of the field of corpus linguistics whose method is expressed by the very famous saying of 20th century linguist J.R. Firth “you shall know a word by the company it keeps”.

This approach has now grown to be known as Deep Learning intended to represent the rich contexts of words represented by a neural net of a significant complexity ( read depth) so that its operational characteristics are now unfathomable.

 CHAT GPT is the inheritor of most of these technologies. However crucially it has stepped up its “learnings” by mining the breadth of the internet of language examples to have an immeasurable scale of texts to learn from across a broad scope of natural languages. The huge scale of this mining has meant that their learning process is so horrendous it costs more than $100 million to create a language model. These are now called large language models (LLMs) and they represent the positive and negative contexts of all (maybe most) words found in their training materials. We don’t know the word pruning strategy used by their algorithm so we can’t determine what words will be poorly represented by their model.

If we wind the clock back to the early tasks defined by the scholars when natural language processing began we can build more effective strategies for assessing CHAT GPT, and other competitor LLMs.

Putting aside the algorithm work on part-of-speech and parsing, key semantic tasks were named entity recognition and relationship  recognition. 

The question is to ask CHAT GPT how it performs on these tasks.

I asked CHAT GPT to generate some sample pathology reports for various cancers and the results appeared presentable. I even asked it to provide a report in a HL7 format  and was impressed at the detail of the report although the cancer content was thin. It left me puzzled as to how it would learn to use such formatting when no learning materials would be available as all real world reports would be held confidentially by health organisations. Later a colleague at the CDC pointed out to me that there were examples in some of the cancer registry advisory manuals that it could have learnt from, maybe.

At that point I decided I needed to test it on content I knew the truth of so I asked it what it knew about myself. It gave a respectable answer based on my professional profile that would be available across the Internet. It identified that I worked at Sydney University which was first clue to a problem. It is 12 years since I was employed by Sydney Univeristy so did CHAT GPT have a problem with getting time relationships correct.

I next asked it what it new about my work IN Sydney (Australia) and it identified that I worked at three universities in Sydney. I mused that it might well have found my name on lists of attendees or speaker at each of these universities and thereby constructed a relationship that was incorrect.

So I went on a test run to establish what it would relate me to. I tried a query on my work in France and it found I worked at a univeristy (untrue) but at which I had given a seminar.

Next I asked it about my work in Mexico and it found that I had worked in a particular  institute in Mexico. In fact I have NEVER visited Mexico in any capacity.

My conclusions from this investigation are, some of course may be incorrect:

CHAT GPT can identify entities to some reasonable reliability;

CHAT GPT has a serious problem with establishing the correct relationship between entities. 

CHAT GPT assumes there is a truth relationship between the entities in a query and possibly across a sequence of queries.

CHAT GPT has a semantic hierarchy of entities that is uses to create relationships that can be unmeaningful.

When you draw these characteristics together then CHAT GPT is problematic because the truth of any statement it produces is potentially unreliable.

WARNING 1: Don’t believe anything that is produced by LLMs. Only believe what you know to be true from your own experience.