reading time: 4 mins

ChatGPT: Deploying large language models on big data

April 5, 2023by The Signoi Team

It won’t have escaped your notice that there has been something of a furore recently about artificial intelligence in the shape of our new ChatBot Overlords at ChatGPT!

Well, it’s not perfect – but it’s certainly good, and it has lots of interesting applications. So at Signoi we are now working on an enhancement to integrate GPT functionality into our projects. We have been working with models like this for some time, and now we’re pointing it at a specific issue – namely, how do we quickly extract and summarize further meaning out of hundreds of thousands of text comments such as Tweets, adding another layer of intuitive insight to the analysis?

First – what is GPT? And what is ChatGPT?

GPT is an acronym that stands for Generative Pre-trained Transformer. It is a family of artificial intelligence large language models (LLMs) that can generate human-like text in response to prompts. There are many of these LLMs – the most recently famous one, ChatGPT, is an AI-powered language model developed by OpenAI with a chat window user interface. It has been trained on a massive amount of text data from the internet and can generate human-like text responses to a given prompt.

Are these models truly intelligent? No. Do they work? Yes. Are they useful for us? Absolutely! We therefore prefer to look at all this as a sort of Augmented Intelligence…

GPT models and large datasets

The challenge with analysing very large datasets is always data reduction into key themes. Signoi uses various semantic clustering mechanisms to do this, ranging from vectorization to LDA. The screenshot below shows the outputs from one example of a clustering of 200,000+ Tweets on health related matters. Word count is north of 10,000,000 – that’s 100 novels’ worth of Tweets.

Our AI firepower will cluster them for sure – but then the human analysts need to interpret and name the clusters. Example below…

Each cluster contains c. 7,000 Tweets representing an underlying theme. For example, a few examples of Tweets in this cluster (number 31, bottom right) quickly tells us why it was cunningly humanly named “Anxiety and other disorders”…

But what can GPT do to help us humans quickly understand what’s in this cluster? Especially as in large datasets there often many more than only 31?

Once we’ve run this data reduction, the ask is then for the language model to do a few useful things for us:

  • Summarise the themes in each of the clusters
  • Give each cluster a name
  • Pull out a few useful bullet points
  • Write a micro summary, maybe in Haiku form, for fun and inspiration

And separately, using a generative visual AI, bring this theme to life through a mood board.

Here are the results for this topic cluster:

What do think?

You can hopefully see immediately that it has made a good stab at naming, summarizing, and capturing the main elements of the theme. Of course it still needs checking and refining. As we said at the beginning, it’s not perfect – but we don’t let perfect get in the way of good! It’s a thought starter for humans, and we think this integration is going to save people a lot of time.

And as we train it further in our demands, it can only get better!

So our hot take on GPT and large language models generally is that they create huge opportunities:

  • For democratization of AI
  • For thought starters and creative inspiration
  • For automation of boring tasks
  • For increased productivity across the research process

Our philosophy has always been to automate 80% of the analysis to give humans more time to do what they’re best at – thinking. These advances in large language models can only help our cause.

We’ll be writing more on this and other AI related subjects shortly, meanwhile please do get in touch if you found this interesting…

Get in touch today to learn more about and what this powerful new approach can do for you.

Please contact us at