Using Word Clouds for Text Analysis
I’ve never been much of a fan of traditional word clouds, but they remain the entry level tool of choice for text analysis. While they can look very pretty they often don’t convey a lot of information about the text they are based on. For instance if we have a data set that was derived from asking people about companies, you will end up with the word cloud like the one below:
If you were asking about smart phones it is a near certainty you will end up with a word cloud dominated by the words “iphone” and “android”. It’s entirely logical, it reflects the fact that people will talk about the subject you ask them about.
In the example above the word “company” is the most frequent word in this collection of text. But it doesn’t give us much information.
Information theory states that the more frequent an event is the LESS information it contains. For example suppose we only have two pieces of information about each day. The first piece of information is if the sun rose or not. The second piece of information is the phase of the moon on that day. Obviously the sun rises every day, so it doesn’t give us any information to differentiate each day. Everyday is the same if we use sunrise as the only information about a day. The phase of the moon varies over a 28 day cycle, so that give us much more information about each day. We can place days into 28 categories using the phase of the moon, but only one using the sunrise. Less frequent events, in our case the phase of the moon, give us more information.
When we are doing word frequency analysis, as in word clouds, the most frequent words are not giving us the most information. We need to look a little deeper.
Our word cloud is dominated by “company” and “jobs”. As discussed earlier these give us little information because they are the most frequent items. The next step is to eliminate the top two most frequent items and see how the word cloud looks:
By eliminating the top two most frequent words we can see a more nuanced view of the data. We can go further with this, here is a word cloud of the same data which has had the top four most frequent items eliminated:
One other variation is to try to identify parts of speech such as verbs and adjectives in the text and use those for analysis. Parts of speech analysis depends a lot on the structure of the text presented, this means that some of the classifications of words may not be exact. Below is a word cloud using just adjectives from the “company” data set.
We get a different view of the data here, with the themes of “new” and “national” standing out. Using the overall word cloud these adjectives can’t be identified.
We can always go to the other end of the frequency spectrum and skip over the top 50 adjectives and see what is at the tail end of the frequency data for our example:
Both ends of the frequency range can be useful.
We can also look at the verbs in the data, shown below:
Again we see a very different picture than the overall word cloud. Evidently “hiring” is a strong theme.
For a fast read on any collection of text word frequency data is an accessible approach. And word clouds don’t seem to be going away as the visualisation of choice for word frequency data. With a little bit of work word clouds can show more than just what is mentioned the most. By eliminating the highest frequency words, which by definition convey the least information, we can get more information about the text data. Being able to split text into parts of speech can give a lot deeper insights into the text data, all parts of speech are not the same.
So I’ve learned to love word clouds, with a little bit of effort you can get a lot of information.