Social media sampling and automated analysis
So, what is a valid sample for analysis of social media posts or any document corpora ?
How do we structure the sample ? What characteristics do we use of the population to generate the sample ? How big should a sample be ? We don’t have all the answers but we have do a solution to the niggling issue of unstructured data.
Social Media Sampling – how much is too much?
As we know the challenge with social media data is not the difficulty in getting the raw information, since thousands of social media postings can be obtained easily and very quickly. The sampling issue comes into play as a consequence of trying to analyse 100,000 comments or postings. Physically reading them all is expensive, time consuming and impractical. It is impossible to read and summarise 100,000 postings in any reasonable time. 100,000 posts would be the equivalent of a 700,000 or more word novel and that is a very conservative estimate. According to Amazon the median length of a novel is about 64,000 words, 700,000 words is more than 10 novels to read and summarise.
The problem grows when there are embedded links in the posts to websites and images. The tendrils of social media posts can be vast. Therefore some sort of sampling seems logical, take a subset of the posts controlled in some way and analyse the subset. A tenth of 100,000 posts is 10,000, but that is still a large number to code and analyse. It seems a pity to ignore the 90,000 other posts.
Sampling in the survey world is a consequence of the difficulty of obtaining data. What is hardly ever done is discarding the completed survey data once we have it. We analyse all completed (and sometimes incomplete) survey responses, so why not analyse all the social media postings ? Unlike survey data, the problem with social media is not the acquisition of the data, it is the analysis of the data that is challenging.
Changing the lens through which we see social media
I think the way forward is to look at social media as a behavioural data stream. Posting is a behaviour, it should be analysed and quantified as any other form of behavioural data. We need a theory of why people post, it can’t be random. Throwing away the vast majority of social media posts collected simply because there is a lack of use of analytical tools doesn’t seem like a good idea to me. Automated metrics and analyses can be generated for a 100,000 posts fairly easily, coding 100,000 posts is expensive and of limited value.
There is nothing magical about words. Languages are structured systems, they are systems of signs, they are amenable to summary and analysis in the same way as survey data is. The automated methods may be very different from the ones survey data uses but they are available. This is not an attempt to remove human analysts from their role though. Statistics need to be interpreted, they are just results. In the same way automated analyses of social media data need interpretation.
The role of semiotics
I’ve often read about how human interpretation of social media data is the only way, applying qualitative techniques to social media as if it is some vast focus group. Writers, rightly, talk about how powerful our minds are when it comes to the interpretation of language and culture. What no one seems to mention is the limitations and biases that the human mind brings to any analysis of text. Charles Pierce, a seminal thinker in the field of semiotics, talked of the understanding or effect of the linguistic sign. He was saying that signs, the elements of language, can have different meanings and effects on the interpreter of the sign. The limits of human memory mean it’s not possible to read the equivalent of “War and Peace”, about 560,000 words, in a couple of days and summarise all of it. And this is a small amount of text, 1.5 million words in a corpora is entirely possible.
The more we learn about the structure of social media posts the better we can segment and analyse them. Automated analysis of text is a great way to build a foundation for better understanding the sea of text we now faced with. After all, it’s not like we can read it all.
Find out how Signoi can help you surface meaning from your unstructured data.