Author: Frank Buckler, Ph.D.
Published on: October 9, 2021 * 7 min read
Apparently, unstructured data appears chaotic at first glance, but with new forms of AI data analysis, it can be tamed to solve business problems.
First we look at how to set up a codebook correctly.
FREE for all client-side Insights professionals.
We ship your hardcopy to USA, CA, UK, GER, FR, IT, and ESP.
Be granular and think about actions – There are various types of coding. Manual coding and supervised coding require the human to build a codebook, and there are some do’s and don’ts around that. Of course, it depends on the use case, but in general, the recommendation needs to be granular and you have to think about the actions.Therefore, it’s not enough to have a code that is correct. It should also be useful and it becomes useful if you are more specific. It’s okay to categorize something as “poor quality”, but it would be good to know how to improve this quality. So, granularity is the king and growing a narrative translates into actionability.
Always have codes with the direction – It means that the code itself has the sentiment. Why? It’s because you can have codes like price and quality. But when people actually read it, they don’t know how to understand that. When you build categories, always construct them in a way that they have a direction like bad pricing, good pricing, bad quality, and good quality. This way, everyone who reads it, understands what it means when you say quality and price. So, it becomes an exercise, and later it enables you to detect granularities by having direction-based codes.
Be mutually exclusive – It means make sure when you choose a certain category, the other one can’t be the same. For instance, friendly service is not mutually exclusive with great service. You have to understand that one specific verbatim belongs to only one category. There is no doubt whether or not it belongs to. If you are not clear, write a description consisting of one or two sentences as it helps the coder to remember the real meaning. But if you can’t describe it, that’s probably not a good piece of code.
Cluster categories in category groups – You have to cluster categories in category groups to define a hierarchy as it enables you to have overview summaries. You can group the categories by classifying them into different clusters like the parent category and the child category. The important thing is to train the child of the specific category.
Label as customers speak – You have to be specific when you label a certain category. Typically, we tend to use general terms that are hard to understand. So, you have to be descriptive and use the language of the customer for the below reason:
Use AND + OR when labeling – Sometimes, you have certain categories, and they can not be differentiated from one another. For example, brand, great brand, reputed brand are hard to differentiate from a customer’s point of view. It is important to use AND and OR the right way when you have certain categories. AND means the customer mentioned both categories, and OR means the customer mentioned one of them. It is necessary to give these details to the person who reads the label as well as the AI the correct guidance.
OTHERS category – What’s very much used are OTHERS category. The only use of it is to count how much we did not categorize specifically. But, it should only be used for manual coding because there are two reasons for that:
Non-informative categories – The non-informative categories are those that don’t belong to our codebook. Typically, customers may say:
“I gave it zero because I don’t recommend it in general.”
You need to pay attention to what the customer says because it helps you explain the outcomes.
Minimum frequency per category – There is a question:
What’s the minimum frequency of a category?
It depends on the usage and it’s something you need to decide. You always build categories and open a new category if it’s mutually distinctive to something that is already mentioned in the other categories. At the end of the exercise, you can still decide to skip some of the categories.
Manual Coding – For manual coding, you have to read all comments. You build your codebook, and when new data arrives, you need to categorize everything from scratch and look for new categories. You also code everything that doesn’t fit to your codebook and visit the OTHERS category to find the new categories.
Unsupervised Categorization – It always finds new categories but changes the definition of the old ones. Therefore, it does not maintain consistency. So, if you can find a software that can fix your old ones, it may be able to find the new categories without changing the old ones.
Supervised Categorization – Supervised learning is trained by a domain expert. So, it needs humans to find the new categories, but there is a trick called smart sorting ( smart categorization). It sorts the verbatims in a way that you can be quite sure about them i-e., you will be sure what is the quality, and what is the price etc. So, supervised learning can help you detect new topics or categories very quickly and easily by getting rid of the OTHERS category.
Unsupervised learning – You know that unsupervised learning finds new categories by retraining and does not maintain consistency. You need to avoid retraining and find a tool that can fix category definitions.
Manual Coding – How can you manage consistency in manual coding?
Supervised Categorization – The same thing applies for supervised learning because the training process involves manual coding. Focus on quality coding and make sure you train the machine right, so it will be consistently correct. In short,
If you are an international company, you have multi-language feedback. The question is how are you dealing with that? How are you categorizing it?
Native Coders – We need native people or native coders who can take a look and categorize. It’s because that is the best you can do as they are the best source to understand what the verbatim means.
Translate first into core language – The alternative approach is to take all verbatims and translate them into one core language. You may think that you lose information.
Yes, you do. But you have one understanding of categories, and one native quarter categorizing everything.
So, if you have multiple languages (more than three), you can use this method as it gives better results than native codes.
So far, we discussed that you need to build and maintain a codebook because it is important in managing a data analysis project. The following are the steps of building a codebook:
Further we discussed that you can find new categories by using the following categorization schemes:
You can manage consistency in these methods by practicing some necessary steps. Further, you can deal with multiple languages by using two methods:
Our Group: www.Success-Drivers.com