Why Are your Customers Unhappy?

Imagine you are CEO of a multinational clothing retailer that has rolled out its winter season apparel lines in North America and Europe. Your corporation collects customer experience feedback from customers at checkout via a receipt-based feedback survey. Each survey takes a customer about a minute to complete and asks her to rate overall satisfaction on a five-point scale. The second question is asked in an open-ended form and expects a textual response.

The first month after the season rollout, 5 percent of all open-ended responses are about poor sweater quality, specifically sweater-pilling, stretching, and unraveling. The overall satisfaction score is slightly lower than last year. Because of the number of open-ended responses—about 50,000, or 16 percent of all responses—the poor sweater problem is hidden in the data and no action is taken. By the second month, sweater quality is now 30 percent of all textual responses and customer satisfaction is plummeting.

This example identifies one of the key challenges corporations face: knowing what is driving poor customer experience. If a sizable portion of customers experience the same negative fate, market share for a corporation is lost and is difficult to regain.

Gathering customer experience feedback is the first and most critical step. Asking customers to provide a rating on a five-point scale or an Net Promoter Score (NPS) scale helps corporations identify that a problem exists (as in our example, 16 percent of all customers are dissatisfied and are prompted for a reason about their dissatisfaction) but doesn't answer the critical why they are dissatisfied.

Until recently, three options were available for corporations that collected voluminous open-ended feedback: manually categorize each response, select a random sample of open-ended answers to categorize, or just use the open-ended responses as a qualitative source to reference while reporting on customer experience. Manually categorizing each response is costly and time-consuming. By the time 50,000 responses are coded, the next month's data has already been collected. Selecting a random sample of open-ended answers to categorize can minimize an early trend or signal in the data—such as sweater-pilling—that will have strong negative impact on the company throughout the season.

Collecting responses using an open-ended text question places a heavy cognitive burden on respondents when compared to other question types, such as picking choices from a list. In addition, the question text that sets up the open-ended response request is often non-specific, such as "Why did you give a score of X to the last question?" In this case, respondents often give vague answers, answer off-topic, or simply answer "I don't know." More than any other question type, open-ended text questions cause respondents to break off and leave the survey response incomplete.

Unlocking Insights with Natural Language Processing

In the past few years, natural language processing (NLP), an artificial intelligence (AI) field based on machine learning techniques, has unlocked the hidden potential of open-ended survey data. NLP techniques can optimize a respondent's experience when answering open-ended text questions while the respondent is taking the survey. NLP can also automate the coding process during analysis. When optimizing an open-ended question to solicit a strong on-topic narrative response, methodological best practices combined with a number of technical advances of using NLP in real time can be applied.

Traditionally, it has been a best practice to make open-ended responses optional so that respondents can answer questions with little cognitive burden without needing to abandon. However, NLP can drive more valuable insight, improving response rates by prompting and probing the respondent with starter language.

Once the respondent enters the open-ended response, an NLP algorithm can execute before the respondent is asked the next survey question. This algorithm creates a probabilistic score based on a number of factors of the response. By applying this algorithm, which examines things like perceived sentiment, length, and unidentified words, NLP assigns a score to the open-ended response. If the score determines that the response is likely of low quality, the survey can prompt the respondent through a series of probes that will elicit an on-topic, more complete response.

The NLP algorithm takes many factors of the response into account, because one factor does not provide enough dimensionality to judge the quality of response. For example, not all long responses are on-topic and well-written, and not all short responses are poorly constructed. Brevity can be the soul of wit.

Once the open-ended response has been collected and the entire survey has been completed with a minimum number of responses in aggregate, NLP techniques can be applied again, this time to detect new trends in dissatisfaction that might be present.

Precision Drives Understanding

For this type of textual response categorization, the goal is to maximize precision. In text analytics, precision and recall are important technical measures when categorizing responses that have related but different meanings. Precision is defined as those responses that are classified correctly. For example, let's say you have 100 textual responses and manually identify 25 in the same category (e.g. sweater-pilling). When you use NLP techniques to identify the 25 sweater pilling responses, let's say (for the sake of this example) the NLP algorithm only identified eight responses as belonging to the category sweater-pilling. And, all eight were about sweater-pilling, so precision was 100 percent. That said, the NLP algorithm did not identify the other 17 sweater-pilling responses, so the recall was quite low. For survey response data, precision then is more important over recall. Better that a response is not categorized at all rather than incorrectly.

Given the importance placed on precision, one best practice technique is semi-automated categorization using topic modeling. This technique sorts through written feedback and groups related terms. This technique uses context and co-occurrence instead of external dictionaries. This way, responses containing industry-specific words and phrases are categorized appropriately. The NLP algorithm uses topic models and proximity of terms to identify these initial categories. To improve the algorithm's ability to create correct topic categories, a number of techniques can be used, including lemmatization and neural networks.

Lemmatization is the process of reducing a word to its base root, so that words that are of the same meaning with different endings are categorized together. This ensures responses are tagged correctly—increasing both precision and recall—without having to use complicated query searches.

Using our sweater-pilling example, let's consider that we want to capture all comments around pilling without having to think through all the various iterations of pilling—pills, pilled, piller, pill balls, etc. With lemmatization, NLP reduces all those words down to the core base word, pill, and then tags responses appropriately.

Neural networks correct for spelling errors or variations, so that if pilling is spelled piling, given the frequency of the word with similar spellings, it categorizes the response under sweater-pilling, using context and understanding the structure of the overall response to accommodate the misspelled word.

Right now, only a few forward-thinking organizations are applying NLP techniques to both collecting and analyzing their survey data, and using that information to identify new trends that will impact their businesses. These organizations are closing the experience gap—that is, bringing customer experience more in line with customer expectations—by discovering and addressing those issues that matter most to customers.


Carol Haney is a senior research and data scientist at Qualtrics. She has been a leader in market and social (government) research for more than 20 years. Her principal research areas are online quantitative research, including non-probability sampling, and textual analysis. She is a co-author of multiple chapters in Social Media, Sociality, and Survey Research, published by Wiley in 2013, with a second printing in 2016. She currently leads all the formative research for the Centers for Disease Control's hard-hitting anti-smoking ads. Prior to Qualtrics, Haney worked in executive positions at Toluna, Harris Interactive, TNS, SPSS, and the National Opinion Research Center at the University of Chicago.