Say ta. Say da. Now repeat the sounds, in each case paying attention to how you’re making them in your mouth. What’s the difference?
Trick question! There isn’t one. It’s not what’s happening in your mouth that makes these sounds different. It’s the “voice onset time”—the time between when you start moving your tongue and when you start vibrating your vocal cords. If that time is greater than roughly 40 milliseconds, English-speakers will hear ta. If it’s less than 40 milliseconds, they’ll hear da.
What’s amazing is that you never hear anything other than ta or da. If two speakers fall on the same side of the 40-millisecond dividing line, it doesn’t matter if their voice onset times differ dramatically. One person’s time might be 80 milliseconds, and the other’s might be only 50 milliseconds, but in both cases you’ll hear ta. If their times fall on opposite sides of the divide, however, a difference of just 10 milliseconds can be transformative. If one person’s voice onset time is 45 milliseconds, you’ll hear ta. If the other person’s time is 35 milliseconds, you’ll hear da. Strange but true.
People have had a lot of fun on the internet recently with the tricks our either-or minds play on us. Think of the audio clip of the word that people hear as either Yanni or Laurel. Or the dress that people see as either black-and-blue or white-and-gold. In these cases, as with ta and da, people fall on one side or the other of the categorical dividing line, and they’re practically willing to stake their lives on the idea that their perception is “right.”
Your mind is a categorization machine, busy all the time taking in voluminous amounts of messy data and then simplifying and structuring it so that you can make sense of the world. This is one of the mind’s most important capabilities; it’s incredibly valuable to be able to tell at a glance whether something is a snake or a stick.
For a categorization to have value, two things must be true: First, it must be valid. You can’t just arbitrarily divide a homogeneous group. As Plato put it, valid categories “carve nature at its joints”—as with snakes and sticks. Second, it must be useful. The categories must behave differently in some way you care about. It’s useful to differentiate snakes from sticks, because that will help you survive a walk in the woods.
So far, so good. But in business we often create and rely on categories that are invalid, not useful, or both—and this can lead to major errors in decision making.
Consider the Myers-Briggs Type Indicator, a personality assessment tool that, according to its publisher, informs HR decision making at more than 80% of Fortune 500 companies. It asks employees to answer 93 questions that have two possible responses and then, on the basis of their answers, places them in one of 16 personality categories. The problem is that these questions demand complex, continual assessment. Do you go more by facts or by intuition? Most of us would probably answer, “Well, it depends”—but that’s not an option on the test. So respondents have to choose one camp or the other, making choices they might not reproduce if they were to take the test again. Answers to the questions are summed up, and the respondent is labeled, say, an “extravert” rather than an “introvert” or a “judger” rather than a “perceiver.” These categorizations simply aren’t valid. The test isn’t useful either: Personality type does not predict outcomes such as job success and satisfaction.
Why, then, is Myers-Briggs so popular? Because categorical thinking generates powerful illusions.
Categorical thinking can be dangerous in four important ways. It can lead you to compress the members of a category, treating them as if they were more alike than they are; amplify differences between members of different categories; discriminate, favoring certain categories over others; and fossilize, treating the categorical structure you’ve imposed as if it were static.
When you categorize, you think in terms of prototypes. But that makes it easy to forget the multitude of variations that exist within the category you’ve established.
The myth of the target customer.
According to a story that Todd Rose tells in his book The End of Average, a newspaper in Cleveland ran a contest in 1945 to find the anatomically prototypical woman. Not long before, a study had determined the average values for a variety of anatomical measurements, and the paper’s editors used those measurements to define their prototype. A total of 3,864 women submitted their measurements. Want to guess how many of them were close to the average on every dimension?
None. People vary on so many dimensions that it’s highly unlikely that any single person will be close to the average on every one of them.
The same holds true for customers. Consider what happens in segmentation studies—one of the most common tools used by marketing departments. The goal of a segmentation study is to separate customers into categories and then identify target customers—that is, the category that deserves special attention and strategic focus.
Segmentation studies typically begin by asking customers about their behavior, desires, and demographic characteristics. A clustering algorithm then divides respondents into groups according to similarities in how they answered. This kind of analysis rarely yields highly differentiated categories. But instead of seriously evaluating whether the clusters are valid, marketers just move on to the next steps in the segmentation process: determining average values, profiling, and creating personas.
This is how “minivan moms” and other such categories are born. After conducting a survey, somebody in marketing identifies an interesting-looking cluster in which, say, 60% of the respondents are female, with an average age in the early 40s and an average of 2.75 kids. Looking at those averages, it’s easy to drift away from the data and start dreaming of a prototypical customer with those very attributes: the minivan mom.
Such labels blind us to the variation that exists within categories. Researchers in a 2011 study, for example, presented participants with an image of women’s silhouettes at nine equidistant points along the spectrum of the body mass index. The participants were shown the silhouettes twice—once just as they appear in Figure 1, and once with the labels “anorexic,” “normal,” and “obese,” as shown in Figure 2.
At each viewing, the participants were asked to rate the images on various dimensions. They saw the women differently when they were labeled than when they were not—even though nothing about the women themselves had changed. For instance, participants assumed that the personality and lifestyle of woman 7 was more like that of woman 9 when the two were labeled obese. Similarly, women 4 and 6 were seen as more alike when they were labeled normal.
As with body types, the segments that most businesses work with are not as clear-cut as they seem. Customers in a segment often behave very differently. To resist the effects of compression, analysts and managers might ask, How likely is it that two customers from different clusters are more similar than two customers from the same cluster? For instance, what is the probability that a minivan mom’s favorite clothing brand is more like that of a maverick mom than like that of another minivan mom? That probability is often closer to 50% than to 0%.
The screening effect.
Compression can also distort recruiting decisions. Imagine that you’re responsible for hiring at your company. You recently posted a job announcement, and 20 people applied. You do a first screening, ranking candidates in terms of their technical skills, and invite the five highest-ranked candidates in for an interview.
Even though technical skills vary considerably among the five, you’re not much influenced by that now in deciding whom to hire. Once you’ve screened candidates on the basis of technical skill, those who made it to the next stage all seem similar to you on that dimension. Affected in this way by categorical thinking, you’ll decide primarily on the basis of the soft skills the candidates demonstrate in interviews: how personable they are, how effectively they communicate, and so on. Those skills are important, of course, but the top requirement for many jobs is the highest possible technical skills, and the screening effect hampers your ability to pinpoint them.
The segments that most businesses work with are not as clear-cut as they seem.
Anomalies in financial investments.
Compression also occurs in financial markets. Investors roughly categorize assets according to size (small-cap or large-cap stocks), industry (energy, say, or health care), geography, and so on. Those classifications help investors sift through the vast number of available investment options, and that’s important. But they also lead investors to allocate capital inefficiently in terms of risk and return. During the internet bubble of the late 1990s, for example, people invested heavily and almost immediately in companies that had adopted dot-com names, even when nothing else had changed about those businesses. That mistake cost many investors dearly. Another example: When a company’s stock is added to the S&P 500, it starts moving more closely with the stock prices of other companies in the index, even if nothing about the company or its stock has actually changed.
Categorical thinking encourages you to exaggerate differences across category boundaries. That can lead you to stereotype people from other groups, set arbitrary thresholds for decisions, and draw inaccurate conclusions.
Amplification can have serious consequences when it affects how you think about members of social or political groups. Studies show that people affiliated with opposing political parties tend to overestimate the extremity of each other’s views.
Who do you think cares more about social equality: liberals or conservatives? If you answered liberals, you’re correct. On average, liberals rate social equality as more important than conservatives do. But some conservatives care more about social equality than some liberals do. Suppose we take two random people on the street—first somebody who votes conservative, and then somebody who votes liberal. What’s the probability that the first person rates social equality as more important than the second does? Much closer to 50% than you might think. Averages mask the overlap between groups, amplifying perceived differences. Despite the average in this case, many conservatives actually care more about social equality than many liberals do.
If you’re a liberal in the United States, you’re likely to assume that all conservatives oppose abortion, gun control, and the social safety net. If you’re a conservative, you’re likely to assume that all liberals want open borders and government-run universal health care. The reality, of course, is that ideologies and policy positions exist on a spectrum.
Amplification due to categorical thinking is especially worrisome in today’s age of big data and customer profiling. Facebook, for example, is known to assign political labels to its users according to their browsing history (“moderate,” “conservative,” or “liberal”) and to provide that information to advertisers. That can lead advertisers to assume that differences among Facebook’s categories of users are bigger than they actually are—which, ironically, can widen the true differences, by giving advertisers an incentive to deliver a highly tailored message to each group. That’s what seems to have happened in 2016, during the U.S. presidential election and the Brexit campaign, when Facebook fed “conservatives” and “liberals” thousands of divisive communications.
Many companies struggle internally with similar amplification dynamics. Success often hinges on creating interdepartmental synergies. But categorical thinking may cause you to seriously underestimate how well your teams can do cross-silo work together. If, say, you assume that your data scientists have lots of technical expertise but little understanding of how the business works, and that your marketing managers have the domain knowledge but can’t wrangle data, you might rarely think about having them team up. That’s one reason so many analytics initiatives fail.
Amplification also has subtler consequences for managerial decisions. Consider that NBA coaches are 17% more likely to change their starting lineup in a game following a close loss (100–101) than they are following a close win (100–99), even though the difference in the other team’s scores is only two points. But few coaches would change a lineup because their team lost 100–106 rather than 100–108, even though the difference is still only two points. A loss feels qualitatively different from a win, because you don’t think about sports outcomes as being on a continuum.
Whenever you make a decision using a cutoff along some continuous dimension, you’re likely to amplify small differences. After the financial crisis in 2008, the Belgian government bailed out Fortis, a subsidiary of BNP Paribas. As a result, the government owned millions of shares of BNP Paribas. According to the Belgian newspaper De Standaard, at the end of January 2018, when the stock price was a little over €67, the government decided that it would sell its shares if they reached €68 again. But they never did; instead the price plummeted, and those shares are now worth only €44.
Marketers tend to get obsessed with target customers, ignoring everyone else.
Nobody in the Belgian government could have predicted that the stock price would fall so much. But the government’s mistake was to make selling its shares an all-or-nothing affair. A better approach would have been to sell some of the stock at one price, some at a second price, and so on.
With the rising influence of behavioral economics and data science, companies increasingly rely on A/B testing to evaluate effectiveness. In part that’s because A/B tests are easy to implement and analyze: You create two versions of the world that are identical except for one factor; you assign one group of participants to experience version A and one to experience version B; and then you measure whether behavior differs substantially between the groups. There will always be some difference between the groups due simply to chance, even if your manipulation had no effect. So, to determine whether the difference is large enough to indicate that the manipulation did have an effect, you apply a statistical test. The outcome of the test is the probability that you would have observed a difference of that magnitude if the manipulation had no effect. This probability is known as the p-value. The closer a p-value is to zero, the more comfortably you can conclude that any difference can be attributed to the factor you manipulated, not just to chance. But how close to zero is close enough?
In 1925 Sir Ronald Fisher, a British statistician and geneticist, decided arbitrarily that .05 was a convenient threshold. Fisher might just as easily have picked .03, and in fact he recommended that the p-value threshold be dependent on the specifics of whatever study was being conducted. But few people paid attention to that. Instead, in the decades that followed, entire scientific disciplines blindly adopted .05 as the magical boundary that separates signal from noise, and it has become the norm in business practice.
That’s a problem. When an A/B test yields a p-value of .04, an intervention might be adopted, but at .06 it might be skipped—even though the difference between p=.04 and p=.06 is not in itself meaningful. Making matters worse, many experimenters peek at the data regularly to test for statistical significance, stopping data collection when they see a p-value below .05. This practice greatly increases the likelihood of concluding that an intervention is effective when in fact it isn’t. A recent study examining the practices of experimenters who use a popular online platform for A/B testing found that the majority engage in such “p-hacking,” increasing false discovery rates from 33% to 42%.
Once you’ve imposed a categorical structure, you tend to favor certain categories over others. But insufficiently attending to other categories can be harmful.
Imagine that you’re the digital marketing director for an online retailer that sells home furnishings with unique and creative designs. You’ve done a segmentation study and identified a target customer segment with the following characteristics: male professionals aged 18 to 34 with creative jobs in fashion, marketing, or media and with medium disposable income. You have $10,000 to spend on digital ads, and you’re considering three plans: (1) No targeting. The ad is served with equal probability to all Facebook users and will cost 40 cents per click. (2) Full targeting. The ad is served only to your target segment and will cost 60 cents per click. (3) Partial targeting. You invest half your budget in marketing to your target segment and the other half in mass marketing, which will cost 48 cents per click.
Which plan should you choose? Probably B or C, because it allows you to narrow your target—right?
Wrong. The best option is probably A, the broadest target. Why? Because targeting broadly often yields a higher ROI than targeting narrowly. Researchers have found that online ads tend to increase purchase probability by only a small fraction of a percent. If the chance that someone will buy your product without seeing an ad is 0.10%, exposure to an ad might move the probability up to 0.13%. The positive impact of the ad may be a bit greater for target customers, but in many cases it won’t compensate for the additional cost per click. Marketers, however, get obsessed with their target customers, ignoring the value that can be extracted from everyone else.
Facebook has been engaged in a concerted effort to teach its advertising customers about the importance of reach relative to narrow targeting. It cites the case of a beer brand that traditionally focused on men. When the brand moved onto digital media platforms, it was able to narrow its targeting, which seemed like a good thing. But in fact that severely limited the reach of its campaigns, and the brand started performing poorly. After some investigation the company realized that a significant proportion of people consuming its product were women. Once it broadened its targeting and creative messaging, it saw immediate positive results.
Net Promoter Score.
Discrimination can distort how data is interpreted. When we teach classes on data analytics, we often ask our students whether they’ve heard of the Net Promoter Score (NPS) and whether their companies use the metric in some way. Invariably most hands go up, and for good reason. After Frederick F. Reichheld introduced the concept, in this magazine (“The One Number You Need to Grow,” December 2003), it quickly became one of the most important key performance indicators in business, and it still is.
What is NPS, and how does it work? Companies ask customers (or employees) to indicate on a 0–10 scale how likely they are to recommend the company to relatives or friends. Zero means “not at all likely,” and 10 means “extremely likely.” After responding, customers are grouped into three categories—detractors (0–6), passives (7–8), and promoters (9–10). The NPS is arrived at by determining the percentage of customers in each category and then subtracting the percentage of detractors from the percentage of promoters. If 60% of your customers are promoters and 10% are detractors, your NPS is 50.
There are good reasons to use NPS. It’s straightforward and easy to understand. Also, it helps avoid the amplification bias that comes with categorical thinking—or, as Reichheld put it in his article, “the ‘grade inflation’ that often infects traditional customer-satisfaction assessments, in which someone a molecule north of neutral is considered ‘satisfied.’”
That’s helpful. But the NPS system actually exhibits the sort of amplification bias that it’s supposed to help companies avoid. Customers who score a 6, for example, are much closer to a 7 than a 0, but nonetheless they get lumped in with the detractors rather than the passives. Small differences across category boundaries matter in determining the score, in other words—whereas the same or larger differences within a category don’t.
NPS has another categorical-thinking problem: It disregards the number of passives it finds. Consider two extreme survey results: One company has 0% detractors and 0% promoters. Another company has 50% detractors and 50% promoters. The NPS for both is the same, but clearly their customer bases are very different and should be managed in different ways.
Biased interpretation of correlations.
Categorical thinking can also distort how you interpret data. Imagine that you’re responsible for managing a service desk. You believe that the satisfaction of your agents may have an effect on customer satisfaction, so you commission a study. A few weeks later a team from HR analytics sends you the data, visualized in a scatterplot that looks like Figure 1.
How would you evaluate the strength of the relationship between agent satisfaction and customer satisfaction? Most people see a moderately strong relationship.
But what if the results were different, and you were sent the scatterplot in Figure 2? How would you evaluate the strength of the relationship now?
Most people see a much weaker relationship or none at all. But the strength of the relationship is actually about the same. The scatterplots are identical except for eight data points that have moved from the upper-right quadrant in the first one to the lower-left quadrant in the second.
So why do people see a stronger relationship in the first graph? Because they tend to privilege the upper-right quadrant. In the first scatterplot they see many satisfied agents with satisfied customers, so they conclude that the correlation is fairly strong. In the second scatterplot they see few satisfied agents with satisfied customers, so they conclude that the correlation is weaker. There’s a lesson here: Failing to attend equally to all categories harms your ability to accurately uncover relationships between variables.
Categories lead to a fixed worldview. They give us a sense that this is how things are, rather than how someone decided to organize the world. John Maynard Keynes articulated the point beautifully. “The difficulty lies, not in the new ideas,” he wrote, “but in escaping from the old ones.”
In the 1950s the Schwinn Bicycle Company dominated the U.S. bicycle market. Schwinn focused on the youth market, building heavy, chrome-encrusted, large-tired bicycles for kids to pedal around the neighborhood. But the market changed markedly from the 1950s to the 1970s. Many adults took up cycling for sport and sought lighter, higher-performance bikes. Schwinn failed to adapt, and U.S. consumers gravitated toward European and Japanese bicycle makers. This was the beginning of Schwinn’s grinding and painful decline into obsolescence. The company’s view of the consumer landscape had fossilized from decades of success selling bikes to children, blinding Schwinn to the tectonic changes under way.
Innovation is about breaking the tendency to think categorically. Many businesses aim to increase the efficiency of their operations through categorization. They assign tasks to people, people to departments, and so on. Such disciplinary boundaries serve a purpose, but they also come at a cost. Future business problems don’t fall neatly within the boundaries that were created to help solve past problems. And thinking only within existing categories can slow down the creation of knowledge, because it interferes with people’s ability to combine elements in new ways.
Consider what researchers from the University of Toronto discovered in 2016, when they asked about 200 participants to build an alien with Legos. Some participants were asked to use blocks that had been organized into groups, and others were asked to use blocks in a random assortment. A third group was then asked to rate the creativity of the solutions—and declared the aliens made using uncategorized blocks to be more creative.
When categories fossilize, they can impede innovation in another way, by making it hard to think about using objects (or ideas) in atypical ways. This is the problem of functional fixedness. If you were given a screw and a wrench and asked to insert the screw in a wall, what would you do? You might try to clamp the head of the screw with the wrench and twist the screw into the wall—with predictably awkward and ineffective results. The most effective approach—using the wrench to hammer the screw in like a nail—might not occur to you.
Limiting the Dangers of Categorical Thinking
So how can a thoughtful leader avoid the harm that comes from categorical thinking? We propose a four-step process:
1. Increase awareness.
We all think categorically, and for good reason. But anybody who makes decisions needs to be aware of the alluring oversimplifications and distortions that categorical thinking encourages, the sense of easy understanding it invites, and the invisible biases it creates. The companies that best avoid those pitfalls will be the ones that help their employees be more comfortable with uncertainty, nuance, and complexity. Is a categorization valid? and Is it useful? are questions that should be part of the decision-making mantra.
2. Develop capabilities to analyze data continuously.
To avoid the decision-making errors that stem from categorical thinking, good continuous analytics are key. But many companies lack the know-how. When it comes to segmentation, for example, they outsource the analytics to specialized companies but then improperly interpret the information they’ve bought. That’s relatively easy to fix. Well-established metrics for evaluating the validity of a defined segment can be applied with a little bit of training. Any company that uses segmentation studies as a major part of its marketing research or strategic planning should employ such metrics and do such trainings; they represent a golden opportunity for smart organizations to develop in-house expertise and reap competitive advantage.
3. Audit decision criteria.
Many companies decide they will act only after they pass some arbitrary threshold on a continuum. This has two drawbacks.
First, it increases risk. Imagine that a company is doing market research to determine whether a new product is likely to succeed. It might move forward with a launch if consumer evaluations hit a predetermined threshold during a large-scale survey or if the results of an experiment yield a p-value smaller than the magic number .05. But because the difference between just hitting and just missing the threshold is minuscule, the company may have crossed it simply because of random variation in the sample or some small bias in the data collection method. A tiny and fundamentally meaningless difference can thus lead to a dramatically different decision—and, as the Belgian government learned when it failed to reach its stock-sale threshold, possibly the wrong one. In such a situation, a staged approach is far sounder. The Belgians could have scaled the amount of investment to the weight of the evidence instead of using a binary cutoff.
Second, an arbitrary threshold can impede learning. Consider a company that plans to make organizational changes if it doesn’t hit a certain revenue target. If it just barely fails to hit that target, it assumes that something is wrong and so makes the changes. But if the company just barely makes its target, it assumes that things are OK and carries on with business as usual, even though the two cases’ numbers are almost identical.
To avoid these problems, we recommend that you perform an audit of decision-making criteria throughout your organization. You’ll probably be surprised at how many decisions are made according to go/no-go criteria. Sometimes that’s unavoidable. But usually alternatives exist, and they represent another opportunity to reap competitive advantage.
4. Schedule regular “defossilization” meetings.
Even if you follow the three steps above, fossilization is still a danger. To avoid it, hold regular brainstorming meetings at which you scrutinize your most basic beliefs about what is happening in your industry. Is your model of the customer landscape still relevant? Are customer needs and desires changing?
BRUNO FONTANAAbout the art: Bruno Fontana is drawn to the mosaic of identical homes and cookie-cutter modern architecture often found in suburbia. In his photography he seeks to both categorize infinite repetitions of form and highlight the touches of customization and decorative embellishments that make these structures unique.
One way to innovate is to reflect on the individual components that make up existing categories and imagine new functions for them. For instance, cars transport people from A to B, and postal workers transport mail from A to B, right?
Well, yes—but if you think that way, you’re probably overlooking interesting opportunities. Amazon recognized this. When the company questioned the function of cars, it realized that they could be used to receive packages, so in the United States it began to deliver mail to the trunks of cars belonging to Prime members. Similarly, in the Netherlands, when PostNL considered the function of its postal workers, it recognized that while walking their routes they could regularly photograph weeds to better assess the effectiveness of herbicidal treatments—a valuable new function that categorical thinking would never have allowed the company to see.
Categories are how we make sense of the world and communicate our ideas to others. But we are such categorization machines that we often see categories where none exist. That warps our view of the world, and our decision making suffers. In the old days, businesses might have been able to get by despite these errors. But today, as the data revolution progresses, a key to success will be learning to mitigate the consequences of categorical thinking.