Diversity Leads to Better Solutions. Here’s why.

The advantage of using data to drive innovation and business results is often discussed, with mounds of evidence backing up its efficacy. However, data by itself has 0 value. It is people that dissect data (digital or analog) to extract the information required to solve tough problems. It makes sense, then, to talk about how the human component of our systems affects the quality of our solutions.

Among the human factors that lead to extraordinary performance, team diversity has been one of the least understood and highly under-utilized levers at our disposal.

I’m not writing this post to convince you that maximizing diversity will improve business performance. There is an overwhelming amount of evidence that diverse teams produce more favourable results (yes, those are all links) in business and scientific domains.

I want to explore the mechanism behind the phenomenon. How does diversity lead to desirable outcomes? We can use formal models derived from evolutionary biology and statistics to arrive at the fundamental forces leading to this interesting outcome.

If your medium of choice is video, feel free to skip the rest of the post altogether and just watch me present these concepts with animated hand-waving.

View this blog as a video presentation on Youtube

If you prefer to read the material instead, the rest of the blog covers the exact same material in textual form.

Problem-Solving Formalized

If we want to talk about how diversity affects the quality of our solutions, we have to first formally define what it means to solve a problem.

We have some problem → P

We propose some solution S

Our solution has some value to us V(S)

We want to maximize the value in a limited timeframe

If we were to visualize the value of all potential solutions to our problem it might look something like this:

Value of hypothetical solutions 1-7


You can look at a particular problem from different perspectives. Perspectives are nothing new. For example, to represent a particular point on a plane, you might use a cartesian perspective with the point represented as X and Y coordinates or a polar perspective where the point is defined by an angle and a length.

Alternate perspectives for a point in space
Alternate perspectives for a point in space

The choice of your perspective will greatly influence your ability to represent solutions to the problem. If you want to represent a straight line, it makes a lot more sense to use the cartesian perspective, whereas to represent arcs, the polar perspective might be a more appropriate encoding.

The choice of perspective affects the simplicity of the solution
The choice of perspective affects the simplicity of the solution

Here’s another example. Let’s say you’re apartment-hunting. The perspective you choose to use will have a big impact on your satisfaction with the apartment you will find.

  • Luxury → Square Footage
  • Convenience → Proximity to the subway line
  • Education → Quality of nearby schools

When you view a problem through a particular perspective, you end up with a continuous landscape of potential solutions defined by that perspective.

A continuous solution landscape
A continuous solution landscape

For example, consider you’re a chocolate bar manufacturer who wants to maximize the value function V(S) of the number of chocolate bars a new product would sell.

You can look at the problem from the perspective of calories or chewiness of the new chocolate bar. The choice of perspective impacts the shape of your landscape.

The choice of perspective has an effect on the shape of the landscape
The choice of perspective has an effect on the shape of the landscape

With calories, you have a peak in value on the left for the calorie-conscious snackers and another peak for the sugar lovers, but once the amount of calories per bar becomes absurdly high, nobody would buy it.

Chewiness yields many more peaks because there’s no clear or useful relationship between sales and the bar’s chewiness. These peaks are called local optima, and better perspectives tend to have fewer peaks, making the solutions easier to navigate.

The caloric landscape has fewer peaks than the chewiness landscape
The caloric landscape has fewer peaks than the chewiness landscape

Fewer peaks are desirable because when you propose a solution and all adjacent possible solutions have lower values, finding a better solution is tricky unless you think outside the box. The fewer the peaks, the less chance you have of getting stuck.

The Perfect Perspective

There is a perfect perspective that yields a landscape with only 1 optimum. This landscape is known as a Mt. Fuji landscape.

An example of a Mt. Fuji landscape is the size of a shovel when moving snow. As when the shovel size is small little snow can be moved, and when it is too large, the snow becomes too heavy to displace.

image 9
Mt. Fuji landscape for the amount of snow displaced from the perspective of shovel size

Mt. Fuji landscapes do not only apply to simple problems. Consider the following 2-player strategy game. We have cards 1 through 9 and each player takes a turn to pick a card, trying to end up with 3 cards which add to 15.

image 10
Sum-to-15 strategy game

You can imagine the amount of strategy that might be involved to prevent your opponent from getting the right cards.

It doesn’t have to be difficult to solve with the right perspective. There is a Mt. Fuji perspective that orders the cards into a magic square. A magic square is a grid of numbers that add up to 15 for all rows, columns, and diagonals.

image 11
Cards ordered into a magic square

All of a sudden, our difficult strategy game turned into a game of tic-tac-toe with a simple optimal decision tree.

The neat thing is that we know a Mt. Fuji landscape exists for every problem out there.

Savant Existence Theorem
For any problem, there exist many perspectives that create Mount Fuji landscapes.


One simply has to order the potential solutions in such a way that yields such a landscape? If the ordering exists, so does the perspective.

The problem is that this perfect perspective is extremely difficult to find in most cases.

Consider this. With only 15 potential solutions to some problem there are 15! ways to order the solutions. Thats 1.3 trillion perspectives.

We need some way to more efficiently navigate the solution space.

This is where we come to heuristics


A heuristic is an imperfect but practical approach to problem-solving that is known to work most of the time for reaching an approximately optimal solution.

There are many heuristics you’re probably familiar with.

The most common heuristic is “hill-climbing” or gradient ascent. Using this heuristic, you move towards the closest adjacent possible solution to the current solution that yields a noticeable improvement. Most incremental improvements fall in this category.

image 12
Hill-climbing heuristic

As you can see, this can work well but runs the risk of getting stuck at a local optimum. This is well known to machine learning practitioners who rely on gradient descent to find parameters to a function that minimizes prediction error.

Another heuristic could be “do the opposite.” You may have practiced some form of this by playing “devil’s advocate”.

One example of this is can be seen in the field of marketing. Most marketers put their most valuable assets behind an email-capture form to collect leads.

Some marketers do the opposite. Instead of giving value after some obstacle, they deliver value upfront, hooking the audience with the quality of information provided, leading them to willingly give their email to learn more. This solution can have superior engagement and even collect more emails.

image 14
“Do the opposite” heuristic for email-capture

Yet another example of a heuristic in the realm of management is “big rocks first“. If you have a bunch of goals for the year and all your effort is spent on the small goals, the big items on that list will not get the attention they need to move the needle.

You may be asking yourself the million dollar question:

“Are there heuristics that are better than others for all problems?”

Unfortunately, the answer is no. This is has been proved by the No Free Lunch Theorem, which states that unless you know something about the problem being solved, no search algorithm is going to perform better than any other when searching over all solutions.

The Effect of Team Diversity

Now that we have the vocabulary, we can bring it all together. Up until now, I’ve been vaguely saying “diversity” leads to better outcomes. What kind of diversity?

In order to come up with better solutions we need diverse heuristics as well as diverse perspectives. Of course, these are highly correlated with gender and ethnic diversity.

Since no single heuristic is better than any other over all solutions, we need diverse heuristics in order to find more optima in our landscapes. For example, if our solution space is a grid and I look up, down, left, and right to find the best solution, we may converge to an optimal solution much faster if you look diagonally as well.

Diagonal vs row-column heuristics. Using both is more effective in finding solutions.
Diagonal vs row-column heuristics

Diverse perspectives have a very interesting effect on the shape of the solution landscape of the entire team. Imagine we have 2 team members Hans and Hanna, each with a unique perspective.

The landscapes resulting from Hans' and Hanna's perspectives
The landscapes resulting from Hans’ and Hanna’s perspectives

Note that solutions A, and B in Hans’ landscape represent the exact same solutions A and B in Hanna’s landscape. In other words, the different perspectives yield the same locally optimal solutions in both landscapes.

Who has a better perspective? Hanna has fewer peaks (meaning fewer places to get stuck), so her perspective may be more useful. We can also look at this from the “perspective” of the average value of local optima for each person. The average value of each person’s optima is known as the individual ability.

Looking at ability, we can see that Hanna’s locally optimal solutions also have a higher average value than Hans’.

Individual ability of Hans and Hanna
Individual ability of Hans and Hanna

Having a diverse set of local optima across team members means that if one team member gets stuck at an optimum, they can check if any other team members have a better solution, and simply continue the search from there.

In general, we can make the following claim:

A team can only get stuck on a local optimum that’s shared by every single member.

Team ability tends to be higher than any individual’s ability because having fewer local optima for the team implies fewer local peaks eroding the average value.

Team ability tends to be higher than individual ability
Team ability tends to be higher than individual ability

This effect is multiplied with the addition of additional team members. However, the beneficial effect on team ability only materializes if there is diversity in individual perspectives. If all team members share the same perspective, you do not get a reduction in the overlap of local optima, surrendering any gains that would provide.

The Fine Print

The model above neglects some factors that influence the effectiveness of diversity on outcomes.

Imperfect Communication

We are not perfect communicators. It is not possible for me to transplant my ideas into your head for you to understand the value of my solution. One way to minimize error introduced by skipping verbal communication and producing an artifact representing the solution so you can observe it directly.

Lack of Inclusion

If I invite you. to the meeting but don’t include you in a meaningful way, your unique perspectives, heuristics, and solutions are going to waste.

The Value Oracle

Until now I assumed that the value of any particular solution is obvious to the whole team. In reality, we do not have some “value oracle” giving us the exact value of our potential solutions.

In these cases, we have to rely on the aggregation of our experts’ predictions to give an accurate estimate of the value of each solution. This brings us to our next topic, the “wisdom of crowds.”

The Wisdom of Crowds

Although predictions made by individuals tend to be innacurate, when you start combining predictions of multiple individuals, you tend to get a prediction that is quite close to the mark.

For example, let’s say we have 3 individuals making predictions for an outcome with an actual value of 18.

Individual predictions vs average prediction. Average prediction is only off by 1, whereas individual predictions vary greatly
Individual predictions vs average prediction

Individually, they do not do a good job with their predictions, but they are only off by 1 in their average prediction.

We can define the error of each individual prediction as the difference between the prediction and the actual value. In order to prevent positive differences to the actual value from cancelling out the negative differences when calculating the average individual error, we can square the individual errors to get rid of negative values.

Error = (Prediction - Actual)^2
Error = (Prediction – Actual)2

Similarly, we can calculate the prediction diversity of the predictions by measuring the difference of each prediction to the average prediction. We square the terms once again to dispose of negative values. In statistics, this diversity term is also known as the variance of the sample.

Diversity = (Prediction - Average Prediction)^2
Prediction Diversity = (Prediction – Average Prediction)2

Having have defined error and diversity we can now unlock the diversity prediction theorem.

Diversity Prediction Theorem
Crowd’s Error = Average Individual Error – Prediction Diversity

Although this looks like pseudo-math, the definitions I’ve given you are all that’s necessary to expand the term and prove that the equivalence is valid.

Formally, this is defined as the following formula

image 24
Diversity Prediction Theorem

c = The average prediction of the crowd
𝜽 = The actual value
si = The prediction of the ith individual
n = The number of individuals

This simple formula continues to tell us the same story: without diversity, there is no wisdom in crowds.


When there is a small individual error, the crowd’s error is also low, and diversity doesn’t play a role, because our problem is easy.

When our problem is difficult, the individual predictions will not be accurate, driving our crowd’s error up. The only way to reduce the crowd error is to add diversity to the mix.

Signal Independence

Sometimes, even if you introduce diversity, you can be missing out on all the benefits by not having processes in place that extract value from the unique perspectives and heuristics.

For example, during a brainstorming session, if everyone in the room is participating in the same discussion, we tend to fall in the trap of group-think and lose the individual genius that comes from our diverse team. It would be much better to have everyone ideate in parallel and then work together to converge on the highest-value solutions.

This notion is well-supported by research that demonstrates that even with a slight correlation between team members’ predictions, you are getting only a small fraction of the signal that could have been harvested from independent perspectives.

image 26
The equivalent number of independent experts given different correlations between experts

It’s quite astounding, that even with a correlation of 0.2, 9 experts add no more signal than 3 experts with independent perspectives. With a higher correlation of 0.4, you’ll have a tough time getting anything more than the equivalent of 2 independent experts.

How does this correlation come about? It can come from the experts discussing the problem together or even with a person they know in common. It can also arise from sociocultural factors such as sharing a similar cultural or educational background.

Further Reading

This has been only a shallow dive into the topic of diversity and how it affects our organizations. We have covered some of the why, but the hard work of making diversity a normative part of our work culture remains a challenge worth examining.

Most of the material for this post is based on work done by Scott E. Page. Specifically, the application of ideas from evolutionary biology (fitness landscapes) and computer science (heuristics) to the field of innovation was derived from his book “The Model Thinker.”

There are also 2 other books by the author that expand on the ideas presented here in great detail.

image 27
Books by Scott E. Page | Image from Scott’s website

If you would like to explore work done around the tangible business benefits of diversity, you may enjoy these resources:

If you have insights about implementing the systems required to make diversity a first-class concern, please contribute to the conversation in the comments or on the social network of your choice.

Turn Data Into Gold

Add your email below to get access to the Data Digest newsletter.  Fresh and actionable tips on going from data to insights are just a click away!

2 Responses

  1. If in the formula “Crowd’s Error = Average Individual Error – Diversity” you can decrease the crowd error by increasing the diversity independent from the individual error, then you can increase the diversity to the point that the crowd error is zero! Or even negative. The formula is correct though. So maybe something is wrong with the assumption that you can increase diversity independent from the individual error?

    1. You’re final intuition is correct! They are not independent! You can see that in the fact that both terms ” Average Individual Error” and “Prediction Diversity” depend on the individual’s prediction.

      The proof is embedded in the formula. The algebraic manipulation is a bit gnarly, but it can be done!

      First, we’ll start by expanding the right-hand side of the equation, which consists of two terms: the average individual error and the prediction diversity.

      1. Average Individual Error:

      \[ \frac{\sum{(S_i – \theta)^2}}{n} = \frac{\sum{(S_i^2 – 2S_i\theta + \theta^2)}}{n} \]

      2. Prediction Diversity:

      \[ \frac{\sum{(S_i – c)^2}}{n} = \frac{\sum{(S_i^2 – 2S_ic + c^2)}}{n} \]

      Subtract the prediction diversity from the average individual error:

      \[ \left( \frac{\sum{(S_i^2 – 2S_i\theta + \theta^2)}}{n} \right) – \left( \frac{\sum{(S_i^2 – 2S_ic + c^2)}}{n} \right) \]

      \[ = \frac{\sum{S_i^2} – 2\theta\sum{S_i} + n\theta^2 – \sum{S_i^2} + 2c\sum{S_i} – nc^2}{n} \]

      Simplify the equation by cancelling out the \(\sum{S_i^2}\) terms and dividing each term by \(n\):

      \[ = \frac{- 2\theta\sum{S_i} + n\theta^2 + 2c\sum{S_i} – nc^2}{n} \]

      \[ = -2\theta\frac{\sum{S_i}}{n} + \theta^2 + 2c\frac{\sum{S_i}}{n} – c^2 \]

      Now remember that \(c = \frac{\sum{S_i}}{n}\), so we can replace \(\frac{\sum{S_i}}{n}\) with \(c\):

      \[ = -2\theta c + \theta^2 + 2c^2 – c^2 \]

      \[ = \theta^2 – 2\theta c + c^2 \]

      This is the expanded form of \((c – \theta)^2\), which is the left side of the equation.

      So we have shown that the left and right side of the equation are equivalent!

Leave a Reply

Your email address will not be published. Required fields are marked *

Don't Let Data-Driven Be A Dream

Get tangible guides on turning data into knowledge.