In trying to make sense of intelligent behavior, it is tempting to

try something like this: We begin by looking at the most common cases of the behavior we can think of, and figure out what

it would take to handle them. Then we build on it: We refine our

account to handle more and more. We stop when our account

crosses some threshold and appears to handle (say) 99.9 percent

of the cases we are concerned with.

This might be called an * engineering *strategy. We produce a

rough form of the behavior we are after, and then we engineer it

to make it work better, handle more. We see this quite clearly in

mechanical design. Given a rocket with a thrust of

*, how can it*

*X*be refined to produce a thrust of

*? Given a bridge that will support load*

*Y**, how can it be bolstered to support load*

*X**?*

*Y*This engineering strategy does work well with a number of

phenomena related to intelligent behavior. For example, when

learning to walk, we do indeed start with simple, common cases,

like walking on the floor or on hard ground, and eventually

graduate to walking on trickier surfaces like soft sand and ice.

Similarly, when learning a first language, we start by listening to

baby talk, not the latest episode of * The McLaughlin Group *(or the

Dana Carvey parody, for that matter).

There are phenomena where this engineering strategy does

not work well at all, however. We call a distribution of events

* long-tailed *if there are events that are extremely rare individually

but remain significant overall. These rare events are what

Nassim Taleb calls “black swans.” (At one time, it was believed in

Europe that all swans were white.) If we start by concentrating

on the common cases, and then turn our attention to the slightly

less common ones, and so on, we may never get to see black

swans, which can be important nonetheless. From a purely statistical point of view, it might appear that we are doing well, but

in fact, we may be doing quite poorly.

**A numeric example**

To get a better sense of what it is like to deal with a long-tailed

phenomenon, it is useful to consider a somewhat extreme imaginary example involving numbers.

Suppose we are trying to estimate the average of a very large

collection of numbers. To make things easy, I am going to start

by giving away the secret about these numbers: There are a trillion of them in the collection, and the average will turn out to

be 100,000. But most of the numbers in the collection will be

very small. The reason the average is so large is that there will

be 1,000 numbers among the trillion whose values are

extremely large, in the range of 100 trillion. (I am using

the American names here: “trillion” means 10^{12} and “billion”

means 10^{9}.)

Let us now suppose we know nothing of this, and that we are

trying to get a sense of what a typical number is by sampling.

The first 10 numbers we sample from the collection are the

following:

2, 1, 1, 54, 2, 1, 3, 1, 934, 1.

The most common number here is 1. The median (that is, the

midpoint number) also happens to be 1. But the average does

appear to be larger than 1. In looking for the average, we want

to find a number where some of the numbers are below and

some are above, but where the total differences below are balanced by the total differences above. We can calculate a “sample

average” by adding all the numbers seen so far and dividing by

the number of them. After the first five numbers, we get a sample average of 12. We might have thought that this pattern

would continue. But after 10 numbers, we see that 12 as the

average is too low; the sample average over the 10 numbers is

in fact 100. There is only one number so far that is larger than

100, but it is large enough to compensate for the other nine that

are much closer to 100.

The problem with a long-tailed phenomenon is that the longer we looked at it, the less we understood what to expect.

But we continue to sample to see if this guess is correct. For

a while, the sample average hovers around 100. But suppose

that after 1,000 sample numbers, we get a much larger

one: 1 million. Once we take this number into account,

our estimate of the overall average now ends up being more

like 1,000.

Imagine that this pattern continues: The vast majority of the

numbers we see look like the first 10 shown above, but every

thousand samples or so, we get a large one (in the range of 1

million). After seeing 1 million samples, we are just about ready

to stop and announce the overall average to be about 1,000,

when suddenly we see an even larger number: 10 billion. This is

an extremely rare occurrence. We have not seen anything like it

in 1 million tries. But because the number is so big, the sample

average now has to change to 10,000. Because this was so unexpected, we decide to continue sampling, until finally, after having seen a total of 1 billion samples, the sample average

appears to be quite steady at 10,000.

This is what it is like to sample from a distribution with a long

tail. For a very long time, we might feel that we understand the

collection well enough to be confident about its statistical properties. We might say something like this:

Although we cannot calculate properties of the collection as a whole, we

can estimate those properties by sampling. After extensive sampling, we

see that the average is about 10,000, although in fact, most of the numbers are much smaller, well under 100. There were some very rare large

outliers, of course, but they never exceeded 10 billion, and a number of

that size was extremely rare, a once-in-a-million occurrence. This has

now been confirmed with over 1 billion test samples. So we can confidently talk about what to expect.

But this is quite wrong! The problem with a long-tailed phenomenon is that the longer we looked at it, the less we understood what to expect. The more we sampled, the bigger the

average turned out to be. Why should we think that stopping at

1 billion samples will be enough?

To see these numbers in more vivid terms, imagine that we

are considering using some new technology that appears to be

quite beneficial. The technology is being considered to combat

some problem that kills about 36,000 people per year. (This

is the number of traffic-related deaths in the United States in

2012.) We cannot calculate exactly how many people will die

once the new technology is introduced, but we can perform

some simulations and see how they come out. Suppose that each

number sampled above corresponds to the number of people

who will die in one year with the technology in place (in one run of the simulation). The big question here is: Should the new

technology be introduced?

According to the sampling above, most of the time, we would

get well below 100 deaths per year with the new technology,

which would be phenomenally better than the current 36,000.

In fact, the simulations show that 99.9 percent of the time, we

would get well below 10,000 deaths. So that looks terrific. Unfortunately, the simulations also show us that there is a one-in-a-thousand chance of getting 1 million deaths, which would be

indescribably horrible. And if that were not all, there also appears

to be a one-in-a-million chance of wiping out all of humanity.

Not so good after all!

Someone might say:

We have to look at this realistically, and not spend undue time on events

that are incredibly rare. After all, a comet might hit the Earth too! Forget

about those black swans; they are not going to bother us. Ask yourself:

What do we really expect to happen overall? Currently 36,000 people

will die if we do not use the technology. Do we expect to be better off

with or without it?

This is not an unreasonable position. We cannot live our lives

by considering only the very * worst *eventualities. The problem

with a long-tailed phenomenon is getting a sense of what a more

*case would be like. But just how big is a typical case? Half*

*typical*of the numbers seen were below 10, but this is misleading

because the other half was wildly different. Ninety-nine percent

of the numbers were below 1,000, but the remaining 1 percent was wildly different. Just how much of the sampling can we

afford to ignore? The sample average has the distinct advantage

of representing the sampling process as a whole. Not only are

virtually all the numbers below 10,000, but the total weight of all the numbers above 10,000, big as they were, is matched by

the total weight of those below. And 10,000 deaths as a typical

number is still much better than 36,000 deaths.

But suppose, for the sake of argument, that after 1 billion

runs, the very next number we see is one of the * huge *ones mentioned at the outset: 100 trillion. (It is not clear that

there can be a technology that can kill 100 trillion

things, but let that pass.) Even though this is a one-in-a-billion

chance, because the number is so large, we have to adjust our

estimate of the average yet again, this time to be 100,000. And

getting 100,000 deaths would be much worse than 36,000

deaths.

This, in a nutshell, is the problem of trying to deal in a practical way with a long-tailed phenomena. If all of your expertise

derives from sampling, you may never get to see events that are

rare but can make a big difference overall.

*From* Common Sense, the Turing Test, and the Quest for Real AI *by Hector J. Levesque, published by the MIT Press.*

*Lead image courtesy of Mopic / Shutterstock*