In trying to make sense of intelligent behavior, it is tempting to
try something like this: We begin by looking at the most common cases of the behavior we can think of, and figure out what
it would take to handle them. Then we build on it: We refine our
account to handle more and more. We stop when our account
crosses some threshold and appears to handle (say) 99.9 percent
of the cases we are concerned with.
This might be called an engineering strategy. We produce a
rough form of the behavior we are after, and then we engineer it
to make it work better, handle more. We see this quite clearly in
mechanical design. Given a rocket with a thrust of X, how can it
be refined to produce a thrust of Y? Given a bridge that will support load X, how can it be bolstered to support load Y?
This engineering strategy does work well with a number of
phenomena related to intelligent behavior. For example, when
learning to walk, we do indeed start with simple, common cases,
like walking on the floor or on hard ground, and eventually
graduate to walking on trickier surfaces like soft sand and ice.
Similarly, when learning a first language, we start by listening to
baby talk, not the latest episode of The McLaughlin Group (or the
Dana Carvey parody, for that matter).
There are phenomena where this engineering strategy does
not work well at all, however. We call a distribution of events
long-tailed if there are events that are extremely rare individually
but remain significant overall. These rare events are what
Nassim Taleb calls “black swans.” (At one time, it was believed in
Europe that all swans were white.) If we start by concentrating
on the common cases, and then turn our attention to the slightly
less common ones, and so on, we may never get to see black
swans, which can be important nonetheless. From a purely statistical point of view, it might appear that we are doing well, but
in fact, we may be doing quite poorly.
A numeric example
To get a better sense of what it is like to deal with a long-tailed
phenomenon, it is useful to consider a somewhat extreme imaginary example involving numbers.
Suppose we are trying to estimate the average of a very large
collection of numbers. To make things easy, I am going to start
by giving away the secret about these numbers: There are a trillion of them in the collection, and the average will turn out to
be 100,000. But most of the numbers in the collection will be
very small. The reason the average is so large is that there will
be 1,000 numbers among the trillion whose values are
extremely large, in the range of 100 trillion. (I am using
the American names here: “trillion” means 1012 and “billion”
Let us now suppose we know nothing of this, and that we are
trying to get a sense of what a typical number is by sampling.
The first 10 numbers we sample from the collection are the
2, 1, 1, 54, 2, 1, 3, 1, 934, 1.
The most common number here is 1. The median (that is, the
midpoint number) also happens to be 1. But the average does
appear to be larger than 1. In looking for the average, we want
to find a number where some of the numbers are below and
some are above, but where the total differences below are balanced by the total differences above. We can calculate a “sample
average” by adding all the numbers seen so far and dividing by
the number of them. After the first five numbers, we get a sample average of 12. We might have thought that this pattern
would continue. But after 10 numbers, we see that 12 as the
average is too low; the sample average over the 10 numbers is
in fact 100. There is only one number so far that is larger than
100, but it is large enough to compensate for the other nine that
are much closer to 100.
The problem with a long-tailed phenomenon is that the longer we looked at it, the less we understood what to expect.
But we continue to sample to see if this guess is correct. For
a while, the sample average hovers around 100. But suppose
that after 1,000 sample numbers, we get a much larger
one: 1 million. Once we take this number into account,
our estimate of the overall average now ends up being more
Imagine that this pattern continues: The vast majority of the
numbers we see look like the first 10 shown above, but every
thousand samples or so, we get a large one (in the range of 1
million). After seeing 1 million samples, we are just about ready
to stop and announce the overall average to be about 1,000,
when suddenly we see an even larger number: 10 billion. This is
an extremely rare occurrence. We have not seen anything like it
in 1 million tries. But because the number is so big, the sample
average now has to change to 10,000. Because this was so unexpected, we decide to continue sampling, until finally, after having seen a total of 1 billion samples, the sample average
appears to be quite steady at 10,000.
This is what it is like to sample from a distribution with a long
tail. For a very long time, we might feel that we understand the
collection well enough to be confident about its statistical properties. We might say something like this:
Although we cannot calculate properties of the collection as a whole, we
can estimate those properties by sampling. After extensive sampling, we
see that the average is about 10,000, although in fact, most of the numbers are much smaller, well under 100. There were some very rare large
outliers, of course, but they never exceeded 10 billion, and a number of
that size was extremely rare, a once-in-a-million occurrence. This has
now been confirmed with over 1 billion test samples. So we can confidently talk about what to expect.
But this is quite wrong! The problem with a long-tailed phenomenon is that the longer we looked at it, the less we understood what to expect. The more we sampled, the bigger the
average turned out to be. Why should we think that stopping at
1 billion samples will be enough?
To see these numbers in more vivid terms, imagine that we
are considering using some new technology that appears to be
quite beneficial. The technology is being considered to combat
some problem that kills about 36,000 people per year. (This
is the number of traffic-related deaths in the United States in
2012.) We cannot calculate exactly how many people will die
once the new technology is introduced, but we can perform
some simulations and see how they come out. Suppose that each
number sampled above corresponds to the number of people
who will die in one year with the technology in place (in one run of the simulation). The big question here is: Should the new
technology be introduced?
According to the sampling above, most of the time, we would
get well below 100 deaths per year with the new technology,
which would be phenomenally better than the current 36,000.
In fact, the simulations show that 99.9 percent of the time, we
would get well below 10,000 deaths. So that looks terrific. Unfortunately, the simulations also show us that there is a one-in-a-thousand chance of getting 1 million deaths, which would be
indescribably horrible. And if that were not all, there also appears
to be a one-in-a-million chance of wiping out all of humanity.
Not so good after all!
Someone might say:
We have to look at this realistically, and not spend undue time on events
that are incredibly rare. After all, a comet might hit the Earth too! Forget
about those black swans; they are not going to bother us. Ask yourself:
What do we really expect to happen overall? Currently 36,000 people
will die if we do not use the technology. Do we expect to be better off
with or without it?
This is not an unreasonable position. We cannot live our lives
by considering only the very worst eventualities. The problem
with a long-tailed phenomenon is getting a sense of what a more
typical case would be like. But just how big is a typical case? Half
of the numbers seen were below 10, but this is misleading
because the other half was wildly different. Ninety-nine percent
of the numbers were below 1,000, but the remaining 1 percent was wildly different. Just how much of the sampling can we
afford to ignore? The sample average has the distinct advantage
of representing the sampling process as a whole. Not only are
virtually all the numbers below 10,000, but the total weight of all the numbers above 10,000, big as they were, is matched by
the total weight of those below. And 10,000 deaths as a typical
number is still much better than 36,000 deaths.
But suppose, for the sake of argument, that after 1 billion
runs, the very next number we see is one of the huge ones mentioned at the outset: 100 trillion. (It is not clear that
there can be a technology that can kill 100 trillion
things, but let that pass.) Even though this is a one-in-a-billion
chance, because the number is so large, we have to adjust our
estimate of the average yet again, this time to be 100,000. And
getting 100,000 deaths would be much worse than 36,000
This, in a nutshell, is the problem of trying to deal in a practical way with a long-tailed phenomena. If all of your expertise
derives from sampling, you may never get to see events that are
rare but can make a big difference overall.
From Common Sense, the Turing Test, and the Quest for Real AI by Hector J. Levesque, published by the MIT Press.
Lead image courtesy of Mopic / Shutterstock