The Buckets and Balls Problem

Suppose that you are given N buckets and start throwing balls into buckets one at a time, each throw being done randomly, ignoring whether or not the buckets already have a ball in them. How many balls do you think you will have to throw before you can expect all the buckets to contain at least one ball?

Obviously, you need at least N balls, but since you are ignoring whether the bucket already contains a ball then it seems reasonable that the throwing will be very inefficient, and some buckets will get many balls before the last buckets is occupied. Maybe the number of balls will average to N2 or maybe even exponential in N?

Also suppose that you were to do this many times. How much would you expect the number of needed balls to vary between each attempt? If N were 1000 then maybe one attempt would take 10000 balls and the next attempt twice as many balls.

In fact the results might be considered surprising:

The number of balls needed is about N*log(N), but the variation between attempts is only about 2N.

Here are some examples:

N Average Width Width/Average
100 460 200 43%
1,000 6,900 2,000 30%
10,000 92,000 20,000 21%
100,0001,151,292 200,000 17%
1,000,00013,815,510 2,000,000 14%

For very large N the average number of balls needed, N*log(N), will be much larger than than the variation of 2N between attempts. This high level of predictability is a good sign that the problem exhibits an exact threshold, or phase transition.

To make things more precise we first rephrase the problem as a decision problem:

INSTANCE: Given N buckets, L balls, and a sequence of random choices for destination of each ball.

QUESTION: Does every bucket end up containing at least one ball?

and we can define an associated "control parameter", a, by

a = L/(N*log(N))
For sufficiently large N, we can now say, for example, that
a < 0.999 gives "almost certainly one bucket is left empty"

a > 1.001 gives "almost certainly no buckets are left empty"

That is, we have a sharp threshold at a=1.

Exact Probability Results

We are interested how we the chance of getting a "yes" answer (that is, all the buckets are occupied) varies as a function of N and L. So we define
Definition: P(N,L) is the probability of a "yes" answer, i.e. that everyone on the buckets will get at least one ball.
This problem is sufficiently simple that we can explicitly calculate P(N,L) using simple probability arguments.

Firstly,

   prob(bucket 1 contains ball 1) =  1/N
and so
   prob(bucket 1 does NOT contain ball 1) =  (1-1/N)
Now
   prob(bucket 1 is empty) =
         prob(bucket 1 does NOT contain ball 1) * 
         prob(bucket 1 does NOT contain ball 2) *
         ...
	 prob(bucket 1 does NOT contain ball L)
and since all the balls have the same associated probability we have
   prob(bucket 1 is empty) = (1-1/N)L
so that
   prob(bucket 1 is NOT empty) = 1 - (1-1/N)L
Now, all the buckets independently have this same probability so we have the main result:
   P(N,L) = prob(all buckets are NOT empty) 
          = ( 1 - (1-1/N)L)N

Interpretation of the result

The expression above is somewhat hard to read (an understatement). So first consider some approximations. Firstly, when N is large we can replace (1-1/N) with e-1/N giving
   P(N,L) =~ ( 1 - e-L/N)N
suppose we put L= N*log(N) then we have P =~ ( 1 - 1/N)N and in the limit of N goes to infinity this is just 1/e, a constant. This suggests that maybe it is the ratio L/(N*log(N)) that matters. so lets just try making some plots

we see that as N increases then the sharpness of the transition increases

To investigate the width we will define a new parameter x by

   x = (L-N*log(N))/N
so that now
   L = N*log(N) + x N
then in the limit of large N we find that
P(N,L) = e-(e-x)
The right hand side is a function which looks like

that is, it does a fairly rapid transition from zero to one.

Putting all this together we have that

   for  L <  N*log(N) - 2N     then      P(N,L) ~ 0 

   for  L >  N*log(N) + 2N     then      P(N,L) ~ 1 
justifying the results we summarized at the top of the page.

Diminishing Returns

Suppose that we measure the rate of progress rate by the number of throws that are needed before we hit an empty bucket. Getting the first new bucket takes precisely one throw as it does not matter which bucket we hit. Getting the last new bucket requires hitting one specific bucket out of N and so might be expected to take around N throws. Hence, the problem shows diminishing returns: getting the first few buckets goes much faster than getting the last few buckets.

Slightly more precisely, suppose at some point that k buckets are empty. Then hittng the next empty bucket is likely to take N/k throws.

With N=100 we get a progress curve looking like

Note that this estimate of the progress rate also give a quick way to estimate the position of the phase transition: Just add up the expected number of throws needed for each of the new buckets. That is, we can expect that the number of throws needed to get all the buckets is roughly

   sum from k=N to k=1 of N/k
which is approximately N*log(N). (To see this, approximate the sum over 1/k with an integral over 1/k).

Remarks

A reasonable complaint about this phase transition is that it is not very sharp for reasonable sizes of N. This seems to be typical of
  1. cases where exact calculations are possible
  2. cases where the problem instances are not computationally hard (on average)
Phase transitions in computationally hard problems tend to be a lot sharper at much smaller sizes, but then they are harder to solve and no exact calculations seem to be possible. If the problems are naturally easier then you need to make much bigger ones to see the transition clearly (obviously this is due to a conspiracy of the CPU manufacturers :-) ).

Although trivial in many senses, this problem might well serve as a first test for some ideas about phase transitions.


Back to ANTS@CIRL project home page: http://www.cirl.uoregon.edu/ANTS/

Page last modified: 2000-09-18 by Andrew Parkes. Please feel free to email corrections/suggestions.