Introduction
I run a juggling club. The fact that it is dedicated to juggling isn’t important here. What is important is that the format is very casual. If you turn up, you pay a small fee to cover the cost of the hall rental and other incidentals. If you don’t turn up, you pay nothing. Each week you have an individual decision about whether to turn up.
As a result, the attendance varies. On a great night, we might have two dozen jugglers turn up. On a poor night, we might have seven. The fee is the same each time, so I am wearing a risk that on average we will get the approximately 15 paying jugglers required to break even. I am not aiming for a profit; in fact, I have and will carry over minor losses to keep the thing going – because *I* enjoy it.
I’ve noticed a phenomenon where the occasional jugglers think that it could be run cheaper, because they have rosy impressions of how many people turn up on average.
I have attributed part of this misunderstanding to a sampling error. The nights that are very busy are witnessed by a large number of jugglers. The nights that are quiet are witnessed by only a small number of jugglers. So the average perception of the jugglers is higher than the real figure.
I’ve never convinced myself how big that effect is, or even that it is true. (Each individual seems to take a genuine random sample, so how could there be a bias?) I decided to do a Monte Carlo simulation. This is the result.
Method
I made a few dangerously large assumptions.
First, I assumed each juggler made the decision to attend independently. In fact, that’s not the case, with couples and cliques tending to turn up as groups rather than individuals. This also ignores the effects of weather, major sporting matches, juggling events and the like which tend to group attendance behaviour. Both factors which would further accentuate the effect.
Second, I assumed that each juggler made the decision to attend each week independently of their previous record and the previous attendances. It was based purely on an internal probability checked each week. This discounts people going through phases of attending each week, or going away, etc. It also discounts people deciding whether to come the next week based on the number of people who turned up the previous week. (Massive over-attendance turns off some of the jugglers because the hall gets too busy for their comfort. Under-attendance turns off many more who come for social interaction.)
Finally, I had to assign the number of jugglers and their likelihood of attendance, based on a finger-in-the-air guess. This modelling decision affected the magnitude of the result greatly, and should be treated with suspicion!
I decided there was one person who attended 100% of the time (me). Two attended 90% of the time, four attended 70%, six 50%, eight 30%, ten 10% and fifty 1% of the time.
That was arbitrary, and gave an expected average attendance of 12.5 people. Slightly less than real life, but close enough to be going with.
Using this model, I did a Monte Carlo simulation of 10,000 weeks.
I averaged the results at each rank.
Results
- The actual average attendance over the 10,000 virtual weeks was 12.4971 (compared to an expected 12.5).
- Obviously, the person who attended 100% of the time had an accurate sample.
- The regulars who attended 90% of the time witnesed an average attendance of 12.60. Still had a good understanding, but you can see a slight overestimate.
- By the time you get to the 50% attendees, the estimate has grown by half a person to 13.00.
- At the extremes of 1% attendees, they witnessed an average of 13.49 – basically one extra person.
Conclusion
I was right – even though each individual attendee appears, from their perspective, to be randomly sampling the attendance, their perception is biased towards a higher average. I wonder what the name is for this paradox.
However, the size of the effect was actually smaller than I predicted. I would have guessed the rarer attendees actually saw 2 or 3 more attendees on average to me. An optimistic impression of one extra attendee doesn’t explain the opinions about profitability. (Variability over a real-world, much smaller sample than 10,000 weeks, not averaged out over up to 50 in the cohort, would explain a lot more.)
Comment by Aristotle Pagaltzis on December 22, 2011
Of course attendants who are affected by grouping factors should have a more significant overestimate, esp those who tend to attend as a couple or clique. This particular factor should be fairly easy to add to your simulation – just make a pair of agents show up together 80% or so of the time, or something like that. (Obviously this is only interesting if the couple are fairly rare in attendance, otherwise both will be frequent attendants so they will tend toward accuracy just as any other agent who attends frequently.) Maybe it’s worth rerunning your numbers and seeing how their particular perception is affected?
Comment by Julian on December 22, 2011
Aristotle,
I agree with every part of your analysis, except the idea that it is worth re-running the numbers.
The magnitude of the results were very sensitive to the arbitrary numbers I guesstimated as the individual likelihoods of people turning up. Any improvements in the area you describe would probably be washed out by the bigger impact of those model changes. The effort to overcome that isn’t reaching my “worth the effort” threshold, I am afraid.
I have to admit I am more interested in the qualitative nature of the paradox rather than the quantitative one. Each individual took (at worst) 100 samples, at random, of the attendance at the club. From their perspective, it seems like a perfectly legitimate method of sampling, with a reasonably small expected error margin. But, in fact, they are getting a biased view. Weird. How would I know if my experiments in other areas (or papers that I read) are suffering from the same sampling problem?
This isn’t even a case of someone assuming that the post office queue is always long because they only sample at lunch-times. Every session has an equal chance of them turning up.
Comment by Andrew on December 22, 2011
If the average attendance of everyone else is a, and mine is completely uncorrelated, and I turn up a fraction f of the time, then I will see on average a+1 people, but the true average will be a+f. So I see an excess of 1-f.
That is, 0 for constant attendees, nearly 1 for infrequent attendees, 0.5 for 50% attendees. For a perfectly correlated n-person group, it’s obviously n(1-f). More generally I think for each other person you’re correlated with by a fraction c, you’ll see an extra c-f disparity, but this margin is too small to contain the proof.
It’s a more interesting effect where there’s a non-linearity, such as pickup sport where you need n people to make credible teams.
Comment by Julian on December 22, 2011
Andrew,
An excellent analysis. Fits perfectly with the data. Explains why I shouldn’t expect more than +1 bias (ignoring couples and cliques).
Thank you very much.
Comment by Jon D on December 23, 2011
The original problem (people over-estimating attendance and therefore under-estimating the per-person costs) could probably be solved by opening your books. Just make a google spreadsheet that you link to your group’s page listing the number of attendees, the revenue, and the costs for each week. Then you recalculate how much you charge every six months or so based on whether you have extra money or not enough money at the end.
Comment by Julian on December 26, 2011
There’s another part that I need to clarify/confess. When I did some trial runs with different numbers of attendees, I found that this effect ranges from about 50% of the average (with a mean of 2 attendees) down to only 2% (with a mean of 50 attendees). I concluded that the number and distribution of attendees had a large effect on the relative error in the estimates.
What I didn’t notice, but perhaps should have, was that the absolute effect was always just under 1 percentage point (for the people who turned up rarely) and the number of attendees didn’t affect that.