From the Archives: Benchmarks and measurement bias
In this post, originally written by Glenn Paulley and posted to sybase.com in March of 2009, Glenn talks about the perils of simplistic benchmarks and focuses specifically on measurement bias.
In the past few weeks I’ve witnessed a number of published performance analyses, both with and without SQL Anywhere. By and large these “benchmarks” have been exceedingly simplistic, which is unsurprising since a simple benchmark requires significantly less development effort than a complex one.
Performance analyses I see frequently, for example, involve (simply) inserting a number of rows into a table as quickly as possible. Knowing that value is a Good Thing (TM) – and in the lab we have specific tests to determine this (and other) values. However, for performance analysis of database applications, the two points I would make are:
- All too often such simplistic tests are not very representative of application behaviour, and so are relatively meaningless; and
- Fine-grained tests of simple operations can be subject to a wide variety of performance factors, where even minute differences in efficiency can skew the test results significantly.
With respect to the latter, here is a quote from Chapter 2 of Raj Jain’s performance analysis book  entitled “Common Mistakes and How to Avoid Them”:
It is important to understand the randomness of various systems and workload parameters that affect the performance. Some of these parameters are better understood than others. For example, an analyst may know the distribution for page references in a computer system but have no idea of the distribution of disk references. In such a case, a common mistake would be to use the page reference distribution as a factor but ignore disk reference distribution even though the disk may be the bottleneck and may have more influence on performance than the page references. The choice of factors should be based on their relevance and not on the analyst’s knowledge of the factors.
The impact of implicit or explicit bias on experimental setups cannot be underestimated. Indeed, a recent paper illustrates that simple compiler benchmarks, such as the SPEC CPU2006 benchmark suite, are not diverse enough to avoid the problems of what the authors call “measurement bias”. Here is the paper’s abstract:
This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may in fact introduce a significant bias in an evaluation. This phenomenon is called measurement bias in the natural and social sciences. Our results demonstrate that measurement bias is significant and commonplace in computer systems evaluation. By significant, we mean that measurement bias can lead to a performance analysis that either over-states an effect or even yields an incorrect conclusion. By commonplace we mean that measurement bias occurs in all architectures that we tried (Pentium 4, Core 2, and m5 03CPU), both compilers that we tried (gcc and Intel’s C compiler), and most of the SPEC CPU2006 C programs. Thus, we cannot ignore measurement bias. Nevertheless, in a literature survey of 133 recent papers from ASPLOS, PACT, PLDI, and CGO, we determined that none of the papers with experimental results adequately consider measurement bias.
Mytkowicz et al.’s results show that the measurement bias caused by environmental factors in the test, such as
(a) the amount of memory needed for environment variables on the machine, which can affect the alignment of the program stack, and
(b) the link order of the compiled objects within the final executable, which can impact the cache-line alignment of “hot” loops, trumps the performance factor to be measured, which in their case is the effectiveness of O3 compiler optimizations.
Their results show that modifying the experimental setup can itself yield a performance speedup of 0.8 to 1.1 – that is, their test can experience a 20% slowdown, or a 10% speedup, depending on these unmeasured factors.
The authors offer three suggestions for avoiding or detecting measurement bias, which in my view are equally applicable to benchmarks of database applications. They are:
- Utilizing a larger benchmark suite, sufficiently diverse to factor out measurement bias.
- Generating a large number of experimental setups, varying parameters known to cause measurement bias, and analyzing the results using statistical methods; and
- Using causal analysis (intervene, measure, confirm) to gain confidence that the conclusions being drawn are valid, even in the face of measurement bias.
My thanks to colleague Nathan Auch for bringing  to my attention.
 Raj Jain (1991). The Art of Computer Systems Performance Evaluation, John Wiley and Sons, New York. ISBN 0-471-50336-3.
 Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney (March 2009). Producing Wrong Data Without Doing Anything Obviously Wrong! In Proceedings, 14th International Conference on Architectural Support for Programming Languages and Operating Systems, Washington, DC, pp. 265-276.