13 benchmarking sins

Avoid these benchmarking boners if you want useful data from your system tests

1 2 3 4 Page 2
Page 2 of 4

Complex benchmark tools

It is important that the benchmark tool not hinder benchmark analysis by its own complexity. Ideally, the program is open source so that it can be studied, and short enough that it can be read and understood quickly.

For micro-benchmarks, it is recommended to pick those written in the C programming language. For client simulation benchmarks, it is recommended to use the same programming language as the client, to minimize differences.

A common problem is one of benchmarking the benchmark -- where the result reported is limited by the benchmark software itself. Complex benchmarks suites can make this difficult to identify, due to the sheer volume of code to comprehend and analyze.

Testing the wrong thing

While there are numerous benchmark tools available to test a variety of workloads, many of them may not be relevant for the target application.

For example, a common mistake is to test disk performance -- based on the availability of disk benchmark tools -- even though the target environment workload is expected to run entirely out of file system cache and not be related to disk I/O.

Similarly, an engineering team developing a product may standardize on a particular benchmark and spend all its performance efforts improving performance as measured by that benchmark. If it doesn't actually resemble customer workloads, however, the engineering effort will optimize for the wrong behavior.

A benchmark may have tested an appropriate workload once upon a time but hasn't been updated for years and so is now testing the wrong thing. The article Eulogy for a Benchmark describes how a version of the SPEC SFS industry benchmark, commonly cited during the 2000s, was based on a customer usage study from 1986.

Ignoring errors

Just because a benchmark tool produces a result doesn't mean the result reflects a successful test. Some -- or even all - of the requests may have resulted in an error. While this issue is covered by the previous sins, this one in particular is so common that it's worth singling out.

I was reminded of this during a recent benchmark of Web server performance. Those running the test reported that the average latency of the Web server was too high for their needs: over one second, on average. Some quick analysis determined what went wrong: the Web server did nothing at all during the test, as all requests were blocked by a firewall. All requests. The latency shown was the time it took for the benchmark client to time-out and error.

Ignoring variance

Benchmark tools, especially micro-benchmarks, often apply a steady and consistent workload, based on the average of a series of measurements of real-world characteristics, such as at different times of day or during an interval. For example, a disk workload may be found to have average rates of 500 reads/sec and 50 writes/sec. A benchmark tool may then either simulate this rate, or simulate the ratio of 10:1 reads/writes, so that higher rates can be tested.

This approach ignores variance: The rate of operations may be variable. The types of operations may also vary, and some types may occur orthogonally. For example, writes may be applied in bursts every 10 seconds (asynchronous write-back data flushing), whereas synchronous reads are steady. Bursts of writes may cause real issues in production, such as by queueing the reads, but are not simulated if the benchmark applies steady average rates.

1 2 3 4 Page 2
Page 2 of 4
7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon