13 benchmarking sins

Avoid these benchmarking boners if you want useful data from your system tests

1 2 3 4 Page 3
Page 3 of 4

Ignoring perturbations

Consider what external perturbations may be affecting results. Will a timed system activity, such as a system backup, execute during the benchmark run? For the cloud, a perturbation may be caused by unseen tenants on the same system.

A common strategy for ironing out perturbations is to make the benchmark runs longer -- minutes instead of seconds. As a rule, the duration of a benchmark should not be shorter than one second. Short tests might be unusually perturbed by device interrupts (pinning the thread while performing interrupt service routines), kernel CPU scheduling decisions (waiting before migrating queued threads to preserve CPU affinity) and CPU cache warmth effects. Try running the benchmark test several times and examining the standard deviation. This should be as small as possible, to ensure repeatability.

Also collect data so that perturbations, if present, can be studied. This might include collecting the distribution of operation latency -- not just the total runtime for the benchmark -- so that outliers can be seen and their details recorded.

Changing multiple factors

When comparing benchmark results from two tests, be careful to understand all the factors that are different between the two.

For example, if two hosts are benchmarked over the network, is the network between them identical? What if one host was more hops away, over a slower network, or over a more congested network? Any such extra factors could make the benchmark result bogus.

In the cloud, benchmarks are sometimes performed by creating instances, testing them, and then destroying them. This creates the potential for many unseen factors: Instances may be created on faster or slower systems, or on systems with higher load and contention from other tenants. It is recommended to test multiple instances and take the average (or better, record the distribution) to avoid outliers caused by testing one unusually fast or slow system.

Benchmarking the competition

Your marketing department would like benchmark results showing how your product beats the competition. This is usually a bad idea, for reasons I'm about to explain.

When customers pick a product, they don't use it for 5 minutes; they use it for months. During that time, they analyze and tune the product for performance, perhaps shaking out the worst issues in the first few weeks.

You don't have a few weeks to spend analyzing and tuning your competitor. In the time available, you can only gather untuned -- and therefore unrealistic -- results. The customers of your competitor -- the target of this marketing activity -- may well see that you've posted untuned results, so your company loses credibility with the very people it was trying to impress.

If you must benchmark the competition, you'll want to spend serious time tuning their product. Also search for best practices, customer forums, and bug databases. You may even want to bring in outside expertise to tune the system. Then make the same effort for your own company before you finally perform head-to-head benchmarks.

Friendly fire

When benchmarking your own products, make every effort to ensure that the top-performing system and configuration have been tested, and that the system has been driven to its true limit. Share the results with the engineering team before publication; they may spot configuration items that you have missed. And if you are on the engineering team, be on the lookout for benchmark efforts -- either from your company or from contracted third parties -- and help them out.

Consider this hypothetical situation: An engineering team has worked hard to develop a high-performing product. Key to its performance is a new technology that they have developed that has yet to be documented. For the product launch, a benchmark team has been asked to provide the numbers. They don't understand the new technology (it isn't documented), they misconfigure it,and then they publish numbers that undersell the product.

Sometimes the system may be configured correctly but simply hasn't been pushed to its limit. Ask the question, What is the bottleneck for this benchmark? This may be a physical resource, such as CPUs, disks or an interconnect, that has been driven to 100% and can be identified using analysis.

Another friendly fire issue is when benchmarking older versions of the software that have performance issues that were fixed in later versions, or on limited equipment that happens to be available, producing a result that is not the best possible (as may be expected by a company benchmark).

1 2 3 4 Page 3
Page 3 of 4
7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon