Measuring Rust Runtime Performance: cargo bench vs. getrusage()
TLDR: Comparing cargo bench results to a slightly more robust method eliminates a lot of the noise, but there still appear to be a few performance regressions that both methods agree on. If someone has the statistics expertise to set me straight or help me take this further or both, please get in touch.
- tool used to collect cargo bench results
- “tool” used to collect getrusage() results
- repo that has the data files in JSON
Some background: I’m curious about the historical runtime performance of the binaries produced by rustc. About a week ago, I ran a bunch of benchmarks and made some graphs using
cargo bench on a few different pinned crate revisions using nightly toolchains from 9/1/15 - 2/11/16. This produced some pretty noisy data, and while it’s possible to visually inspect the graphs and find some interesting information, a statistical analysis of performance regressions is pretty difficult. Posting that pseudo-analysis to the Rust subreddit yielded some great feedback, so I thought I’d do it again now that I have more data.
In the GitHub issue that inspired me to try benchmarking historical Rust runtime performance with
cargo bench, eddyb made some nice suggestions about a benchmark method to reduce the noise in the data that
cargo bench was producing:
My 2 cents: don’t measure wall-clock time in a multi-threading OS. On Linux,
ru_utimefield will measure time spent in userspace, which helps to remove I/O and scheduling noise. It’s not used by
#[bench]because it’s not portable (AFAIK).
The ideal solution would involve pinning a single thread to a single core exclusively (i.e. no other threads can run on that core), doing as little I/O as possible, and locking the CPU frequency.
Well, I attempted some of that. There were some other points he made about memory usage, but I haven’t even tried to track that yet. The results are mixed. Some not-so-brief observations about this method:
- libc’s getrusage()
ru_utimereports times in increments of 1-4ms on the machines I’ve tried it on (that is without setting CONFIG_HIGH_RES_TIMERS and maybe recompiling my kernel). This meant that I needed to pick a number of iterations for each benchmark that would make sure that 3-4ms represented 1% or less of the total runtime for that benchmark function. I probably could have automated this, but I didn’t. Oh well.
- For a few reasons, some of the benchmarks I ported over were not compiling on various nightlies when they’d worked using cargo bench. I opted to remove them from my suite rather than figure the problems out, so the below graphs are only for crates whose benchmarks I ran with both
cargo benchand the hacked together benchmarks.
- It’s pretty easy to make sure that frequency scaling is disabled on my motherboard, and it’s also pretty easy to use the nix crate to set process affinity to a single core. It’s also pretty easy to use libc to call getrusage() and find time the process has spent in userspace. I am not aware of an easy way to kick all other process off the pinned core. FWIW, I left htop running and any of the 0.1% CPU usage processes which came up from idle appeared to be scheduled on other cores while I was running the benchmark.
So, I wrote a tiny macro to track time used by a given number of iterations, and ported a bunch of crates’ benchmark code to use it rather than
libtest (I will pay my penance to the licensing gods another day). The code (oh, please don’t look at it, but if you must) is available on GitHub, and the JSON files from both benchmark runs are in the repo root:
example-benchmarks.json for those using cargo bench, and
getrusage-benchmarks.json for…you guessed it.
Just like last time, I first took the average runtime for each benchmark function (cargo reports this in nanoseconds/iteration, my suite reports it in microseconds overall) and normalized it against the first point in the relevant series. This produced a number close to 1.0 for each benchmark function, and then for each nightly compiler date, I took the geometric mean to produce a single index value for that crate’s benchmark results for that particular nightly compiler. This index is what’s plotted below.
Next time: exploring different methods for automating the discovery of knowledge from the large amount of information that these methods produce. First up is the 1st order gaussian derivative. I would be SUPER excited if someone with a more-than-just-googling understanding of time series analysis has some ideas about the most reliable way to process and analyze performance data like this (or memory usage, or cache misses, or whatever else can eventually be benchmarked).
A few notes about what the graphs mean:
- The values are normalized against the measurements from the first point in the series when data was collected. So a value of 1.0 for a benchmark means “approximately the same overall as when we measured in early September 2015.”
- The benchmark code that was run at each data point was identical, the only variation was in which compiler and standard library version was used. These were downloaded and run using multirust because I already had it installed.
- No significant processes were used on the desktop where these benchmarks ran. Both benchmark methods were run on the same machine, same kernel version, etc. The machine in question runs Arch Linux. The only significant resource usage was htop.
The code used to generate each graphic is at the end of this post.
The two benchmark methods agree most of the time, although the disagreement around Christmas 2015 is interesting, as they diverged in opposite directions. It’s not even a matter of one holding steady while the other moves. However shortly thereafter both method agree that cbor got about 5% faster across the board in mid-January. There was also a fairly distinct and long-lived performance improvement in early October.
Nothing to see here. Move along.
Both methods seem to agree that csv’s performance has been getting ever so slightly slower, up until late January 2016, when it went back to ~1.0.
While a the noisy portions of the series seems to spend more of its time above 1.0 then below it, there’s nothing immediately obvious that I see here.
The method I trust a little more seems to think that memchr had some heft slowdowns in January that is hasn’t yet recovered from. That might be an interesting period of time to investigate for a performance regression.
Nothing to see here, move along.
NOTE: The y-axis for this graph is scaled differently than the others due to how far the performance indices moved.
This is the only example I’ve seen of a really serious performance regression that both benchmark methods agree on. thankfully it appears to be back to normal, but something was causing a ~75% slowdown in permutohedron from mid-September 2015 to late January 2016.
Nothing super interesting here.
The two methods just do not agree at all about suffix’s performance, but they seem to agree that something changed in late November 2015, and again in late December, and again in mid-January 2016. Of course, the graph suggests that exactly the opposite thing happened in both of those occasions, which is weird.
There’s a definite slowdown trend from January 2016 to mid-February 2016 that both methods agree on, even having nearly identical movements on the same days. It looks here like there was a series of successive changes which decreased uuid’s runtime performance, but there doesn’t appear to be a smoking gun as there is with permutohedron.