Comparing cargo bench results to a slightly more robust method eliminates a lot of the noise, but there still appear to be a few performance regressions that both methods agree on. If someone has the statistics expertise to set me straight or help me take this further or both, please get in touch.
Some background: I'm curious about the historical runtime performance of the binaries produced by rustc. About a week ago, I ran a bunch of benchmarks and made some graphs using cargo bench
on a few different pinned crate revisions using nightly toolchains from 9/1/15 - 2/11/16. This produced some pretty noisy data, and while it's possible to visually inspect the graphs and find some interesting information, a statistical analysis of performance regressions is pretty difficult. Posting that pseudo-analysis to the Rust subreddit yielded some great feedback, so I thought I'd do it again now that I have more data.
In the GitHub issue that inspired me to try benchmarking historical Rust runtime performance with cargo bench
, eddyb made some nice suggestions about a benchmark method to reduce the noise in the data that cargo bench
was producing:
My 2 cents: don't measure wall-clock time in a multi-threading OS. On Linux,
getrusage
'sru_utime
field will measure time spent in userspace, which helps to remove I/O and scheduling noise. It's not used bylibtest
's#[bench]
because it's not portable (AFAIK).
The ideal solution would involve pinning a single thread to a single core exclusively (i.e. no other threads can run on that core), doing as little I/O as possible, and locking the CPU frequency.
Well, I attempted some of that. There were some other points he made about memory usage, but I haven't even tried to track that yet. The results are mixed. Some not-so-brief observations about this method:
ru_utime
reports times in increments of 1-4ms on the machines I've tried it on (that is without setting CONFIG_HIGH_RES_TIMERS and maybe recompiling my kernel). This meant that I needed to pick a number of iterations for each benchmark that would make sure that 3-4ms represented 1% or less of the total runtime for that benchmark function. I probably could have automated this, but I didn't. Oh well.cargo bench
and the hacked together benchmarks.So, I wrote a tiny macro to track time used by a given number of iterations, and ported a bunch of crates' benchmark code to use it rather than libtest
(I will pay my penance to the licensing gods another day). The code (oh, please don't look at it, but if you must) is available on GitHub, and the JSON files from both benchmark runs are in the repo root: example-benchmarks.json
for those using cargo bench, and getrusage-benchmarks.json
for...you guessed it.
Just like last time, I first took the average runtime for each benchmark function (cargo reports this in nanoseconds/iteration, my suite reports it in microseconds overall) and normalized it against the first point in the relevant series. This produced a number close to 1.0 for each benchmark function, and then for each nightly compiler date, I took the geometric mean to produce a single index value for that crate's benchmark results for that particular nightly compiler. This index is what's plotted below.
Next time: exploring different methods for automating the discovery of knowledge from the large amount of information that these methods produce. First up is the 1st order gaussian derivative. I would be SUPER excited if someone with a more-than-just-googling understanding of time series analysis has some ideas about the most reliable way to process and analyze performance data like this (or memory usage, or cache misses, or whatever else can eventually be benchmarked).
A few notes about what the graphs mean:
The code used to generate each graphic is at the end of this post.
The two benchmark methods agree most of the time, although the disagreement around Christmas 2015 is interesting, as they diverged in opposite directions. It's not even a matter of one holding steady while the other moves. However shortly thereafter both method agree that cbor got about 5% faster across the board in mid-January. There was also a fairly distinct and long-lived performance improvement in early October.
plot_benchmark(cargo_benches['cbor'], custom_benches['cbor'])
Nothing to see here. Move along.
plot_benchmark(cargo_benches['crc'], custom_benches['crc'])
Both methods seem to agree that csv's performance has been getting ever so slightly slower, up until late January 2016, when it went back to ~1.0.
plot_benchmark(cargo_benches['csv'], custom_benches['csv'])
While a the noisy portions of the series seems to spend more of its time above 1.0 then below it, there's nothing immediately obvious that I see here.
plot_benchmark(cargo_benches['itertools'], custom_benches['itertools'])
The method I trust a little more seems to think that memchr had some heft slowdowns in January that is hasn't yet recovered from. That might be an interesting period of time to investigate for a performance regression.
plot_benchmark(cargo_benches['memchr'], custom_benches['memchr'])
Nothing to see here, move along.
plot_benchmark(cargo_benches['optional'], custom_benches['optional'])
NOTE: The y-axis for this graph is scaled differently than the others due to how far the performance indices moved.
This is the only example I've seen of a really serious performance regression that both benchmark methods agree on. thankfully it appears to be back to normal, but something was causing a ~75% slowdown in permutohedron from mid-September 2015 to late January 2016.
plot_benchmark(cargo_benches['permutohedron'], custom_benches['permutohedron']) plt.ylim([0.5, 2.5])
Nothing super interesting here.
plot_benchmark(cargo_benches['rand'], custom_benches['rand'])
The two methods just do not agree at all about suffix's performance, but they seem to agree that something changed in late November 2015, and again in late December, and again in mid-January 2016. Of course, the graph suggests that exactly the opposite thing happened in both of those occasions, which is weird.
plot_benchmark(cargo_benches['suffix'], custom_benches['suffix'])
There's a definite slowdown trend from January 2016 to mid-February 2016 that both methods agree on, even having nearly identical movements on the same days. It looks here like there was a series of successive changes which decreased uuid's runtime performance, but there doesn't appear to be a smoking gun as there is with permutohedron.
plot_benchmark(cargo_benches['uuid'], custom_benches['uuid'])
import datetime import json import pandas as pd import matplotlib import matplotlib.pyplot as plt import matplotlib.dates as mdates import numpy as np from scipy.stats.mstats import gmean matplotlib.style.use('ggplot') cargo_benches = {} # process the JSON saved by secondstring with open('example-benchmarks.json') as m: manifest_data = json.loads(''.join(m.readlines())) raw_benches = manifest_data['benches'] for b in raw_benches: name = b['name'] cargo_benches[name] = {} results = b['results'] # for successful benchmarks, accumulate a list of all function average times for date, r in results.items(): if r.get('Success', None) is not None: bench_run = r['Success']['times'] for bench_fn in bench_run: if date not in cargo_benches[name]: cargo_benches[name][date] = [] time = bench_run[bench_fn]['avg_time_ns'] cargo_benches[name][date].append(time) import statistics as stats custom_benches = {} # process the json saved by the custom bench suite with open('getrusage-benchmarks.json') as f: benches = json.loads(''.join(f.readlines())) for date, all_benches in benches.items(): for bench_fn, times in all_benches.items(): if bench_fn not in custom_benches: custom_benches[bench_fn] = {} if date not in custom_benches[bench_fn]: custom_benches[bench_fn][date] = {} custom_benches[bench_fn][date] = {n: stats.mean(t) for n, t in times.items()} def normalize_cargo_against_first(bench_results): tpls = [(d, avgs) for d, avgs in bench_results.items()] tpls = sorted(tpls) first = tpls[0] normalized = {} for date, avgs in bench_results.items(): normalized[date] = [a / f for a, f in zip(avgs, first[1])] return normalized def normalize_custom_against_first(bench_results): tpls = [(d, avgs) for d, avgs in bench_results.items()] tpls = sorted(tpls) first = tpls[0] normalized = {} for date, avgs in bench_results.items(): normalized[date] = [] for fn in avgs: normalized[date].append(avgs[fn] / first[1][fn]) return normalized # prep a pandas dataframe of each benchmark suite def data_frame_for_benchmark(bench_results, cargo=True): # parse the dates and calculate the geometric mean if cargo: bench_results = normalize_cargo_against_first(bench_results) else: bench_results = normalize_custom_against_first(bench_results) bench_results = {datetime.datetime.strptime(d, '%Y-%m-%d'): np.array(v) for d, v in bench_results.items()} bench_results = {d: gmean(times) for d, times in bench_results.items()} # create a dataframe for plotting frame = pd.DataFrame.from_dict(bench_results, orient='index') if frame.size > 0: # sort by date frame = frame.sort_index() return frame else: # handle benchmarks that had all of their functions ignored return None # plot a time series for a given benchmark def plot_benchmark(cargo_bench_times, custom_bench_times): cargo_frame = data_frame_for_benchmark(cargo_bench_times, cargo=True) custom_frame = data_frame_for_benchmark(custom_bench_times, cargo=False) frame = pd.concat({'cargo bench': cargo_frame, 'getrusage()': custom_frame}, axis=1) if frame is not None: plt.figure() frame.plot(figsize=(15,4)) plt.ylim([0.8, 1.2]) start_date = datetime.datetime(2015, 9, 1) end_date = datetime.datetime(2016, 2, 11) plt.xlim([start_date, end_date])