Published Aug 02, 2021
I have an embedded processor that calls a C++ function per frame of a video stream.
We need to optimize this function!
How can I benchmark each optimization trick I try?
I first set the RNG seed to generate a fixed dataset for each of 1000 trials.
I can bench the original function offline or online.
In offline, I can generate datasets and bench the original once beforehand, so tests run faster.
However, I must assume the test environment is controlled, or that I can trust an absolute time from last week instead of retiming.
I cannot assume a controlled environment.
Also, I develop off-device on a beefier machine which runs 1000 test trials in under 1 second.
Thus, I choose to bench online.
For a month, I use online benchmarks and optimize the time ratio.
One trick I try is switching from double to float.
Tests show just 0.2% accuracy loss but 40% speedup relative to the original.
It sounds too good to be true. After looking at absolute times, I saw the optimized has no speedup.
The original slows down!
Hereโs the high-level C++ code I run each trial.
SummaryStruct trial(bool is_optimized, DatasetStruct* dataset) {
// construct problem
ProblemStruct* problem = construct_problem(dataset);
// run function and return summary stats
SummaryStruct summary;
if (is_optimized) {
summary = optimized_solve(problem);
} else {
summary = original_solve(problem);
}
// logging
log_summary(summary);
return summary;
}
trial(true, dataset) // optimizedtrial(false, dataset) // originalMy tests show that changing optimized_solve() from double to float slows down original_solve().
How is that possible?
I define these functions independently, so the bug is likely in trial().
What if I just bench trial(false, dataset)? Then I only call original_solve().
Now I switch from double to float in optimized_solve().
Even though I never call optimized_solve(), I see the same slowdown in original_solve().
Huh?
cachegrind shows similar cache miss rates for both original_solve() with double and optimized_solve() with float.
However, the instruction count of the latter are much higher then the former.
50% higher.
Out of curiosity, I try the following.
I first split trial() in two.
SummaryStruct optimized_trial(DatasetStruct* dataset) {
// construct problem
ProblemStruct* problem = construct_problem(dataset);
// run function and return summary stats
SummaryStruct summary = optimized_solve(problem);
// logging
log_summary(summary);
return summary;
}
SummaryStruct original_trial(DatasetStruct* dataset) {
// construct problem
ProblemStruct* problem = construct_problem(dataset);
// run function and return summary stats
SummaryStruct summary = original_solve(problem);
// logging
log_summary(summary);
return summary;
}
original_trial(dataset) // originalSince I only bench original_trial(), I only call original_solve().
Now I switch from double to float in optimized_solve().
As expected, I see the same slowdown in original_solve().
Now I simply comment out optimized_trial() before I bench original_trial().
Oddly enough, I see no slowdown in original_trial()!
I believe the compiler combines original_solve() and optimized_solve().
When I switch optimized_solve() from double to float, original_solve() is still double.
Thus, I can see how original_solve() would be slower and have higher instruction count.
There probably are a ridiculous number of casts from float to double and back.
How do I prevent the compiler from merging these functions?
My hypothesis is that the compiler only merges functions in the same file.
optimized_solve() and original_solve() both live in solve.cpp.
What if I break the two functions into separate files optimized_solve.cpp and original_solve.cpp?
Now with optimized_solve() using float, I see that original_trial() runs no slower.
Nice! The fix seems too simple to be true, but Iโll take it.
Of course, Iโm not done yet.
I have two functions with duplicate code that I should abstract out.
I like analogies.
optimized_trial() and original_trial() have identical code in the top and bottom but differ in the middle.
Identical code is the bread, and differing code is the meat.
I donโt like having two trial() functions.
Can I still use one trial() function without the compiler noticing?
My idea is to abstract optimized_solve() and original_solve() as object methods solve() of subclasses.
I must access data across function calls, and classes package data nicely with one pointer.
Tests show that using float in optimized solve() once again slows down trial(false, dataset).
The compiler sees past my ruse!
Maybe I need separate trial() functions to trick the compiler.
Can I instead abstract the bread?
My solution is to move construction and logging into shared util methods.
Tests show that using float in optimized_solve() does not slow down original_trial().
Thank goodness!
My work in optimization is often more empirical than exact.
I brainstorm many tricks to reduce computations, but I only know if they work after I implement and benchmark them.
I especially treat the C++ compiler as a blackbox.
Taking a class like 6.035 Computer Language Engineering seems a fun way to look inside.
Of course, compilers in practice are way more complicated, but 6.035 should be a great start.