Published Aug 02, 2021
I have an embedded processor that calls a C++
function per frame of a video stream.
We need to optimize this function!
How can I benchmark each optimization trick I try?
I first set the RNG seed to generate a fixed dataset for each of 1000 trials.
I can bench the original function offline or online.
In offline, I can generate datasets and bench the original once beforehand, so tests run faster.
However, I must assume the test environment is controlled, or that I can trust an absolute time from last week instead of retiming.
I cannot assume a controlled environment.
Also, I develop off-device on a beefier machine which runs 1000 test trials in under 1 second.
Thus, I choose to bench online.
For a month, I use online benchmarks and optimize the time ratio.
One trick I try is switching from double
to float
.
Tests show just 0.2% accuracy loss but 40% speedup relative to the original.
It sounds too good to be true. After looking at absolute times, I saw the optimized has no speedup.
The original slows down!
Hereโs the high-level C++
code I run each trial.
SummaryStruct trial(bool is_optimized, DatasetStruct* dataset) {
// construct problem
ProblemStruct* problem = construct_problem(dataset);
// run function and return summary stats
SummaryStruct summary;
if (is_optimized) {
summary = optimized_solve(problem);
} else {
summary = original_solve(problem);
}
// logging
log_summary(summary);
return summary;
}
trial(true, dataset) // optimized
trial(false, dataset) // original
My tests show that changing optimized_solve()
from double
to float
slows down original_solve()
.
How is that possible?
I define these functions independently, so the bug is likely in trial()
.
What if I just bench trial(false, dataset)
? Then I only call original_solve()
.
Now I switch from double
to float
in optimized_solve()
.
Even though I never call optimized_solve()
, I see the same slowdown in original_solve()
.
Huh?
cachegrind
shows similar cache miss rates for both original_solve()
with double
and optimized_solve()
with float
.
However, the instruction count of the latter are much higher then the former.
50% higher.
Out of curiosity, I try the following.
I first split trial()
in two.
SummaryStruct optimized_trial(DatasetStruct* dataset) {
// construct problem
ProblemStruct* problem = construct_problem(dataset);
// run function and return summary stats
SummaryStruct summary = optimized_solve(problem);
// logging
log_summary(summary);
return summary;
}
SummaryStruct original_trial(DatasetStruct* dataset) {
// construct problem
ProblemStruct* problem = construct_problem(dataset);
// run function and return summary stats
SummaryStruct summary = original_solve(problem);
// logging
log_summary(summary);
return summary;
}
original_trial(dataset) // original
Since I only bench original_trial()
, I only call original_solve()
.
Now I switch from double
to float
in optimized_solve()
.
As expected, I see the same slowdown in original_solve()
.
Now I simply comment out optimized_trial()
before I bench original_trial()
.
Oddly enough, I see no slowdown in original_trial()
!
I believe the compiler combines original_solve()
and optimized_solve()
.
When I switch optimized_solve()
from double
to float
, original_solve()
is still double
.
Thus, I can see how original_solve()
would be slower and have higher instruction count.
There probably are a ridiculous number of casts from float
to double
and back.
How do I prevent the compiler from merging these functions?
My hypothesis is that the compiler only merges functions in the same file.
optimized_solve()
and original_solve()
both live in solve.cpp
.
What if I break the two functions into separate files optimized_solve.cpp
and original_solve.cpp
?
Now with optimized_solve()
using float
, I see that original_trial()
runs no slower.
Nice! The fix seems too simple to be true, but Iโll take it.
Of course, Iโm not done yet.
I have two functions with duplicate code that I should abstract out.
I like analogies.
optimized_trial()
and original_trial()
have identical code in the top and bottom but differ in the middle.
Identical code is the bread, and differing code is the meat.
I donโt like having two trial()
functions.
Can I still use one trial()
function without the compiler noticing?
My idea is to abstract optimized_solve()
and original_solve()
as object methods solve()
of subclasses.
I must access data across function calls, and classes package data nicely with one pointer.
Tests show that using float
in optimized solve()
once again slows down trial(false, dataset)
.
The compiler sees past my ruse!
Maybe I need separate trial()
functions to trick the compiler.
Can I instead abstract the bread?
My solution is to move construction and logging into shared util
methods.
Tests show that using float
in optimized_solve()
does not slow down original_trial()
.
Thank goodness!
My work in optimization is often more empirical than exact.
I brainstorm many tricks to reduce computations, but I only know if they work after I implement and benchmark them.
I especially treat the C++
compiler as a blackbox.
Taking a class like 6.035 Computer Language Engineering seems a fun way to look inside.
Of course, compilers in practice are way more complicated, but 6.035 should be a great start.