A rambling account of a day-long bughunt.
Yesterday a colleague discovered that when running under ecto, PCL's ascii-format pcd writer output files with corrupted colors... many of the points were incorrectly black. This pcd writer packs four bytes into a float and serializes the float with standard c++ streaming operators. This program was executed by a python script; at length Eitan discovered that if he simply omitted the import of ecto_ros, without recompiling, the files were uncorrupted.
My first thought ODR violation, which results in Undefined Behavior, AKA Nasal Demons, bugs that manifest themselves in absolutely whatever way they choose. Probably it has nothing to do with the pcd writer per se, we thought. It might be a locale issue, i.e. that it was running on this colleague’s desk, we thought... we were wrong.
It was completely reproducible, but difficult to debug: the process loaded a python interpreter, ecto, ecto_pcl, ~20 PCL shared libraries, maybe ten more ROS shared libraries, ecto_ros, 12 OpenCV shared libraries... something like 80 total. The import of ecto_ros accounted for roughly 50 of them.
Well the ODR violation is tough to track down as much of this code is already compiled; you can't easily preprocess translation units and compare them. Getting everything checked out and built would have been a several hour exercise, during which time the problem would probably have disappeared, leaving us back where we started. We looked at ldd output, rpaths, tried to run nm on shared objects and saw that they were compiled release and stripped. We fired up valgrind and gdb and chased irrelevant global-destructor-time invalid deletes in ROS libraries, walking the stack but not seeing much.
We constructed a standalone binary that loaded a pcd file and saved an ascii version... at first, no error, until we linked against ecto_ros. Good, we can reproduce it. We constructed a test harness that recompiled our binary, executed it, and compared the checksum of the saved file to one known to be uncorrupted. Thinking of ODR violations, we wanted to see if there were duplicate but differing files on the include path, i.e. two copies of sensor_msgs/PointCloud2.h, but this path was thirty directories long, and again we couldn't easily see the preprocessed source that had been compiled in to the shared libraries we were running with.
We whittled away at the list of libraries; comment out, recompile, relink, rerun. We were down to OpenCV, pcl, and a couple of ROS libraries that were required by pcl. That is still twenty-something libraries.
Many iterations later, we had narrowed it down to OpenCV, which had been used inside ecto_ros for some nodes that convert between sensor_msgs/Image and cv::Mat. Next, of those 12 OpenCV libraries, just the presence of libopencv_core.so.2.3.1 in the link list would toggle the bug.
We built OpenCV from scratch and compiled against that. We see the corruption. We next attempted to somehow reduce the contents of this library until we find the code that toggles the bug... but this code isn't being executed anyway, and after some effort we discovered that we weren't going to be capable of subdividing this codebase due to missing symbols when linking at runtime... Our only option would be to manually comment out the bodies of all the functions in all the translation units.
In desperation, we did what any reasonable software engineer would do, we randomly flipped compile flags. To our astonishment the debug version of libopencv_core did not provoke the buggy behavior, but the release version did. This, despite the fact that no code inside this library was ever being run.
Was it assert() statements with side effects that negated the error condition, run at dlopen() time because they're found in the constructors of static objects? Other nefarious code compiled in by NDEBUG? SSE optimization flags? We tested each: No, no, and no. Twenty recompiles or so later, the culprit was isolated:
Google demonstrated its omniscience by turning up a site called programerror.com as the first hit for gcc fast-math. Another hour later we'd found that, of the various -f flags that -ffast-math enables, the specfic offender was -funsafe-math-operations. Excerpt from the gcc docs:
-funsafe-math-optimizations Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards. When used at link-time, it may include libraries or startup files that change the default FPU control word or other similar optimizations. This option is not turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. Enables -fno-signed-zeros, -fno-trapping-math, -fassociative-math and -freciprocal-math. The default is -fno-unsafe-math-optimizations.
So. -ffast-math would appear to be rather evil. If you use this flag to build your library, programs who link against you that expect correct math may break, and if they do, they will be consistent about it, but it will be very tough to figure out why.
So. If you've skipped to the bottom of this message, the short story is: DO NOT USE -ffast-math.