When faced with weird results coming from everyday tools, it can be very tempting to discard those results in disbelief by calling it a bug. The fact that those weird results are rare occurrences, that they may be not reproducible in the developer’s environment, and most importantly, that the developer cannot understand how to explain them, are strong incentives to discard those reports. And yet, it is very important to fight the urge to mistrust your tools because in most occurrences, they are absolutely right.
Very recently, I helped a coworker with this kind of subtle problem. His team had a crash log coming in from customers showing that our plugin for some other software was crashing the application while calling the boost libraries. In order to illustrate this situation, I’ve reproduced a simplified crash log where the simplified stack frames read like the following :
1 2 3 4 5 4 boost::system::some_function() //Path to system-wide: libboost_system.so 3 MyPlugin::initialization() //Path to: MyPlugin.so 2 MyPlugin::onload() //Path to: MyPlugin.so 1 ThirdPartyApp::pluginManager() //Path to: ThirdPartyApp 0 ThirdPartyApp::main() //Path to: ThirdPartyApp
Now the important detail about that stack trace is the call from frame 3 to frame 4. What proved challenging was that the team could not explain how this situation could be possible.
MyPlugin.so statically links to the boost system library and they believed the path to the system-wide dynamic library was a bug of the crash reporter or the symbolicator. And so, they were looking for new hypothesis to work with. That’s where I stopped them, before any more work was done. We needed to make sure that they were right about that conclusion before digging elsewhere.
Now fortunately for us, the team had one environment where they successfully reproduced the problem. The plugin worked fine everywhere else. This meant we could experiment with the problematic environment and perform tests in order to identify the critical difference which led to the problem. Our first step was to make sure that
libboost_system.so does exist at the path reported by the crash log. And it did. On both a stable environment, and the problematic environment.
We then set out to prove that the static link did work by making sure the boost symbol reported by the crash log does exist within our binary (
MyPlugin.so). To get that done, we used the following command line:
nm MyPlugin.so | c++filt | grep boost::system::some_function. The
nm utility lists all of the symbol names exposed by a binary. This means a list of every exported symbol. Then, the output is piped through
c++filt which demangles the symbol names and makes them easier to read by humans. Finally,
grep filters the output looking for our search string. Surely enough, we found the symbol we were looking for.
Up to this point, we have proved that there are two different locations where the code exists. But we have yet to prove which one really gets called at runtime on both environments. Some might argue that the crash log proves what is happening, but this is precisely the hypothesis we are trying to challenge.
One way to prove what is happening is to hook up a debugger and set a breakpoint. But there is another way we can achieve the same result without installing any developer tools on the problematic environment: activating some debugging features of the dynamic linker.
Various logging and lookup features of the dynamic linker can be configured with environment variables. In this particular case, it is the symbol bindings that are of interest. Linux’s dynamic linker
ld produces the logs with the
LD_DEBUG=bindings environment variable. On macOS, look for
DYLD_PRINT_BINDINGS=1 to achieve the same result.
Here’s a trimmed down exerpt of that logging on macOS:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 dyld: bind: ThirdPartyApp:0x10B93C000 = libdyld.dylib:dyld_stub_binder, *0x10B93C000 = 0x7FFFB1483168 dyld: lazy bind: ThirdPartyApp:0x10B93C010 = libdyld.dylib:_dlopen, *0x10B93C010 = 0x7FFFB14847F7 dyld: forced lazy bind: ThirdPartyApp:0x10B93C080 = libboost_system.dylib:some_function(), *0x10B93C080 = 0x10B972310 dyld: bind: MyPlugin.dylib:0x10B973000 = libc++.1.dylib:std::__1::cout, *0x10B973000 = 0x7FFFBA136660 dyld: bind: MyPlugin.dylib:0x10B973010 = libc++abi.dylib:___gxx_personality_v0, *0x10B973010 = 0x7FFFB0091FC0 dyld: bind: MyPlugin.dylib:0x10B973018 = libdyld.dylib:dyld_stub_binder, *0x10B973018 = 0x7FFFB1483168 dyld: forced lazy bind: MyPlugin.dylib:0x10B973028 = libunwind.dylib:__Unwind_Resume, *0x10B973028 = 0x7FFFB16CEE8E […] dyld: forced lazy bind: MyPlugin.dylib:0x10B973030 = libboost_system.dylib:some_function(), *0x10B973030 = 0x10B972310 […] dyld: forced lazy bind: MyPlugin.dylib:0x10B9730A8 = libc++abi.dylib:std::terminate(), *0x10B9730A8 = 0x7FFFB0091D90 dyld: forced lazy bind: MyPlugin.dylib:0x10B9730B0 = libc++abi.dylib:___cxa_begin_catch, *0x10B9730B0 = 0x7FFFB00917E1 dyld: forced lazy bind: MyPlugin.dylib:0x10B9730B8 = libc++abi.dylib:___cxa_end_catch, *0x10B9730B8 = 0x7FFFB0091855 dyld: forced lazy bind: MyPlugin.dylib:0x10B9730C0 = libsystem_platform.dylib:_memset, *0x10B9730C0 = 0x7FFFB169B34E dyld: forced lazy bind: MyPlugin.dylib:0x10B9730C8 = libsystem_c.dylib:_strlen, *0x10B9730C8 = 0x7FFFB14BDB40 dyld: lazy bind: ThirdPartyApp:0x10B93C018 = libdyld.dylib:_dlsym, *0x10B93C018 = 0x7FFFB1484888
I have isolated one line in the middle of the output which shows what we were looking for. There is a symbol within
MyPlugin.dylib (the macOS equivalent of
MyPlugin.so) being bound to the system-wide boost dynamic library. So that’s our proof that the tools are right. The crashlog is not misleading and rather than searching for a solution blindly while ignoring the best lead, the investigation can get back on track.
Now, my point with this article is that one should never dismiss some results with mere assumptions. Proofs can go a long way saving precious time. But I understand some of you might still be puzzled about how a function call into a static library could get rebound. You have two options. The first option is to read this great article on the technical details explaining the problem. The second option is to read the following paragraph which summarizes it much more briefly. In both cases, you may want to toy around with the sample program I made that reproduces the issue.
Now if you are still reading, it means you want the spoiler version. Basically, the problem is that
MyPlugin.so didn’t properly hide its internal symbols: everything was global. So even though
Boost::System was statically linked to
MyPlugin.so, the dynamic linker is still allowed to rebind the call to the statically linked
Boost to the other
Boost instance that has been previously linked in by
ThirdPartyApp. Yes, exactly. Maybe you missed it the first time you read the exerpt a few paragraphs back, but there was a indeed a line that was logged showing
ThirdPartyApp linking the same symbol name from
Boost::System as what
MyPlugin.so is going to use. Now this problem is easy to reproduce on Linux system, but there is an important additional detail that concerns the Mac. Mac uses a 2-level namespace system meant to prevent this type of accidental binding from happening. You will find an additional linker flag meant to deactivate this protection in the
CMakeList.txt of my sample program on github.