I. From Milind Chabbi
I was developing a shared-memory synchronization algorithm, which was recursive in nature and involved complicated interactions between multiple threads via shared memory updates. The code ran into a livelock and the bug was neither apparent from inspecting the algorithm/code nor possible to isolate with traditional debugging techniques. The debugging was further complicated by lack of reproducibility of the bug, need for several threads to reproduce the bug, and run-to-run non determinism. Debugging tricks such as watchpoints, page protection and assertions could only identify symptoms of the problem but failed to help in arriving at the root cause of the problem even for parallel programming experts.
Intel's PinPlay served as a savior by aiding in identifying the root cause--a data race. With PinPlay, I was able to run the code several times and record the log of a buggy execution. Deterministic replay feature of PinPlay for multi-threaded codes, in conjunction with powerful features of Pin framework to perform sophisticated analysis during execution replay, allowed me to break into the debugger just in time to notice step-by-step memory updates and thread interleaving that caused the data race. The cause of the data race was the following: the programmer assumed 64-bit cache-line-aligned memory writes to be atomically visible on x86_64 machines, whereas the compiler (GNU C++ v.4.4.5) took liberty to split a 64-bit write of an immediate value into two independent 32-bit writes, violating the programmer's assumption. This caused a small execution window of two instructions where a shared variable was in an inconsistent state, leading to the occasional data race and eventual livelock.
Like Intel's Pin framework, PinPlay is also robust and works on real code on real machines, making it my choice for debugging parallel programs. I would most certainly recommend PinPlay to both novice and expert programmers to debug their code that exhibit non determinism. In fact, we plan to introduce PinPlay in one of advanced multi-core programming classes here at Rice University.
Affiliation:
Milind Chabbi is a doctoral candidate advised by Prof. John Mellor-Crummey in the department of computer science at Rice University. Milind is a member of Rice University's HPCToolkit team, where he develops tools and techniques for performance analysis of complex software systems.
If you are curious this was the issue:
My C++ source code: cache_line_aligned_64_bit_variable = 0xdffffffffffffffd; // Expected atomic write g++ generated assembly on 64-bit machine: movl $0xfffffffd,(%rax) // lower 32-bit update movl $0xdfffffff,0x4(%rax) // Higher 32-bit update
As you can see, an atomic write was turned into a non-atomic write during m/c code generation! I can't really blame compiler for taking liberty to do so.