[Xorp-hackers] Heap profiling at call site

Sun Nov 29 04:17:45 PST 2009

Bruce Simpson wrote:
> I found mpatrol, when used as an LD_PRELOAD, can have some problems with 
> symbol retrieval. I'll see if adding it to the link line (like Google's 
> cpu profiler) can overcome this issue.
>   

mpatrol will log and do backtrace at call site just fine under 
FreeBSD/i386, which suggests it's probably an x86-64 ABI issue.

The mpatrol author isn't an ABI-head; there's been some list churn about 
it, and it sounds like it affects Linux too. Everyone, in free tool 
land, it seems, is waiting for libunwind to cut a fully working x86-64 
release. Given the libunwind project originated at  HP, it's not too 
much surprise to learn that they focused on Itanic^WItanium first.

Re call site heap profiling; it's largely academic just now, although 
having it would save us a lot of time in tracking these things down.

I should mention at this point, I'm still going with Marko's old hunch 
that it is allocator churn (operator new) which brings XRL I/O 
performance down. Anecdotal evidence seems to bear this out (looking at 
the hits in the call traces on the allocators).

Valgrind (in callgrind mode) will give accurate call counts on 'operator 
new()', 'malloc()' and friends; that's what really matters. If we take a 
callgrind format sample from a real box  (using oprofile or pmcstat), 
and then cross-reference, that'll quickly give us some insight into 
whether the XRL I/O paths are wedging on excessive allocations, or not.

I would just like to have these things automated, so that when I get the 
Thrift TTransport code banged out, I can tell, at a glance, that I am 
not comparing apples with oranges, and that the improvements can be 
quantified more quickly.

What's likely to give a performance boost, in the short term, is cutting 
over to UNIX domain stream sockets for XRL. These are likely to function 
like pipes. At least in FreeBSD, pipes are handled as a fifofs vnode, 
which shims directly to the UNIX domain stream socket I/O path; these 
are zero-copy inside the kernel, because the I/O is strictly local.

I believe Linux has since adopted similar optimizations in its pipe and 
UNIX domain socket implementations.

Local TCP can't offer such optimizations. The rules say, if it's a TCP, 
it has to act like a TCP. Even going over loopback means taking more 
locks, and running a full TCP state machine. So zero-copy is not as 
easily implemented on such paths.

cheers,
BMS