[Xorp-hackers] Heap profiling at call site
Bruce Simpson
bms at incunabulum.net
Sun Nov 29 04:17:45 PST 2009
Bruce Simpson wrote:
> I found mpatrol, when used as an LD_PRELOAD, can have some problems with
> symbol retrieval. I'll see if adding it to the link line (like Google's
> cpu profiler) can overcome this issue.
>
mpatrol will log and do backtrace at call site just fine under
FreeBSD/i386, which suggests it's probably an x86-64 ABI issue.
The mpatrol author isn't an ABI-head; there's been some list churn about
it, and it sounds like it affects Linux too. Everyone, in free tool
land, it seems, is waiting for libunwind to cut a fully working x86-64
release. Given the libunwind project originated at HP, it's not too
much surprise to learn that they focused on Itanic^WItanium first.
Re call site heap profiling; it's largely academic just now, although
having it would save us a lot of time in tracking these things down.
I should mention at this point, I'm still going with Marko's old hunch
that it is allocator churn (operator new) which brings XRL I/O
performance down. Anecdotal evidence seems to bear this out (looking at
the hits in the call traces on the allocators).
Valgrind (in callgrind mode) will give accurate call counts on 'operator
new()', 'malloc()' and friends; that's what really matters. If we take a
callgrind format sample from a real box (using oprofile or pmcstat),
and then cross-reference, that'll quickly give us some insight into
whether the XRL I/O paths are wedging on excessive allocations, or not.
I would just like to have these things automated, so that when I get the
Thrift TTransport code banged out, I can tell, at a glance, that I am
not comparing apples with oranges, and that the improvements can be
quantified more quickly.
What's likely to give a performance boost, in the short term, is cutting
over to UNIX domain stream sockets for XRL. These are likely to function
like pipes. At least in FreeBSD, pipes are handled as a fifofs vnode,
which shims directly to the UNIX domain stream socket I/O path; these
are zero-copy inside the kernel, because the I/O is strictly local.
I believe Linux has since adopted similar optimizations in its pipe and
UNIX domain socket implementations.
Local TCP can't offer such optimizations. The rules say, if it's a TCP,
it has to act like a TCP. Even going over loopback means taking more
locks, and running a full TCP state machine. So zero-copy is not as
easily implemented on such paths.
cheers,
BMS
More information about the Xorp-hackers
mailing list