[Xorp-hackers] Crash due to stale cached xrl sender pointer.

Mon Nov 9 10:04:23 PST 2009

On 11/09/2009 09:13 AM, Bruce Simpson wrote:
> Bruce Simpson wrote:
>> It would be good to know if changing the Xrl allocation in this way
>> helps the situation with the race you saw...
>
> A bit more on the 'static Xrl*' mechanism.
>
> This is a lazy allocation mechanism, which amortizes the cost of
> instantiating class Xrl. The change I posted yesterday just moves the
> pointer to this allocation into the XrlFooClient itself, where we can
> manage its lifecycle; the semantics of its use don't change.
>
> Some data points:
> * There is only one instance of cached 'class Xrl*' for each
> XrlFooClient method.
> * Each method's wrapper lazy-allocates it once per runtime, when the
> method wrapper is called.
> * Without my change, this lazy allocation is non-reentrant, and happens
> once per runtime.
> * With my change, it happens once per method call, for each XrlFooClient
> instance, which is reentrant (but still not thread-safe).
> * It sets up the Xrl::args() field. That's it. No additional
> amortization of up-front costs of sending an XRL.
> * If the consumer of XrlFooClient::send_foo(target, arg) changes its
> 'target' argument, the resolved XrlSender is very likely to change,
> forcing a FinderDBEntry lookup in XrlRouter::send() anyway.
> * No upfront resolution is performed when the target changes.
>
> In practice, this optimization probably doesn't offer much, but it's
> better than nothing. class Xrl is still an intermediate representation,
> and until the Xrl is actually sent on the wire, it doesn't get 'packed'
> into its on-wire representation.
>
> There is a lost optimization opportunity here. XRL clients which use the
> C++ client stubs never need to tunnel XRLs through the Finder, so they
> can always use binary representation. However, there is nowhere to put
> it, until we actually hit the output buffers for the destination
> transport. [1]
>
> I haven't measured cycle-for-cycle yet the costs of sending an XRL. I
> probably will do this when profiling the Thrift run; I'm still in
> analysis, trying to come up with the least invasive change to avoid
> rewriting code I don't/won't need to.
>
> Thread safety hasn't really been a consideration to date. I'm making
> notes, but I'm not going to substantially change the behaviour towards
> making existing code thread-safe.

I think it is way too early to worry about micro optimizations.  If
you want to move to Thrift, then lets do so.  As long as the performance
isn't noticeably worse, then that's OK.

Don't try to get everything figured out up front...get something working
and then we see what needs tweaking later.

With regard to larger issues of the IPC, I dislike the current callback
logic.  I would like to get rid of all the auto-generated template code
and use real callback objects.  Each xrl call will have a pointer to a callback
object and will call that at appropriate times.

If/when you get your Thrift stuff in, I might attempt to work on
the callback logic.

I don't think we should worry about any async message support in the transport
level.  As long as we have useful callbacks, then only the client/servers need
to worry about async.  For instance, client says 'do-big-work()'.
Server immediately responds 'working' (RPC call is now done).
Later, server can send more messages to client as the big work completes.

XRL doesn't need to know anything about this.

Don't do any work to add functionality that *may* be needed in the future,
such as extra transport mechanisms, thread safety, etc.  I really hate
threads unless they are absolutely required.  They make things way too
hard to debug.

If/when it's actually needed, we can do that work then.  (For threads,
likely this would require a re-write of the entire application in
question, as well as hacking on the various libraries.)

> [1] We can probably leverage this opportunity using Thrift, as we can
> then render the binary representation upfront. We can also use a mixin
> with TTransport to do scatter/gather I/O with multiple TMemoryBuffers,
> so we only hit writev() or send() to keep syscall count minimal. XRL
> will currently try to do this, but it's tied to stream semantics.

"Premature optimization is the root of all evil."

Lets get something stable, then figure out a big work-load, and run
some real performance tests (with oprofile, gprof, custom profiling, etc).

For instance, in my case, just adding that netlink filtering took system load
from 300+ to 20, effectively making an O(N^2) to O(N).  This sort of
thing is way more important than saving a few cycles on encode/decode
of messages we send once per second.

Most of the slowdowns I've seen are stupid things like sleeping for
a timeout instead of doing work, probably to work around some race
no one felt like debugging.

Thanks,
Ben

-- 
Ben Greear <greearb at candelatech.com>
Candela Technologies Inc  http://www.candelatech.com