[Xorp-hackers] Crash due to stale cached xrl sender pointer.
Bruce Simpson
bms at incunabulum.net
Thu Oct 29 10:10:03 PDT 2009
Hi Ben,
Not really meant to be spending time on this at the moment, but I
shall... it is not too far off from what I'm actually doing, and I think
we probably do need to go over this again ground a bit, given that I am
effectively rewriting the affected code right now.
Ben Greear wrote:
>
> In general, I dislike smart pointers, but in this case, they seem
> tailor made for the problem.
I would disagree that smart pointers are even necessarily the right
answer to the issue you've found; in some cases, they can do more harm
than good.
I got the electronic equivalent of dirty looks, at first, when I had
to work around a problem in template class Spt using ref_ptr<..>&.
However, that was an existing, isolated use of ref_ptr within the tree.
I'd prefer not to use ref_ptr for new code if at all possible.
XrlPFSender is a stereotype of an object which should not be created
and destroyed trivially. If something in libxipc is tripping over it, it
is possibly a race condition, or updates not being propagated elsewhere.
In this scenario, the Xrl blob contains a cached pointer to a
transport channel (XrlPFSender) which has now potentially gone away.
Given that Xrl instances are like confetti, it would be difficult to
track them all, and I'm not sure a refcount is the most appropriate way
to deal with that (see below).
It's a little trickier in this scenario because of how class Xrl is
treated in the code base.
>>
>> Is there a simpler workaround possible for the issue? I'd rather not
>> get too deep into reviewing a patch which cuts fairly deep into
>> internals which are probably about to get rewritten.
> I doubt it...don't know where all the xrls are stored..would have to
> search all of them and clean out any with
> pointers to the sender that is to be deleted.
There are several layers of indirection and caching going on in the XRL
layer, but the important ones here are:-
1) the cached FinderDBEntry used to hold the previous results of an
indirect XRL call through the Finder. The most fundamental cache
mechanism in libxipc.
2) the cached resolved_sender() pointer in Xrl -- what we're
interested in here.
The Xrl instance involved, based on your backtrace, seems to be
allocated by XrlRawPacket4V0p1Client::send_register_receiver(), and held
in a statically declared pointer.
When an XRL is to be sent through the C++ bindings, it will call back
into XrlRouter to see if there is a cached XrlPFSender for the given
XRL. The lookup is done w/o arguments.
One glaring blot on the landscape beckons this question:
* Are any processes sharing the segment that this 'static Xrl*' pointer
happens to be in? The pointer looks like it should be in BSS and thus
subject to copy-on-write, so this should not be an issue. However, if
multiple entities in the same process are calling the C++ bindings, this
COULD be a reentrancy issue.
If the Finder learns that an XRL target has gone away, it should blow
away the FinderClient cache entries, and then the XrlRouter::send()
method should notice this.
Unfortunately, this may not help in the failure case we're
examining. If this check is raced by the XRL being withdrawn and later
re-advertised by its target (e.g. its host process got restarted), then
obviously the cached XrlPFSender is going to be invalid in the XRL.
It's not 100% impossible that this notification has been raced. If
the code in your OSPF process which wants to send the XRL, is running
from a timer callback, and this callback happens to collide with the
FinderClient learning about the XRL target moving somewhere else in the
system -- then where the XRL data is going to get sent, will be affected
by which point in time it races the FinderClient cache update.
In many ways, the fact that the problem exists, is an artefact of
how method call resolution is working in the XRL RPC layer; it is
per-method rather than per-service, and this is really one of the things
I'm trying to address through the Thrift rewrite.
What the code is trying to do, is to cache the transport pointer
right next to the outgoing data. In principle this would be fine, were
it not for the fact that the transport can go away for a variety of
reasons. XrlPFSender has no knowledge of Xrl referencing it, and no
meaningful way to convey the failure mode to Xrl. It's really
XrlRouter's role to deal with this.
In the situation above, even if we held a ref_ptr on an XrlPFSender,
we wouldn't even know if the underlying network transport is still
valid. The "right thing" to do would be to force the inner Xrl's cached
resolved_sender pointer to be invalidated -- or validate the pointer
upfront when it's used. Again, this is really XrlRouter's responsibility.
It's possible for the Xrl's target to be known, and its XRL method
resolved, but its destination transport still unresolved, which is what
XrlRouter::get_sender() is trying to deal with.
For what it's worth, class Xrl largely exists because XORP RPC calls
can't be expressed as simple binary blobs. There are *two* RPC protocols
running in tandem inside libxipc, and one of them is textual. To my
mind, Xrl should be a more lightweight class than it actually is.
Caching the transport pointer (XrlPFSender) in the Xrl itself is
just asking for trouble in situations like this, given that we have no
means of telling the Xrl 'your resolved sender has gone away' -- it's
buried in libfooxif.so's copy-on-write BSS segment.
It's a non-trivial issue to fix. Using the ref_ptr seems deceptively
simple -- we get a handle on the transport, and so the code doesn't blow
up, but we probably don't fix the underlying issue (unless I'm missing
something).
Something interesting to try might be to modify clnt-gen to do a few
things in the client shims:
%%%
return _sender->send(*x, callback(....));
%%%
to become
%%%
bool retval = _sender->send(*x, callback(....));
x->set_resolved(false);
return retval;
%%%
This, however, doesn't fix the root problem either - it just makes
it possible to work around the issue without changing the allocation
semantics for XrlPFSender, by deprecating one of the cache mechanisms in
libxipc.
The current cache mechanism is fubar, because it apparently can't
deal with something in the XrlPFSender life cycle which causes it to be
deleted.
In summary:
I strongly believe that what you're actually seeing is a race which
class Xrl is not able to defend itself against... because the
responsibility for it belongs in class XrlRouter.
It would be good to get a handle first on who introduced the
secondary caching mechanism, and why. Most likely this was to avoid any
STL container traversals when an XRL is actually being sent, but given
that you've probably run into a race which blows this mechanism up, it
needs revisiting. (Yes, this has fallen under the axe in the Thrift
branch...!)
cheers,
BMS
More information about the Xorp-hackers
mailing list