[Xorp-hackers] Crash due to stale cached xrl sender pointer.

Bruce Simpson bms at incunabulum.net
Thu Oct 29 10:10:03 PDT 2009


Hi Ben,

    Not really meant to be spending time on this at the moment, but I 
shall... it is not too far off from what I'm actually doing, and I think 
we probably do need to go over this again ground a bit, given that I am 
effectively rewriting the affected code right now.

Ben Greear wrote:
>
> In general, I dislike smart pointers, but in this case, they seem 
> tailor made for the problem.

    I would disagree that smart pointers are even necessarily the right 
answer to the issue you've found; in some cases, they can do more harm 
than good.

    I got the electronic equivalent of dirty looks, at first, when I had 
to work around a problem in template class Spt using ref_ptr<..>&. 
However, that was an existing, isolated use of ref_ptr within the tree. 
I'd prefer not to use ref_ptr for new code if at all possible.

    XrlPFSender is a stereotype of an object which should not be created 
and destroyed trivially. If something in libxipc is tripping over it, it 
is possibly a race condition, or updates not being propagated elsewhere.

    In this scenario, the Xrl blob contains a cached pointer to a 
transport channel (XrlPFSender) which has now potentially gone away. 
Given that Xrl instances are like confetti, it would be difficult to 
track them all, and I'm not sure a refcount is the most appropriate way 
to deal with that (see below).

It's a little trickier in this scenario because of how class Xrl is 
treated in the code base.

>>
>> Is there a simpler workaround possible for the issue? I'd rather not 
>> get too deep into reviewing a patch which cuts fairly deep into 
>> internals which are probably about to get rewritten.
> I doubt it...don't know where all the xrls are stored..would have to 
> search all of them and clean out any with
> pointers to the sender that is to be deleted.

There are several layers of indirection and caching going on in the XRL 
layer, but the important ones here are:-
  1) the cached FinderDBEntry used to hold the previous results of an 
indirect XRL call through the Finder. The most fundamental cache 
mechanism in libxipc.
  2) the cached resolved_sender() pointer in Xrl  -- what we're 
interested in here.

The Xrl instance involved, based on your backtrace, seems to be 
allocated by XrlRawPacket4V0p1Client::send_register_receiver(), and held 
in a statically declared pointer.

When an XRL is to be sent through the C++ bindings, it will call back 
into XrlRouter to see if there is a cached XrlPFSender for the given 
XRL. The lookup is done w/o arguments.

One glaring blot on the landscape beckons this question:
 * Are any processes sharing the segment that this 'static Xrl*' pointer 
happens to be in? The pointer looks like it should be in BSS and thus 
subject to copy-on-write, so this should not be an issue. However, if 
multiple entities in the same process are calling the C++ bindings, this 
COULD be a reentrancy issue.

If the Finder learns that an XRL target has gone away, it should blow 
away the FinderClient cache entries, and then the XrlRouter::send() 
method should notice this.

    Unfortunately, this may not help in the failure case we're 
examining. If this check is raced by the XRL being withdrawn and later 
re-advertised by its target (e.g. its host process got restarted), then 
obviously the cached XrlPFSender is going to be invalid in the XRL.

    It's not 100% impossible that this notification has been raced. If 
the code in your OSPF process which wants to send the XRL, is running 
from a timer callback, and this callback happens to collide with the 
FinderClient learning about the XRL target moving somewhere else in the 
system -- then where the XRL data is going to get sent, will be affected 
by which point in time it races the FinderClient cache update.

    In many ways, the fact that the problem exists, is an artefact of 
how method call resolution is working in the XRL RPC layer; it is 
per-method rather than per-service, and this is really one of the things 
I'm trying to address through the Thrift rewrite.

    What the code is trying to do, is to cache the transport pointer 
right next to the outgoing data. In principle this would be fine, were 
it not for the fact that the transport can go away for a variety of 
reasons. XrlPFSender has no knowledge of Xrl referencing it, and no 
meaningful way to convey the failure mode to Xrl. It's really 
XrlRouter's role to deal with this.

    In the situation above, even if we held a ref_ptr on an XrlPFSender, 
we wouldn't even know if the underlying network transport is still 
valid. The "right thing" to do would be to force the inner Xrl's cached 
resolved_sender pointer to be invalidated -- or validate the pointer 
upfront when it's used. Again, this is really XrlRouter's responsibility.

    It's possible for the Xrl's target to be known, and its XRL method 
resolved, but its destination transport still unresolved, which is what 
XrlRouter::get_sender() is trying to deal with.

    For what it's worth, class Xrl largely exists because XORP RPC calls 
can't be expressed as simple binary blobs. There are *two* RPC protocols 
running in tandem inside libxipc, and one of them is textual. To my 
mind, Xrl should be a more lightweight class than it actually is.

    Caching the transport pointer (XrlPFSender) in the Xrl itself is 
just asking for trouble in situations like this, given that we have no 
means of telling the Xrl 'your resolved sender has gone away' -- it's 
buried in libfooxif.so's copy-on-write BSS segment.

    It's a non-trivial issue to fix. Using the ref_ptr seems deceptively 
simple -- we get a handle on the transport, and so the code doesn't blow 
up, but we probably don't fix the underlying issue (unless I'm missing 
something).

    Something interesting to try might be to modify clnt-gen to do a few 
things in the client shims:
%%%
return _sender->send(*x, callback(....));
%%%
to become
%%%
bool retval = _sender->send(*x, callback(....));
x->set_resolved(false);
return retval;
%%%

    This, however, doesn't fix the root problem either - it just makes 
it possible to work around the issue without changing the allocation 
semantics for XrlPFSender, by deprecating one of the cache mechanisms in 
libxipc.
    The current cache mechanism is fubar, because it apparently can't 
deal with something in the XrlPFSender life cycle which causes it to be 
deleted.

In summary:
    I strongly believe that what you're actually seeing is a race which 
class Xrl is not able to defend itself against... because the 
responsibility for it belongs in class XrlRouter.

    It would be good to get a handle first on who introduced the 
secondary caching mechanism, and why. Most likely this was to avoid any 
STL container traversals when an XRL is actually being sent, but given 
that you've probably run into a race which blows this mechanism up, it 
needs revisiting. (Yes, this has fallen under the axe in the Thrift 
branch...!)

cheers,
BMS



More information about the Xorp-hackers mailing list