[Xorp-hackers] XRL call serialization

Thu Oct 29 12:09:16 PDT 2009

Hi Ben,

Time for more devil's advocate action.

Ben Greear wrote:
>
> It seems that the router-mgr *might* could read and queue several xrl
> requests, and then possibly answer them out of order.

Based on my recent footprints in XrlAction, that is very likely. But not 
because XRL is doing anything wrong.

One of the things I wanted to mention in my previous reply on this 
thread: if you keep calling different XRL methods, reentrancy in the 
client isn't a problem -- you can tell your own requests apart just 
fine, they're for different methods.

But but but: if we have multiple XRL calls in-flight, for the same 
method, this breaks down. Now the dispatch of the callback ('here's the 
answer to my question') will be on a per-call basis. And so the only 
guarantee you get of in-order dispatch, is the fact that XRL transport 
is using a stream (TCP out of the box).

If you mix possibly co-dependent operations and fire them off, problems 
may happen. [Although the XRLs in these scenarios aren't being batched.]

This is why the Router Manager is pretty tight about its timings, and 
keeping the XRL actions tied down to particular commit steps, is pretty 
critical to making sure stuff doesn't go out of control.

    Again, it might be worth revisiting Pavlin's original idea, that we 
teach the routing processes to keep their own snapshots of state and 
implement commit/rollback there. The more I stare at Thrift and XRL, the 
more I believe that's a good idea. It simplifies the Router Manager 
interface with the other processes.

    Although as you point out, we still need to keep those snapshots 
around in the Router Manager so that the process can restart OK -- 
either that or we give processes some abstract form of non-volatile 
storage we can easily propagate back to the management point at the 
point of commit.

However, you're quite right -- I see no reason why you can't introduce 
funk into the system from the Router Manager, the same way that olsr's 
register_rib() method might.

Consider this scenario -- let's imagine that xorp_olsr has crashed. It 
left a whole bunch of OLSR routes in the RIB. It is using a non-default 
admin distance. For whatever reason, this was configured on-the-fly, and 
was an uncommitted change. That process is restarted.

Along comes the existing register_rib() function. Let's assume the 
set-admin-distance step modifies the old origin table from the previous 
incarnation of xorp_olsr. Let's also assume that there is a 
redistribution policy in effect for OSPF, which is redistributing routes 
above a given admin distance.on another interface to an OSPF backbone area.

You can see how that gets really interesting. As soon as the call to 
change the admin distance has fired, the routes will be rewritten to 
contain the new admin distance, the RIB will redistribute the routes 
(via policy) to xorp_ospf, and we've got a fair amount of system 
activity going on, just due to a process restart.

Fortunately, the RIB method to set the admin distance does not rewrite 
existing routes at the moment, and that was deliberately left unfinished 
(although not for this reason). So this scenario, whilst it's been 
elaborated on somewhat, isn't possible just now with the mechanisms I've 
described.

But it does point towards the need to either have a configurable policy 
for method disposition, or strong guaranteees about the RPC layer 
behaving in-order.

You end up having to rely on a reliable network transport. You can 
assume that the XRL request you just got is to be executed right away, 
but only insofar as the transport you read from, has not re-ordered 
anything in transit.

Reliability doesn't imply in-order delivery to the user process. If you 
receive XRL requests out of order, you'll need to buffer them. If your 
transport isn't reliable, you have no way of knowing that you won't get 
an earlier message -- without implementing the concept of a time-out; 
i.e. if Mr Server don't see an out-of-order message within N time units, 
I will time it out and send you a NACK, to stop blocking all other 
access to the resource. [Sounds like kernel driver locking to me...]

Up until now, we have relied on TCP to do all of this for us behind the 
scenes. The price we pay for that is some inefficiency in the 
implementation: head-of-line blocking, and being unable to preserve RPC 
method boundaries.

(This is why the AMQP guys have the hots for STCP, but the STCP guys 
can't do much about pushing the model forward until Microsoft sit up and 
take notice -- no-one's shipping STCP as a Windows 7 NDIS/TDI driver, as 
far as I know.)

You can see why stuff like TIPC happens. But I seriously disagree with 
their approach. Pushing all asynchrony into the kernel isn't the answer, 
and it limits your client uptake -- Linux is not the only game in town, 
and there are very good reasons for that which I won't go into here. 
Just using the existing Berkeley Sockets API is cute, but far from 
perfect -- it has holes of its own.

Also, they never really tackled the cross-language interop issue the way 
Thrift has.

...

So I guess it boils down to: caveat implementor. If you use XRL, don't 
rely on call serialization from the API. If you need to cross road after 
pushing button, do so. Otherwise, you might end up in a traffic 
accident. :-)

>   (Been a few
> days since I poked at the router mgr code, not sure I fully understood
> it when I did).

There is a lot going on in there.

XRLs should be dispatched in the order in which they are received. 
However, there are actually no guarantees for this behaviour -- it is 
'best effort'.

    When an XRL call is received, for example, STCPRequestHandler will 
attempt to dispatch it immediately, in line with further reception.
    XRL targets are internally synchronous. The method call dispatch 
happens in the context of XrlRouter's event I/O callbacks, which are 
registered with the outer EventLoop.

So from the server's point of view, XRL is pretty much synchronous. But, 
even on the same host, that dispatch could happen on another CPU. [As 
I've probably mentioned elsewhere, most of XORP's inter-process sync in 
the time domain, is actually pinioned on the host's socket buffer locks.]

The uncertainty in the whole system to do with time and call dispatch is 
however localized:
 * When/how did that XRL get fired off?
 * How are my socket buffers?
 * How many cores do I have?
 * How's my scheduling?

    Just out of interest, I will reveal that as of this week, that I 
have written most of the code generator needed to shim XRL calls 
directly into Thrift ones.

    This is so that adopting Thrift does not mean a dragnet across all 
400+ KLOCs of XORP, but should make it a mostly drop-in replacement for 
XRL in the existing code.

    I have yet to write most of the new libxipc, though, which is why 
I'm feeling that space out just now, and being pretty conservative in 
what I'm disclosing (people have got a whiff of what I'm doing; 
knowledge breeds expectations; expectations pump up the volume).

    Thrift's C++ RPC libraries are actually pretty written. They make it 
possible to pull off a few tricks for making the method calls a bit more 
scalable, and for providing guarantees about call serialization in a 
scalable system.

    However, making that work requires some additional movement. As you 
can see, there are a few assumptions about how the whole system actually 
behaves, which are incorrect in some places.

cheers,
BMS