[Xorp-hackers] xorp error handling

Wed Aug 22 20:57:49 PDT 2012

On 08/22/2012 08:19 PM, Chan, Anthony wrote:
> Hi,
>
> We are running XORP with a custom routing protocol and are able to run into the error situation where a SEND_FAILED error code is generated (“XrlPFSTCPSender
> died: Keepalive timeout”).  After this occurs, XORP basically becomes non operational, but all the processes are still around and in a running state.  Our
> platform is resource constrained so it is fairly easy for us to reproduce once we inject enough routes.
>
> I believe what is happening is that the RIB process becomes too busy to acknowledge the IPC keepalive between it and the routing process.  However the
> rtrmngr/Finder does not restart the RIB because the RIB became responsive again and acknowledged the keepalive from the rtrmngr/Finder process.  Since the IPC
> between the routing and RIB process is now down, and rtrmngr/Finder cannot detect any process issues, nothing can be done now to recover from this state.  Do
> you believe this is a possible scenario with the current XORP error handling process??
>
> We are using 1.8.5 without setting the environment variable “ XORP_SENDER_KEEPALIVE_TIME”, therefore using the default 10 seconds as the keepalive interval.

Does the problem go away if you set the keep-alive higher?

You could also throttle your routes that you are sending to the RIB to keep
from over-working it.

In general, restarting xorp processes never works right anyway..so if one dies (or times out),
you usually just have to restart xorp completely.

Thanks,
Ben

-- 
Ben Greear <greearb at candelatech.com>
Candela Technologies Inc  http://www.candelatech.com