[Xorp-users] questions about xorp

Wed, 29 Mar 2006 19:11:25 -0800

> I have a firewall cluster using xorp for multicast. I have an script that g=
> et up xorp processses in the active node. Really I have only run xorp in th=
> e active node. When there is a failover my script run the xorp in the new a=
> ctive node and kill xorp in the passive one. This is the xorp processes tha=
> t I start:
> 
>  1308 ?        S      0:01 /opt/bladefusion/xorp/bin/xorp_rtrmgr
>  1310 ?        S      0:49 /opt/bladefusion/xorp/fea/xorp_fea
>  1381 ?        S      0:00 /opt/bladefusion/xorp/rib/xorp_rib
>  1399 ?        S      0:00 /opt/bladefusion/xorp/fib2mrib/xorp_fib2mrib
>  1417 ?        S      0:28 /opt/bladefusion/xorp/mld6igmp/xorp_igmp
>  1435 ?        S      0:05 /opt/bladefusion/xorp/pim/xorp_pimsm4
> 
> My question is: Is necessary use xorp_fib2mrib in my system? I haven=C2=B4t=
>  to sync xorp states between nodes because I have only one node running xor=
> p .

For all practical purpose, the answer is "yes".
Process xorp_fib2mrib is used to obtain the unicast forwarding state
from the kernel (via the FEA) and push it into the Multicast RIB
(which is needed by PIM-SM for the reverse-path forwarding check).

> My other question is about an error in my /var/log/messages, when my xorp d=
> ied
> 
> Mar 28 02:01:41 fw1bjscpd BF-PIM: [ 2006/03/28 02:01:41 ERROR xorp_rtrmgr:2=
> 2712 XRL +629 xrl_pf_stcp.cc die ] XrlPFSTCPSender died: Keepalive timeout=
> =20

This error (and all other errors) is problematic.
This message shows some XRL communication problems: the keepalive to
some of the other XORP processes has timeout.
All other errors are probably a direct or indirect result of those
XRL timeouts.

Do you see those errors during the switchover, or well after the
switchover was completed?

If they happen during the switchover then something has gone wrong.
E.g., if you are starting a new instance of XORP, first you must
make sure that all processes from the old instance have been killed.
Otherwise, they may inflict on the new XORP instance.

If you start seeing the errors well after the switchover has
completed, then the following log message might be a suspect:

> Mar 28 02:02:13 fw1bjscpd ntpdate[24855]: adjust time server 55.1.1.8 offse=
> t -0.054280 sec

In general, adjusting the time backwards doesn't play well with
XORP, so the above adjustment _may_ have something to do with XRL
keepalive timeout. The easiest way to test this is temporary to turn
off NTP and see whether you still get keepalive timeouts.

Pavlin