[Xorp-users] Problems with Linux kernel and OSPF ???

Aidan Walton awalton at wires3.net
Sat Dec 8 11:04:38 PST 2007


Woops its done it again, here are both ends of the link and attached the
two replay files, again neither side had an interface failure. It's very
strange that this now seems to occur very day or so, just as I started
this conversation with you:

I agree that we need to get to the bottom of this but I also need a
stable network. Do you think there is anything significant in the
changes to ospf that may be related?

root at woodside-relay> show ospf4 neighbor detail
  Address         Interface             State      ID              Pri
Dead
89.248.141.193   ath0/ath0              Full      89.248.141.223   128
38
  Area 0.0.0.0, opt 0x2, DR 89.248.141.194, BDR 89.248.141.193
  Up 103:53:13, adjacent 103:49:09
89.248.141.198   ath1/ath1              Full      89.248.141.221   128
39
  Area 0.0.0.0, opt 0x2, DR 89.248.141.198, BDR 0.0.0.0
  Up 103:53:11, adjacent 00:04:30
root at woodside-relay> show ospf4 database detail
   OSPF link state database, Area 0.0.0.0
Router-LSA:
LS age  285 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x1 Link
State ID 89.248.141.222 Advertising Router 89.248.141.222 LS sequence
number 0x8000098b LS checksum 0x3f92 length 60
        bit Nt false
        bit V false
        bit E false
        bit B false
        Type 2 Transit network IP address of Designated router
89.248.141.194 Routers interface address 89.248.141.194 Metric 1
        Type 3 Stub network Subnet number 89.248.141.222 Mask
255.255.255.255 Metric 1
        Type 2 Transit network IP address of Designated router
89.248.141.197 Routers interface address 89.248.141.197 Metric 1
Router-LSA:
LS age 1153 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x1 Link
State ID 89.248.141.223 Advertising Router 89.248.141.223 LS sequence
number 0x80001059 LS checksum 0x76dd length 48
        bit Nt false
        bit V false
        bit E true
        bit B false
        Type 2 Transit network IP address of Designated router
89.248.141.194 Routers interface address 89.248.141.193 Metric 1
        Type 3 Stub network Subnet number 89.248.141.223 Mask
255.255.255.255 Metric 1
Router-LSA:
LS age  286 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x1 Link
State ID 89.248.141.221 Advertising Router 89.248.141.221 LS sequence
number 0x800008a2 LS checksum 0x5eb1 length 48
        bit Nt false
        bit V false
        bit E true
        bit B false
        Type 2 Transit network IP address of Designated router
89.248.141.198 Routers interface address 89.248.141.198 Metric 1
        Type 3 Stub network Subnet number 89.248.141.221 Mask
255.255.255.255 Metric 1
Network-LSA:
LS age 1163 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x2 Link
State ID 89.248.141.194 Advertising Router 89.248.141.222 LS sequence
number 0x800000d0 LS checksum 0xbeee length 32
        Network Mask 0xfffffffc
        Attached Router 89.248.141.222
        Attached Router 89.248.141.223
Network-LSA:
LS age  286 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x2 Link
State ID 89.248.141.198 Advertising Router 89.248.141.221 LS sequence
number 0x80000001 LS checksum 0x2458 length 32
        Network Mask 0xfffffffc
        Attached Router 89.248.141.221
        Attached Router 89.248.141.222
As-External-LSA:
LS age 1163 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x5 Link
State ID 0.0.0.0 Advertising Router 89.248.141.223 LS sequence number
0x80001016 LS checksum 0xd350 length 36
        Network Mask 0
        bit E true
        Metric 10 0xa
        Forwarding address 89.248.141.223
        External Route Tag 0
As-External-LSA:
LS age  705 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x5 Link
State ID 89.248.141.224 Advertising Router 89.248.141.221 LS sequence
number 0x800000cf LS checksum 0x4e98 length 36
        Network Mask 0xffffffe0
        bit E true
        Metric 0 0
        Forwarding address 89.248.141.221
        External Route Tag 0
As-External-LSA:
LS age  705 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x5 Link
State ID 89.248.142.128 Advertising Router 89.248.141.221 LS sequence
number 0x800000cf LS checksum 0xfb28 length 36
        Network Mask 0xfffffff8
        bit E true
        Metric 10 0xa
        Forwarding address 89.248.141.221
        External Route Tag 0
Network-LSA:
LS age  286 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x2 Link
State ID 89.248.141.197 Advertising Router 89.248.141.222 LS sequence
number 0x80000093 LS checksum 0xfeea length 32
        Network Mask 0xfffffffc
        Attached Router 89.248.141.222
        Attached Router 89.248.141.221

root at lodgefarm> show ospf4 neighbor  detail
  Address         Interface             State      ID              Pri
Dead
89.248.141.197   ath0/ath0              Full      89.248.141.222   128
33
  Area 0.0.0.0, opt 0x2, DR 89.248.141.197, BDR 0.0.0.0
  Up 25:26:07, adjacent 00:09:17



On Fri, 2007-12-07 at 15:30 -0800, Atanu Ghosh wrote:
> Hi,
> 
> I did think about suggesting that you switched to point-to-point but
> then we would be ignoring the problem. The XORP OSPF requires that when
> point-to-point is selected that the neighbours are explicitly
> configured, this may not have been obvious when you tried it
> before.
> 
> Decreasing the hello interval will cause more hello packets to be sent
> (in the router-dead interval), which will need more packets to be lost
> before an adjacency is lost.
> 
> The only module that is sensitive to operating system differences is the
> FEA (Forwarding Engine Abstraction). For example all the protocols send
> and receive packets through the FEA, so the vagaries of how to send and
> receive raw IP packets are dealt with by the FEA among other things. It
> is therefore certainly safe to build OSPF on any of your hosts running
> the same operating system. It should also be safe to build the FEA on
> any of your hosts that are running the same OS with a similar version
> number.
> 
> The router manager the process that controls all the XORP processes on
> startup reads template files that define how a process can be
> configured. These files are in the template directory and have the .tp
> extension, the OSPFv2 template file is ospfv2.tp. In general if you
> update a binary its matching template file should also be installed. In
> this case ospfv2.tp has not changed since the last release. In the
> template directory there are also files with the extension .cmds these
> define the operational commands that are available for a protocol. The
> file ospfv2.cmds has changed as the "clear ospf4 database" command has
> been added. You will need to update this file in the templates
> directory. You should also update the operational command programs
> themselves print_neighbours, print_lsas and clear_database.
> 
> In answer to your question it should be safe to just update the OSPF
> components of the system. We hope to one day support the updating of
> protocols without requiring a restart. 
> 
> 	   Atanu.
> 
> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
> 
>     Aidan> Hi, This makes sense, I did not see the net LSA from .221. To
>     Aidan> be honest even if I had I'm pretty sure I would not have
>     Aidan> thought anything was wrong. It has been a long time since I
>     Aidan> studied OSPF's workings. I suppose I have ensured myself a
>     Aidan> certain re-baptism of fire. If I'm not mistaken I tried
>     Aidan> already to configure these links as point-to-point
>     Aidan> connections.  1. to avoid the DR elections process 2. to save
>     Aidan> on IP addresses.  For some reason, I forget, it did not seem
>     Aidan> to work. This would be the best approach to stability I would
>     Aidan> have thought as we avoid DR election all together, but as you
>     Aidan> say if I lengthen the dead-interval it may ride out whatever
>     Aidan> caused the adjacency to drop. The trouble is I don't think
>     Aidan> anything on the physical side caused the adjacency to drop in
>     Aidan> the first place, so just increasing the dead time may well
>     Aidan> not achieve stability.  Why decrease the hello interval?
> 
>     Aidan> Regarding the new code. This may be a stupid question but
>     Aidan> here goes. As you know these routers are live and as I am the
>     Aidan> worlds worst admin, I have different kernels on different
>     Aidan> routers. If I remember correctly the xorp code I'm running on
>     Aidan> the different routers was complied off-line on a different
>     Aidan> machine, so they are running code the was not complied
>     Aidan> against the actually running kernel headers. Xorp does not
>     Aidan> produce kernel modules so will this ever matter? How far can
>     Aidan> I push this?
> 
>     Aidan> The important question: Is each binary produced at compile
>     Aidan> time stand-alone. i.e. if I compile the CVS sources off-line
>     Aidan> and then take just the ospfv2 binary, will this work with the
>     Aidan> other binaries that are already running on the live network?
>     Aidan> Can I just replace the ospf module or are there dependencies
>     Aidan> in the other binaries that will cause this approach to break
>     Aidan> things. The modular approach would of course be nice !
> 
>     Aidan> Thanks for help so far, and I'm working on the logging
>     Aidan> issue. I had this working in the past and now it doesn't hmmm
>     Aidan> strange?
> 
>     Aidan> Thanks again Aidan
> 
> 
>     Aidan> On Thu, 2007-12-06 at 20:13 -0800, Atanu Ghosh wrote:
>     >> Hi,
>     >> 
>     >> The problem that I see is that that both routers (89.248.141.221
>     >> and 89.248.141.222) are announcing a Network-LSA for the subnet
>     >> 89.248.141.196/30. There should only be one Network-LSA per
>     >> subnet, an unfortunate side effect of this behaviour is that from
>     >> the perspective of the LSA database routers 89.248.141.221 and
>     >> 89.248.141.222 are not connected, hence no routes.
>     >> 
>     >> Only the designated router (DR) for a subnet should be announcing
>     >> a Network-LSA from the output of "show ospf4 neighbor detail"
>     >> router 89.248.141.221 doesn't consider itself to be the DR
>     >> (router 89.248.141.222 is the DR, 89.248.141.197 is its interface
>     >> address).
>     >> 
>     >> >From the sequence number and age of the Network-LSA generated by
>     >> router 89.248.141.221 we can see that it was initially announced
>     >> 6 mins 48 secs ago, which is similar to the 8 mins 32 secs the
>     >> adjacency has existed.
>     >> 
>     >> Router 89.248.141.221 is announcing a Network-LSA even though it
>     >> is not the DR. The odd part is that the Network-LSA was initially
>     >> announced after the adjacency was formed and a DR had already
>     >> been selected.
>     >> 
>     >> Could you try running the latest code from CVS some DR election
>     >> issues have been fixed since the last release?
>     >> 
>     >> My guess would be that the problem only occurs after a loss of
>     >> adjacency. The default settings for hello-interval and
>     >> router-dead-interval are 10 and 40 seconds respectively. You
>     >> could try decreasing the hello-interval and increasing
>     >> router-dead-interval until we get to the bottom of this.
>     >> 
>     >> Next time you see a problem: 1) $ print_lsas -S save.lsas 2) show
>     >> ospf4 neighbor detail (On the router in question and the
>     >> neighbour) Just to confirm the problem.
>     >> 
>     >> I will try and reproduce the problem.
>     >> 
>     >> Atanu.
>     >> 
>     >> OSPF link state database, Area 0.0.0.0 Type ID Adv Rtr Seq Age
>     >> Opt Cksum Len Network 89.248.141.197 89.248.141.222 0x8000003b
>     >> 409 0x2 0xaf92 32 Network *89.248.141.198 89.248.141.221
>     >> 0x80000001 409 0x2 0x2458 32
>     >> 
>     >> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
>     >> 
>     Aidan> Hi Atanu, Now that I'm over the panic of collecting the data
>     Aidan> and recovering the network. I can see that the neighbour
>     Aidan> seems to have been up for 29hrs but adjacent for only
>     Aidan> 8mins. Clearly the adjacency has been dropped and even though
>     Aidan> it has recovered it no longer imports the routes from the
>     Aidan> database. Unfortunately because I do not have the logs, as
>     Aidan> explained in the last email, I have no trace of the failure
>     Aidan> :( Note the tcpdump of the hellos in both directions. I
>     Aidan> thought if the adjacency failed a database update would be
>     Aidan> forced when it is re-established.
>     >>
>     Aidan> Oh BTW I checked my logs on the interfaces and the physicals
>     Aidan> have been up all the time.
>     >>
>     Aidan> What's up with ospf?  Aidan
>     >>
>     Aidan> On Wed, 2007-12-05 at 01:00 -0800, Atanu Ghosh wrote:
>     >> >> Hi,
>     >> >> 
>     >> >> The output that it would be good to see before and after the
>     >> >> problem occurs.  1) $ netstat -nr 2) Xorp> show interfaces 3)
>     >> >> Xorp> show route table ipv4 unicast final 4) Xorp> show ospf4
>     >> >> neighbor detail 5) Xorp> show ospf4 database detail 6) $ >>
>     >> print_lsas -S save.lsas The print_lsas program can be found in >>
>     >> ospf/tools directory. The program stores the LSA database in a >>
>     >> form that can be replayed.
>     >> >> 
>     >> >> You can also enable tracing in ospf: traceoptions { flag { all
>     >> { >> disable: false } } }
>     >> >> 
>     >> >> Which should show routes being added and deleted.
>     >> >> 
>     >> >> The latest code in CVS has a "clear ospf4 database" command,
>     >> it >> would be interesting to know if once the problem occurs if
>     >> this >> solves the problem.
>     >> >> 
>     >> >> It might also be interesting to keep the "ip mon" command
>     >> running >> to track routes being added and deleted.
>     >> >> 
>     >> >> Would it be possible at some off peak time to flap the ADSL
>     >> link >> to see if this replicates the problem. I know that you
>     >> have >> stated that there were no ADSL issues when the problem
>     >> occurred, >> but I do wonder if we are seeing some issue related
>     >> to dynamic >> interfaces.
>     >> >> 
>     >> >> Atanu.
>     >> >> 
>     >> >> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
>     >> >> 
>     Aidan> Hi, The adjacency runs over a wireless link between the
>     Aidan> routers. It can, very possibly, drop in and out, but as far
>     Aidan> as I can see this did not happen and to be honest in the 9
>     Aidan> months I have had this system up I have never seen the
>     Aidan> wireless link drop, but packet corruption could be a
>     Aidan> possibility and this may be less easy to diagnose. It is a
>     Aidan> high power 5.8GHz connection, here in the UK this is a
>     Aidan> licensed band (and yes I have a license). So I don't think
>     Aidan> interference is the likely cause, though I wouldn't rule this
>     Aidan> out. If I look at the logs from the same period I seen
>     Aidan> nothing to indicate the interface flapped, I would see the
>     Aidan> wireless dis-associate and re-associate and cypher exchange
>     Aidan> and this did not happen. But as I say there could be a period
>     Aidan> of high BER on the links. I thought ospf would handle this
>     Aidan> reasonably gracefully? I have to say heavy BER was not
>     Aidan> evident when I came to repair the network, or at least I
>     Aidan> didn't detect it and in the past I have run ospf over another
>     Aidan> one of my wireless links with stations 10km apart with the
>     Aidan> wireless link almost non-functional, dropping packets left
>     Aidan> right and centre and re-associating over and over, but xorp's
>     Aidan> ospf never complained!  I was beginning to suspect that this
>     Aidan> was related to my adsl link on the suspect router, as this is
>     Aidan> a dynamic interface and I have this defined independently of
>     Aidan> xorp. If this interface flaps then the default route
>     Aidan> associated with the adsl ppp session is withdrawn. The
>     Aidan> default from the adsl line is not propagated into ospf
>     Aidan> though, instead I use a static default with a higher metric
>     Aidan> pointed at the loopback and inject this into ospf
>     Aidan> instead. Then the flaps of the adsl line do not cause churn
>     Aidan> in the ospf domain. I was starting to think that the addition
>     Aidan> and removal of the default from the adsl line was affecting
>     Aidan> the kernel table and this was upsetting xorp's ospf. However
>     Aidan> this morning when this happened the adsl line was stable. As
>     Aidan> far as my logs look it suddenly decided to stop functioning
>     Aidan> with no correlated events from other system processes. The
>     Aidan> only things in the logs at the same time is iptables dropping
>     Aidan> DOS attacks, but this in normal, unfortunately far to normal.
>     Aidan> show ospf4 neighbour simply stated 'full' there is only one
>     Aidan> neighbour defined on this router. I didn't look this time at
>     Aidan> show interfaces, but from memory of the last time this
>     Aidan> happened this also was normal.  The problem is that these
>     Aidan> routers are mounted 10m high up telegraph poles. If I loose
>     Aidan> connectivity it requires a ladder and a climbing harness to
>     Aidan> get at them, this is not to mention my upset customers who,
>     Aidan> as is normal with customers, do not delay in telling me they
>     Aidan> have lost their Internet links.  I suppose what I'm trying to
>     Aidan> understand is how to be best prepared for next time, logging,
>     Aidan> processes and checks during the failure period to grab as
>     Aidan> much useful info before I am forced to restart xorp and get
>     Aidan> my customers up and running again. This is a very short
>     Aidan> period I have to say. I have a small group of business units
>     Aidan> supported on this router and all hell breaks loose if this
>     Aidan> happens during working hours.  How can I get the maximum
>     Aidan> logging info from the xorp processes?  Anything I can do in
>     Aidan> order that you can help me, will be dutifully carried
>     Aidan> out. What next, any suggestions?  Thanks Aidan I will On Tue,
>     Aidan> 2007-12-04 at 12:19 -0800, Atanu Ghosh wrote:
>     >> >>
>     Atanu> Hi,
>     >> >>
>     Atanu> The scenario that you describe would be perfectly normal if
>     Atanu> the connectivity between the "suspect" router and the
>     Atanu> "adjacent" router is lost. Although I would expect the "show
>     Atanu> ospf4 neighbor" to show the state of the adjacency to be
>     Atanu> "Down" not "Full". When an OSPF router loses its adjancencies
>     Atanu> the LSA database will slowly timeout, however, the routes
>     Atanu> will be withdrawn as soon as the adjacencies are lost.
>     >> >>
>     Atanu> We will require more information to diagnose the problem next
>     Atanu> time the problem occurs the output of "show interfaces" and
>     Atanu> "show ospf4 neighbor" would be very useful.
>     >> >>
>     Atanu> XORP tracks the state of interfaces in particular the carrier
>     Atanu> state. If OSPF believes that the Ethernet has been
>     Atanu> disconnected it will stop attempting to send hello
>     Atanu> packets. Is it possible that there is a problem with an
>     Atanu> interface or cable between the two routers?
>     >> >>
>     Atanu> Atanu.
>     >> >> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
>     >> >> 
>     Aidan> Hi All, I am using xorp in a production environment,
>     Aidan> admittedly a small one. I operate a local WISP and xorp is
>     Aidan> running on my wireless nodes. I have a very simple
>     Aidan> configuration and really I could probably get away with
>     Aidan> static routing throughout the entire network, but I wanted to
>     Aidan> try xorp and see just how stable it was. However as I expand
>     Aidan> the network I am having second thoughts. It is not good at
>     Aidan> all when a network goes up in smoke and I can't explain why
>     Aidan> or predict when and what the causes are.  The network has
>     Aidan> been in operation 24x7 for around 9 months. I am running on a
>     Aidan> Linux kernel 2.6.18-4 and for the vast majority of the time I
>     Aidan> have no issues. However now for the fourth time I see the
>     Aidan> same problem: Suddenly the Linux kernel and the xorp rib
>     Aidan> become detached. Normally all routes in the kernel match
>     Aidan> those that xorp is generating, receiving and electing as
>     Aidan> active. I am running OSPF and the neighbour states remain
>     Aidan> 'full' throughout but if I am not mistaken I see ospf hellos
>     Aidan> only in one direction (i.e nothing being transmitted from the
>     Aidan> router I suspect). The lsdb of OSPF on the suspect and
>     Aidan> adjacent routers contain all the routes but they are aging
>     Aidan> out slowly on the adjacent router. When I look at the kernel
>     Aidan> routes those from OSPF have already vanished.  I can see the
>     Aidan> ospf process running on the offending router? and again I can
>     Aidan> see the ospf lsdb intact and correct. When I restart xorp the
>     Aidan> system recovers and the routes appear in the kernel again. I
>     Aidan> suspect a problem with ospf. I tried enabling traceoptions on
>     Aidan> the ospf process, but in fact I needed to restart all the
>     Aidan> xorp processes before this actually became active. I now have
>     Aidan> this running so if/when it happens again I might be able to
>     Aidan> offer some more information.  Does anyone have any experience
>     Aidan> of ospf begin unstable? any suggestions how I might more
>     Aidan> effectively capture some logs from this event. I do not see
>     Aidan> any options for logging the fea process. Is there anything I
>     Aidan> can enable to help diagnose the issue?  Many thanks, and of
>     Aidan> course cheers for the code in the first place.  Aidan
>     Aidan> _______________________________________________ Xorp-users
>     Aidan> mailing list Xorp-users at xorp.org
>     Aidan> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: save.lsas.lodgefarm
Type: application/octet-stream
Size: 481 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-users/attachments/20071208/7e145042/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: save.lsas.woodside-relay
Type: application/octet-stream
Size: 481 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-users/attachments/20071208/7e145042/attachment-0003.obj 


More information about the Xorp-users mailing list