[Xorp-users] Problems with Linux kernel and OSPF ???

Atanu Ghosh atanu at ICSI.Berkeley.EDU
Sat Dec 8 11:28:08 PST 2007


Hi,

I need to spend a little more time looking at this but at first glance
it looks as if both routers have selected themselves as the DR.

I think this problem is fixed in CVS:
<http://xorpc.icir.org/cgi-bin/cvsweb.cgi/xorp/ospf/peer.cc?rev=1.290&content-type=text/x-cvsweb-markup>.

I think it would make sense for you to take the latest version of OSPF
from the CVS repository.

  Atanu.

>>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:

    Aidan> Woops its done it again, here are both ends of the link and attached the
    Aidan> two replay files, again neither side had an interface failure. It's very
    Aidan> strange that this now seems to occur very day or so, just as I started
    Aidan> this conversation with you:

    Aidan> I agree that we need to get to the bottom of this but I also need a
    Aidan> stable network. Do you think there is anything significant in the
    Aidan> changes to ospf that may be related?

    Aidan> root at woodside-relay> show ospf4 neighbor detail
    Aidan> Address         Interface             State      ID              Pri
    Aidan> Dead
    Aidan> 89.248.141.193   ath0/ath0              Full      89.248.141.223   128
    Aidan> 38
    Aidan> Area 0.0.0.0, opt 0x2, DR 89.248.141.194, BDR 89.248.141.193
    Aidan> Up 103:53:13, adjacent 103:49:09
    Aidan> 89.248.141.198   ath1/ath1              Full      89.248.141.221   128
    Aidan> 39
    Aidan> Area 0.0.0.0, opt 0x2, DR 89.248.141.198, BDR 0.0.0.0
    Aidan> Up 103:53:11, adjacent 00:04:30
    Aidan> root at woodside-relay> show ospf4 database detail
    Aidan> OSPF link state database, Area 0.0.0.0
    Aidan> Router-LSA:
    Aidan> LS age  285 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x1 Link
    Aidan> State ID 89.248.141.222 Advertising Router 89.248.141.222 LS sequence
    Aidan> number 0x8000098b LS checksum 0x3f92 length 60
    Aidan> bit Nt false
    Aidan> bit V false
    Aidan> bit E false
    Aidan> bit B false
    Aidan> Type 2 Transit network IP address of Designated router
    Aidan> 89.248.141.194 Routers interface address 89.248.141.194 Metric 1
    Aidan> Type 3 Stub network Subnet number 89.248.141.222 Mask
    Aidan> 255.255.255.255 Metric 1
    Aidan> Type 2 Transit network IP address of Designated router
    Aidan> 89.248.141.197 Routers interface address 89.248.141.197 Metric 1
    Aidan> Router-LSA:
    Aidan> LS age 1153 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x1 Link
    Aidan> State ID 89.248.141.223 Advertising Router 89.248.141.223 LS sequence
    Aidan> number 0x80001059 LS checksum 0x76dd length 48
    Aidan> bit Nt false
    Aidan> bit V false
    Aidan> bit E true
    Aidan> bit B false
    Aidan> Type 2 Transit network IP address of Designated router
    Aidan> 89.248.141.194 Routers interface address 89.248.141.193 Metric 1
    Aidan> Type 3 Stub network Subnet number 89.248.141.223 Mask
    Aidan> 255.255.255.255 Metric 1
    Aidan> Router-LSA:
    Aidan> LS age  286 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x1 Link
    Aidan> State ID 89.248.141.221 Advertising Router 89.248.141.221 LS sequence
    Aidan> number 0x800008a2 LS checksum 0x5eb1 length 48
    Aidan> bit Nt false
    Aidan> bit V false
    Aidan> bit E true
    Aidan> bit B false
    Aidan> Type 2 Transit network IP address of Designated router
    Aidan> 89.248.141.198 Routers interface address 89.248.141.198 Metric 1
    Aidan> Type 3 Stub network Subnet number 89.248.141.221 Mask
    Aidan> 255.255.255.255 Metric 1
    Aidan> Network-LSA:
    Aidan> LS age 1163 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x2 Link
    Aidan> State ID 89.248.141.194 Advertising Router 89.248.141.222 LS sequence
    Aidan> number 0x800000d0 LS checksum 0xbeee length 32
    Aidan> Network Mask 0xfffffffc
    Aidan> Attached Router 89.248.141.222
    Aidan> Attached Router 89.248.141.223
    Aidan> Network-LSA:
    Aidan> LS age  286 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x2 Link
    Aidan> State ID 89.248.141.198 Advertising Router 89.248.141.221 LS sequence
    Aidan> number 0x80000001 LS checksum 0x2458 length 32
    Aidan> Network Mask 0xfffffffc
    Aidan> Attached Router 89.248.141.221
    Aidan> Attached Router 89.248.141.222
    Aidan> As-External-LSA:
    Aidan> LS age 1163 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x5 Link
    Aidan> State ID 0.0.0.0 Advertising Router 89.248.141.223 LS sequence number
    Aidan> 0x80001016 LS checksum 0xd350 length 36
    Aidan> Network Mask 0
    Aidan> bit E true
    Aidan> Metric 10 0xa
    Aidan> Forwarding address 89.248.141.223
    Aidan> External Route Tag 0
    Aidan> As-External-LSA:
    Aidan> LS age  705 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x5 Link
    Aidan> State ID 89.248.141.224 Advertising Router 89.248.141.221 LS sequence
    Aidan> number 0x800000cf LS checksum 0x4e98 length 36
    Aidan> Network Mask 0xffffffe0
    Aidan> bit E true
    Aidan> Metric 0 0
    Aidan> Forwarding address 89.248.141.221
    Aidan> External Route Tag 0
    Aidan> As-External-LSA:
    Aidan> LS age  705 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x5 Link
    Aidan> State ID 89.248.142.128 Advertising Router 89.248.141.221 LS sequence
    Aidan> number 0x800000cf LS checksum 0xfb28 length 36
    Aidan> Network Mask 0xfffffff8
    Aidan> bit E true
    Aidan> Metric 10 0xa
    Aidan> Forwarding address 89.248.141.221
    Aidan> External Route Tag 0
    Aidan> Network-LSA:
    Aidan> LS age  286 Options  0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x2 Link
    Aidan> State ID 89.248.141.197 Advertising Router 89.248.141.222 LS sequence
    Aidan> number 0x80000093 LS checksum 0xfeea length 32
    Aidan> Network Mask 0xfffffffc
    Aidan> Attached Router 89.248.141.222
    Aidan> Attached Router 89.248.141.221

    Aidan> root at lodgefarm> show ospf4 neighbor  detail
    Aidan> Address         Interface             State      ID              Pri
    Aidan> Dead
    Aidan> 89.248.141.197   ath0/ath0              Full      89.248.141.222   128
    Aidan> 33
    Aidan> Area 0.0.0.0, opt 0x2, DR 89.248.141.197, BDR 0.0.0.0
    Aidan> Up 25:26:07, adjacent 00:09:17



    Aidan> On Fri, 2007-12-07 at 15:30 -0800, Atanu Ghosh wrote:
    >> Hi,
    >> 
    >> I did think about suggesting that you switched to point-to-point but
    >> then we would be ignoring the problem. The XORP OSPF requires that when
    >> point-to-point is selected that the neighbours are explicitly
    >> configured, this may not have been obvious when you tried it
    >> before.
    >> 
    >> Decreasing the hello interval will cause more hello packets to be sent
    >> (in the router-dead interval), which will need more packets to be lost
    >> before an adjacency is lost.
    >> 
    >> The only module that is sensitive to operating system differences is the
    >> FEA (Forwarding Engine Abstraction). For example all the protocols send
    >> and receive packets through the FEA, so the vagaries of how to send and
    >> receive raw IP packets are dealt with by the FEA among other things. It
    >> is therefore certainly safe to build OSPF on any of your hosts running
    >> the same operating system. It should also be safe to build the FEA on
    >> any of your hosts that are running the same OS with a similar version
    >> number.
    >> 
    >> The router manager the process that controls all the XORP processes on
    >> startup reads template files that define how a process can be
    >> configured. These files are in the template directory and have the .tp
    >> extension, the OSPFv2 template file is ospfv2.tp. In general if you
    >> update a binary its matching template file should also be installed. In
    >> this case ospfv2.tp has not changed since the last release. In the
    >> template directory there are also files with the extension .cmds these
    >> define the operational commands that are available for a protocol. The
    >> file ospfv2.cmds has changed as the "clear ospf4 database" command has
    >> been added. You will need to update this file in the templates
    >> directory. You should also update the operational command programs
    >> themselves print_neighbours, print_lsas and clear_database.
    >> 
    >> In answer to your question it should be safe to just update the OSPF
    >> components of the system. We hope to one day support the updating of
    >> protocols without requiring a restart. 
    >> 
    >> Atanu.
    >> 
    >> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
    >> 
    Aidan> Hi, This makes sense, I did not see the net LSA from .221. To
    Aidan> be honest even if I had I'm pretty sure I would not have
    Aidan> thought anything was wrong. It has been a long time since I
    Aidan> studied OSPF's workings. I suppose I have ensured myself a
    Aidan> certain re-baptism of fire. If I'm not mistaken I tried
    Aidan> already to configure these links as point-to-point
    Aidan> connections.  1. to avoid the DR elections process 2. to save
    Aidan> on IP addresses.  For some reason, I forget, it did not seem
    Aidan> to work. This would be the best approach to stability I would
    Aidan> have thought as we avoid DR election all together, but as you
    Aidan> say if I lengthen the dead-interval it may ride out whatever
    Aidan> caused the adjacency to drop. The trouble is I don't think
    Aidan> anything on the physical side caused the adjacency to drop in
    Aidan> the first place, so just increasing the dead time may well
    Aidan> not achieve stability.  Why decrease the hello interval?
    >> 
    Aidan> Regarding the new code. This may be a stupid question but
    Aidan> here goes. As you know these routers are live and as I am the
    Aidan> worlds worst admin, I have different kernels on different
    Aidan> routers. If I remember correctly the xorp code I'm running on
    Aidan> the different routers was complied off-line on a different
    Aidan> machine, so they are running code the was not complied
    Aidan> against the actually running kernel headers. Xorp does not
    Aidan> produce kernel modules so will this ever matter? How far can
    Aidan> I push this?
    >> 
    Aidan> The important question: Is each binary produced at compile
    Aidan> time stand-alone. i.e. if I compile the CVS sources off-line
    Aidan> and then take just the ospfv2 binary, will this work with the
    Aidan> other binaries that are already running on the live network?
    Aidan> Can I just replace the ospf module or are there dependencies
    Aidan> in the other binaries that will cause this approach to break
    Aidan> things. The modular approach would of course be nice !
    >> 
    Aidan> Thanks for help so far, and I'm working on the logging
    Aidan> issue. I had this working in the past and now it doesn't hmmm
    Aidan> strange?
    >> 
    Aidan> Thanks again Aidan
    >> 
    >> 
    Aidan> On Thu, 2007-12-06 at 20:13 -0800, Atanu Ghosh wrote:
    >> >> Hi,
    >> >> 
    >> >> The problem that I see is that that both routers (89.248.141.221
    >> >> and 89.248.141.222) are announcing a Network-LSA for the subnet
    >> >> 89.248.141.196/30. There should only be one Network-LSA per
    >> >> subnet, an unfortunate side effect of this behaviour is that from
    >> >> the perspective of the LSA database routers 89.248.141.221 and
    >> >> 89.248.141.222 are not connected, hence no routes.
    >> >> 
    >> >> Only the designated router (DR) for a subnet should be announcing
    >> >> a Network-LSA from the output of "show ospf4 neighbor detail"
    >> >> router 89.248.141.221 doesn't consider itself to be the DR
    >> >> (router 89.248.141.222 is the DR, 89.248.141.197 is its interface
    >> >> address).
    >> >> 
    >> >> >From the sequence number and age of the Network-LSA generated by
    >> >> router 89.248.141.221 we can see that it was initially announced
    >> >> 6 mins 48 secs ago, which is similar to the 8 mins 32 secs the
    >> >> adjacency has existed.
    >> >> 
    >> >> Router 89.248.141.221 is announcing a Network-LSA even though it
    >> >> is not the DR. The odd part is that the Network-LSA was initially
    >> >> announced after the adjacency was formed and a DR had already
    >> >> been selected.
    >> >> 
    >> >> Could you try running the latest code from CVS some DR election
    >> >> issues have been fixed since the last release?
    >> >> 
    >> >> My guess would be that the problem only occurs after a loss of
    >> >> adjacency. The default settings for hello-interval and
    >> >> router-dead-interval are 10 and 40 seconds respectively. You
    >> >> could try decreasing the hello-interval and increasing
    >> >> router-dead-interval until we get to the bottom of this.
    >> >> 
    >> >> Next time you see a problem: 1) $ print_lsas -S save.lsas 2) show
    >> >> ospf4 neighbor detail (On the router in question and the
    >> >> neighbour) Just to confirm the problem.
    >> >> 
    >> >> I will try and reproduce the problem.
    >> >> 
    >> >> Atanu.
    >> >> 
    >> >> OSPF link state database, Area 0.0.0.0 Type ID Adv Rtr Seq Age
    >> >> Opt Cksum Len Network 89.248.141.197 89.248.141.222 0x8000003b
    >> >> 409 0x2 0xaf92 32 Network *89.248.141.198 89.248.141.221
    >> >> 0x80000001 409 0x2 0x2458 32
    >> >> 
    >> >> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
    >> >> 
    Aidan> Hi Atanu, Now that I'm over the panic of collecting the data
    Aidan> and recovering the network. I can see that the neighbour
    Aidan> seems to have been up for 29hrs but adjacent for only
    Aidan> 8mins. Clearly the adjacency has been dropped and even though
    Aidan> it has recovered it no longer imports the routes from the
    Aidan> database. Unfortunately because I do not have the logs, as
    Aidan> explained in the last email, I have no trace of the failure
    Aidan> :( Note the tcpdump of the hellos in both directions. I
    Aidan> thought if the adjacency failed a database update would be
    Aidan> forced when it is re-established.
    >> >>
    Aidan> Oh BTW I checked my logs on the interfaces and the physicals
    Aidan> have been up all the time.
    >> >>
    Aidan> What's up with ospf?  Aidan
    >> >>
    Aidan> On Wed, 2007-12-05 at 01:00 -0800, Atanu Ghosh wrote:
    >> >> >> Hi,
    >> >> >> 
    >> >> >> The output that it would be good to see before and after the
    >> >> >> problem occurs.  1) $ netstat -nr 2) Xorp> show interfaces 3)
    >> >> >> Xorp> show route table ipv4 unicast final 4) Xorp> show ospf4
    >> >> >> neighbor detail 5) Xorp> show ospf4 database detail 6) $ >>
    >> >> print_lsas -S save.lsas The print_lsas program can be found in >>
    >> >> ospf/tools directory. The program stores the LSA database in a >>
    >> >> form that can be replayed.
    >> >> >> 
    >> >> >> You can also enable tracing in ospf: traceoptions { flag { all
    >> >> { >> disable: false } } }
    >> >> >> 
    >> >> >> Which should show routes being added and deleted.
    >> >> >> 
    >> >> >> The latest code in CVS has a "clear ospf4 database" command,
    >> >> it >> would be interesting to know if once the problem occurs if
    >> >> this >> solves the problem.
    >> >> >> 
    >> >> >> It might also be interesting to keep the "ip mon" command
    >> >> running >> to track routes being added and deleted.
    >> >> >> 
    >> >> >> Would it be possible at some off peak time to flap the ADSL
    >> >> link >> to see if this replicates the problem. I know that you
    >> >> have >> stated that there were no ADSL issues when the problem
    >> >> occurred, >> but I do wonder if we are seeing some issue related
    >> >> to dynamic >> interfaces.
    >> >> >> 
    >> >> >> Atanu.
    >> >> >> 
    >> >> >> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
    >> >> >> 
    Aidan> Hi, The adjacency runs over a wireless link between the
    Aidan> routers. It can, very possibly, drop in and out, but as far
    Aidan> as I can see this did not happen and to be honest in the 9
    Aidan> months I have had this system up I have never seen the
    Aidan> wireless link drop, but packet corruption could be a
    Aidan> possibility and this may be less easy to diagnose. It is a
    Aidan> high power 5.8GHz connection, here in the UK this is a
    Aidan> licensed band (and yes I have a license). So I don't think
    Aidan> interference is the likely cause, though I wouldn't rule this
    Aidan> out. If I look at the logs from the same period I seen
    Aidan> nothing to indicate the interface flapped, I would see the
    Aidan> wireless dis-associate and re-associate and cypher exchange
    Aidan> and this did not happen. But as I say there could be a period
    Aidan> of high BER on the links. I thought ospf would handle this
    Aidan> reasonably gracefully? I have to say heavy BER was not
    Aidan> evident when I came to repair the network, or at least I
    Aidan> didn't detect it and in the past I have run ospf over another
    Aidan> one of my wireless links with stations 10km apart with the
    Aidan> wireless link almost non-functional, dropping packets left
    Aidan> right and centre and re-associating over and over, but xorp's
    Aidan> ospf never complained!  I was beginning to suspect that this
    Aidan> was related to my adsl link on the suspect router, as this is
    Aidan> a dynamic interface and I have this defined independently of
    Aidan> xorp. If this interface flaps then the default route
    Aidan> associated with the adsl ppp session is withdrawn. The
    Aidan> default from the adsl line is not propagated into ospf
    Aidan> though, instead I use a static default with a higher metric
    Aidan> pointed at the loopback and inject this into ospf
    Aidan> instead. Then the flaps of the adsl line do not cause churn
    Aidan> in the ospf domain. I was starting to think that the addition
    Aidan> and removal of the default from the adsl line was affecting
    Aidan> the kernel table and this was upsetting xorp's ospf. However
    Aidan> this morning when this happened the adsl line was stable. As
    Aidan> far as my logs look it suddenly decided to stop functioning
    Aidan> with no correlated events from other system processes. The
    Aidan> only things in the logs at the same time is iptables dropping
    Aidan> DOS attacks, but this in normal, unfortunately far to normal.
    Aidan> show ospf4 neighbour simply stated 'full' there is only one
    Aidan> neighbour defined on this router. I didn't look this time at
    Aidan> show interfaces, but from memory of the last time this
    Aidan> happened this also was normal.  The problem is that these
    Aidan> routers are mounted 10m high up telegraph poles. If I loose
    Aidan> connectivity it requires a ladder and a climbing harness to
    Aidan> get at them, this is not to mention my upset customers who,
    Aidan> as is normal with customers, do not delay in telling me they
    Aidan> have lost their Internet links.  I suppose what I'm trying to
    Aidan> understand is how to be best prepared for next time, logging,
    Aidan> processes and checks during the failure period to grab as
    Aidan> much useful info before I am forced to restart xorp and get
    Aidan> my customers up and running again. This is a very short
    Aidan> period I have to say. I have a small group of business units
    Aidan> supported on this router and all hell breaks loose if this
    Aidan> happens during working hours.  How can I get the maximum
    Aidan> logging info from the xorp processes?  Anything I can do in
    Aidan> order that you can help me, will be dutifully carried
    Aidan> out. What next, any suggestions?  Thanks Aidan I will On Tue,
    Aidan> 2007-12-04 at 12:19 -0800, Atanu Ghosh wrote:
    >> >> >>
    Atanu> Hi,
    >> >> >>
    Atanu> The scenario that you describe would be perfectly normal if
    Atanu> the connectivity between the "suspect" router and the
    Atanu> "adjacent" router is lost. Although I would expect the "show
    Atanu> ospf4 neighbor" to show the state of the adjacency to be
    Atanu> "Down" not "Full". When an OSPF router loses its adjancencies
    Atanu> the LSA database will slowly timeout, however, the routes
    Atanu> will be withdrawn as soon as the adjacencies are lost.
    >> >> >>
    Atanu> We will require more information to diagnose the problem next
    Atanu> time the problem occurs the output of "show interfaces" and
    Atanu> "show ospf4 neighbor" would be very useful.
    >> >> >>
    Atanu> XORP tracks the state of interfaces in particular the carrier
    Atanu> state. If OSPF believes that the Ethernet has been
    Atanu> disconnected it will stop attempting to send hello
    Atanu> packets. Is it possible that there is a problem with an
    Atanu> interface or cable between the two routers?
    >> >> >>
    Atanu> Atanu.
    >> >> >> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
    >> >> >> 
    Aidan> Hi All, I am using xorp in a production environment,
    Aidan> admittedly a small one. I operate a local WISP and xorp is
    Aidan> running on my wireless nodes. I have a very simple
    Aidan> configuration and really I could probably get away with
    Aidan> static routing throughout the entire network, but I wanted to
    Aidan> try xorp and see just how stable it was. However as I expand
    Aidan> the network I am having second thoughts. It is not good at
    Aidan> all when a network goes up in smoke and I can't explain why
    Aidan> or predict when and what the causes are.  The network has
    Aidan> been in operation 24x7 for around 9 months. I am running on a
    Aidan> Linux kernel 2.6.18-4 and for the vast majority of the time I
    Aidan> have no issues. However now for the fourth time I see the
    Aidan> same problem: Suddenly the Linux kernel and the xorp rib
    Aidan> become detached. Normally all routes in the kernel match
    Aidan> those that xorp is generating, receiving and electing as
    Aidan> active. I am running OSPF and the neighbour states remain
    Aidan> 'full' throughout but if I am not mistaken I see ospf hellos
    Aidan> only in one direction (i.e nothing being transmitted from the
    Aidan> router I suspect). The lsdb of OSPF on the suspect and
    Aidan> adjacent routers contain all the routes but they are aging
    Aidan> out slowly on the adjacent router. When I look at the kernel
    Aidan> routes those from OSPF have already vanished.  I can see the
    Aidan> ospf process running on the offending router? and again I can
    Aidan> see the ospf lsdb intact and correct. When I restart xorp the
    Aidan> system recovers and the routes appear in the kernel again. I
    Aidan> suspect a problem with ospf. I tried enabling traceoptions on
    Aidan> the ospf process, but in fact I needed to restart all the
    Aidan> xorp processes before this actually became active. I now have
    Aidan> this running so if/when it happens again I might be able to
    Aidan> offer some more information.  Does anyone have any experience
    Aidan> of ospf begin unstable? any suggestions how I might more
    Aidan> effectively capture some logs from this event. I do not see
    Aidan> any options for logging the fea process. Is there anything I
    Aidan> can enable to help diagnose the issue?  Many thanks, and of
    Aidan> course cheers for the code in the first place.  Aidan
    Aidan> _______________________________________________ Xorp-users
    Aidan> mailing list Xorp-users at xorp.org
    Aidan> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-users



More information about the Xorp-users mailing list