[Xorp-users] Problems with Linux kernel and OSPF ???
Atanu Ghosh
atanu at ICSI.Berkeley.EDU
Sat Dec 8 11:28:08 PST 2007
Hi,
I need to spend a little more time looking at this but at first glance
it looks as if both routers have selected themselves as the DR.
I think this problem is fixed in CVS:
<http://xorpc.icir.org/cgi-bin/cvsweb.cgi/xorp/ospf/peer.cc?rev=1.290&content-type=text/x-cvsweb-markup>.
I think it would make sense for you to take the latest version of OSPF
from the CVS repository.
Atanu.
>>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
Aidan> Woops its done it again, here are both ends of the link and attached the
Aidan> two replay files, again neither side had an interface failure. It's very
Aidan> strange that this now seems to occur very day or so, just as I started
Aidan> this conversation with you:
Aidan> I agree that we need to get to the bottom of this but I also need a
Aidan> stable network. Do you think there is anything significant in the
Aidan> changes to ospf that may be related?
Aidan> root at woodside-relay> show ospf4 neighbor detail
Aidan> Address Interface State ID Pri
Aidan> Dead
Aidan> 89.248.141.193 ath0/ath0 Full 89.248.141.223 128
Aidan> 38
Aidan> Area 0.0.0.0, opt 0x2, DR 89.248.141.194, BDR 89.248.141.193
Aidan> Up 103:53:13, adjacent 103:49:09
Aidan> 89.248.141.198 ath1/ath1 Full 89.248.141.221 128
Aidan> 39
Aidan> Area 0.0.0.0, opt 0x2, DR 89.248.141.198, BDR 0.0.0.0
Aidan> Up 103:53:11, adjacent 00:04:30
Aidan> root at woodside-relay> show ospf4 database detail
Aidan> OSPF link state database, Area 0.0.0.0
Aidan> Router-LSA:
Aidan> LS age 285 Options 0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x1 Link
Aidan> State ID 89.248.141.222 Advertising Router 89.248.141.222 LS sequence
Aidan> number 0x8000098b LS checksum 0x3f92 length 60
Aidan> bit Nt false
Aidan> bit V false
Aidan> bit E false
Aidan> bit B false
Aidan> Type 2 Transit network IP address of Designated router
Aidan> 89.248.141.194 Routers interface address 89.248.141.194 Metric 1
Aidan> Type 3 Stub network Subnet number 89.248.141.222 Mask
Aidan> 255.255.255.255 Metric 1
Aidan> Type 2 Transit network IP address of Designated router
Aidan> 89.248.141.197 Routers interface address 89.248.141.197 Metric 1
Aidan> Router-LSA:
Aidan> LS age 1153 Options 0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x1 Link
Aidan> State ID 89.248.141.223 Advertising Router 89.248.141.223 LS sequence
Aidan> number 0x80001059 LS checksum 0x76dd length 48
Aidan> bit Nt false
Aidan> bit V false
Aidan> bit E true
Aidan> bit B false
Aidan> Type 2 Transit network IP address of Designated router
Aidan> 89.248.141.194 Routers interface address 89.248.141.193 Metric 1
Aidan> Type 3 Stub network Subnet number 89.248.141.223 Mask
Aidan> 255.255.255.255 Metric 1
Aidan> Router-LSA:
Aidan> LS age 286 Options 0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x1 Link
Aidan> State ID 89.248.141.221 Advertising Router 89.248.141.221 LS sequence
Aidan> number 0x800008a2 LS checksum 0x5eb1 length 48
Aidan> bit Nt false
Aidan> bit V false
Aidan> bit E true
Aidan> bit B false
Aidan> Type 2 Transit network IP address of Designated router
Aidan> 89.248.141.198 Routers interface address 89.248.141.198 Metric 1
Aidan> Type 3 Stub network Subnet number 89.248.141.221 Mask
Aidan> 255.255.255.255 Metric 1
Aidan> Network-LSA:
Aidan> LS age 1163 Options 0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x2 Link
Aidan> State ID 89.248.141.194 Advertising Router 89.248.141.222 LS sequence
Aidan> number 0x800000d0 LS checksum 0xbeee length 32
Aidan> Network Mask 0xfffffffc
Aidan> Attached Router 89.248.141.222
Aidan> Attached Router 89.248.141.223
Aidan> Network-LSA:
Aidan> LS age 286 Options 0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x2 Link
Aidan> State ID 89.248.141.198 Advertising Router 89.248.141.221 LS sequence
Aidan> number 0x80000001 LS checksum 0x2458 length 32
Aidan> Network Mask 0xfffffffc
Aidan> Attached Router 89.248.141.221
Aidan> Attached Router 89.248.141.222
Aidan> As-External-LSA:
Aidan> LS age 1163 Options 0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x5 Link
Aidan> State ID 0.0.0.0 Advertising Router 89.248.141.223 LS sequence number
Aidan> 0x80001016 LS checksum 0xd350 length 36
Aidan> Network Mask 0
Aidan> bit E true
Aidan> Metric 10 0xa
Aidan> Forwarding address 89.248.141.223
Aidan> External Route Tag 0
Aidan> As-External-LSA:
Aidan> LS age 705 Options 0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x5 Link
Aidan> State ID 89.248.141.224 Advertising Router 89.248.141.221 LS sequence
Aidan> number 0x800000cf LS checksum 0x4e98 length 36
Aidan> Network Mask 0xffffffe0
Aidan> bit E true
Aidan> Metric 0 0
Aidan> Forwarding address 89.248.141.221
Aidan> External Route Tag 0
Aidan> As-External-LSA:
Aidan> LS age 705 Options 0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x5 Link
Aidan> State ID 89.248.142.128 Advertising Router 89.248.141.221 LS sequence
Aidan> number 0x800000cf LS checksum 0xfb28 length 36
Aidan> Network Mask 0xfffffff8
Aidan> bit E true
Aidan> Metric 10 0xa
Aidan> Forwarding address 89.248.141.221
Aidan> External Route Tag 0
Aidan> Network-LSA:
Aidan> LS age 286 Options 0x2 DC: 0 EA: 0 N/P: 0 MC: 0 E: 1 LS type 0x2 Link
Aidan> State ID 89.248.141.197 Advertising Router 89.248.141.222 LS sequence
Aidan> number 0x80000093 LS checksum 0xfeea length 32
Aidan> Network Mask 0xfffffffc
Aidan> Attached Router 89.248.141.222
Aidan> Attached Router 89.248.141.221
Aidan> root at lodgefarm> show ospf4 neighbor detail
Aidan> Address Interface State ID Pri
Aidan> Dead
Aidan> 89.248.141.197 ath0/ath0 Full 89.248.141.222 128
Aidan> 33
Aidan> Area 0.0.0.0, opt 0x2, DR 89.248.141.197, BDR 0.0.0.0
Aidan> Up 25:26:07, adjacent 00:09:17
Aidan> On Fri, 2007-12-07 at 15:30 -0800, Atanu Ghosh wrote:
>> Hi,
>>
>> I did think about suggesting that you switched to point-to-point but
>> then we would be ignoring the problem. The XORP OSPF requires that when
>> point-to-point is selected that the neighbours are explicitly
>> configured, this may not have been obvious when you tried it
>> before.
>>
>> Decreasing the hello interval will cause more hello packets to be sent
>> (in the router-dead interval), which will need more packets to be lost
>> before an adjacency is lost.
>>
>> The only module that is sensitive to operating system differences is the
>> FEA (Forwarding Engine Abstraction). For example all the protocols send
>> and receive packets through the FEA, so the vagaries of how to send and
>> receive raw IP packets are dealt with by the FEA among other things. It
>> is therefore certainly safe to build OSPF on any of your hosts running
>> the same operating system. It should also be safe to build the FEA on
>> any of your hosts that are running the same OS with a similar version
>> number.
>>
>> The router manager the process that controls all the XORP processes on
>> startup reads template files that define how a process can be
>> configured. These files are in the template directory and have the .tp
>> extension, the OSPFv2 template file is ospfv2.tp. In general if you
>> update a binary its matching template file should also be installed. In
>> this case ospfv2.tp has not changed since the last release. In the
>> template directory there are also files with the extension .cmds these
>> define the operational commands that are available for a protocol. The
>> file ospfv2.cmds has changed as the "clear ospf4 database" command has
>> been added. You will need to update this file in the templates
>> directory. You should also update the operational command programs
>> themselves print_neighbours, print_lsas and clear_database.
>>
>> In answer to your question it should be safe to just update the OSPF
>> components of the system. We hope to one day support the updating of
>> protocols without requiring a restart.
>>
>> Atanu.
>>
>> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
>>
Aidan> Hi, This makes sense, I did not see the net LSA from .221. To
Aidan> be honest even if I had I'm pretty sure I would not have
Aidan> thought anything was wrong. It has been a long time since I
Aidan> studied OSPF's workings. I suppose I have ensured myself a
Aidan> certain re-baptism of fire. If I'm not mistaken I tried
Aidan> already to configure these links as point-to-point
Aidan> connections. 1. to avoid the DR elections process 2. to save
Aidan> on IP addresses. For some reason, I forget, it did not seem
Aidan> to work. This would be the best approach to stability I would
Aidan> have thought as we avoid DR election all together, but as you
Aidan> say if I lengthen the dead-interval it may ride out whatever
Aidan> caused the adjacency to drop. The trouble is I don't think
Aidan> anything on the physical side caused the adjacency to drop in
Aidan> the first place, so just increasing the dead time may well
Aidan> not achieve stability. Why decrease the hello interval?
>>
Aidan> Regarding the new code. This may be a stupid question but
Aidan> here goes. As you know these routers are live and as I am the
Aidan> worlds worst admin, I have different kernels on different
Aidan> routers. If I remember correctly the xorp code I'm running on
Aidan> the different routers was complied off-line on a different
Aidan> machine, so they are running code the was not complied
Aidan> against the actually running kernel headers. Xorp does not
Aidan> produce kernel modules so will this ever matter? How far can
Aidan> I push this?
>>
Aidan> The important question: Is each binary produced at compile
Aidan> time stand-alone. i.e. if I compile the CVS sources off-line
Aidan> and then take just the ospfv2 binary, will this work with the
Aidan> other binaries that are already running on the live network?
Aidan> Can I just replace the ospf module or are there dependencies
Aidan> in the other binaries that will cause this approach to break
Aidan> things. The modular approach would of course be nice !
>>
Aidan> Thanks for help so far, and I'm working on the logging
Aidan> issue. I had this working in the past and now it doesn't hmmm
Aidan> strange?
>>
Aidan> Thanks again Aidan
>>
>>
Aidan> On Thu, 2007-12-06 at 20:13 -0800, Atanu Ghosh wrote:
>> >> Hi,
>> >>
>> >> The problem that I see is that that both routers (89.248.141.221
>> >> and 89.248.141.222) are announcing a Network-LSA for the subnet
>> >> 89.248.141.196/30. There should only be one Network-LSA per
>> >> subnet, an unfortunate side effect of this behaviour is that from
>> >> the perspective of the LSA database routers 89.248.141.221 and
>> >> 89.248.141.222 are not connected, hence no routes.
>> >>
>> >> Only the designated router (DR) for a subnet should be announcing
>> >> a Network-LSA from the output of "show ospf4 neighbor detail"
>> >> router 89.248.141.221 doesn't consider itself to be the DR
>> >> (router 89.248.141.222 is the DR, 89.248.141.197 is its interface
>> >> address).
>> >>
>> >> >From the sequence number and age of the Network-LSA generated by
>> >> router 89.248.141.221 we can see that it was initially announced
>> >> 6 mins 48 secs ago, which is similar to the 8 mins 32 secs the
>> >> adjacency has existed.
>> >>
>> >> Router 89.248.141.221 is announcing a Network-LSA even though it
>> >> is not the DR. The odd part is that the Network-LSA was initially
>> >> announced after the adjacency was formed and a DR had already
>> >> been selected.
>> >>
>> >> Could you try running the latest code from CVS some DR election
>> >> issues have been fixed since the last release?
>> >>
>> >> My guess would be that the problem only occurs after a loss of
>> >> adjacency. The default settings for hello-interval and
>> >> router-dead-interval are 10 and 40 seconds respectively. You
>> >> could try decreasing the hello-interval and increasing
>> >> router-dead-interval until we get to the bottom of this.
>> >>
>> >> Next time you see a problem: 1) $ print_lsas -S save.lsas 2) show
>> >> ospf4 neighbor detail (On the router in question and the
>> >> neighbour) Just to confirm the problem.
>> >>
>> >> I will try and reproduce the problem.
>> >>
>> >> Atanu.
>> >>
>> >> OSPF link state database, Area 0.0.0.0 Type ID Adv Rtr Seq Age
>> >> Opt Cksum Len Network 89.248.141.197 89.248.141.222 0x8000003b
>> >> 409 0x2 0xaf92 32 Network *89.248.141.198 89.248.141.221
>> >> 0x80000001 409 0x2 0x2458 32
>> >>
>> >> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
>> >>
Aidan> Hi Atanu, Now that I'm over the panic of collecting the data
Aidan> and recovering the network. I can see that the neighbour
Aidan> seems to have been up for 29hrs but adjacent for only
Aidan> 8mins. Clearly the adjacency has been dropped and even though
Aidan> it has recovered it no longer imports the routes from the
Aidan> database. Unfortunately because I do not have the logs, as
Aidan> explained in the last email, I have no trace of the failure
Aidan> :( Note the tcpdump of the hellos in both directions. I
Aidan> thought if the adjacency failed a database update would be
Aidan> forced when it is re-established.
>> >>
Aidan> Oh BTW I checked my logs on the interfaces and the physicals
Aidan> have been up all the time.
>> >>
Aidan> What's up with ospf? Aidan
>> >>
Aidan> On Wed, 2007-12-05 at 01:00 -0800, Atanu Ghosh wrote:
>> >> >> Hi,
>> >> >>
>> >> >> The output that it would be good to see before and after the
>> >> >> problem occurs. 1) $ netstat -nr 2) Xorp> show interfaces 3)
>> >> >> Xorp> show route table ipv4 unicast final 4) Xorp> show ospf4
>> >> >> neighbor detail 5) Xorp> show ospf4 database detail 6) $ >>
>> >> print_lsas -S save.lsas The print_lsas program can be found in >>
>> >> ospf/tools directory. The program stores the LSA database in a >>
>> >> form that can be replayed.
>> >> >>
>> >> >> You can also enable tracing in ospf: traceoptions { flag { all
>> >> { >> disable: false } } }
>> >> >>
>> >> >> Which should show routes being added and deleted.
>> >> >>
>> >> >> The latest code in CVS has a "clear ospf4 database" command,
>> >> it >> would be interesting to know if once the problem occurs if
>> >> this >> solves the problem.
>> >> >>
>> >> >> It might also be interesting to keep the "ip mon" command
>> >> running >> to track routes being added and deleted.
>> >> >>
>> >> >> Would it be possible at some off peak time to flap the ADSL
>> >> link >> to see if this replicates the problem. I know that you
>> >> have >> stated that there were no ADSL issues when the problem
>> >> occurred, >> but I do wonder if we are seeing some issue related
>> >> to dynamic >> interfaces.
>> >> >>
>> >> >> Atanu.
>> >> >>
>> >> >> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
>> >> >>
Aidan> Hi, The adjacency runs over a wireless link between the
Aidan> routers. It can, very possibly, drop in and out, but as far
Aidan> as I can see this did not happen and to be honest in the 9
Aidan> months I have had this system up I have never seen the
Aidan> wireless link drop, but packet corruption could be a
Aidan> possibility and this may be less easy to diagnose. It is a
Aidan> high power 5.8GHz connection, here in the UK this is a
Aidan> licensed band (and yes I have a license). So I don't think
Aidan> interference is the likely cause, though I wouldn't rule this
Aidan> out. If I look at the logs from the same period I seen
Aidan> nothing to indicate the interface flapped, I would see the
Aidan> wireless dis-associate and re-associate and cypher exchange
Aidan> and this did not happen. But as I say there could be a period
Aidan> of high BER on the links. I thought ospf would handle this
Aidan> reasonably gracefully? I have to say heavy BER was not
Aidan> evident when I came to repair the network, or at least I
Aidan> didn't detect it and in the past I have run ospf over another
Aidan> one of my wireless links with stations 10km apart with the
Aidan> wireless link almost non-functional, dropping packets left
Aidan> right and centre and re-associating over and over, but xorp's
Aidan> ospf never complained! I was beginning to suspect that this
Aidan> was related to my adsl link on the suspect router, as this is
Aidan> a dynamic interface and I have this defined independently of
Aidan> xorp. If this interface flaps then the default route
Aidan> associated with the adsl ppp session is withdrawn. The
Aidan> default from the adsl line is not propagated into ospf
Aidan> though, instead I use a static default with a higher metric
Aidan> pointed at the loopback and inject this into ospf
Aidan> instead. Then the flaps of the adsl line do not cause churn
Aidan> in the ospf domain. I was starting to think that the addition
Aidan> and removal of the default from the adsl line was affecting
Aidan> the kernel table and this was upsetting xorp's ospf. However
Aidan> this morning when this happened the adsl line was stable. As
Aidan> far as my logs look it suddenly decided to stop functioning
Aidan> with no correlated events from other system processes. The
Aidan> only things in the logs at the same time is iptables dropping
Aidan> DOS attacks, but this in normal, unfortunately far to normal.
Aidan> show ospf4 neighbour simply stated 'full' there is only one
Aidan> neighbour defined on this router. I didn't look this time at
Aidan> show interfaces, but from memory of the last time this
Aidan> happened this also was normal. The problem is that these
Aidan> routers are mounted 10m high up telegraph poles. If I loose
Aidan> connectivity it requires a ladder and a climbing harness to
Aidan> get at them, this is not to mention my upset customers who,
Aidan> as is normal with customers, do not delay in telling me they
Aidan> have lost their Internet links. I suppose what I'm trying to
Aidan> understand is how to be best prepared for next time, logging,
Aidan> processes and checks during the failure period to grab as
Aidan> much useful info before I am forced to restart xorp and get
Aidan> my customers up and running again. This is a very short
Aidan> period I have to say. I have a small group of business units
Aidan> supported on this router and all hell breaks loose if this
Aidan> happens during working hours. How can I get the maximum
Aidan> logging info from the xorp processes? Anything I can do in
Aidan> order that you can help me, will be dutifully carried
Aidan> out. What next, any suggestions? Thanks Aidan I will On Tue,
Aidan> 2007-12-04 at 12:19 -0800, Atanu Ghosh wrote:
>> >> >>
Atanu> Hi,
>> >> >>
Atanu> The scenario that you describe would be perfectly normal if
Atanu> the connectivity between the "suspect" router and the
Atanu> "adjacent" router is lost. Although I would expect the "show
Atanu> ospf4 neighbor" to show the state of the adjacency to be
Atanu> "Down" not "Full". When an OSPF router loses its adjancencies
Atanu> the LSA database will slowly timeout, however, the routes
Atanu> will be withdrawn as soon as the adjacencies are lost.
>> >> >>
Atanu> We will require more information to diagnose the problem next
Atanu> time the problem occurs the output of "show interfaces" and
Atanu> "show ospf4 neighbor" would be very useful.
>> >> >>
Atanu> XORP tracks the state of interfaces in particular the carrier
Atanu> state. If OSPF believes that the Ethernet has been
Atanu> disconnected it will stop attempting to send hello
Atanu> packets. Is it possible that there is a problem with an
Atanu> interface or cable between the two routers?
>> >> >>
Atanu> Atanu.
>> >> >> >>>>> "Aidan" == Aidan Walton <awalton at wires3.net> writes:
>> >> >>
Aidan> Hi All, I am using xorp in a production environment,
Aidan> admittedly a small one. I operate a local WISP and xorp is
Aidan> running on my wireless nodes. I have a very simple
Aidan> configuration and really I could probably get away with
Aidan> static routing throughout the entire network, but I wanted to
Aidan> try xorp and see just how stable it was. However as I expand
Aidan> the network I am having second thoughts. It is not good at
Aidan> all when a network goes up in smoke and I can't explain why
Aidan> or predict when and what the causes are. The network has
Aidan> been in operation 24x7 for around 9 months. I am running on a
Aidan> Linux kernel 2.6.18-4 and for the vast majority of the time I
Aidan> have no issues. However now for the fourth time I see the
Aidan> same problem: Suddenly the Linux kernel and the xorp rib
Aidan> become detached. Normally all routes in the kernel match
Aidan> those that xorp is generating, receiving and electing as
Aidan> active. I am running OSPF and the neighbour states remain
Aidan> 'full' throughout but if I am not mistaken I see ospf hellos
Aidan> only in one direction (i.e nothing being transmitted from the
Aidan> router I suspect). The lsdb of OSPF on the suspect and
Aidan> adjacent routers contain all the routes but they are aging
Aidan> out slowly on the adjacent router. When I look at the kernel
Aidan> routes those from OSPF have already vanished. I can see the
Aidan> ospf process running on the offending router? and again I can
Aidan> see the ospf lsdb intact and correct. When I restart xorp the
Aidan> system recovers and the routes appear in the kernel again. I
Aidan> suspect a problem with ospf. I tried enabling traceoptions on
Aidan> the ospf process, but in fact I needed to restart all the
Aidan> xorp processes before this actually became active. I now have
Aidan> this running so if/when it happens again I might be able to
Aidan> offer some more information. Does anyone have any experience
Aidan> of ospf begin unstable? any suggestions how I might more
Aidan> effectively capture some logs from this event. I do not see
Aidan> any options for logging the fea process. Is there anything I
Aidan> can enable to help diagnose the issue? Many thanks, and of
Aidan> course cheers for the code in the first place. Aidan
Aidan> _______________________________________________ Xorp-users
Aidan> mailing list Xorp-users at xorp.org
Aidan> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-users
More information about the Xorp-users
mailing list