[Xorp-hackers] xorp-ospf performance issues: busy-spins & packet floods.
Ben Greear
greearb at candelatech.com
Fri Oct 19 20:58:46 PDT 2007
So, after I got fea to quit crashing, I was able to do some tests with
dynamic interfaces & OSPF.
With 15 routers, the system load goes to ~20 and ospf cannot seem to
get out of init.
I notice that ospf and fea are taking large amounts of CPU during
and after xorpsh activity.
I did an strace on xorp_ospf2 and it looks like it is sitting in
a loop doing selects but not actually reading/writing the descriptors
for much of the time.
If I remember correctly, this was the same problem that xorpsh showed
before Pavlin put the fix in to sleep for 10ms. Maybe a similar
fix is needed for ospf2?
Even then, it seems that we should not need to busy-spin even at 10ms.
We should be able to set a longer timeout and wait on select to tell
us when we have messages and/or can send data.
Here' strace from xorp_ospfv2:
clock_gettime(CLOCK_MONOTONIC, {19988, 604437813}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 604521828}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 604834293}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 604918543}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 605048836}) = 0
time(NULL) = 1192851387
time(NULL) = 1192851387
clock_gettime(CLOCK_MONOTONIC, {19988, 605290834}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 605397338}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 605731294}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 605842029}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 606542801}) = 0
time(NULL) = 1192851387
time(NULL) = 1192851387
clock_gettime(CLOCK_MONOTONIC, {19988, 606813953}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 606927272}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 607256007}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 607339372}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 608347317}) = 0
time(NULL) = 1192851387
time(NULL) = 1192851387
clock_gettime(CLOCK_MONOTONIC, {19988, 608595070}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 608678070}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 608989096}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 609073655}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 609176878}) = 0
time(NULL) = 1192851387
time(NULL) = 1192851387
clock_gettime(CLOCK_MONOTONIC, {19988, 609418106}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 609500577}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 609814101}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 609934343}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 610732073}) = 0
time(NULL) = 1192851387
time(NULL) = 1192851387
clock_gettime(CLOCK_MONOTONIC, {19988, 610919469}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 610982308}) = 0
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
clock_gettime(CLOCK_MONOTONIC, {19988, 611307881}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 611391965}) = 0
clock_gettime(CLOCK_MONOTONIC, {19988, 612422222}) = 0
fea is also taking up lots of CPU, but it seems like it might be doing
at least some real work. It is running multiple selects before it actually
reads from the FD, so it still seems like it could be optimized.
clock_gettime(CLOCK_MONOTONIC, {20125, 745575383}) = 0
clock_gettime(CLOCK_MONOTONIC, {20125, 745683166}) = 0
select(60, [24 25 26 27 28 29 30 31 32 33 34 35 36 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 59], [], [], {0, 0}) = 1 (in [57], left {0, 0})
select(60, [24 25 26 27 28 29 30 31 32 33 34 35 36 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 59], [], [], {0, 0}) = 1 (in [57], left {0, 0})
select(60, [24 25 26 27 28 29 30 31 32 33 34 35 36 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 59], [], [], {3, 441222}) = 1 (in [57], left {3, 441222})
clock_gettime(CLOCK_MONOTONIC, {20125, 746149812}) = 0
recvmsg(57, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.13.14.13")}, msg_iov(1)=[{"E\300\0P\253\314\0\0\1Y\24\252\n\r\16\r\340\0\0\5\2\4\0"..., 65536}], msg_controllen=24, {cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=, ...}, msg_flags=0}, 0) = 80
time(NULL) = 1192851524
time(NULL) = 1192851524
clock_gettime(CLOCK_MONOTONIC, {20125, 746558226}) = 0
clock_gettime(CLOCK_MONOTONIC, {20125, 746641971}) = 0
At least part of fea's problem is that it is getting ~2000 packets per second for
interfaces not it's own. I'm not sure at all why all the traffic...as under 'normal'
load it was only getting about 100 per second.
Here's a snippet of a packet capture. Looks like two routers have
gone insane and are flooding each other with LS Updates.
[root at lf1016-55 local]# tshark -n -i 11.16.11
Capturing on 11.16.11
0.000000 99.1.1.10 -> 224.0.0.5 OSPF LS Update
0.000111 99.1.1.10 -> 224.0.0.5 OSPF LS Update
0.000135 99.1.1.11 -> 224.0.0.5 OSPF LS Acknowledge
0.000157 99.1.1.11 -> 224.0.0.5 OSPF LS Update
0.000178 99.1.1.11 -> 224.0.0.5 OSPF LS Update
0.000198 99.1.1.11 -> 224.0.0.5 OSPF LS Acknowledge
0.000219 99.1.1.11 -> 224.0.0.5 OSPF LS Update
0.000239 99.1.1.10 -> 224.0.0.5 OSPF LS Update
0.000260 99.1.1.10 -> 224.0.0.5 OSPF LS Update
0.000285 99.1.1.11 -> 224.0.0.5 OSPF LS Update
0.000306 99.1.1.11 -> 224.0.0.5 OSPF LS Acknowledge
0.000326 99.1.1.11 -> 224.0.0.5 OSPF LS Update
0.000346 99.1.1.11 -> 224.0.0.5 OSPF LS Update
Thanks,
Ben
--
Ben Greear <greearb at candelatech.com>
Candela Technologies Inc http://www.candelatech.com
More information about the Xorp-hackers
mailing list