[Xorp-hackers] xorp-ospf performance issues: busy-spins & packet floods.

Pavlin Radoslavov pavlin at icir.org
Fri Oct 19 22:03:58 PDT 2007


Ben Greear <greearb at candelatech.com> wrote:

> So, after I got fea to quit crashing, I was able to do some tests with
> dynamic interfaces & OSPF.
> 
> With 15 routers, the system load goes to ~20 and ospf cannot seem to
> get out of init.
> 
> I notice that ospf and fea are taking large amounts of CPU during
> and after xorpsh activity.
> 
> I did an strace on xorp_ospf2 and it looks like it is sitting in
> a loop doing selects but not actually reading/writing the descriptors
> for much of the time.
> 
> If I remember correctly, this was the same problem that xorpsh showed
> before Pavlin put the fix in to sleep for 10ms.  Maybe a similar
> fix is needed for ospf2?

The xorpsh fix is semi-hackish and xorpsh-specific so it shouldn't
be applied to OSPF or FEA.

> Even then, it seems that we should not need to busy-spin even at 10ms.
> We should be able to set a longer timeout and wait on select to tell
> us when we have messages and/or can send data.

The eventloop is used to handle different events (I/O, timers,
tasks) and in the process it is calling select(2) probably more than
necessary. E.g., it needs to find the event with the highest
priority and the login for doing that probably could be optimized to
reduce some of the system calls.
This is probably the reason why you see the FEA calling select(2)
several times, but this is just a speculation.

Thanks,
Pavlin

> Here' strace from xorp_ospfv2:
> 
> clock_gettime(CLOCK_MONOTONIC, {19988, 604437813}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 604521828}) = 0
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> clock_gettime(CLOCK_MONOTONIC, {19988, 604834293}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 604918543}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 605048836}) = 0
> time(NULL)                              = 1192851387
> time(NULL)                              = 1192851387
> clock_gettime(CLOCK_MONOTONIC, {19988, 605290834}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 605397338}) = 0
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> clock_gettime(CLOCK_MONOTONIC, {19988, 605731294}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 605842029}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 606542801}) = 0
> time(NULL)                              = 1192851387
> time(NULL)                              = 1192851387
> clock_gettime(CLOCK_MONOTONIC, {19988, 606813953}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 606927272}) = 0
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> clock_gettime(CLOCK_MONOTONIC, {19988, 607256007}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 607339372}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 608347317}) = 0
> time(NULL)                              = 1192851387
> time(NULL)                              = 1192851387
> clock_gettime(CLOCK_MONOTONIC, {19988, 608595070}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 608678070}) = 0
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> clock_gettime(CLOCK_MONOTONIC, {19988, 608989096}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 609073655}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 609176878}) = 0
> time(NULL)                              = 1192851387
> time(NULL)                              = 1192851387
> clock_gettime(CLOCK_MONOTONIC, {19988, 609418106}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 609500577}) = 0
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> clock_gettime(CLOCK_MONOTONIC, {19988, 609814101}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 609934343}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 610732073}) = 0
> time(NULL)                              = 1192851387
> time(NULL)                              = 1192851387
> clock_gettime(CLOCK_MONOTONIC, {19988, 610919469}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 610982308}) = 0
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> select(48, [25 31 33 36 38 40 42 45 46 47], [46 47], [], {0, 0}) = 3 (in [46], out [46 47], left {0, 0})
> clock_gettime(CLOCK_MONOTONIC, {19988, 611307881}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 611391965}) = 0
> clock_gettime(CLOCK_MONOTONIC, {19988, 612422222}) = 0
> 
> 
> fea is also taking up lots of CPU, but it seems like it might be doing
> at least some real work.  It is running multiple selects before it actually
> reads from the FD, so it still seems like it could be optimized.
> 
> clock_gettime(CLOCK_MONOTONIC, {20125, 745575383}) = 0
> clock_gettime(CLOCK_MONOTONIC, {20125, 745683166}) = 0
> select(60, [24 25 26 27 28 29 30 31 32 33 34 35 36 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 59], [], [], {0, 0}) = 1 (in [57], left {0, 0})
> select(60, [24 25 26 27 28 29 30 31 32 33 34 35 36 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 59], [], [], {0, 0}) = 1 (in [57], left {0, 0})
> select(60, [24 25 26 27 28 29 30 31 32 33 34 35 36 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 59], [], [], {3, 441222}) = 1 (in [57], left {3, 441222})
> clock_gettime(CLOCK_MONOTONIC, {20125, 746149812}) = 0
> recvmsg(57, {msg_name(16)={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.13.14.13")}, msg_iov(1)=[{"E\300\0P\253\314\0\0\1Y\24\252\n\r\16\r\340\0\0\5\2\4\0"..., 65536}], msg_controllen=24, {cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=, ...}, msg_flags=0}, 0) = 80
> time(NULL)                              = 1192851524
> time(NULL)                              = 1192851524
> clock_gettime(CLOCK_MONOTONIC, {20125, 746558226}) = 0
> clock_gettime(CLOCK_MONOTONIC, {20125, 746641971}) = 0
> 
> At least part of fea's problem is that it is getting ~2000 packets per second for
> interfaces not it's own.  I'm not sure at all why all the traffic...as under 'normal'
> load it was only getting about 100 per second.
> 
> Here's a snippet of a packet capture.  Looks like two routers have
> gone insane and are flooding each other with LS Updates.
> 
> 
> [root at lf1016-55 local]# tshark -n -i 11.16.11
> Capturing on 11.16.11
>    0.000000    99.1.1.10 -> 224.0.0.5    OSPF LS Update
>    0.000111    99.1.1.10 -> 224.0.0.5    OSPF LS Update
>    0.000135    99.1.1.11 -> 224.0.0.5    OSPF LS Acknowledge
>    0.000157    99.1.1.11 -> 224.0.0.5    OSPF LS Update
>    0.000178    99.1.1.11 -> 224.0.0.5    OSPF LS Update
>    0.000198    99.1.1.11 -> 224.0.0.5    OSPF LS Acknowledge
>    0.000219    99.1.1.11 -> 224.0.0.5    OSPF LS Update
>    0.000239    99.1.1.10 -> 224.0.0.5    OSPF LS Update
>    0.000260    99.1.1.10 -> 224.0.0.5    OSPF LS Update
>    0.000285    99.1.1.11 -> 224.0.0.5    OSPF LS Update
>    0.000306    99.1.1.11 -> 224.0.0.5    OSPF LS Acknowledge
>    0.000326    99.1.1.11 -> 224.0.0.5    OSPF LS Update
>    0.000346    99.1.1.11 -> 224.0.0.5    OSPF LS Update
> 
> Thanks,
> Ben
> 
> 
> -- 
> Ben Greear <greearb at candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com
> 
> _______________________________________________
> Xorp-hackers mailing list
> Xorp-hackers at icir.org
> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers



More information about the Xorp-hackers mailing list