[Xorp-hackers] XORP IPC mechanism is stressed by data plane traffic

edrt edrt@citiz.net
Thu, 24 Mar 2005 22:08:45 +0800


>
>:( It appears that once multicast forwarding is enabled in the
>kernel, the burst of multicast data packets quickly utilizes the
>system resources and triggers the XRL failures.
>If you want to push this solution further, then you could introduce
>another XRL from PIM to the MFEA that will be called by PIM after
>its configuration is completed. This XRL itself will trigger the
>enabling of multicast forwarding in the kernel.
>
>However, there is no guarantee this will indeed fix the problem.
>In your case, it happens during the startup configuration, but in
>general there is the possibility this may happen even during normal
>router operation (e.g., the XRLs exchanged among the XORP modules
>may be affected).
>

Yes.


>> Today, I tried some tuning, and they seem to fix most of the problems
>> 
>>  1) Increase network mbuf, this suppress most of the ENOBUFS.
>> 
>>  2) Add EWOULDBLOCK to is_pseudo_error, then call_xrl usually successfully
>>    returns after a second/third... try of read. But call_xrl consumes 
>>    considerable time.
>> 
>>  3) Increase XORP tasks' priority above the priority of the data forwarding
>>    task (i.e. the task doing most of the IP stack processing), this makes
>>    call_xrl successfully returns almost immediately.
>> 
>> I doubt these tuning might have high possibilities of causing other problems.
>> But if only take into consideration of the original problem I encoutered, 
>> they make all the XORP components works normally even with overloading
>> external multicast traffic.
>> 
>> Could anyone comment on the possible side effects of these tuning? (Because
>> they are cheap solution, and I might to use them) What I could think out 
>> right now:
>> 
>>  #1 seems harmless, but only decrease the available free system memory
>> 
>>  #2 I have no idea what problem it will cause, anybody get ideas?
>> 
>>  #3 this may starve data forwarding task if protocol tasks consume
>>     too much time in their processing. Besides checking each protocol
>>     tasks' implementation any other method to solve this problem? 
>>     (EventLoop::run might help to detect some of the problem)
>
>If you don't increase the XORP tasks' priority does it still work?
>Also, if you add ENOBUFS to is_pseudo_error without increasing the
>network mbuf does it still work?

 * With low priority XORP tasks, call_xrl will be very slow, read failed
   with multiple EWOULDBLOCK until eventually call_xrl timeout.

 * With high priority XORP tasks, low mbuf, ignore ENOBUFS in is_pseudo_error, 
   all XORP components complains ENOBUFS when they communicate.

>What I am afraid is that even if increasing the mbuf in your kernel
>appears to help fixing the problem in your setup, this may not be
>true if the amount of multicast traffic is higher.
>

Emm, I'll research this more. But at the first sight, there might be
a stable packet receive rate device driver can afford when device is
overloaded, which is not always proportional to external packet injection
rate. If so, we can tune mbuf based on this assumption.


>> 
>> BTW
>> I encounter the following problem during stress testing, because I'm 
>> using ported XORP source code, and they are NOT based on the latest CVS HEAD.
>> I could not ensure you can reproduce the problem, but just in case anybody 
>> have interest to track it...
>> 
>>  * In a high volume data traffic enviroment
>>  * without the tuning above
>>  * try to send XORP component XRL commands through call_xrl
>>  * call_xrl's XrlPFSTCPSender dies with EWOULDBLOCK/ENOBUFS...
>> 
>> task calling call_xrl core dumps, and the stack looks something
>> like
>
>Is this with the original XORP call_xrl binary called by a script
>(or exec()-ed), or this is in the in-process code that has been
>derived from call_xrl?

in-process code derived from call_xrl


>Also, could you tell the particular reason for the coredump (e.g.,
>invalid pointer, etc), because I cannot decode the log below.
>

I don't catch the reason (that's why I paste the stack information)
But it is triggered by XrlPFSTCPSender::die, and from the stack 
information it seems that response_handler doesn't return properly.
Again, I'm not sure we can reproduce it on XORP CVS HEAD version.
Anyway, can we move this to bugzilla and track it there?



Thanks
Eddy