[Xorp-hackers] Patch to get rid of two system calls per asyncio send.

Bruce M Simpson bms at incunabulum.net
Mon Mar 24 13:25:20 PDT 2008


Hi,

This seems like a good time and place to lay down the law about how 
asyncio.cc got more complicated, when I was dragged into the game to 
make it work inside Windows...

Pavlin Radoslavov wrote:
> Ben Greear <greearb at candelatech.com> wrote:
>   
>> Asyncio was disabling and enabling SIGPIPE for each send.  At least on Linux
>> (and probably BSD), we can use MSG_NOSIGNAL in most cases.  Attached is a patch
>> that implements this.  Not specifically benchmarked, but it's always good to
>> get rid of
>> extra system calls...
>>     
>
> I agree that we should get rid of extra system calls.
> However, this part of the code is very critical and we want to be
> very careful with it (e.g., it has been changed by a number of
> people in the past and it might be quite fragile).
>   

I second Pavlin. It is code which is risky to modify, without performing 
detailed testing across all the supported platforms.

It took MONTHS of pain to get asyncio.cc working correctly under 
Windows, and even then, I didn't completely understand what was going on.

So, I cheated. What follows is a tour down my memory lane...

At one point I was proposing turning the I/O model upside down to fit 
what NT does, obviously I had to reconsider my approach as this would 
have taken too much development time, as well as being an overly 
intrusive change.

There is some special magic going on there, which is necessary to make 
sure data gets in and out of Winsock's I/O thread without resorting to 
radical design change.

To summarise:

1, In NT, all read and write operations block -- there is no such thing 
as non-blocking I/O for "ordinary" NT file descriptors.
 Winsock attempts to emulate it up to a point, however only for very 
specific APIs.
 The MSDN documentation explicitly states, in a number of places, that 
I/O Completion Ports are the preferred mechanism for high volume/low 
latency Winsock processing.
 [We do more special magic to enable XORP processes, such as xorpsh, to 
read from an NT console or pipe in an apparently non-blocking way, see 
win_con_read() and win_pipe_read() in win_io.c.]

2. In Winsock, socket events dispatched using the WSAEventSelect() 
mechanism are edge-triggered, not level-triggered (in the sense of 
digital logic design).
 The NT synchronisation primitives used to actually signal conditions 
are Event objects, created via the WSACreateEvent() API.

3. The generation of IOT_READ ("this file descriptor has data pending to 
be read") requires that a context switch to Winsock's thread is forced 
in order for background I/O processing to happen.
 Attempting to read data without such a context switch will simply cause 
the process's primary thread to block forever.
 Furthermore, it is possible for unread data to sit in one of Winsock's 
buffer *without* the IOT_READ event having been generated, in which case 
taking the context switch is unnecessarily expensive, and slows things 
down until the Winsock I/O thread effects a poll on our behalf ("Oh, I 
forgot to tell you, there's data waiting for you...") -- this is why the 
call to FIONREAD is there, otherwise it plays havoc with XRL latency.
 See the EDGE_TRIGGERED_READ_LATENCY define for the code which 
implements this path.

4. The disposition of IOT_WRITE ("this file descriptor may be written 
to") is edge triggered in Winsock, not level triggered as POSIX select() 
is; writes are also handled in the Winsock I/O thread.
  We cannot simply write() as much as we can, block, and have our event 
handler invoked as is the case in POSIX environments; instead we must 
reenter the EventLoop, causing a call to WaitForMultipleObjects() and 
thus a context switch.
  As such it's necessary to add a XorpTask upfront in order to service 
writes, as there is no way of knowing that the descriptor is ready to 
write to, *until* we have forced a context switch, giving Winsock a 
chance to tell us that it is!
  See the EDGE_TRIGGERED_WRITE define for the code which implements this 
path.

5. IOT_DISCONNECT is signalled as a separate Winsock event, see 
BufferedAsyncReader.

The above probably sounds very clear, and straightforward, in hindsight, 
but it's worth bearing in mind it took several months of speculative 
work to pull it off.

We had to make these design changes because the emulation of select() in 
Windows may only be used with sockets, and furthermore, it cannot deal 
with mixed address families, which was a dealbreaker for IPv6 support.

Obviously these techniques aren't necessary if using NT I/O Completion 
Ports or NT threads as the dispatch mechanism, however, those are out of 
scope for XORP, for reasons which should be self explanatory from the 
above, if not, read the future thread on cross-language support.

The knowledge herein should probably be more widely disseminated, for 
the benefit of folk porting POSIX applications to native Windows.

Please don't break any of it :-)

cheers
BMS



More information about the Xorp-hackers mailing list