[Xorp-hackers] Patch to get rid of two system calls per asyncio send.
Bruce M Simpson
bms at incunabulum.net
Mon Mar 24 13:25:20 PDT 2008
This seems like a good time and place to lay down the law about how
asyncio.cc got more complicated, when I was dragged into the game to
make it work inside Windows...
Pavlin Radoslavov wrote:
> Ben Greear <greearb at candelatech.com> wrote:
>> Asyncio was disabling and enabling SIGPIPE for each send. At least on Linux
>> (and probably BSD), we can use MSG_NOSIGNAL in most cases. Attached is a patch
>> that implements this. Not specifically benchmarked, but it's always good to
>> get rid of
>> extra system calls...
> I agree that we should get rid of extra system calls.
> However, this part of the code is very critical and we want to be
> very careful with it (e.g., it has been changed by a number of
> people in the past and it might be quite fragile).
I second Pavlin. It is code which is risky to modify, without performing
detailed testing across all the supported platforms.
It took MONTHS of pain to get asyncio.cc working correctly under
Windows, and even then, I didn't completely understand what was going on.
So, I cheated. What follows is a tour down my memory lane...
At one point I was proposing turning the I/O model upside down to fit
what NT does, obviously I had to reconsider my approach as this would
have taken too much development time, as well as being an overly
There is some special magic going on there, which is necessary to make
sure data gets in and out of Winsock's I/O thread without resorting to
radical design change.
1, In NT, all read and write operations block -- there is no such thing
as non-blocking I/O for "ordinary" NT file descriptors.
Winsock attempts to emulate it up to a point, however only for very
The MSDN documentation explicitly states, in a number of places, that
I/O Completion Ports are the preferred mechanism for high volume/low
latency Winsock processing.
[We do more special magic to enable XORP processes, such as xorpsh, to
read from an NT console or pipe in an apparently non-blocking way, see
win_con_read() and win_pipe_read() in win_io.c.]
2. In Winsock, socket events dispatched using the WSAEventSelect()
mechanism are edge-triggered, not level-triggered (in the sense of
digital logic design).
The NT synchronisation primitives used to actually signal conditions
are Event objects, created via the WSACreateEvent() API.
3. The generation of IOT_READ ("this file descriptor has data pending to
be read") requires that a context switch to Winsock's thread is forced
in order for background I/O processing to happen.
Attempting to read data without such a context switch will simply cause
the process's primary thread to block forever.
Furthermore, it is possible for unread data to sit in one of Winsock's
buffer *without* the IOT_READ event having been generated, in which case
taking the context switch is unnecessarily expensive, and slows things
down until the Winsock I/O thread effects a poll on our behalf ("Oh, I
forgot to tell you, there's data waiting for you...") -- this is why the
call to FIONREAD is there, otherwise it plays havoc with XRL latency.
See the EDGE_TRIGGERED_READ_LATENCY define for the code which
implements this path.
4. The disposition of IOT_WRITE ("this file descriptor may be written
to") is edge triggered in Winsock, not level triggered as POSIX select()
is; writes are also handled in the Winsock I/O thread.
We cannot simply write() as much as we can, block, and have our event
handler invoked as is the case in POSIX environments; instead we must
reenter the EventLoop, causing a call to WaitForMultipleObjects() and
thus a context switch.
As such it's necessary to add a XorpTask upfront in order to service
writes, as there is no way of knowing that the descriptor is ready to
write to, *until* we have forced a context switch, giving Winsock a
chance to tell us that it is!
See the EDGE_TRIGGERED_WRITE define for the code which implements this
5. IOT_DISCONNECT is signalled as a separate Winsock event, see
The above probably sounds very clear, and straightforward, in hindsight,
but it's worth bearing in mind it took several months of speculative
work to pull it off.
We had to make these design changes because the emulation of select() in
Windows may only be used with sockets, and furthermore, it cannot deal
with mixed address families, which was a dealbreaker for IPv6 support.
Obviously these techniques aren't necessary if using NT I/O Completion
Ports or NT threads as the dispatch mechanism, however, those are out of
scope for XORP, for reasons which should be self explanatory from the
above, if not, read the future thread on cross-language support.
The knowledge herein should probably be more widely disseminated, for
the benefit of folk porting POSIX applications to native Windows.
Please don't break any of it :-)
More information about the Xorp-hackers