Discussion:
error response from select on Windows
(too old to reply)
Nate Rosenblum
2014-02-27 23:21:01 UTC
Permalink
Raw Message
Hi,

Recently I've been investigating the mysterious termination of my event
dispatch loop on a Windows system, with libevent 2.0.21. This has happened
extremely rarely, and only in an EC2 virtualized environment. The system is
not OOMing, so I suspect an error return from select in win32_dispatch
(which is the only place that dispatch could return an error). Assume for
the sake of discussion that I'm not passing invalid socket descriptors or
timeouts, and that I have plenty of events queued to keep the loop from
exiting because empty (a minidump of the process shows 100 events queued on
the event_base). The MSDN documentation lists several possible errors (e.g.
WSAENETDOWN), but certainly I never see those returned during transient
network failures (e.g. switching wireless networks). Does anyone have any
ideas?

Best,

--nate
Nate Rosenblum
2014-02-28 21:12:50 UTC
Permalink
Raw Message
Post by Nate Rosenblum
Recently I've been investigating the mysterious termination of my event
dispatch loop on a Windows system, with libevent 2.0.21. This has happened
extremely rarely, and only in an EC2 virtualized environment. The system is
not OOMing, so I suspect an error return from select in win32_dispatch
(which is the only place that dispatch could return an error). Assume for
the sake of discussion that I'm
Ok, I have tracked this down to a call to `bufferevent_free` on a
bufferevent that is outstanding in win32_dispatch's invocation of select.
This is happening at the behest of an application-level timeout. The
algorithm looks roughly like this:

bufev = bufferevent_socket_new(...);
bufferevent_setcb(..., connectionCb, ...);
bufferevent_socket_connect_hostname(...);
// Above repeated for several alternative connections

// ...

// Register application-level global timeout
timeout = evtimer_new(base, timeoutCb, ...);
evtimer_add(timeout, tv)

// ...

void timeoutCb(int, short, arg) {
bufferevent_free(bufev); // <-- this is the bufferevent for the
connection above (really several are processed)
}

If the timeout fires while we're still waiting for a response on the
connect for the underlying fd and we're using a select-based backend, the
close will cause select to return an error and the dispatch loop will bail
out. This is certainly the case for both select and win32select backends; I
have not checked whether closing the descriptor also causes the kqueue or
*poll interfaces to bail out.

Is what I am doing even reasonable? The documentation for bufferevent_free
implicitly suggests that it's ok to call while an operation is outstanding,
but it looks to me like doing so will break any select-based implementation.

Best,

--nate
Nick Mathewson
2014-03-01 18:28:35 UTC
Permalink
Raw Message
Post by Nate Rosenblum
Post by Nate Rosenblum
Recently I've been investigating the mysterious termination of my event
dispatch loop on a Windows system, with libevent 2.0.21. This has happened
extremely rarely, and only in an EC2 virtualized environment. The system is
not OOMing, so I suspect an error return from select in win32_dispatch
(which is the only place that dispatch could return an error). Assume for
the sake of discussion that I'm
Ok, I have tracked this down to a call to `bufferevent_free` on a
bufferevent that is outstanding in win32_dispatch's invocation of select.
This is happening at the behest of an application-level timeout. The
bufev = bufferevent_socket_new(...);
bufferevent_setcb(..., connectionCb, ...);
bufferevent_socket_connect_hostname(...);
// Above repeated for several alternative connections
// ...
// Register application-level global timeout
timeout = evtimer_new(base, timeoutCb, ...);
evtimer_add(timeout, tv)
// ...
void timeoutCb(int, short, arg) {
bufferevent_free(bufev); // <-- this is the bufferevent for the
connection above (really several are processed)
}
If the timeout fires while we're still waiting for a response on the connect
for the underlying fd and we're using a select-based backend, the close will
cause select to return an error and the dispatch loop will bail out. This is
certainly the case for both select and win32select backends; I have not
checked whether closing the descriptor also causes the kqueue or *poll
interfaces to bail out.
Is what I am doing even reasonable? The documentation for bufferevent_free
implicitly suggests that it's ok to call while an operation is outstanding,
but it looks to me like doing so will break any select-based implementation.
There have been some longstanding issues trying to get
bufferevent_free() to work from one thread while the bufferevent is
active in another. We've been trying to get them all straightened out
in the latest libevent 2.1 master granch, but apparently there are
some problems left.

It seems to me that the right response here may be for the select loop
to treat this error as a non-error, and retry. (If we're worried about
looping forever, we could ignore it only the error when there's a
pending notification from another thread, indicating that some other
thread has changed the list of pending events.)

The alternative is for the close() to happen in the "finalize
callback" that happens from the main thread in the new 0., and make
sure that happens after the event_del calls have had their effect.

Any interest in helping to track this down? :) One thing that might
help is a little test program that tries to provoke this bug. Beyond
that, improving the finalize-callback code in the master branch would
also be of benefit.

best wishes,
--
Nick
***********************************************************************
To unsubscribe, send an e-mail to ***@freehaven.net with
unsubscribe libevent-users in the body.
Nate Rosenblum
2014-03-03 18:31:09 UTC
Permalink
Raw Message
Hi Nick, thanks for the feedback.

The alternative is for the close() to happen in the "finalize
Post by Nick Mathewson
callback" that happens from the main thread in the new 0., and make
sure that happens after the event_del calls have had their effect.
I think this sounds like the right approach for closing socket descriptors;
in general I'm more comfortable when resource deallocation is happening in
a single-threaded context. What's more, I don't think that we should try to
fix this issue by looping the select, as EBADF (or its WSA equivalent) is
probably not a benign error message in most cases, but indication of a
programming error.

I do think that checking each of the backends for other non-zero non-error
codes and looping on those is a good idea (as already done with EINTR, for
example). In my first message I mentioned WSAENETDOWN; internet chatter
suggests that might happen occasionally when coming out of sleep and
probably should not terminate the select-based event loop.
Post by Nick Mathewson
Any interest in helping to track this down? :) One thing that might
help is a little test program that tries to provoke this bug. Beyond
that, improving the finalize-callback code in the master branch would
also be of benefit.
Yeah, I'd be happy to help. I have a test program that I can adapt and
publish. I haven't been following the development of bufferevent
finalization in 2.1 but I will start taking a look.

Best,

--nate

Loading...