[Xorp-hackers] potential race condition in run_command.cc?

Michael Larson mike at vyatta.com
Thu Jul 20 16:24:21 PDT 2006


Pavlin,

OK--here are the details that I can provide (to get additional details I'll need to created another image).

It looks as if the call to block_child_signals() does not block signals from the child. Below is the sequence of events that I observed:

1) main thread calls popen2()
2) main thread spawns child process and child executes command
3) child_handler is called and waitpid returns the pid of the created process in step #2 above
4) child_handler fails to find pid in pid2command map after calling the non-blocking waitpid a second time
5) popen2() returns and execution continues within RunCommandBase::execute(), where the pid is now inserted into the pid2command
6) at this point the rtrmgr will hang. I have not isolated the exact location, but I know that the done callback is never called for this command when these sequence of steps occur.

Now with that said you've given my an idea what is up here. In a version of xorp here we are linking to pthreads--not doing much with threads beyond linking to the library at this point, but I'm wondering if this is what is causing sigprocmask() to fail in block_child_signals()?

Mike

----- Original Message -----
From: Pavlin Radoslavov <pavlin at icir.org>
To: Michael Larson <mike at vyatta.com>
Cc: xorp-hackers at xorp.org
Sent: Thursday, July 20, 2006 3:50:21 PM GMT-0800
Subject: Re: [Xorp-hackers] potential race condition in run_command.cc?

> We've been occasionally seeing a hang of the rtrmgr on startup. In
> tracing the cause it looks as if there is a race condition in how
> the run_command.cc executes the popen2() method.
> 
> Specifically,
> 
> libxorp/run_command.cc (snippet from RunCommandBase::execute()):
> 
>     _pid = popen2(_command, _argument_list, _stdout_stream, _stderr_stream,
>                   redirect_stderr_to_stdout());
>     //    XLOG_TRACE(true, "RunCommandBase::execute() Executing program: 6");
>     if (_stdout_stream == NULL) {
>         XLOG_ERROR("Failed to execute command \"%s\"", final_command.c_str());
>         cleanup();
>         _exec_id.restore_saved_exec_id(error_msg);
>         return (XORP_ERROR);
>     }
>     // Insert the new process to the map of processes
>     XLOG_ASSERT(pid2command.find(_pid) == pid2command.end());
>     pid2command[_pid] = this;
> 
> 
> 
> The std::map pid2command inserts the pid after executing
> popen2(). Inside of popen2() the process is forked and the
> _command is executed in the child process.
> 
> When the process has completed it is expected to signal the parent
> process through the method child_handler() (in run_command.cc):
> 
> child_handler(int signo)
> {
>     XLOG_ASSERT(signo == SIGCHLD);
> 
>     //
>     // XXX: Wait for any child process.
>     // If we are executing any child process outside of the RunProcess
>     // mechanism, then the waitpid() here may create a wait() blocking
>     // problem for those processes. If this ever becomes an issue, then
>     // we should try non-blocking waitpid() for each pid in the
>     // pid2command map.
>     //
>     do {
>         pid_t pid = 0;
>         int wait_status = 0;
>         map<pid_t, RunCommandBase *>::iterator iter;
> 
>         pid = waitpid(-1, &wait_status, WUNTRACED | WNOHANG);
>         debug_msg("pid=%d, wait status=%d\n", XORP_INT_CAST(pid), wait_status);
>         if (pid <= 0)
>             return;     // XXX: no more child processes
> 
>         XLOG_ASSERT(pid > 0);
>         popen2_mark_as_closed(pid, wait_status);
>         iter = pid2command.find(pid);
>         if (iter == pid2command.end()) {
> 
> 
> 
> However, if the pid is not found in the map pid2command the
> child_handler() continues to loop and wait for the child process
> to complete (and signal its completion via waitpid).
> 
> So, the problem is that on commands that immediately
> return/complete the child_handler() before the pid2command map is
> updated, the child_handler() will be called before the pid2command
> has the new pid and the command will never complete.

What you describe in the above paragraph shouldn't happen, because
we call block_child_signals() right before popen2().

Can you provide more specific details about the problem you are
seeing. E.g., is the rtrmgr stuck forever in the "do..while" loop,
or are there child processes which have terminated, but the
waitpid() never detects them?
If this is not the case, then are you saying that child_handler() is
indeed called before popen2() completes despite the
block_child_signals().

Note that child_handler() might be called when any child terminates,
but then the "do..while" loop will take care of waitpid() for all
child processes that have terminated. Hence, the long comment inside
child_handler() is probably not true (i.e., the code should work
even if there was a child process that wasn't executed by the
RunCommand mechanism).

Regards,
Pavlin

> Our version of xorp has more program statements in the template
> files therefore it seems as if it is a bit easier for us to
> stumble on this condition. Finally, I have a fix for this
> condition, but  it is more of a patch than an elegant fix. I think
> the fix should be something along the lines where the pid2command
> needs to be updated before the exec() call is made in popen.

_______________________________________________
Xorp-hackers mailing list
Xorp-hackers at icir.org
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers



More information about the Xorp-hackers mailing list