[Xorp-hackers] potential race condition in run_command.cc?

Pavlin Radoslavov pavlin at icir.org
Thu Jul 20 15:50:21 PDT 2006


> We've been occasionally seeing a hang of the rtrmgr on startup. In
> tracing the cause it looks as if there is a race condition in how
> the run_command.cc executes the popen2() method.
> 
> Specifically,
> 
> libxorp/run_command.cc (snippet from RunCommandBase::execute()):
> 
>     _pid = popen2(_command, _argument_list, _stdout_stream, _stderr_stream,
>                   redirect_stderr_to_stdout());
>     //    XLOG_TRACE(true, "RunCommandBase::execute() Executing program: 6");
>     if (_stdout_stream == NULL) {
>         XLOG_ERROR("Failed to execute command \"%s\"", final_command.c_str());
>         cleanup();
>         _exec_id.restore_saved_exec_id(error_msg);
>         return (XORP_ERROR);
>     }
>     // Insert the new process to the map of processes
>     XLOG_ASSERT(pid2command.find(_pid) == pid2command.end());
>     pid2command[_pid] = this;
> 
> 
> 
> The std::map pid2command inserts the pid after executing
> popen2(). Inside of popen2() the process is forked and the
> _command is executed in the child process.
> 
> When the process has completed it is expected to signal the parent
> process through the method child_handler() (in run_command.cc):
> 
> child_handler(int signo)
> {
>     XLOG_ASSERT(signo == SIGCHLD);
> 
>     //
>     // XXX: Wait for any child process.
>     // If we are executing any child process outside of the RunProcess
>     // mechanism, then the waitpid() here may create a wait() blocking
>     // problem for those processes. If this ever becomes an issue, then
>     // we should try non-blocking waitpid() for each pid in the
>     // pid2command map.
>     //
>     do {
>         pid_t pid = 0;
>         int wait_status = 0;
>         map<pid_t, RunCommandBase *>::iterator iter;
> 
>         pid = waitpid(-1, &wait_status, WUNTRACED | WNOHANG);
>         debug_msg("pid=%d, wait status=%d\n", XORP_INT_CAST(pid), wait_status);
>         if (pid <= 0)
>             return;     // XXX: no more child processes
> 
>         XLOG_ASSERT(pid > 0);
>         popen2_mark_as_closed(pid, wait_status);
>         iter = pid2command.find(pid);
>         if (iter == pid2command.end()) {
> 
> 
> 
> However, if the pid is not found in the map pid2command the
> child_handler() continues to loop and wait for the child process
> to complete (and signal its completion via waitpid).
> 
> So, the problem is that on commands that immediately
> return/complete the child_handler() before the pid2command map is
> updated, the child_handler() will be called before the pid2command
> has the new pid and the command will never complete.

What you describe in the above paragraph shouldn't happen, because
we call block_child_signals() right before popen2().

Can you provide more specific details about the problem you are
seeing. E.g., is the rtrmgr stuck forever in the "do..while" loop,
or are there child processes which have terminated, but the
waitpid() never detects them?
If this is not the case, then are you saying that child_handler() is
indeed called before popen2() completes despite the
block_child_signals().

Note that child_handler() might be called when any child terminates,
but then the "do..while" loop will take care of waitpid() for all
child processes that have terminated. Hence, the long comment inside
child_handler() is probably not true (i.e., the code should work
even if there was a child process that wasn't executed by the
RunCommand mechanism).

Regards,
Pavlin

> Our version of xorp has more program statements in the template
> files therefore it seems as if it is a bit easier for us to
> stumble on this condition. Finally, I have a fix for this
> condition, but  it is more of a patch than an elegant fix. I think
> the fix should be something along the lines where the pid2command
> needs to be updated before the exec() call is made in popen.



More information about the Xorp-hackers mailing list