[Xorp-hackers] Runtime diagnostics and callbacks in backtraces

Wed Nov 18 06:58:23 PST 2009

Ben Greear wrote:
>
> If you just had the original XRL in text format and/or binary format,
> you could use gdb and/or live error-handling code to print it out.  You
> could post-process this manually or programatically to decode what
> the XRL command is, for instance.

I'm confused why preserving the XRL would make a difference here.

The XRL target handlers aren't callbacks; they normally aren't invoked 
by any other means other than the XrlRouter class handling an incoming 
RPC request. In any event, the handlers themselves never get to see the 
raw XRL; they just get passed its arguments.

These issues always arise in any async programming environment, e.g. 
including Twisted Python; however that language has built-in 
introspection for call stack frames.

>
> Given the current call chain, the backtrace is likely to stop at the
> timer handler or task dispatcher, which still gives you no clue about
> what actually created the timer or task.

This is certainly the case for in-process XorpTimers  or XorpTasks, 
which is probably the root of our shared frustration.

>
> A first-class object would let us easily add a __LINE__ and __FILE__ 
> where it
> was created at, for instance.  This could be printed out later with or 
> without
> fancy backtrace support in the toolchain.

Have you looked at DEBUG_CALLBACKS, which already does some of this?

The problem with it is that it doesn't have much fine granularity -- if 
you turn it on for everything, the code will wedge.
It's also a bit invasive for production use.

>
> But, what would be even better is to decrease the number of callbacks
> and just process code in a linear manner and pass back correct
> return codes and/or message strings.  Then the backtraces will
> actually show the originator of the actions.

It is a more general pattern, and I would be quite surprised if 
solutions haven't already been found elsewhere.

However, asking for asynchronous code to be made linear, is pretty much 
calling for a design change which won't happen. The whole XORP 
architecture is designed to run in an event-driven manner. In all of the 
code that I've read, I haven't seen any which was async when it didn't 
need to be.

One of the quirks which Quagga's design was infamous for, was that 
management commands would block out the router itself from actually 
running. For example, it was possible to tie up the BGP process by 
dumping out its routing table, miss updates, and drop the BGP session as 
a result.

XORP isn't completely immune to this condition, because there's still a 
possibility of racing an update timer; XRL server-side routines are 
synchronous, and there is no preemption. By and large it avoids it, 
though, by being event-driven.

>
> I wouldn't mind re-writing the router manager to do this more
> to my liking, but I'll wait until you post your xorp-ipc rewrite
> first.

It might interest you to know that the Router Manager is not being used 
in the commercial XORP product. Although this has more to do with how 
it's being deployed.

I think it was a design mistake to implement the process control in C++, 
but that's just my opinion.

> ...
> That still has template and typedef shit all over it.  I really don't
> think we should need templates for this, and a very few, if any,
> typedefs.

Callback libraries are intended to be generic, and in C++, templates are 
usually the mechanism by which generic constructs are realized, as 
they're type-agnostic. Boost.Function is no different from libxorp in 
this respect.

Code which dispatches callbacks normally doesn't need to know about 
bound arguments, although it might conceivably have to know about 
dispatch-time arguments. Those are however wholly dependent on the 
context in which the function template "callback()" is being used.

We could take a position whether they're semantic sugar, or just a means 
of getting the job done.

Templates aren't going away from C++ any time soon (or the XORP code 
base, for that matter), so let's make what incremental changes we can to 
improve the situation now.

 >  With proper accounting for object ownership, we could probably get
rid of ref-ptrs w/out undue pain as well.

Now that's a good point.

Something which the smart pointer(s) lack is a means of actually dumping 
out the shared ownership. In kernel code scenarios, we typically need to 
track exactly who the consumers of a given resource are, so that we can 
gracefully shut down user processes or other threads.

Given that code which needs to dispatch callbacks, will need to hold a 
refcount on what is a dynamically allocated Memento holding the 
callback's state, there's a clear notion of ownership.

When I peeked at the Boost internals last week, class sp_counted_base 
was used to track the watch/hold counts, but it doesn't track the owners 
by default. e.g. most mutex implementations will support tracking the 
resource owners. Turns out it does, if you define 
BOOST_SP_ENABLE_DEBUG_HOOKS, but it's an 'int' id space, and doesn't 
track the context in which the reference is actually held, which is 
really what we're asking for here.

So having said all that, it's probably best to make the __FILE__ and 
__LINE__ glue part of callback()...
which is something we already have, actually. It just has to be enabled 
appropriately.

regards,
BMS