From lizhaous2000 at yahoo.com Mon Nov 2 09:31:42 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Mon, 2 Nov 2009 09:31:42 -0800 (PST) Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done Message-ID: <47960.88678.qm@web58702.mail.re1.yahoo.com> This is a good link which might be interesting. http://www.ece.ucsb.edu/~kastner/labyrinth/bug1.txt --- On Fri, 10/30/09, Li Zhao wrote: > From: Li Zhao > Subject: Re: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done > To: "Ben Greear" > Cc: xorp-hackers at icir.org > Date: Friday, October 30, 2009, 10:30 AM > I thought task manager was fine. But > it might be that the first node was deleted twice, one of > which is this pop_front and another hidden one. > > --- On Thu, 10/29/09, Ben Greear > wrote: > > > From: Ben Greear > > Subject: Re: [Xorp-hackers] rtrmgr crash on SIGABRT > because of pop_front in task_done > > To: "Li Zhao" > > Cc: xorp-hackers at icir.org > > Date: Thursday, October 29, 2009, 1:26 PM > > On 10/29/2009 08:16 AM, Li Zhao > > wrote: > > > I am puzzled by operator delete(prt=0x0). But > inside > > deallocate(this=0x8d55238, __p=0x8d55238), the __p is > not > > 0x0. pop_front means "removes and deletes". So > somewhere > > else this list node was deleted again? > > > > > > --- On Thu, 10/29/09, Li Zhao? > > wrote: > > > > > >> From: Li Zhao > > >> Subject: [Xorp-hackers] rtrmgr crash on > SIGABRT > > because of pop_front in task_done > > >> To: xorp-hackers at icir.org > > >> Date: Thursday, October 29, 2009, 10:54 AM > > >> I added a new protocol and I can > > >> start it in CLI by command "create protocol > XXX", > > but the > > >> rtrmgr crashed after command "delete > protocol > > XXX". > > >> I can also easily reproduce the exactlt same > crash > > via the > > >> following steps: > > >> > > >> 0. I am running xorp processes on an > embedded > > system. > > >> 1. start rtrmgr from linux shell on the > system; > > >> 2. manually start xorp_static_routes from > linux > > shell. This > > >> static will hijack the xrl channels to > rtrmgr; > > >> 3. use cli command "create protocol static" > to > > start a > > >> second xorp_static_routes. > > >> 4. use cli command "delete protocol static" > to > > stop static. > > >> both xorp_static_routes were terminated. > depended > > process > > >> like fea, rib and policy were also > terminated. > > rtrmgr > > >> crash. > > > > I ran under valgrind, and saw this info: > > > > ==27820== Invalid free() / delete / delete[] > > ==27820==? ? at 0x4A05E3F: operator delete(void*) > > (vg_replace_malloc.c:342) > > ==27820==? ? by 0x463531: > > > __gnu_cxx::new_allocator > > >::deallocate(std::_List_node*, > unsigned > > long) (new_a > > llocator.h:95) > > ==27820==? ? by 0x462427: > > std::_List_base > > >::_M_put_node(std::_List_node*) > > (stl_list.h:320) > > ==27820==? ? by 0x46143B: std::list > std::allocator > > >::_M_erase(std::_List_iterator) > > (stl_list.h:1431) > > ==27820==? ? by 0x45FF0B: std::list > std::allocator >::pop_front() > > (stl_list.h:906) > > ==27820==? ? by 0x45DB73: > > TaskManager::task_done(bool, std::string const&) > > (task.cc:2256) > > ==27820==? ? by 0x465970: > > XorpMemberCallback2B0 > std::string const&>::dispatch(bool, > std::string > > const&) (call > > back_nodebug.hh:4636) > > ==27820==? ? by 0x45C540: Task::step8_report() > > (task.cc:1998) > > ==27820==? ? by 0x4659DF: > > XorpMemberCallback0B0::dispatch() > > (callback_nodebug.hh:306) > > ==27820==? ? by 0x449613: > > > Module::terminate_with_prejudice(ref_ptr > > >) (module_manager.cc:218) > > ==27820==? ? by 0x44F63C: > > XorpMemberCallback0B1 > ref_ptr > > >::dispatch() > > (callback_nodebug.hh:598) > > ==27820==? ? by 0x549D72: > > OneoffTimerNode2::expire(XorpTimer&, void*) > > (timer.cc:167) > > ==27820==? Address 0x50c9340 is 80 bytes inside a > > block of size 200 alloc'd > > ==27820==? ? at 0x4A06FFC: operator new(unsigned > > long) (vg_replace_malloc.c:230) > > ==27820==? ? by 0x42C81F: > > MasterConfigTree::MasterConfigTree(std::string > const&, > > MasterTemplateTree*, ModuleManager&, > XorpClient&, > > boo > > l, bool) (master_conf_tree.cc:119) > > ==27820==? ? by 0x406ED6: Rtrmgr::run() > > (main_rtrmgr.cc:319) > > ==27820==? ? by 0x407E57: main > > (main_rtrmgr.cc:665) > > > > > > It appears to me that the task-manager object (this) > is > > already deleted when > > the taskmanager::task_done() method is called. > > > > Could probably add some debugging to the destructors > and > > constructors of TaskManager > > to verify.? I have some other things to do > first..but > > will look at this a bit later > > if no one beats me to it. > > > > Thanks, > > Ben > > > > -- > > Ben Greear > > Candela Technologies Inc? http://www.candelatech.com > > > > > > > > From lizhaous2000 at yahoo.com Mon Nov 2 10:10:14 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Mon, 2 Nov 2009 10:10:14 -0800 (PST) Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done In-Reply-To: <47960.88678.qm@web58702.mail.re1.yahoo.com> Message-ID: <308722.4433.qm@web58707.mail.re1.yahoo.com> I have tried the example code in the link and the SIGABRT stack is very similar except that in stack #4 in function malloc_printerr() str = munmap_chunk() instead of free(). --- On Mon, 11/2/09, Li Zhao wrote: > From: Li Zhao > Subject: Re: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done > To: "Ben Greear" > Cc: xorp-hackers at icir.org > Date: Monday, November 2, 2009, 12:31 PM > This is a good link which might be > interesting. > > http://www.ece.ucsb.edu/~kastner/labyrinth/bug1.txt > > > --- On Fri, 10/30/09, Li Zhao > wrote: > > > From: Li Zhao > > Subject: Re: [Xorp-hackers] rtrmgr crash on SIGABRT > because of pop_front in task_done > > To: "Ben Greear" > > Cc: xorp-hackers at icir.org > > Date: Friday, October 30, 2009, 10:30 AM > > I thought task manager was fine. But > > it might be that the first node was deleted twice, one > of > > which is this pop_front and another hidden one. > > > > --- On Thu, 10/29/09, Ben Greear > > wrote: > > > > > From: Ben Greear > > > Subject: Re: [Xorp-hackers] rtrmgr crash on > SIGABRT > > because of pop_front in task_done > > > To: "Li Zhao" > > > Cc: xorp-hackers at icir.org > > > Date: Thursday, October 29, 2009, 1:26 PM > > > On 10/29/2009 08:16 AM, Li Zhao > > > wrote: > > > > I am puzzled by operator delete(prt=0x0). > But > > inside > > > deallocate(this=0x8d55238, __p=0x8d55238), the > __p is > > not > > > 0x0. pop_front means "removes and deletes". So > > somewhere > > > else this list node was deleted again? > > > > > > > > --- On Thu, 10/29/09, Li Zhao? > > > wrote: > > > > > > > >> From: Li Zhao > > > >> Subject: [Xorp-hackers] rtrmgr crash on > > SIGABRT > > > because of pop_front in task_done > > > >> To: xorp-hackers at icir.org > > > >> Date: Thursday, October 29, 2009, 10:54 > AM > > > >> I added a new protocol and I can > > > >> start it in CLI by command "create > protocol > > XXX", > > > but the > > > >> rtrmgr crashed after command "delete > > protocol > > > XXX". > > > >> I can also easily reproduce the exactlt > same > > crash > > > via the > > > >> following steps: > > > >> > > > >> 0. I am running xorp processes on an > > embedded > > > system. > > > >> 1. start rtrmgr from linux shell on the > > system; > > > >> 2. manually start xorp_static_routes > from > > linux > > > shell. This > > > >> static will hijack the xrl channels to > > rtrmgr; > > > >> 3. use cli command "create protocol > static" > > to > > > start a > > > >> second xorp_static_routes. > > > >> 4. use cli command "delete protocol > static" > > to > > > stop static. > > > >> both xorp_static_routes were > terminated. > > depended > > > process > > > >> like fea, rib and policy were also > > terminated. > > > rtrmgr > > > >> crash. > > > > > > I ran under valgrind, and saw this info: > > > > > > ==27820== Invalid free() / delete / delete[] > > > ==27820==? ? at 0x4A05E3F: operator > delete(void*) > > > (vg_replace_malloc.c:342) > > > ==27820==? ? by 0x463531: > > > > > > __gnu_cxx::new_allocator > > > >::deallocate(std::_List_node*, > > unsigned > > > long) (new_a > > > llocator.h:95) > > > ==27820==? ? by 0x462427: > > > std::_List_base std::allocator > > > >::_M_put_node(std::_List_node*) > > > (stl_list.h:320) > > > ==27820==? ? by 0x46143B: std::list > > std::allocator > > > >::_M_erase(std::_List_iterator) > > > (stl_list.h:1431) > > > ==27820==? ? by 0x45FF0B: std::list > > std::allocator >::pop_front() > > > (stl_list.h:906) > > > ==27820==? ? by 0x45DB73: > > > TaskManager::task_done(bool, std::string > const&) > > > (task.cc:2256) > > > ==27820==? ? by 0x465970: > > > XorpMemberCallback2B0 bool, > > > std::string const&>::dispatch(bool, > > std::string > > > const&) (call > > > back_nodebug.hh:4636) > > > ==27820==? ? by 0x45C540: Task::step8_report() > > > (task.cc:1998) > > > ==27820==? ? by 0x4659DF: > > > XorpMemberCallback0B0 Task>::dispatch() > > > (callback_nodebug.hh:306) > > > ==27820==? ? by 0x449613: > > > > > > Module::terminate_with_prejudice(ref_ptr > > > >) (module_manager.cc:218) > > > ==27820==? ? by 0x44F63C: > > > XorpMemberCallback0B1 > > ref_ptr > > > >::dispatch() > > > (callback_nodebug.hh:598) > > > ==27820==? ? by 0x549D72: > > > OneoffTimerNode2::expire(XorpTimer&, void*) > > > (timer.cc:167) > > > ==27820==? Address 0x50c9340 is 80 bytes inside > a > > > block of size 200 alloc'd > > > ==27820==? ? at 0x4A06FFC: operator > new(unsigned > > > long) (vg_replace_malloc.c:230) > > > ==27820==? ? by 0x42C81F: > > > MasterConfigTree::MasterConfigTree(std::string > > const&, > > > MasterTemplateTree*, ModuleManager&, > > XorpClient&, > > > boo > > > l, bool) (master_conf_tree.cc:119) > > > ==27820==? ? by 0x406ED6: Rtrmgr::run() > > > (main_rtrmgr.cc:319) > > > ==27820==? ? by 0x407E57: main > > > (main_rtrmgr.cc:665) > > > > > > > > > It appears to me that the task-manager object > (this) > > is > > > already deleted when > > > the taskmanager::task_done() method is called. > > > > > > Could probably add some debugging to the > destructors > > and > > > constructors of TaskManager > > > to verify.? I have some other things to do > > first..but > > > will look at this a bit later > > > if no one beats me to it. > > > > > > Thanks, > > > Ben > > > > > > -- > > > Ben Greear > > > Candela Technologies Inc? http://www.candelatech.com > > > > > > > > > > > > > > > > > ? ? ? > > _______________________________________________ > Xorp-hackers mailing list > Xorp-hackers at icir.org > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > From bms at incunabulum.net Mon Nov 2 10:12:17 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 02 Nov 2009 18:12:17 +0000 Subject: [Xorp-hackers] XRL/Thrift (was: Re: Omitting XrlDB from Router Manager) In-Reply-To: <4AECBF88.7030704@candelatech.com> References: <4AE7178B.9000709@incunabulum.net> <4AEAF863.7000500@candelatech.com> <4AEC6BA1.3010007@incunabulum.net> <4AECBF88.7030704@candelatech.com> Message-ID: <4AEF2101.8070506@incunabulum.net> Ben Greear wrote: > > Do you have an estimate for when you plan to post your changes? That's a very good question. At the moment, no, but I'll pick one out of the hat. I'll be conservative, and say XRL/Thrift may be ready to go sometime in mid December; that's not an official milestone, and please, don't hold me to it. It's the beginning of November now. I am the only developer, as far as I know, working near-full-time on the community XORP branch, so I'm sure you can appreciate there's a lot of pressure involved. It's difficult to stay focused on that goal, and still provide lifeline support here, so I am trying to keep ahead as much as possible. I know I've had support requests from people on these lists which I have sadly had to drop packet on for this reason. What I'll do now is post an update about where I'm at, and try to think of tasks which other people could potentially pitch in on. For the foreseeable future, this is going to involve C++ hacking. That's just how it is -- but there are other ways in which people can get involved with the project, e.g. documentation, support, tracking down bugs. One really interesting thing which people have asked for, is a Linux version of the LiveCD. This would be a great third party developer contribution for someone to get involved with. thanks, BMS From bms at incunabulum.net Mon Nov 2 10:21:17 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 02 Nov 2009 18:21:17 +0000 Subject: [Xorp-hackers] XORP/Thrift technical background Message-ID: <4AEF231D.8080404@incunabulum.net> Executive level objectives: * Find and fix known performance bottlenecks; * Remove barriers to uptake by developer community. Metaphorically, this is a spinal resection: Surgeons will tell you this is one of the more intensive surgical interventions that can be performed on a human being, because of all the inter-system coupling involved. The last round of metrics from Ohloh.net, indicate we're around the 400 KLOCs of C++ mark for XORP. This includes libxipc, which is around 40 KLOCs. This is sizeable by any project's standards. It's obvious that the solution SHOULD NOT require a rewrite of existing process code. This rules out Boost.ASIO right off the bat, and forces the scope to get constrained to just implementing Thrift. Before this project fully commenced, there was research into the available alternatives. There aren't many; a list can be produced, if that's something people care to see. Where Thrift stands out above all the others, is in cross-language interop. You can use it from just about anything (except C, although that's being worked on). As time goes on, this is probably going to become more important, because of who's using it, and why: Spotify, Last.FM, Facebook and others. If you look at http://thriftpuzzle.facebook.com/ -- Thrift is cunningly used as a Facebook recruitment tool. In this project, we're using it to streamline the RPC marshaling in a (possibly distributed) router. There is plenty of common ground here, because of overlapping requirements for low latency and compact representation. The Thrift developers make it pretty clear that they expect Thrift to run on a reliable, ordered, stream protocol (e.g. UNIX file handles, pipes, HTTP sessions, TCP sessions, UNIX domain stream sockets) -- for now. Of course the price you pay for using a stream to deliver messages, is head-of-line blocking. This dependency comes about largely because Thrift was developed for web services, although because Thrift is itself message-oriented, there's no reason why a message-oriented transport cannot be used for the RPC calls in future. Of course, the trade-off with cross-language interop, is that not all languages/frameworks implement asynchronous dispatch, or do so in the same way. It is the proverbial 'garden rake on the ground' -- as we've seen with the Net-SNMP code, if you keep stepping on it, it will hit you in the face time and time again. The above named startups all do slightly different things to implement the server side scalability in their service offerings, which is where the value is. So, Thrift has left the async programming model as an unanswered question, to date. Although what's in Facebook's tree, is in Facebook's tree, and not something we get to see, just yet. Esteve Fernandez (who I had the pleasure of meeting at LShift, Ltd. over the summer) has done some excellent work on getting Thrift to work on top of AMQP in Twisted Python, with asynchronous dispatch. It's likely that we can borrow from this conceptually for AMQP to be implemented in future, but that's far off at the moment. It's an open question. thanks, BMS From bms at incunabulum.net Mon Nov 2 10:34:48 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 02 Nov 2009 18:34:48 +0000 Subject: [Xorp-hackers] XRL/Thrift: using Thrift to talk to XORP processes (future) Message-ID: <4AEF2648.4050509@incunabulum.net> A few words on using Thrift externally to talk to XORP processes, and what shape that is likely to take. * The intention behind the current effort, is to replace XRL with the Thrift protocol. * In Thrift-speak, every XORP process advertising a service, will have the equivalent of a TServerSocket endpoint open for reaching that service and sending it RPC requests. * This will use the TFramedTransport with the TBinaryProtocol. There are going to be some issues with this in the beginning which need to be resolved. Rather than try to resolve them myself upfront, I'm going to document them as we go along. * If you look at the tutorials shipped with Thrift, they're very simple: * The Calculator client example opens a single socket to a known Thrift service port. * No service discovery (RPC name resolution) is performed. Now because XORP processes need to participate in the Finder protocol to discover where the components are (and in some cases, as we've seen, even communicate with them), it isn't a simple matter of just instantiating a TSocket as Calculator does. * It would be unreasonable to expect developers to clone all of the logic in XrlRouter for naming/discovery, so XrlRouter is still going to be needed in these situations. * There may be situations where developers need to implement an XRL target in their 3rd party code, without necessarily getting involved in the sync/async split. * It is one very constrained scenario where threading could be useful. * In Thrift clients, as they are currently implemented, calls are purely blocking (synchronous), but language native. * However: Thrift's client stubs always pass a 0 sequence number for their T_CALL messages. * This is absolutely fine if you only ever have one request in flight from the client to the server in a session, but it pretty much kills any possible parallelism or asynchrony. In XORP's integration of Thrift, we *will* be relying on the ability to tell requests from the same client apart. * We could delve into the blob and rewrite it, but we can only do that if we have control of the transport. We have no such control in the Thrift libraries as shipped. * Also if we're using ring buffers, that's gnarly. We would ideally like to hit a send() or writev() once and be done with it, to avoid increasing syscall overhead in situations where we need high performance. * As we've seen, the use of a TCP stream (where in-order delivery is guaranteed) forces serialization of RPC calls, but it isn't evident at API level. * This may break message-oriented transports, and it's difficult to mix-and-match libxipc's pseudo-async i/o mode with a Thrift client's sync i/o mode without tripping up over it. * The assumption in their programming model seems to be that threads will be used in a Java-like way: ie. cheap, don't share much state, and what they do is explicitly 'synchronized'. * So we'd have to be very careful about how libxipc gets entered. The code is not currently thread-safe. * So what we'll probably have to do, is support one model or the other, but not both at once. * Add a blocking API for service lookup to XrlRouter -- we can't rely on the EventLoop being run whilst we're in blocking Thrift APIs. * The 0 cseqid is only really an issue if we have more than one RPC in-flight on the same transport from the same client. * Either force clients to build their own TTransport to connect to it, or add a hook to XrlRouter to place the I/O streams in a blocking mode. Alternatively, we could add an API to Thrift internally to request a new cseqid for each outgoing request, but that is something which needs to be discussed with the Thrift developers, through the Apache JIRA process. thanks, BMS From bms at incunabulum.net Mon Nov 2 10:35:39 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 02 Nov 2009 18:35:39 +0000 Subject: [Xorp-hackers] XRL/Thrift: points for further work Message-ID: <4AEF267B.2050903@incunabulum.net> Here I'll outline a few points for further work. Some of these might not make too much sense, until I've posted an update of where the XRL/Thrift work is currently at (to explain what's going on in XRL). First, a few words on scalability: * Means different things to different people. In this case, it probably means the ability to parallelize, which may or may not involve threads. * One of the things which could be done, and is pretty much a prerequisite for threading, is to decouple the direct dispatch for the server-side of XORP RPC method calls, and implement a scheme similar to what we need in the XRL clients -- that is, we buffer the blob, giving us an opportunity to ship the request off to other threads, e.g. using work queues. * I am shortly going to re-post here some of what was discussed on xorp-dev@ privately a year ago to underline this. A few words on clustering and AMQP: * AMQP is outside of the scope of this project, but it's worth a little thinking ahead for, given that it facilitates building scalable, fault-tolerant service clusters. * You can regard the use of AMQP, for Thrift RPC method calls, as a form of tunneled method call. * Please refer to my upcoming email on XRL/Thrift technical specifics, where I explain what tunneling the method calls involves. * Generally, there is only 1 transport layer session to the AMQP broker active at any time from a single client. * The tunneling wouldn't happen in libxipc as such; rather, we'd open an AMQP session per XRL target, and multiplex to/from this session for each service supported by that XRL target. Thrift, and some AMQP library yet-to-be-decided, would take care of the representation in that model. * The implementation constraints are likely to be similar to that of any other message-based transport. * See other message re Thrift 3rd party use, about why message-oriented semantics for Thrift clients are an issue. * The AMQP broker itself isn't a naming service; it's an RPC router. * Although an AMQP native Finder would be interesting. We have a number of producer/consumer relationships between XRL targets, which would be better implemented, in an AMQP world, using AMQP exchanges and bindings, and its Publish/Subscribe idioms. thanks, BMS From bms at incunabulum.net Mon Nov 2 10:49:32 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 02 Nov 2009 18:49:32 +0000 Subject: [Xorp-hackers] XORP 1.7 status Message-ID: <4AEF29BC.3070707@incunabulum.net> Hi all, A few words on how a XORP 1.7 release could be pushed out. It's likely that I will break the Thrift work up further, because a XORP 1.7-RC is overdue. It's also likely that the RC would be limited to only required bug fix and infrastructure work, and no new functionality would be planned. Please understand that I am oversubscribed :-) I am currently the only active developer on the community branch, apart from JT Conklin, who is occasionally active on the SCons and build engineering parts of the branch. Where we have volunteers to test RCs and offer help with integration, that process is likely to get much easier, and produce a 1.7 milestone release more quickly. So it's very likely I will push back on patches in the interim. The 1.7-RC would involve shipping at least a source code tarball, and USB memory stick build. Patches and bug reports stand more chance of acted upon quickly, if they follow house code style, include test cases (and preferably logs of reproduction), and have a Trac ticket open for them. We understand the LiveCD is popular with folk who are building test network configurations e.g. in VMware or similar virtualization environments. A show of hands on this would be great. It currently involves a small amount of manual patching of a FreeBSD 7.2-STABLE tree to put the LiveCD together [1]. There is a documented procedure for this; although, if any of the trees involved are in flux, this can be tedious to deal with. What I'm likely to do, is to call a freeze if/when we've reached consensus about the tree being ready for 1.7, a LiveCD snap being desirable, and it will require testing before we raise the release flag. thanks, BMS [1] One of the things I've been trying to do, as time goes on, is push generic code back to the upstream project(s) where it really belongs, or where we've derived from. The FreeBSD based LiveCD generation is one such piece of work. I've had a little discussion with phk@ about it, but didn't reach agreement about the way forward for that work. I hope other FreeBSD developers can pitch in on that; there's been some interest about it, as it would just make everyone's lives that little bit easier. From bms at incunabulum.net Mon Nov 2 11:01:43 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 02 Nov 2009 19:01:43 +0000 Subject: [Xorp-hackers] Xrl/Thrift: Technical implementation details Message-ID: <4AEF2C97.407@incunabulum.net> So far, the following has happened in the development branch: * I have pretty much read all of libxipc and thrift, and digested most of it. That has been most time consuming to date. * I reverse engineered what Thrift was generating for typical RPC calls, and compared it with XRL's behaviour. * Once it was clear there was a fit, work proceeded further. * libxipc has been cut down to the bare minimum API surface required for routing processes to link against. * This involves cutting out a lot of what will now be dead wood in libxipc. * We have in fact more than one RPC protocol in libxipc, if you include the textual Finder protocol, and more than two if you include more than just XrlPFSTCP. * XRL keeps a lot of state for in-flight RPC calls around, as C++ objects, in an intermediate representation, because of this Finder protocol split. * A lot of the protocol machinery in libxipc exists to deal with that conceptual split. * The Router Manager has been cut out of the dev branch whilst the work is ongoing. * Reason for this is, as we've seen in other recent threads on this list, that it makes use of XRL in very different ways from the rest of the tree, and uses a number of libxipc APIs which the rest of the tree never touches. * In xrl/scripts, clnt-gen and tgt-gen change to generate the equivalent Thrift binary blobs. * They use Thrift's TProtocol interface, instead of building an Xrl (with help from XrlArgs / XrlAtom / XrlAtomList etc.) as they currently do in mainline XORP. * Thrift methods can potentially return structs. * XRL methods don't return structs; instead, return arguments are listed after the '->' token in the *.xif IDL file. * The approach used is to turn the XRL return arguments into a Thrift struct, which can be accessed from Thrift using a compatible 'struct {}' declaration in a *.thrift IDL file. * This is transparent inside the XRL/Thrift port, and only exposed in thrift-gen, see below. In XRL clients, method dispatch is still pseudo-asynchronous:- * The trickiest part is dealing with asynchronous method call resolution. * In XRL, this is per-method, and pushes a lot of state around. * In Thrift, this can happen per-service. * In the old XRL world, the XrlFooClient::send_foo() stubs build up an XRL, which is then sent using the XrlSender interface. * This send may happen over a variety of destination transports, because of the Finder protocol vs XrlPF* split. * The XRL send MAY be deferred if the method can't be resolved (it's pending Finder target resolution). * In the new XRL/Thrift world, these stubs marshal the body of a binary T_CALL message directly into a binary buffer, and then call into XrlRouter (via the XrlSender interface) to ship it off to the correct destination. * Because RPC endpoints can come and go, the FinderClient needs to track the endpoints based on what it learns from the Finder. * Because we're tied to TCP (for now) as a network transport, method dispatch is not fully asynchronous (nor do want it to be, for now). In XRL targets, method dispatch is still synchronous, and doesn't change [yet]:- * Request comes in, libxipc parses it from the server's RPC endpoint, and will ship it off to the Thrifted XRL target stubs (handle_*()). * In the Thrift case, instead of flipping class Xrl instances around, the handle*() callbacks read directly from a buffered Thrift binary blob (T_CALL message) we just read from the transport. * No plans to break this up further at this time. See below re scalability. * If the process's method handler returns XrlCmdError::OKAY(), then we marshall the result out using a Thrift T_REPLY. * Otherwise, XrlCmdErrors will get translated into Thrift T_EXCEPTION reply messages by the XRL target. * This translation doesn't happen in the stubs themselves, rather, we preserve the existing XRL APIs and deal with it in libxipc. A few words on 3rd party process interop: * XRL, at wire level, is conceptually a subset of Thrift's wire-level protocol. * A thrift-gen translator has been written which takes a XORP *.xif XRL IDL file, and generates a compatible *.thrift IDL file. * Whilst Thrift has language-level exception support, we don't use this here. It requires an additional struct in the T_REPLY. It made more sense to keep things simple, as this also bloats the translated Thrift service definitions. * Whilst thrift-gen is not useful immediately (see other message re using Thrift to talk to XORP), it'll be needed for writing Thrifted code to talk to XORP components directly later on. * It made sense to do this first, to get familiar with the Xif parser in Python, and get more of a feel for the Thrift syntax, as well as identifying the conceptual overlap with XRL. Some performance issues exist with XRL/Thrift which require modifying Thrift itself: * Thrift's TProtocol binary read/write methods take only std::string as arguments. * They also do some legwork to avoid allocating intermediate storage on the stack, but instead from the C++ runtime heap. * There is no clean way to cast a vector, or similar scoped array types, to std::string, without introducing an intermediate copy on the stack. * It isn't something which needs to be resolved immediately, as the only affected XRL method calls are those which ship packet payloads. * All other XIF native types should cast cleanly into their new Thrift representation, without significant intermediate copies on the stack; apart from libxorp's Mac, which is just a 6 byte binary quantity. * [Note re scalability: unmarshalling need only happen at the point of dispatch, assuming buffering mechanisms are in place.] * The likely resolution is that I'll send the Thrift developers a patch for TProtocol to implement a (void *, size_t) overload for the TProtocol::readBinary() and TProtocol::writeBinary() methods. On the subject of invocation through the Finder, something which isn't possible/relevant in Thrift: * The current libxipc supports the notion of tunneled XRL method calls. * These calls are routed directly to the Finder, which is then responsible for dispatching them to the target. * In Thrifted libxipc, this changes; the Finder just acts as a naming service. * This is an essential mechanism for the Router Manager, which invokes XRL methods based on the same textual representation of them used by the current Finder protocol. * The plan is to ditch the XRLdb (or keep it around purely for testing purposes); refactor XrlAction and XorpClient in the Router Manager to generate Thrift method calls, based on the textual XRL method call description; and execute those calls using the same lookup mechanism as we will use in the new libxipc. thanks, BMS From bms at incunabulum.net Mon Nov 2 12:25:05 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 02 Nov 2009 20:25:05 +0000 Subject: [Xorp-hackers] [Fwd: Re: [Xorp-dev] XORP XRL performance] Message-ID: <4AEF4021.3050604@incunabulum.net> Here are a few words from Orion last year about implementing parallelism in XORP XRL targets. Consider carefully that the process code itself must be thread safe, before we consider libxipc thread safety. -------------- next part -------------- An embedded message was scrubbed... From: "Orion T Hodson" Subject: Re: [Xorp-dev] XORP XRL performance Date: Fri, 5 Sep 2008 09:12:52 -0600 (MDT) Size: 5826 Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091102/5d6b2331/attachment.eml From bms at incunabulum.net Mon Nov 2 12:26:53 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 02 Nov 2009 20:26:53 +0000 Subject: [Xorp-hackers] [Fwd: Re: [Xorp-dev] XORP XRL performance] Message-ID: <4AEF408D.2010003@incunabulum.net> Here are some of Marko's thoughts on the subject. Includes a (still working) link to a patch against XORP (at that time) which implements batch XRL updates. A similar change exists in the corporate version of XORP. -------------- next part -------------- An embedded message was scrubbed... From: Marko Zec Subject: Re: [Xorp-dev] XORP XRL performance Date: Fri, 5 Sep 2008 17:38:03 +0200 Size: 5223 Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091102/1aea6231/attachment.eml From greearb at candelatech.com Mon Nov 2 16:23:49 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 02 Nov 2009 16:23:49 -0800 Subject: [Xorp-hackers] XRL/Thrift In-Reply-To: <4AEF2101.8070506@incunabulum.net> References: <4AE7178B.9000709@incunabulum.net> <4AEAF863.7000500@candelatech.com> <4AEC6BA1.3010007@incunabulum.net> <4AECBF88.7030704@candelatech.com> <4AEF2101.8070506@incunabulum.net> Message-ID: <4AEF7815.5040902@candelatech.com> On 11/02/2009 10:12 AM, Bruce Simpson wrote: > Ben Greear wrote: >> >> Do you have an estimate for when you plan to post your changes? > > That's a very good question. At the moment, no, but I'll pick one out of > the hat. > > I'll be conservative, and say XRL/Thrift may be ready to go sometime in > mid December; that's not an official milestone, and please, don't hold > me to it. > > It's the beginning of November now. I am the only developer, as far as I > know, working near-full-time on the community XORP branch, so I'm sure > you can appreciate there's a lot of pressure involved. > > It's difficult to stay focused on that goal, and still provide lifeline > support here, so I am trying to keep ahead as much as possible. I know > I've had support requests from people on these lists which I have sadly > had to drop packet on for this reason. > > What I'll do now is post an update about where I'm at, and try to think > of tasks which other people could potentially pitch in on. > > For the foreseeable future, this is going to involve C++ hacking. That's > just how it is -- but there are other ways in which people can get > involved with the project, e.g. documentation, support, tracking down bugs. > > One really interesting thing which people have asked for, is a Linux > version of the LiveCD. This would be a great third party developer > contribution for someone to get involved with. Our Ubuntu-based 9.10 live-cd has Xorp on it (as well as our LANforge product and some other things). But, one could simply ignore our stuff and run xorp from the command line... It's currently behind a password (freely given upon registration). I'll think about making it directly downloadable... Ben > > thanks, > BMS -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Wed Nov 4 10:38:27 2009 From: greearb at candelatech.com (Ben Greear) Date: Wed, 04 Nov 2009 10:38:27 -0800 Subject: [Xorp-hackers] FEA: Filter netlink sockets for specific table routes. Message-ID: <4AF1CA23.1060901@candelatech.com> Implement packet filter on netlink interfaces so that FEA only gets route updates for the table(s) it cares about. When running 100 routers with around 300 routes each, this takes system load from 300+ down to max of around 20 (on a dual quad-core system). This should greatly help scalability as virtual routers increase, since number of netlink messages will be Routers * Routes, instead of Routers^2 * Routes. This patch also makes netlink socket reading non-blocking, which is required to keep fea from hanging when packets are filtered. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: xorp-netlink-filter.patch Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091104/cf99bee9/attachment-0001.ksh From luca.belforte at student.uclouvain.be Thu Nov 5 10:33:09 2009 From: luca.belforte at student.uclouvain.be (Luca Belforte) Date: Thu, 05 Nov 2009 19:33:09 +0100 Subject: [Xorp-hackers] XML-RPC for retrieving information on the router Message-ID: <4AF31A65.4060601@student.uclouvain.be> Hello, I'm trying to write a XML-RPC process in xorp, to retrieve in XML format some "useful" information on the router. First of all, i'm interested if someone have already developed a module/process who do the same or something similar. Secondary, I was searching where the Routing Table are stored, but to be honest, I'm a little bit lost, so I'm searching how retrieve the current routing table on a router, with a XRL. Thanks Luca From bms at incunabulum.net Thu Nov 5 13:50:37 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 05 Nov 2009 21:50:37 +0000 Subject: [Xorp-hackers] XML-RPC for retrieving information on the router In-Reply-To: <4AF31A65.4060601@student.uclouvain.be> References: <4AF31A65.4060601@student.uclouvain.be> Message-ID: <4AF348AD.2070401@incunabulum.net> Hi Luca, Luca Belforte wrote: > Hello, > > I'm trying to write a XML-RPC process in xorp, to retrieve in XML format > some "useful" information on the router. > First of all, i'm interested if someone have already developed a > module/process who do the same or something similar. > At the moment, we don't have this capability in the community branch. I believe there is something similar in the commercial product, though, but it is probably specific to how the product's built, rather than being a general XML export mechanism. There is the ongoing Thrift work which could make something like this a bit easier, but not until a bit further down the line. > Secondary, I was searching where the Routing Table are stored, but to be > honest, I'm a little bit lost, so I'm searching how retrieve the current > routing table on a router, with a XRL. > If you look at rib/tools/show_routes.cc, you'll see the source of the show_routes program, which is used to dump the routing tables. There's more than RIB, and there's more than one routing table in each RIB: (unicast, multicast) * (ipv4, ipv6). That's urib4, urib6, mrib4, mrib6, normally -- 4 RIBs, and however many origin tables, as you have routing protocols, in each. Broadly, what that tool does is loop over each RIB and address family combination, then print the *given* table. Usually this is the RIB's final table (what it pushes to the forwarding plane). The thing is, because retrieval of an entire table is an RPC intensive operation, it's split up into a callback interface. What actually happens is that the show_routes tool registers itself as accepting routes for redistribution, just like a routing protocol module. Then it requests redistribution from the RIB for that table; off goes the RIB and fires off each route as a client request. [libxipc afociandos will note this is mostly because the RIB can then dispatch each routing table entry in an asynchronous manner -- XRL will have pseudo-asynchronous dispatch at the client end of a session, but only has synchronous dispatch at the server end of a session. And probably also note that this is a candidate for a batch operation.] What is essentially just a routing table dump, is split up in XORP in this way, so as not to block the RIB (or other processes) out from other tasks whilst the dump is in progress. thanks, BMS From mnunna0 at gmail.com Thu Nov 5 14:13:27 2009 From: mnunna0 at gmail.com (mahendra nunna) Date: Thu, 5 Nov 2009 17:13:27 -0500 Subject: [Xorp-hackers] does xorp replace linux routing deamons Message-ID: <6e49b4d40911051413x50960c78m81c79e210251853b@mail.gmail.com> hi When i run xorp.... will it handle all the packets i m trying to send out of my host.... ? I mean ..... does XORP send the packets out of my host instead the kernel ?.... -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091105/0f055e28/attachment.html From bms at incunabulum.net Thu Nov 5 21:21:07 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Fri, 06 Nov 2009 05:21:07 +0000 Subject: [Xorp-hackers] does xorp replace linux routing deamons In-Reply-To: <6e49b4d40911051413x50960c78m81c79e210251853b@mail.gmail.com> References: <6e49b4d40911051413x50960c78m81c79e210251853b@mail.gmail.com> Message-ID: <4AF3B243.2000102@incunabulum.net> mahendra nunna wrote: > hi > > When i run xorp.... will it handle all the packets i m trying to send > out of my host.... ? I mean ..... does XORP send the packets out of my > host instead the kernel ?.... No -- the host operating system is responsible for all packet I/O. cheers, BMS From bms at incunabulum.net Fri Nov 6 05:25:31 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Fri, 06 Nov 2009 13:25:31 +0000 Subject: [Xorp-hackers] XML-RPC for retrieving information on the router In-Reply-To: <4AF3DC65.9000001@student.uclouvain.be> References: <4AF31A65.4060601@student.uclouvain.be> <4AF348AD.2070401@incunabulum.net> <4AF3DC65.9000001@student.uclouvain.be> Message-ID: <4AF423CB.6030305@incunabulum.net> Luca Belforte wrote: > Thanks Bruce for your reply. > > I will try to modify the code of show_routes.cc to get a xml file to > send by rpc ^^". > > Just is not easy to understand all the mechanism of XORP XRL. > Perhaps have you a document who explain it? (i read the documents on > XORP web site, but they not really help me) > There is a document on libxipc in PDF format on xorp.org (from an older tree), which you've probably seen. Normally, developers don't ever need to touch XRLs directly. I recently checked in a script, skel-gen, to the community branch, which will take a XORP .tgt file and generate skeleton server code. If you're writing a XORP service from scratch, this should help cut down the amount of manual hacking, just to bootstrap a new service. I agree, the internals of libxipc are not so well documented. It's something I'm having to make notes on as I progress. There are a lot of classes in there which exist just to wrap asynchronous operations. This isn't something which threading would necessarily 'fix'. We're going to be sticking to the notion of XRL targets for some time, I'm aiming to touch as little code as possible to get Thrift up and running. thanks, BMS From bms at incunabulum.net Fri Nov 6 05:34:43 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Fri, 06 Nov 2009 13:34:43 +0000 Subject: [Xorp-hackers] Linux LiveCD? In-Reply-To: <4AEF7815.5040902@candelatech.com> References: <4AE7178B.9000709@incunabulum.net> <4AEAF863.7000500@candelatech.com> <4AEC6BA1.3010007@incunabulum.net> <4AECBF88.7030704@candelatech.com> <4AEF2101.8070506@incunabulum.net> <4AEF7815.5040902@candelatech.com> Message-ID: <4AF425F3.2010706@incunabulum.net> Ben, We'd be very interested in this here; if you could post pointers, that would be a great starting point for a lot of folk. Ben Greear wrote: > > Our Ubuntu-based 9.10 live-cd has Xorp on it (as well as our LANforge > product and some other things). But, one could simply ignore our > stuff and run xorp from the command line... > > It's currently behind a password (freely given upon registration). I'll > think about making it directly downloadable... I believe Ubuntu is shipping a CD creation tool of some kind, is this something which you use to roll your CDs? I guess what I'm getting at is that it would be great to have a reproducible build that can exist as part of the XORP tree, which folk can check out and recreate on their own. Similar to what we currently have with NanoBSD; it can be built completely from source, up to the point where the XORP package actually has to be created and pushed into the new system image. It might even be interesting to get shipped in pfSense. Mind you, I know the situation with reproducible, small Linux builds is not that great. I did do a port of NanoBSD to Gentoo Linux as a direct result of this. I sent this to some Gentoo developers, but got no response. Mostly I did this so I could shrinkwrap Linux for testing purposes in virtual machines, so it was very minimalist. thanks, BMS From greearb at candelatech.com Fri Nov 6 07:54:39 2009 From: greearb at candelatech.com (Ben Greear) Date: Fri, 06 Nov 2009 07:54:39 -0800 Subject: [Xorp-hackers] Linux LiveCD? In-Reply-To: <4AF425F3.2010706@incunabulum.net> References: <4AE7178B.9000709@incunabulum.net> <4AEAF863.7000500@candelatech.com> <4AEC6BA1.3010007@incunabulum.net> <4AECBF88.7030704@candelatech.com> <4AEF2101.8070506@incunabulum.net> <4AEF7815.5040902@candelatech.com> <4AF425F3.2010706@incunabulum.net> Message-ID: <4AF446BF.6090909@candelatech.com> Bruce Simpson wrote: > Ben, > > We'd be very interested in this here; if you could post pointers, that > would be a great starting point for a lot of folk. It's somewhat tricky to build it how I do, since I need to upgrade the kernel too for additional features. I do have some notes (attached), but they change from rls to rls, and also include LANforge related stuff. You can find the ISO image here (for now). http://www.candelatech.com/oss/xorp_binaries/ You'll find xorp in /usr/local/xorp Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ubuntu-live-notes.txt Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091106/d76db76c/attachment-0001.txt From lizhaous2000 at yahoo.com Fri Nov 6 07:58:22 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Fri, 6 Nov 2009 07:58:22 -0800 (PST) Subject: [Xorp-hackers] does xorp replace linux routing deamons In-Reply-To: <6e49b4d40911051413x50960c78m81c79e210251853b@mail.gmail.com> Message-ID: <973791.18357.qm@web58708.mail.re1.yahoo.com> In simple words, XORP processes can update the linux kernel routing (forwarding) table. xorp processes are control plane processes. The packets are still received and sent out through linux kernel part by checking kernel forwarding table or cache. --- On Thu, 11/5/09, mahendra nunna wrote: > From: mahendra nunna > Subject: [Xorp-hackers] does xorp replace linux routing deamons > To: xorp-users at xorp.org > Cc: xorp-hackers at icir.org > Date: Thursday, November 5, 2009, 5:13 PM > hi? > When i run xorp.... will it handle all the > packets i m trying to send out of my host.... ? I mean ..... > does XORP send the packets out of my host instead the kernel > ?.... ? > > -----Inline Attachment Follows----- > > _______________________________________________ > Xorp-hackers mailing list > Xorp-hackers at icir.org > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > From bms at incunabulum.net Sun Nov 8 03:17:18 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Sun, 08 Nov 2009 11:17:18 +0000 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AE9EA19.5030702@candelatech.com> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> Message-ID: <4AF6A8BE.2000603@incunabulum.net> I wonder here if we're better off just making the per-method cached Xrl a property of the XrlFooClient stub class itself. This would also fix a memory leak while we're at it, but means modifying the clnt-gen code generator. I believe you hit the problem with your multiple instance patch in the tree; perhaps if the re-entrancy problem in the libfooxif.so stubs is eliminated, your stuff will work w/o the ref_ptr patch. From bms at incunabulum.net Sun Nov 8 06:02:47 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Sun, 08 Nov 2009 14:02:47 +0000 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AF6A8BE.2000603@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> Message-ID: <4AF6CF87.3040106@incunabulum.net> The attached patch changes the allocation semantics of the cached Xrl pointer to be per-client-instance rather than per-library-instance. It does this by modifying the Xrl client stub generator to include a named auto_ptr for each method in the Xrl client stub. Sizes are for FreeBSD 7.2/amd64 with gcc 4.2.1. Hopefully it should give others a feel for working with the code generator. This probably doesn't fix the original race, but it may make it easier to mitigate some other way, by giving us a handle on the cached Xrls which are causing the problem, rather than letting them dangle in the BSS segment. That, and it should fix the glaringly obvious memory leakage, and non-reentrancy, caused by polluting a library with 'static Xrl*' instances. I've tested this briefly with the current SVN tree, just by running a xorp_finder, call_xrl, and a xorp_rib. This exercises the reentrancy of the XRL client stubs, because the Finder Client is instantiated once for every XrlRouter. The RIB uses libfeaclient, which instantiates its *own* XrlRouter instance (yes, you can have more than one per process...) for getting interface updates from the FEA. This patch does increase the code size of the stubs very slightly; that's the trade-off. As we are no longer holding the cached 'static Xrl*' in the BSS segment, it shrinks versus the code segment; auto_ptr doesn't use any additional data storage above the pointer type, but we are now checking allocations and deallocations of Xrl for the stubs. When building with shared libraries, the template expansions for auto_ptr are a lost opportunity for coalescing with the linker, although this only wastes ~128 bytes per library (for auto_ptr::get() and auto_ptr::reset()). Most of this seems to be NOP padding for cache line alignment. I'm still not 100% happy with how the XrlPFSender cache mechanism works. Because we're holding a pointer somewhere, to something whose lifecycle is managed somewhere else, there really isn't any other answer than the one you've already suggested. I'm not that happy about using ref_ptr to do it, for much the same reasons as I've already described earlier in this thread -- there is no clean syntax for observing a ref_ptr vs Boost's shared_ptr/weak_ptr. I'm not at all happy about using ref_ptr&, because it is all too easy to ignore its specific meaning and introduce problems. Certainly, the only reason I did it in Spt, was because that class was already using ref_ptr internally to track nodes in a container. Granted, this is code fairly deep down in the core of the tree, which many developers would never need to touch directly, but that makes it even more important to be careful. In Thrift, the binary blobs themselves can be decoupled from where they go. There is a potential chicken-and-egg problem if we support multiple TProtocol types, where we would need to know the sender before the stubs create the blob; if we just speak TBinaryProtocol to everything, we don't have this problem, but it does mean we can't just tell the XORP RPC endpoints 'speak JSON to this guy', 'speak XML-RPC to this guy', 'speak AMQP to this guy' etc. So the idea of caching the transport we'd prefer to transmit from, is still one that bears further scrutiny, even in a re-spin. The difference is, for maximum flexibility, Thrifted XRL stubs would actually want to see XrlPFSender's equivalent upfront, before XrlSender::send() is even called. This is after scrutinizing libxipc even further this week, and realizing most of what's there is to support asynchrony. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: sizes-xif-autoptr.txt Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091108/fccc60fe/attachment.txt -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: sizes-xif-noautoptr.txt Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091108/fccc60fe/attachment-0001.txt -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: xrl-auto-ptr.diff Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091108/fccc60fe/attachment.ksh From bms at incunabulum.net Sun Nov 8 07:13:10 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Sun, 08 Nov 2009 15:13:10 +0000 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AF6CF87.3040106@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> Message-ID: <4AF6E006.7040506@incunabulum.net> Bruce Simpson wrote: > > I've tested this briefly with the current SVN tree, just by running > a xorp_finder, call_xrl, and a xorp_rib. This exercises the reentrancy > of the XRL client stubs, because the Finder Client is instantiated > once for every XrlRouter. The RIB uses libfeaclient, which > instantiates its *own* XrlRouter instance (yes, you can have more than > one per process...) for getting interface updates from the FEA. Anyway, back to this incremental change. For testing, 3 shells: 1. Start the Finder (tedious, LD_LIBRARY_PATH needed if doing it from the source tree, which I don't recommend normally):- %%% anglepoise:~/svn/xorp/xorp % !81 env LD_LIBRARY_PATH=obj/x86_64-unknown-freebsd7.2/libxipc:obj/x86_64-unknown-freebsd7.2/libcomm:obj/x86_64-unknown-freebsd7.2/libxorp ./obj/x86_64-unknown-freebsd7.2/libxipc/xorp_finder -v Finder \ %%% 2. Start the RIB: %%% anglepoise:~/svn/xorp/xorp % env LD_LIBRARY_PATH=obj/x86_64-unknown-freebsd7.2/libxipc:obj/x86_64-unknown-freebsd7.2/libcomm:obj/x86_64-unknown-freebsd7.2/libxorp:obj/x86_64-unknown-freebsd7.2/rib:obj/x86_64-unknown-freebsd7.2/libfeaclient:obj/x86_64-unknown-freebsd7.2/xrl/interfaces:obj/x86_64-unknown-freebsd7.2/xrl/targets:obj/x86_64-unknown-freebsd7.2/policy/backend:obj/x86_64-unknown-freebsd7.2/policy/common:obj/x86_64-unknown-freebsd7.2/libproto ./obj/x86_64-unknown-freebsd7.2/rib/xorp_rib %%% 2b. You should see the Finder complain about RIB asking where the non-existent FEA is in the first shell: %%% [ 2009/11/08 14:07:31 WARNING xorp_finder XrlFinderTarget ] Handling method for finder/0.2/resolve_xrl failed: XrlCmdError 102 Command failed Target "fea" does not exist or is not enabled. %%% 3. Use call_xrl to verify the RIB registered OK with the Finder, by asking for all XRL targets to be dumped out: %%% anglepoise:~/svn/xorp/xorp % !32 env LD_LIBRARY_PATH=obj/x86_64-unknown-freebsd7.2/libxipc:obj/x86_64-unknown-freebsd7.2/libcomm:obj/x86_64-unknown-freebsd7.2/libxorp ./obj/x86_64-unknown-freebsd7.2/libxipc/call_xrl finder://finder/finder/0.2/get_xrl_targets target_names:list=:txt=call_xrl-4fcff00b2aeeb4447a32f10d3fa6d02f at 127.0.0.1,:txt=finder,:txt=ifmgr_mirror-4bb726603db32aacf6bd6dc5e4dd5053 at 127.0.0.1,:txt=rib-afe2bbdbd98937f915fe36ecf37e30a6 at 127.0.0.1 %%% As you can see, the RIB came up OK in this case. Profiling with the FreeBSD valgrind snapshot revealed that the "still reachable loss records" for the send*() methods in the XIF client stubs had indeed gone away, a desired result -- it's good to fix memory leaks. I'll hold off on committing for now, though. It would be good to know if changing the Xrl allocation in this way helps the situation with the race you saw... From bms at incunabulum.net Mon Nov 9 05:30:32 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 09 Nov 2009 13:30:32 +0000 Subject: [Xorp-hackers] [PATCH] add XRL class listing to libxipc/finder Message-ID: <4AF81978.9000408@incunabulum.net> Hi all, This is just a quick patch I wrote to expose the XRL class list within the Finder for debugging purposes. The Finder itself is currently excluded from its output. Some background: If a routing process needs to restart, there may be dependent processes requiring its services. Currently, this dependency is expressed as a set of process watches within the Finder itself. Every process contains at least one XRL target. When this target is registered with the Finder by class XrlRouter, it is provided a class name (e.g. "rib", "fea") and an MD5 HMAC of some other salt is appended to form the instance name. The process watches are implemented as watches on both 'class' and 'instance'. For example, the RIB can register with the Finder to ask when a specific instance of the FEA (e.g. the one it's currently talking to) has gone away, and when other FEAs arrive on the scene (e.g. if the process is restarted by the Router Manager). By invoking the new XRLs from the command line, using the call_xrl tool, you can interrogate the Finder to find out which XRL process classes are currently registered, and which XRL targets (instances) implement those classes. The new XRLs are named 'get_xrl_classes' and 'get_xrl_class_instances'. Please see my posting yesterday about fixing the 'static Xrl*' leak, about using call_xrl for debugging purposes. cheers, BMS -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: libxipc-list-classes.diff Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091109/3224f12e/attachment.ksh From bms at incunabulum.net Mon Nov 9 06:19:04 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 09 Nov 2009 14:19:04 +0000 Subject: [Xorp-hackers] ProtoUnit: protocol framework Message-ID: <4AF824D8.1080305@incunabulum.net> Hi all, This is just to bring your attention to the ProtoUnit framework, within libproto. It looks as though some of the work here is unfinished, in the sense that whilst the all of the multicast control plane components (PIM, IGMP and MLD) use it, it seems as though it was intended for use across the tree. I'm not going to recommend at this point that we try to finish this work. Rather, I just wanted to draw it to people's attention, as it is probably a useful building block for new protocols. Also, if anyone is following up on XRL, the class names used within the Finder namespace are passed to the XrlRouter constructor. Most consumers of XrlRouter don't instantiate it directly; rather, they use the XrlStdRouter convenience interface, which has defaults for the Finder's transport address. [1] For the multicast components, the XRL class name comes from a table in libproto/proto_unit.cc. cheers, BMS [1] P.S. Ben: I'd be interested to know how you deal with the Finder namespace within your virtualization changes. Are you running a new set of XORP processes for each virtualized router, or sharing state within the existing processes -- or am I missing something? From bms at incunabulum.net Mon Nov 9 06:57:52 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 09 Nov 2009 14:57:52 +0000 Subject: [Xorp-hackers] Thrifted XrlSender In-Reply-To: <4AF6CF87.3040106@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> Message-ID: <4AF82DF0.90801@incunabulum.net> Bruce Simpson wrote: > I'm still not 100% happy with how the XrlPFSender cache mechanism > works. Because we're holding a pointer somewhere, to something whose > lifecycle is managed somewhere else, there really isn't any other > answer than the one you've already suggested. > ... > > In Thrift, the binary blobs themselves can be decoupled from where > they go. There is a potential chicken-and-egg problem if we support > multiple TProtocol types, where we would need to know the sender > before the stubs create the blob; if we just speak TBinaryProtocol to > everything, we don't have this problem, but it does mean we can't just > tell the XORP RPC endpoints 'speak JSON to this guy', 'speak XML-RPC > to this guy', 'speak AMQP to this guy' etc. > > So the idea of caching the transport we'd prefer to transmit from, > is still one that bears further scrutiny, even in a re-spin. The > difference is, for maximum flexibility, Thrifted XRL stubs would > actually want to see XrlPFSender's equivalent upfront, before > XrlSender::send() is even called. I'm thinking the best way forward here is to assume the use of TBinaryProtocol in all situations. As libxipc currently stands, calls through the XrlSender::send() interface need not know the destination endpoint first; they can be temporarily buffered whilst the FinderClient lookup completes, which is an asynchronous operation, whose completion gets dispatched from the EventLoop. We need to buffer the outgoing RPC call at this point. In a language where the call stack is independent from the object stack, e.g. Python, it's easier to use a continuation here to deal with awaiting for the result of a pending operation. However, in C++, we can't easily split the XRL output marshaling like this, so we buffer it. Now, the most likely source of performance issues with XRL is the intermediate representation. With Thrift, there is no intermediate representation -- what we buffer, is what we transmit. If we stick to using TBinaryProtocol, we need only render the blob into a buffer, and dispatch that blob when the FinderClient lookup completes, which requires no change to the existing logic. Using TBinaryProtocol shouldn't be an impediment to future scalability or feature additions. AMQP is designed with relaying binary blobs in mind. A change in representation is only really useful if we need to interact with the processes using some other protocol, and there are better ways, more appropriate ways to do this. From bms at incunabulum.net Mon Nov 9 06:57:56 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 09 Nov 2009 14:57:56 +0000 Subject: [Xorp-hackers] More on Thrift and 3rd party Java interop. Message-ID: <4AF82DF4.70305@incunabulum.net> At this point, I'm also trying to consider what other participants on this list have raised as desirable features. 1. Java support, specifically JNI. I'm not 100% sure what the original submitter meant. Trying to wrap the existing XORP router code as a set of JNI methods is probably a futile exercise, given that JNI code generally needs to be completely thread safe. However, assuming that the objective is just to interact with the XORP processes using RPC, or even implement router processes in Java, that would certainly be possible with the Thrift changes. I just skimmed what's available. In Java, the use of blocking I/O operations generally isn't an issue, as threads are somewhat cheaper, and preferred in that language; it should be possible to just re-use the existing Java client libraries which ship with Thrift. For client processes, which just interact with the router components at a simple level, this is just fine. For implementing routing processes in Java, we have a whole other set of issues. In libfeaclient, libxipc, we've got a set of common operations in a framework for interacting with other components within XORP itself, and which could be candidates for invocation through a language-neutral interface. Wrapping in this way would be necessary, to avoid reimplementing all the logic. Both libfeaclient, and libxipc, rely on the XORP EventLoop as a means of realizing an event-driven programming framework. Thread safety of the APIs would be an issue here; some sort of service thread would be necessary to wrap them as components safely. I don't plan to look at this further as part of the current effort. It does require further work to support such use, and this is really out of scope for what's possible in the near future. 2. XML-RPC. Given that implementing Thrift would let an object scripting language talk directly to XORP processes, and that many of them have excellent XML support, it's probably better to let that be the translation mechanism. XML-RPC doesn't really have a place in the embeddable router core itself, for performance reasons; although, this guideline could be broken, see below re shared memory. 3. Bringing back SNMP support. In current SVN XORP, I've pulled SNMP support, mostly because it doesn't slot into the framework well at all, and doesn't provide much of the SNMP functionality anyway. What I've said so far holds fine for incremental / fine-grained control data. In situations where we potentially need to ship an entire default-free-zone routing table across process boundaries, we're probably talking about a good case for shared memory. Certainly, the fact that we have a lot of critical data, locked up in the address space of a process, which we can't otherwise get at, has meant we've had to bend the rules in the past. There are no 100% clean ways of dealing with it. cheers, BMS From greearb at candelatech.com Mon Nov 9 08:37:40 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 09 Nov 2009 08:37:40 -0800 Subject: [Xorp-hackers] ProtoUnit: protocol framework In-Reply-To: <4AF824D8.1080305@incunabulum.net> References: <4AF824D8.1080305@incunabulum.net> Message-ID: <4AF84554.1080702@candelatech.com> Bruce Simpson wrote: > Hi all, > > This is just to bring your attention to the ProtoUnit framework, within > libproto. > > It looks as though some of the work here is unfinished, in the sense > that whilst the all of the multicast control plane components (PIM, IGMP > and MLD) use it, it seems as though it was intended for use across the tree. > > I'm not going to recommend at this point that we try to finish this > work. Rather, I just wanted to draw it to people's attention, as it is > probably a useful building block for new protocols. > > Also, if anyone is following up on XRL, the class names used within the > Finder namespace are passed to the XrlRouter constructor. Most consumers > of XrlRouter don't instantiate it directly; rather, they use the > XrlStdRouter convenience interface, which has defaults for the Finder's > transport address. [1] > > For the multicast components, the XRL class name comes from a table in > libproto/proto_unit.cc. > > cheers, > BMS > > [1] P.S. Ben: I'd be interested to know how you deal with the Finder > namespace within your virtualization changes. Are you running a new set > of XORP processes for each virtualized router, or sharing state within > the existing processes -- or am I missing something? > I run a new set of xorp (rtrmgr) on unique FINDER ports and with unique routing table. Thanks, Ben > _______________________________________________ > Xorp-hackers mailing list > Xorp-hackers at icir.org > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > -- Ben Greear Candela Technologies Inc http://www.candelatech.com From bms at incunabulum.net Mon Nov 9 09:04:04 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 09 Nov 2009 17:04:04 +0000 Subject: [Xorp-hackers] ProtoUnit: protocol framework In-Reply-To: <4AF84554.1080702@candelatech.com> References: <4AF824D8.1080305@incunabulum.net> <4AF84554.1080702@candelatech.com> Message-ID: <4AF84B84.9040809@incunabulum.net> Ben Greear wrote: >> >> [1] P.S. Ben: I'd be interested to know how you deal with the Finder >> namespace within your virtualization changes. Are you running a new >> set of XORP processes for each virtualized router, or sharing state >> within the existing processes -- or am I missing something? >> > I run a new set of xorp (rtrmgr) on unique FINDER ports and with > unique routing table. Thanks for clarifying this. As you've probably seen, the Finder namespace is completely flat, and doesn't have any tuples or extension fields which would let us easily inject an instance field. However, implementing such is certainly possible. It's out of scope for now, though. cheers, BMS From bms at incunabulum.net Mon Nov 9 09:13:17 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 09 Nov 2009 17:13:17 +0000 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AF6E006.7040506@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> Message-ID: <4AF84DAD.3090004@incunabulum.net> Bruce Simpson wrote: > It would be good to know if changing the Xrl allocation in this way > helps the situation with the race you saw... > A bit more on the 'static Xrl*' mechanism. This is a lazy allocation mechanism, which amortizes the cost of instantiating class Xrl. The change I posted yesterday just moves the pointer to this allocation into the XrlFooClient itself, where we can manage its lifecycle; the semantics of its use don't change. Some data points: * There is only one instance of cached 'class Xrl*' for each XrlFooClient method. * Each method's wrapper lazy-allocates it once per runtime, when the method wrapper is called. * Without my change, this lazy allocation is non-reentrant, and happens once per runtime. * With my change, it happens once per method call, for each XrlFooClient instance, which is reentrant (but still not thread-safe). * It sets up the Xrl::args() field. That's it. No additional amortization of up-front costs of sending an XRL. * If the consumer of XrlFooClient::send_foo(target, arg) changes its 'target' argument, the resolved XrlSender is very likely to change, forcing a FinderDBEntry lookup in XrlRouter::send() anyway. * No upfront resolution is performed when the target changes. In practice, this optimization probably doesn't offer much, but it's better than nothing. class Xrl is still an intermediate representation, and until the Xrl is actually sent on the wire, it doesn't get 'packed' into its on-wire representation. There is a lost optimization opportunity here. XRL clients which use the C++ client stubs never need to tunnel XRLs through the Finder, so they can always use binary representation. However, there is nowhere to put it, until we actually hit the output buffers for the destination transport. [1] I haven't measured cycle-for-cycle yet the costs of sending an XRL. I probably will do this when profiling the Thrift run; I'm still in analysis, trying to come up with the least invasive change to avoid rewriting code I don't/won't need to. Thread safety hasn't really been a consideration to date. I'm making notes, but I'm not going to substantially change the behaviour towards making existing code thread-safe. cheers, BMS [1] We can probably leverage this opportunity using Thrift, as we can then render the binary representation upfront. We can also use a mixin with TTransport to do scatter/gather I/O with multiple TMemoryBuffers, so we only hit writev() or send() to keep syscall count minimal. XRL will currently try to do this, but it's tied to stream semantics. From greearb at candelatech.com Mon Nov 9 10:04:23 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 09 Nov 2009 10:04:23 -0800 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AF84DAD.3090004@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> Message-ID: <4AF859A7.7070600@candelatech.com> On 11/09/2009 09:13 AM, Bruce Simpson wrote: > Bruce Simpson wrote: >> It would be good to know if changing the Xrl allocation in this way >> helps the situation with the race you saw... > > A bit more on the 'static Xrl*' mechanism. > > This is a lazy allocation mechanism, which amortizes the cost of > instantiating class Xrl. The change I posted yesterday just moves the > pointer to this allocation into the XrlFooClient itself, where we can > manage its lifecycle; the semantics of its use don't change. > > Some data points: > * There is only one instance of cached 'class Xrl*' for each > XrlFooClient method. > * Each method's wrapper lazy-allocates it once per runtime, when the > method wrapper is called. > * Without my change, this lazy allocation is non-reentrant, and happens > once per runtime. > * With my change, it happens once per method call, for each XrlFooClient > instance, which is reentrant (but still not thread-safe). > * It sets up the Xrl::args() field. That's it. No additional > amortization of up-front costs of sending an XRL. > * If the consumer of XrlFooClient::send_foo(target, arg) changes its > 'target' argument, the resolved XrlSender is very likely to change, > forcing a FinderDBEntry lookup in XrlRouter::send() anyway. > * No upfront resolution is performed when the target changes. > > In practice, this optimization probably doesn't offer much, but it's > better than nothing. class Xrl is still an intermediate representation, > and until the Xrl is actually sent on the wire, it doesn't get 'packed' > into its on-wire representation. > > There is a lost optimization opportunity here. XRL clients which use the > C++ client stubs never need to tunnel XRLs through the Finder, so they > can always use binary representation. However, there is nowhere to put > it, until we actually hit the output buffers for the destination > transport. [1] > > I haven't measured cycle-for-cycle yet the costs of sending an XRL. I > probably will do this when profiling the Thrift run; I'm still in > analysis, trying to come up with the least invasive change to avoid > rewriting code I don't/won't need to. > > Thread safety hasn't really been a consideration to date. I'm making > notes, but I'm not going to substantially change the behaviour towards > making existing code thread-safe. I think it is way too early to worry about micro optimizations. If you want to move to Thrift, then lets do so. As long as the performance isn't noticeably worse, then that's OK. Don't try to get everything figured out up front...get something working and then we see what needs tweaking later. With regard to larger issues of the IPC, I dislike the current callback logic. I would like to get rid of all the auto-generated template code and use real callback objects. Each xrl call will have a pointer to a callback object and will call that at appropriate times. If/when you get your Thrift stuff in, I might attempt to work on the callback logic. I don't think we should worry about any async message support in the transport level. As long as we have useful callbacks, then only the client/servers need to worry about async. For instance, client says 'do-big-work()'. Server immediately responds 'working' (RPC call is now done). Later, server can send more messages to client as the big work completes. XRL doesn't need to know anything about this. Don't do any work to add functionality that *may* be needed in the future, such as extra transport mechanisms, thread safety, etc. I really hate threads unless they are absolutely required. They make things way too hard to debug. If/when it's actually needed, we can do that work then. (For threads, likely this would require a re-write of the entire application in question, as well as hacking on the various libraries.) > [1] We can probably leverage this opportunity using Thrift, as we can > then render the binary representation upfront. We can also use a mixin > with TTransport to do scatter/gather I/O with multiple TMemoryBuffers, > so we only hit writev() or send() to keep syscall count minimal. XRL > will currently try to do this, but it's tied to stream semantics. "Premature optimization is the root of all evil." Lets get something stable, then figure out a big work-load, and run some real performance tests (with oprofile, gprof, custom profiling, etc). For instance, in my case, just adding that netlink filtering took system load from 300+ to 20, effectively making an O(N^2) to O(N). This sort of thing is way more important than saving a few cycles on encode/decode of messages we send once per second. Most of the slowdowns I've seen are stupid things like sleeping for a timeout instead of doing work, probably to work around some race no one felt like debugging. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From bms at incunabulum.net Mon Nov 9 10:57:10 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 09 Nov 2009 18:57:10 +0000 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AF859A7.7070600@candelatech.com> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> <4AF859A7.7070600@candelatech.com> Message-ID: <4AF86606.4050304@incunabulum.net> Ben Greear wrote: > > With regard to larger issues of the IPC, I dislike the current callback > logic. I would like to get rid of all the auto-generated template code > and use real callback objects. Each xrl call will have a pointer to a > callback > object and will call that at appropriate times. > > If/when you get your Thrift stuff in, I might attempt to work on > the callback logic. Can you expand on what a 'real callback object' is? The callback() method in XORP is a template method which returns a real C++ object. FWIW, Boost.Function is actually implemented in a pretty similar way to XORP's callback library: http://www.boost.org/doc/libs/1_40_0/doc/html/function.html It is actually more generic than XORP's, which assumes that a callback() is never going to be invoked right away (like a functor). The code already maintains a callback pointer for XRL invocation, see XrlSender::send(). Because Xrl is an intermediate representation, and is sometimes used synchronously, the callback isn't maintained there. > > I don't think we should worry about any async message support in the > transport > level. As long as we have useful callbacks, then only the > client/servers need > to worry about async. For instance, client says 'do-big-work()'. > Server immediately responds 'working' (RPC call is now done). > Later, server can send more messages to client as the big work completes. > > XRL doesn't need to know anything about this. We already do. Asynchrony is just a fact of life in libxipc. To dispel any confusion: I'm not referring to asynchrony at the level of a socket or a UNIX file descriptor. When discussing asynchrony in my posts, I am writing about mechanisms already present in the code. It's been clear from past history, and private exchanges with other developers, that scalability has been a concern, particularly with BGP. Scalability is something I consider on the worktop now, and to be considered, but not necessarily taking action on it right now. I am documenting what I'm seeing, as I'm reading further into it; the way we do XRL, does influence our scalability. We may not always use streams for the transport, for such reasons. The big problem with Thrift has been that there is no asynchrony story in Thrift as it's currently shipped. As such, I'm 'retro-fitting' the protocol into XORP, along a principle of least invasive change. I don't plan to change callback behaviour, nor other aspects of the existing programming model. > > If/when it's actually needed, we can do that work then. (For threads, > likely this would require a re-write of the entire application in > question, as well as hacking on the various libraries.) My current opinion is that making the tree thread-safe is feasible, but most likely means getting rid of some of the non-reentrant code which exists in libxorp. Thread safety is very unlikely to involve a ground-up rewrite of the whole tree, if appropriate constraints are in place. Boost could be leveraged to deliver some of that thread safety, but that's out-of-scope for the current effort. > > For instance, in my case, just adding that netlink filtering took > system load > from 300+ to 20, effectively making an O(N^2) to O(N). This sort of > thing is way more important than saving a few cycles on encode/decode > of messages we send once per second. In the case of XRL, pack/unpack is probably not the bottleneck; allocations probably are. I appreciate what you're doing with Netlink; I often wish myself that the BSD developers would just implement that API, as it would solve a number of problems they're now having with API stability. > > Most of the slowdowns I've seen are stupid things like sleeping for > a timeout instead of doing work, probably to work around some race > no one felt like debugging. Were these timeouts actually blocking execution of other pending tasks? If so, that's something which needs dealing with. thanks, BMS From bms at incunabulum.net Mon Nov 9 11:00:36 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 09 Nov 2009 19:00:36 +0000 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AF84B84.9040809@incunabulum.net> References: <4AF824D8.1080305@incunabulum.net> <4AF84554.1080702@candelatech.com> <4AF84B84.9040809@incunabulum.net> Message-ID: <4AF866D4.40303@incunabulum.net> So... back to the original problem. I'm glad non-reentrancy can be ruled out as a root cause, but that is always something that's good to tackle. It is possible that there's a race when the system is loaded, but the bug is all too obvious -- we are using a pointer as a cache, and there is no way to invalidate it. If there has been a race, I would really like to catch it. My best guess is that the send ran before one of these: 1. a FinderClient operation, which would invalidate the resolved sender transport, runs before the Xrl send. This is most likely. 2. a transport layer close notification inside an XrlPFSender. But this is unlikely. XrlPFSTCP uses BufferedAsyncFileReader. Only the Win32 transport code in that file will dispatch an async close, because that's what Winsock does. ...where XRL is concerned, these are the places I'd look more carefully for races. Sorry to be a nazi about ref_ptr. It's one of those things I wish would just go away. There is a need for a refcounted object/pointer type. ref_ptr fills that gap now; it existed before Boost did. There are several places where its use is actually typedef'd away, making it harder to tell, at-a-glance, what's going on. One of these is in the FinderClient. I'm going to want to get rid of this; ref_ptr caused no end of problems for me in OLSR. I did say that Boost needs to be introduced carefully and incrementally; I'll be sticking to that. I took a look at how Boost implements the weak_ptr / shared_ptr split. It turns out they both reference an embedded sp_counted_base instance, to implement the refcounts. They maintain a separate count of holders and watchers. It's thread safe, but it'll use atomic ops if you don't want to use threads; and looks pretty efficient. In the meantime, to avoid using ref_ptr, here's what I'd suggest. We can sanity check 's' before its use using an existing map, XrlRouter::_senders2, which seems to have been introduced to support batch XRL operations. We know the key, there is one match and one match only, so it should be quick enough. Given that we already maintain this map, I'm surprised we don't use it for XrlRouter::get_sender() anyway. But beware: that method name is overloaded. Of course, it's likely the change can be picked up on later on. I am looking at things with the Thrift goggles on right now, and seeing exactly the same code pattern I saw emerging in OLSR. thanks, BMS From greearb at candelatech.com Mon Nov 9 11:27:53 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 09 Nov 2009 11:27:53 -0800 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AF86606.4050304@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> <4AF859A7.7070600@candelatech.com> <4AF86606.4050304@incunabulum.net> Message-ID: <4AF86D39.50108@candelatech.com> On 11/09/2009 10:57 AM, Bruce Simpson wrote: > Ben Greear wrote: >> >> With regard to larger issues of the IPC, I dislike the current callback >> logic. I would like to get rid of all the auto-generated template code >> and use real callback objects. Each xrl call will have a pointer to a >> callback >> object and will call that at appropriate times. >> >> If/when you get your Thrift stuff in, I might attempt to work on >> the callback logic. > > Can you expand on what a 'real callback object' is? > The callback() method in XORP is a template method which returns a real > C++ object. The template code is impossible to understand. I'd have a base 'XorpCallBack' class and have others inherit from that. In implementing clients/servers, I'd probably have a few methods like: foo::handleCallback(CallBackObject& obj) that would handle lots of different callbacks in one place. Maybe with a real object of obvious type, we could actually trace code flow. As it sits, the opaque templated callbacks make it impossible for me to really understand a backtrace, for instance. Anyway, that's just a rant. A more critical problem is things like task_done() in task.cc (rtr-mgr). It always assumes that the thing that completed was the front of _tasklist, but it is not always so (that is one cause of the crash I worked around with the ref-ptr hack you dislike so much). That task_done() should receive a task object so that it knows exactly what to remove from the _task_list. In fact, if tasks can be completed asynchronously, then that assumption in task_list is completely broken. If things cannot be completed async, then there is no need for all the callback indirection anyway, and we could greatly simplify logic flow by removing the callback logic. > FWIW, Boost.Function is actually implemented in a pretty similar way to > XORP's callback library: > http://www.boost.org/doc/libs/1_40_0/doc/html/function.html That's not helping me feel good about Boost :P >> I don't think we should worry about any async message support in the >> transport >> level. As long as we have useful callbacks, then only the >> client/servers need >> to worry about async. For instance, client says 'do-big-work()'. >> Server immediately responds 'working' (RPC call is now done). >> Later, server can send more messages to client as the big work completes. >> >> XRL doesn't need to know anything about this. > > We already do. Asynchrony is just a fact of life in libxipc. > > To dispel any confusion: I'm not referring to asynchrony at the level of > a socket or a UNIX file descriptor. When discussing asynchrony in my > posts, I am writing about mechanisms already present in the code. > > It's been clear from past history, and private exchanges with other > developers, that scalability has been a concern, particularly with BGP. > Scalability is something I consider on the worktop now, and to be > considered, but not necessarily taking action on it right now. Someone that uses BGP should make a test case, including whatever tools are needed to fake out a large BGP thing, and we should figure out exactly where the problem lies. Without a way to test this, we will have no idea if the Thrift/Boost thing (or anything else) is better or not. > In the case of XRL, pack/unpack is probably not the bottleneck; > allocations probably are. It's very easy to run under oprofile and/or gprof (now that I posted patches for gprof support). You don't have to assume anything. >> Most of the slowdowns I've seen are stupid things like sleeping for >> a timeout instead of doing work, probably to work around some race >> no one felt like debugging. > > Were these timeouts actually blocking execution of other pending tasks? > If so, that's something which needs dealing with. They were blocking the task I cared about (xorpsh 'commit'). Anyway, I fixed this already. I also fixed the other 2-second sleeps on process startup (both patches were posted some time back). Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Mon Nov 9 11:39:08 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 09 Nov 2009 11:39:08 -0800 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AF866D4.40303@incunabulum.net> References: <4AF824D8.1080305@incunabulum.net> <4AF84554.1080702@candelatech.com> <4AF84B84.9040809@incunabulum.net> <4AF866D4.40303@incunabulum.net> Message-ID: <4AF86FDC.4040605@candelatech.com> On 11/09/2009 11:00 AM, Bruce Simpson wrote: > So... back to the original problem. > > I'm glad non-reentrancy can be ruled out as a root cause, but that is > always something that's good to tackle. > > It is possible that there's a race when the system is loaded, but the > bug is all too obvious -- we are using a pointer as a cache, and there > is no way to invalidate it. > > If there has been a race, I would really like to catch it. > > My best guess is that the send ran before one of these: > 1. a FinderClient operation, which would invalidate the resolved sender > transport, runs before the Xrl send. > This is most likely. > 2. a transport layer close notification inside an XrlPFSender. > But this is unlikely. XrlPFSTCP uses BufferedAsyncFileReader. Only the > Win32 transport code in that file will dispatch an async close, because > that's what Winsock does. > > ...where XRL is concerned, these are the places I'd look more carefully > for races. > > Sorry to be a nazi about ref_ptr. It's one of those things I wish > would just go away. There is a need for a refcounted object/pointer > type. ref_ptr fills that gap now; it existed before Boost did. If you have something that takes it's place and works, then just make the conversion and commit the patch. It just needs a logical reference pointer...I don't care exactly how it's implemented. I posted patches to enable the xrl perf tests, so it's easy to run performance regression tests now. > There are several places where its use is actually typedef'd away, > making it harder to tell, at-a-glance, what's going on. One of these is > in the FinderClient. I'm going to want to get rid of this; ref_ptr > caused no end of problems for me in OLSR. It wouldn't hurt my feelings to remove every typedef in the code, except perhaps for type-defs of function pointers (which are ugly as sin no matter how you do them). > I did say that Boost needs to be introduced carefully and incrementally; > I'll be sticking to that. I took a look at how Boost implements the > weak_ptr / shared_ptr split. It turns out they both reference an > embedded sp_counted_base instance, to implement the refcounts. They > maintain a separate count of holders and watchers. It's thread safe, but > it'll use atomic ops if you don't want to use threads; and looks pretty > efficient. I'll look at boost when someone posts a patch to make it do something. I'll try to keep an open mind until then :) Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From lizhaous2000 at yahoo.com Tue Nov 10 06:27:00 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Tue, 10 Nov 2009 06:27:00 -0800 (PST) Subject: [Xorp-hackers] autotool Message-ID: <325038.10683.qm@web58705.mail.re1.yahoo.com> In xorp build system, "Always use our own libtool" policy was there. And this libtool is a little bit old and it seems by default it did not generate PIC code so shared libraries were not built. How can I use libtool on my host machine? I might figure out this later, but if some one can quickly point it out, That would be great. Thanks. Li From lizhaous2000 at yahoo.com Tue Nov 10 07:50:32 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Tue, 10 Nov 2009 07:50:32 -0800 (PST) Subject: [Xorp-hackers] autotool In-Reply-To: <325038.10683.qm@web58705.mail.re1.yahoo.com> Message-ID: <899503.16340.qm@web58708.mail.re1.yahoo.com> I figured it out. in config/ltmain.sh there are two Parameters (VERSION, TIMESTAP) which control what libtool will be built. in configure.in the AC_DISABLE_SHARED option was turned on. If we comment this out, the PIC code will be compiled. --- On Tue, 11/10/09, Li Zhao wrote: > From: Li Zhao > Subject: [Xorp-hackers] autotool > To: xorp-hackers at icir.org > Date: Tuesday, November 10, 2009, 9:27 AM > In xorp build system, "Always use our > own libtool" policy was there. And this libtool > is a little bit old and it seems by default it did not > generate PIC code > so shared libraries were not built. How can I use libtool > on my host machine? > I might figure out this later, but if some one can quickly > point it out, > That would be great. > > > Thanks. > > Li > > > ? ? ? > > _______________________________________________ > Xorp-hackers mailing list > Xorp-hackers at icir.org > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > From bms at incunabulum.net Thu Nov 12 04:14:51 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 12 Nov 2009 12:14:51 +0000 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AF86D39.50108@candelatech.com> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> <4AF859A7.7070600@candelatech.com> <4AF86606.4050304@incunabulum.net> <4AF86D39.50108@candelatech.com> Message-ID: <4AFBFC3B.4030400@incunabulum.net> Ben Greear wrote: > > The template code is impossible to understand. I'd have a base > 'XorpCallBack' > class and have others inherit from that. That approach is likely to introduce additional indirection, due to the nature of virtual functions in C++; see below. > In implementing clients/servers, > I'd probably have a few methods like: > foo::handleCallback(CallBackObject& obj) > that would handle lots of different callbacks in one place. > > Maybe with a real object of obvious type, we could actually trace > code flow. As it sits, the opaque templated callbacks make it > impossible for me to really understand a backtrace, for instance. This sounds like a job for stlfilt. In an ideal world, the tool chain we normally use, GNU C++, would be smart enough to run members, which happen to refer to templates, through the demangler; unfortunately, this doesn't always happen. One thing I find particularly frustrating is the inability to elide namespaces, or default constructor signatures. I don't need to know all the time that method foo is using std::string with the given allocator, which happens to be the default STL allocator. Of course, another problem with implementing this is: where does the debug information needed to do this end up. I suppose one could say this is part of the added value of commercially available toolchains, but of course those aren't always up to scratch either. I would hope LLVM is an improvement... > > Anyway, that's just a rant. It's something which has affected all of our work at some point. > > If things cannot be completed async, then there is no need for all > the callback indirection anyway, and we could greatly simplify logic > flow by removing the callback logic. I can't speak for the Router Manager right now, although in the case of XrlAction, the asynchrony is probably needed. In libxipc, it's very clear that asynchrony wasn't introduced unnecessarily. There's a clear need to have a continuation in the situation where an RPC endpoint isn't known straight away ('resolving an XRL'), and callbacks are the usual mechanism by which this is realized. I would agree that the syntax is not that elegant, and can be difficult to work with. If we had strong support for coroutines in the language, it would be an entirely different story. It might be worth looking at D or Go. Python certainly has coroutine/continuation support, in the form of the 'yield' keyword. One of the reasons why I'm doing the work on Thrift, is to give us a means of moving to other languages for implementing different parts of the system, and thus more options. There are good reasons why C++ was chosen in the beginning, but I'd be the first to agree that C++ may not be the right tool for all of the components in the system. > >> FWIW, Boost.Function is actually implemented in a pretty similar way to >> XORP's callback library: >> http://www.boost.org/doc/libs/1_40_0/doc/html/function.html > > That's not helping me feel good about Boost :P One problem with writing callback libraries for C++, is that functions themselves are not first-class objects in C++; you can't refer to the argument list as a container, for example, there is no concept of reflection (aka introspection) as e.g. Java or Python have. Implementing the callbacks using a base class usually means that there's a virtual method involved, which means another pointer indirection to dispatch it. Python and other languages don't have these problems, because their design is completely different, and such issues are not present there. cheers, BMS From bms at incunabulum.net Thu Nov 12 04:16:20 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 12 Nov 2009 12:16:20 +0000 Subject: [Xorp-hackers] autotool In-Reply-To: <899503.16340.qm@web58708.mail.re1.yahoo.com> References: <899503.16340.qm@web58708.mail.re1.yahoo.com> Message-ID: <4AFBFC94.5030504@incunabulum.net> Li Zhao wrote: > I figured it out. in config/ltmain.sh there are two Parameters (VERSION, TIMESTAP) > which control what libtool will be built. in configure.in the AC_DISABLE_SHARED option was turned on. > If we comment this out, the PIC code will be compiled. > Do you see any issues like this with the SCons build? AFAIK the SCons framework does not use libtool in any way, and should use the appropriate -fPIC vs -fpic flags for the target. cheers, BMS From lizhaous2000 at yahoo.com Thu Nov 12 07:40:18 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Thu, 12 Nov 2009 07:40:18 -0800 (PST) Subject: [Xorp-hackers] autotool In-Reply-To: <4AFBFC94.5030504@incunabulum.net> Message-ID: <461295.61430.qm@web58701.mail.re1.yahoo.com> Sorry, I don't know what SCons is. I modified xorp building process a little bit. Each file needed to build a libarary is compiled in two versions (PIC and non-PIC) and SOME library is built in both static and shared form after I add "-rpath /usr/local/xorp/lib" since my platform (LINUX VM on x86) supports shared library. My initial plan was to build xorp executibles selectively i.e. my new protocol process would be built linking shared libaries and the original xorp processes would be built linking static libaries. But it is not that easy to achieve this. So if one library has a shared version, the shared libary will be linked with each process which depends on it. So some processes are linked with some libaries which are shared but with others which are static. The build is running good. Now I am testing if they are working as before. --- On Thu, 11/12/09, Bruce Simpson wrote: > From: Bruce Simpson > Subject: Re: [Xorp-hackers] autotool > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Thursday, November 12, 2009, 7:16 AM > Li Zhao wrote: > > I figured it out. in config/ltmain.sh there are two > Parameters (VERSION, TIMESTAP) > > which control what libtool will be built. in > configure.in the AC_DISABLE_SHARED option was turned on. > > If we comment this out, the PIC code will be > compiled. > >??? > > Do you see any issues like this with the SCons build? AFAIK > the SCons framework does not use libtool in any way, and > should use the appropriate -fPIC vs -fpic flags for the > target. > > cheers, > BMS > From bms at incunabulum.net Thu Nov 12 07:42:24 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 12 Nov 2009 15:42:24 +0000 Subject: [Xorp-hackers] autotool In-Reply-To: <461295.61430.qm@web58701.mail.re1.yahoo.com> References: <461295.61430.qm@web58701.mail.re1.yahoo.com> Message-ID: <4AFC2CE0.5030507@incunabulum.net> Li Zhao wrote: > Sorry, I don't know what SCons is. I modified xorp building process a little bit. > The code in SVN is using SCons now instead of autotools, so libtool is not used. We've had some issues with the rpath, these were followed up in earlier threads on this list (although not everything has been dealt with, yet). cheers, BMS From lizhaous2000 at yahoo.com Thu Nov 12 07:52:08 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Thu, 12 Nov 2009 07:52:08 -0800 (PST) Subject: [Xorp-hackers] autotool In-Reply-To: <4AFC2CE0.5030507@incunabulum.net> Message-ID: <778286.12545.qm@web58704.mail.re1.yahoo.com> I download xorp 1.6 from the internet and have been using it since then. I probably will stick to this xorp version for quite while because our product needs stable development process because of schedule. I might update to new version after we have necessary build and test. --- On Thu, 11/12/09, Bruce Simpson wrote: > From: Bruce Simpson > Subject: Re: [Xorp-hackers] autotool > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Thursday, November 12, 2009, 10:42 AM > Li Zhao wrote: > > Sorry, I don't know what SCons is. I modified xorp > building process a little bit.??? > > The code in SVN is using SCons now instead of autotools, so > libtool is not used. > > We've had some issues with the rpath, these were followed > up in earlier threads on this list (although not everything > has been dealt with, yet). > > cheers, > BMS > From lizhaous2000 at yahoo.com Thu Nov 12 09:08:05 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Thu, 12 Nov 2009 09:08:05 -0800 (PST) Subject: [Xorp-hackers] autotool In-Reply-To: <4AFC2CE0.5030507@incunabulum.net> Message-ID: <965858.66121.qm@web58705.mail.re1.yahoo.com> The initial test shows that the xorp linked with shared libraries are running in correct and healthy way. rpath is instructed correctly and executible files are much smaller now. --- On Thu, 11/12/09, Bruce Simpson wrote: > From: Bruce Simpson > Subject: Re: [Xorp-hackers] autotool > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Thursday, November 12, 2009, 10:42 AM > Li Zhao wrote: > > Sorry, I don't know what SCons is. I modified xorp > building process a little bit.??? > > The code in SVN is using SCons now instead of autotools, so > libtool is not used. > > We've had some issues with the rpath, these were followed > up in earlier threads on this list (although not everything > has been dealt with, yet). > > cheers, > BMS > From bms at incunabulum.net Fri Nov 13 09:02:18 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Fri, 13 Nov 2009 17:02:18 +0000 Subject: [Xorp-hackers] Boost and Win32 update Message-ID: <4AFD911A.307@incunabulum.net> Hi all, As of today I've done refactorings for boost::noncopyable, boost::polymorphic_cast and boost::polymorphic_downcast on the tree. I don't plan to commit any Boost parts until after a 1.7 release is branched, as like the SCons changes, they are potential tree breakers. I also cut a patch for backing out the Windows support in its entirety. We have still have no volunteers to take the Windows port on; I have no firm date to de-orbit it yet. In any event, if someone wanted to pick up on it, the changes will be in public SVN history. cheers, BMS From bms at incunabulum.net Sun Nov 15 01:30:11 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Sun, 15 Nov 2009 09:30:11 +0000 Subject: [Xorp-hackers] Boost and Win32 update In-Reply-To: <4AFD911A.307@incunabulum.net> References: <4AFD911A.307@incunabulum.net> Message-ID: <4AFFCA23.2020209@incunabulum.net> Bruce Simpson wrote: > Hi all, > > As of today I've done refactorings for boost::noncopyable, > boost::polymorphic_cast and boost::polymorphic_downcast on the tree. > A bit more about these things, and what they do. boost::noncopyable is a mixin class which can replace any occurrence of making the assignment operator and copy constructor private/protected for a class. It does so by supplying a mixin class which is used as a base class. Because inheritance is used to do this, without 'virtual', it should make absolutely no difference to performance or binary footprint. The reason for using this inline mixin is to make it absolutely clear, in source, that instances of the class are not copyable; and this should make for more readable code. boost::polymorphic_downcast() is a function template which replaces static_cast. It is used in situations where we are downcasting a pointer to a class to one of its derived classes, *without* runtime checking. What it does is quite simply to assert() on the equivalent dynamic_cast, before returning the result of static_cast. This lets us catch inheritance problems in debug builds, because static_cast is never checked against the inheritance graph for a class at runtime. It is not suitable for use in situations where classes have overridden virtual functions; in that case, we need to use a dynamic_cast. boost::polymorphic_cast() is a function template which replaces the use of dynamic_cast where we expect a downcast or crosscast to always succeed. If the cast fails, a std::bad_cast exception will be raised. Normally, this exception is only raised by the C++ runtime library when we perform a dynamic_cast where T is a *reference*, not a pointer. The main benefit of the Boost cast is that it makes the intention clear in the source, and it also allows us to catch the exception at runtime for gracefully backing out of the operation where the cast failed. In XORP, this pattern is typically seen where we are performing a dynamic_cast on an object pointer, and then XLOG_ASSERT() on the result. It is *not* suitable in the situation where dynamic_cast is used to check if a pointed-to object supports a given interface or not, or where we actually depend on its return value. There are a number of places in the code where we do this, and this is a different subject. The only other change to date I've made in my private Boost branch is to replace the use of pcreposix for regular expressions, with Boost's C regex API. It has a C++ API, although when I quickly hacked in its use, it didn't function as expected, so I reverted to the C API. As code in this branch has started to use Boost concepts and utility headers, it makes sense to use Boost's regex library. It does mean we need one less third party library to ship the system with. This stuff is all in my own Hg branch for the moment, when it's time to start rolling 1.7, I'll probably export these as SVN feature branches on SF. thanks, BMS From bms at incunabulum.net Sun Nov 15 01:30:54 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Sun, 15 Nov 2009 09:30:54 +0000 Subject: [Xorp-hackers] On replacing ref_ptr with shared_ptr Message-ID: <4AFFCA4E.1090801@incunabulum.net> What I'd really like to do is to be able to replace ref_ptr with shared_ptr/weak_ptr, although this is a non-trivial refactoring. The code typedefs ref_ptr away in a number of places, so it's not 100% obvious where it is being used. The callback code also uses ref_ptr extensively -- that's probably a candidate for the Boost mixin enable_shared_from_this, which is now part of C++0x TR1. It is used to embed a weak_ptr to 'this' in every class which is potentially shared as a refcounted object. The object can then use the shared_from_this() template method member to obtain a shared_ptr to itself without bumping the refcount itself. The reason Boost users usually do this is to return the shared_ptr to the object in situations where it would otherwise return or pass its 'this' pointer. We don't want the object to embed a shared_ptr to itself, because this means there is a dangling refcount; this is why the embedded weak_ptr is declared 'mutable'. There are a number of places in the code where we use refcounting, or we need it and it isn't there; the ref_ptr& semantics are potentially harmful, as the intent isn't clear, especially for folk who are either new to the code, or are getting back into it after being away for months. From bms at incunabulum.net Sun Nov 15 01:31:21 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Sun, 15 Nov 2009 09:31:21 +0000 Subject: [Xorp-hackers] dynamic_cast refactoring candidate in RIB Message-ID: <4AFFCA69.6080400@incunabulum.net> I picked up on this on a drive-by code review on Friday. It might interest someone trying to wring a few more cycles out of the code, or in deploying Boost in a useful way. The RIB has a table_has_name_and_type() function template which is currently unused. It could be removed. dynamic_cast is used in a number of places to check the type of a given RouteTable at runtime. A refactoring to replace these uses of dynamic_cast with a table_has_type() function template might be slightly more readable. Because the return of this dynamic_cast is used as a hint as to whether the object supports a given interface or not, it is not suitable for a Boost polymorphic_cast, nor polymorphic_downcast. It could be argued that the use of dynamic_cast is potentially expensive, although in practice, this is highly dependent upon the C++ compiler in use. As a rough heuristic, performance problems with dynamic_cast are normally only a problem if the inheritance graph is deeper than ~4 classes. If object pointers are used to perform this check, dynamic_cast and RTTI must be used; virtual functions are in use. It could conceivably be replaced with some other mechanism e.g. a single table_type() virtual, although this refactoring probably isn't on a critical path. From bms at incunabulum.net Tue Nov 17 08:20:03 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 17 Nov 2009 16:20:03 +0000 Subject: [Xorp-hackers] Runtime diagnostics and callbacks in backtraces In-Reply-To: <4AF86D39.50108@candelatech.com> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> <4AF859A7.7070600@candelatech.com> <4AF86606.4050304@incunabulum.net> <4AF86D39.50108@candelatech.com> Message-ID: <4B02CD33.9050305@incunabulum.net> Ben Greear wrote: > > The template code is impossible to understand. I'd have a base > 'XorpCallBack' > class and have others inherit from that. In implementing > clients/servers, > I'd probably have a few methods like: > foo::handleCallback(CallBackObject& obj) > that would handle lots of different callbacks in one place. > > Maybe with a real object of obvious type, we could actually trace > code flow. As it sits, the opaque templated callbacks make it > impossible for me to really understand a backtrace, for instance. I've been doing some research on a related topic today, as it's something which affects all users (including corporate). We've had a few folk on xorp-users@ who have installed XORP on production systems, and run into problems. In this situation, GDB often isn't accessible, or crash dumps are difficult to retrieve. One of the things KDE, another large C++ system, does is to try to retrieve crash dumps when an application launched from the KDE Desktop crashes. To do this, however, it requires GDB. More information here: http://techbase.kde.org/Development/Tutorials/Debugging/How_to_create_useful_crash_reports * Let's recap on callbacks: The XORP callback library, in libxorp, is a set of function templates which bind up C++ function pointers (including member function pointers) for deferred invocation. [The Boost.Function library operates in a similar way, although the representation of a callable object there is more decoupled from its implementation.] I believe part of the problem is GCC's 'deep typedef substitution' feature. After GNU C++ 3.x, the ABI changed considerably, and so did the runtime. Code generation also changed. * Let's recap that C++ templates have parameterized types: Most C++ toolchains, including GNU C++ since 3.x, will use the _canonical type name_ when producing debugging information. * Whilst this leads to complete and correct error messages for the compiler, they aren't particularly easy to read, so tools (e.g. stlfilt, gdb-stl-utils) abound to parse C++ compiler error output. These aren't really useful to us here; our problem is that of producing meaningful system diagnostics, when the router is deployed in the field. One answer to this might be to add explicit backtrace support which we can potentially ship in a production build. Unfortunately, the typedef substitution issue can't really be dealt with here: producing a backtrace with arguments requires a lot more debugging information., and all we're likely to get out of backtraces is the call stack, but not the contents of the stack frames. So I've investigated GLIBC's backtrace(), libunwind, and libdwarf. [1] backtrace() itself is part of GLIBC since 2.1, however, it doesn't demangle C++ symbols. There is sample code out there to wrap backtrace() and backtrace_symbols() with abi::__cxa_demangle() to produce meaningful C++ backtraces. GNU C++ will emit DWARF entries for typedef'd types at the point of their use, based on some quick experiments with the 'dwarfdump' utility on the 'call_xrl' binary. Let's walk through: * The XrlRecvCallback type gets its own DW_TAG_typedef tag in the DWARF segments, and any functions which reference it, appear to do so as a DW_TAG_formal_parameter pointing to the typedef, NOT the canonical type. * However, the backtrace is going to contain a reference to the canonical type, NOT its typedef, due to how template name mangling works, which allows the linker to do its job with the code which generated as the result of instantiating a template. * If it were possible to introduce an alias to the mangled symbol for the template expansion which *contained* the typedef, that would give us a hint, but debugging tools are probably still going to have to take their best guess using a heuristic. * Because the templatized callback object's dispatch() member could be called from conceivably anywhere, mapping the callback object's 'this' pointer back to a typedef is probably still going to require manual inspection, though. This is a very callback-specific problem. As you quite rightly pointed out originally, it's something which could potentially be solved with a first-class object, , i.e. instead of this: typedef XorpCallback2::RefPtr XrlRecvCallback; try to do something like this: class XrlRecvCallback : public XorpCallback2::RefPtr {} ... this may work, because RefPtr is also a typedef alias for the ref_ptr object, where T is XorpCallback2 in the above example. (in template meta-programming, typedef is assignment when working with types.) This would, however, prevent the linker from doing any coalescing of function fragments for the instantiation of template XorpCallback2 above, which would lead to classic template bloat. [2] Also, the syntax is still really ugly. I guess what we'd love GNU C++ to do is to let us provide some sort of hint for the type name of a callable object, to make things more human-readable. This is a bit like what we want, but XrlRecvCallback is not just a specialization, but a typedef alias of a *member* of a specialization: http://wiki.dwarfstd.org/index.php?title=C%2B%2B0x:_Template_Aliases Of course, the name XrlRecvCallback is deceptive, because it's actually a ref_ptr to the callback itself. As you know, I'm pretty opposed to obscuring the use of a refcounted object pointer, because of the pain it's caused me during development. Fully specialized templates are legal C++, but I don't know if they are legal as template aliases in C++0x. If we had something like this: **template<> using XorpRecvCallback = **XorpCallback2; ...and then substituted this for the original use of ****XorpRecvCallback****: shared_ptr<****XorpRecvCallback****> ...that might work for me. cheers, BMS [1] P.S. I reckon a backtrace dumper for production builds could be knocked up really quickly. [2] In practice this might not be as big an issue as one might think, because the linker may end up having to put multiple weak symbols for template instantiations into each dynamic object, where we're using shared libraries. See the recent auto_ptr patch I posted for the XRL client stubs, which has a similar problem. From bms at incunabulum.net Tue Nov 17 08:24:59 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 17 Nov 2009 16:24:59 +0000 Subject: [Xorp-hackers] Runtime diagnostics and callbacks in backtraces In-Reply-To: <4B02CD33.9050305@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> <4AF859A7.7070600@candelatech.com> <4AF86606.4050304@incunabulum.net> <4AF86D39.50108@candelatech.com> <4B02CD33.9050305@incunabulum.net> Message-ID: <4B02CE5B.2060001@incunabulum.net> Bruce Simpson wrote: > > If we had something like this: > > **template<> using XorpRecvCallback = **XorpCallback2; > > ...and then substituted this for the original use of > ****XorpRecvCallback****: > shared_ptr<****XorpRecvCallback****> > Omit all asterisks except XrlArgs*, Thunderbird converted some bold that crept in. From bms at incunabulum.net Tue Nov 17 08:35:16 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 17 Nov 2009 16:35:16 +0000 Subject: [Xorp-hackers] Runtime diagnostics and callbacks in backtraces In-Reply-To: <4B02CD33.9050305@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> <4AF859A7.7070600@candelatech.com> <4AF86606.4050304@incunabulum.net> <4AF86D39.50108@candelatech.com> <4B02CD33.9050305@incunabulum.net> Message-ID: <4B02D0C4.9030404@incunabulum.net> Bruce Simpson wrote: > [1] P.S. I reckon a backtrace dumper for production builds could be > knocked up really quickly. This guy has the magic: http://idlebox.net/2008/0901-stacktrace-demangled/ ... so I guess a SCons check is needed for backtrace() and backtrace_symbols() in libexecinfo (the *BSD port of the GLIBC backtrace() module) and libc, __cxa_demangle() in libstdc++, plus header checks on and . I'd probably add an XLOG_BACKTRACE() macro to xlog.h, and omit the time/date prologue appropriately using something similar to XLOG_LEVEL_RTRMGR_ONLY_NO_PREAMBLE. I'd introduce the backtrace wrapper code to xorp_backtrace() in libxorp/debug.c. We can't really get away from using heap storage in this situation (because of all the string processing), so if the heap is toasted, all bets are off. XLOG_FATAL() and XLOG_ASSERT() could then be taught to xorp_backtrace() in most situations. I'm off out the door now today. If I hadn't been stuck on C++0x template aliases, this would probably be done now. cheers, BMS From greearb at candelatech.com Tue Nov 17 18:17:47 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 17 Nov 2009 18:17:47 -0800 Subject: [Xorp-hackers] Runtime diagnostics and callbacks in backtraces In-Reply-To: <4B02CD33.9050305@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> <4AF859A7.7070600@candelatech.com> <4AF86606.4050304@incunabulum.net> <4AF86D39.50108@candelatech.com> <4B02CD33.9050305@incunabulum.net> Message-ID: <4B03594B.90608@candelatech.com> On 11/17/2009 08:20 AM, Bruce Simpson wrote: > These aren't really useful to us here; our problem is that of producing > meaningful system diagnostics, when the router is deployed in the field. > One answer to this might be to add explicit backtrace support which we > can potentially ship in a production build. If you just had the original XRL in text format and/or binary format, you could use gdb and/or live error-handling code to print it out. You could post-process this manually or programatically to decode what the XRL command is, for instance. Given the current call chain, the backtrace is likely to stop at the timer handler or task dispatcher, which still gives you no clue about what actually created the timer or task. > This is a very callback-specific problem. > > As you quite rightly pointed out originally, it's something which could > potentially be solved with a first-class object, , i.e. instead of this: A first-class object would let us easily add a __LINE__ and __FILE__ where it was created at, for instance. This could be printed out later with or without fancy backtrace support in the toolchain. But, what would be even better is to decrease the number of callbacks and just process code in a linear manner and pass back correct return codes and/or message strings. Then the backtraces will actually show the originator of the actions. I wouldn't mind re-writing the router manager to do this more to my liking, but I'll wait until you post your xorp-ipc rewrite first. > typedef XorpCallback2 XrlArgs*>::RefPtr XrlRecvCallback; > try to do something like this: > class XrlRecvCallback : public XorpCallback2 XrlArgs&, XrlArgs*>::RefPtr {} That still has template and typedef shit all over it. I really don't think we should need templates for this, and a very few, if any, typedefs. With proper accounting for object ownership, we could probably get rid of ref-ptrs w/out undue pain as well. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From bms at incunabulum.net Wed Nov 18 06:58:23 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Wed, 18 Nov 2009 14:58:23 +0000 Subject: [Xorp-hackers] Runtime diagnostics and callbacks in backtraces In-Reply-To: <4B03594B.90608@candelatech.com> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> <4AF859A7.7070600@candelatech.com> <4AF86606.4050304@incunabulum.net> <4AF86D39.50108@candelatech.com> <4B02CD33.9050305@incunabulum.net> <4B03594B.90608@candelatech.com> Message-ID: <4B040B8F.6040701@incunabulum.net> Ben Greear wrote: > > If you just had the original XRL in text format and/or binary format, > you could use gdb and/or live error-handling code to print it out. You > could post-process this manually or programatically to decode what > the XRL command is, for instance. I'm confused why preserving the XRL would make a difference here. The XRL target handlers aren't callbacks; they normally aren't invoked by any other means other than the XrlRouter class handling an incoming RPC request. In any event, the handlers themselves never get to see the raw XRL; they just get passed its arguments. These issues always arise in any async programming environment, e.g. including Twisted Python; however that language has built-in introspection for call stack frames. > > Given the current call chain, the backtrace is likely to stop at the > timer handler or task dispatcher, which still gives you no clue about > what actually created the timer or task. This is certainly the case for in-process XorpTimers or XorpTasks, which is probably the root of our shared frustration. > > A first-class object would let us easily add a __LINE__ and __FILE__ > where it > was created at, for instance. This could be printed out later with or > without > fancy backtrace support in the toolchain. Have you looked at DEBUG_CALLBACKS, which already does some of this? The problem with it is that it doesn't have much fine granularity -- if you turn it on for everything, the code will wedge. It's also a bit invasive for production use. > > But, what would be even better is to decrease the number of callbacks > and just process code in a linear manner and pass back correct > return codes and/or message strings. Then the backtraces will > actually show the originator of the actions. It is a more general pattern, and I would be quite surprised if solutions haven't already been found elsewhere. However, asking for asynchronous code to be made linear, is pretty much calling for a design change which won't happen. The whole XORP architecture is designed to run in an event-driven manner. In all of the code that I've read, I haven't seen any which was async when it didn't need to be. One of the quirks which Quagga's design was infamous for, was that management commands would block out the router itself from actually running. For example, it was possible to tie up the BGP process by dumping out its routing table, miss updates, and drop the BGP session as a result. XORP isn't completely immune to this condition, because there's still a possibility of racing an update timer; XRL server-side routines are synchronous, and there is no preemption. By and large it avoids it, though, by being event-driven. > > I wouldn't mind re-writing the router manager to do this more > to my liking, but I'll wait until you post your xorp-ipc rewrite > first. It might interest you to know that the Router Manager is not being used in the commercial XORP product. Although this has more to do with how it's being deployed. I think it was a design mistake to implement the process control in C++, but that's just my opinion. > ... > That still has template and typedef shit all over it. I really don't > think we should need templates for this, and a very few, if any, > typedefs. Callback libraries are intended to be generic, and in C++, templates are usually the mechanism by which generic constructs are realized, as they're type-agnostic. Boost.Function is no different from libxorp in this respect. Code which dispatches callbacks normally doesn't need to know about bound arguments, although it might conceivably have to know about dispatch-time arguments. Those are however wholly dependent on the context in which the function template "callback()" is being used. We could take a position whether they're semantic sugar, or just a means of getting the job done. Templates aren't going away from C++ any time soon (or the XORP code base, for that matter), so let's make what incremental changes we can to improve the situation now. > With proper accounting for object ownership, we could probably get rid of ref-ptrs w/out undue pain as well. Now that's a good point. Something which the smart pointer(s) lack is a means of actually dumping out the shared ownership. In kernel code scenarios, we typically need to track exactly who the consumers of a given resource are, so that we can gracefully shut down user processes or other threads. Given that code which needs to dispatch callbacks, will need to hold a refcount on what is a dynamically allocated Memento holding the callback's state, there's a clear notion of ownership. When I peeked at the Boost internals last week, class sp_counted_base was used to track the watch/hold counts, but it doesn't track the owners by default. e.g. most mutex implementations will support tracking the resource owners. Turns out it does, if you define BOOST_SP_ENABLE_DEBUG_HOOKS, but it's an 'int' id space, and doesn't track the context in which the reference is actually held, which is really what we're asking for here. So having said all that, it's probably best to make the __FILE__ and __LINE__ glue part of callback()... which is something we already have, actually. It just has to be enabled appropriately. regards, BMS From greearb at candelatech.com Wed Nov 18 12:20:08 2009 From: greearb at candelatech.com (Ben Greear) Date: Wed, 18 Nov 2009 12:20:08 -0800 Subject: [Xorp-hackers] Runtime diagnostics and callbacks in backtraces In-Reply-To: <4B040B8F.6040701@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> <4AF859A7.7070600@candelatech.com> <4AF86606.4050304@incunabulum.net> <4AF86D39.50108@candelatech.com> <4B02CD33.9050305@incunabulum.net> <4B03594B.90608@candelatech.com> <4B040B8F.6040701@incunabulum.net> Message-ID: <4B0456F8.1060801@candelatech.com> On 11/18/2009 06:58 AM, Bruce Simpson wrote: > Ben Greear wrote: >> >> If you just had the original XRL in text format and/or binary format, >> you could use gdb and/or live error-handling code to print it out. You >> could post-process this manually or programatically to decode what >> the XRL command is, for instance. > > I'm confused why preserving the XRL would make a difference here. I'd like to know what triggered a chain of events leading up to an assert/crash/whatever. It's probably a timer (which could store it's __LINE__ and __FILE__ (and/or other debug info) on creation, or it's the result of some request from outside (XRL). >> A first-class object would let us easily add a __LINE__ and __FILE__ >> where it >> was created at, for instance. This could be printed out later with or >> without >> fancy backtrace support in the toolchain. > > Have you looked at DEBUG_CALLBACKS, which already does some of this? > > The problem with it is that it doesn't have much fine granularity -- if > you turn it on for everything, the code will wedge. > It's also a bit invasive for production use. If you can't turn it on for production, then it's worthless for getting bug reports out of the field. If we're only storing strings in objects, and only writing them out to logs when a problem actually occurs, it shouldn't be _too_ much overhead. > However, asking for asynchronous code to be made linear, is pretty much > calling for a design change which won't happen. The whole XORP > architecture is designed to run in an event-driven manner. In all of the > code that I've read, I haven't seen any which was async when it didn't > need to be. Maybe this is the case, but from my grubbing around in rtr-mgr, it seems like things could be made more straight-forward in many cases. No need to worry about this now, though..I'm not planning to hack on this anytime soon. >> I wouldn't mind re-writing the router manager to do this more >> to my liking, but I'll wait until you post your xorp-ipc rewrite >> first. > > It might interest you to know that the Router Manager is not being used > in the commercial XORP product. Although this has more to do with how > it's being deployed. I have no knowledge of how commercial XORP is supposed to work. Until I see actual code being merged back into the open project, I'm going to assume commercial xorp has forked and gone away. As for the rest of the discussion, I'm trying to hold off on any un-needed xorp coding until you get your changes merged. Hopefully after that we can merge at least some of my tree upstream. And, after _that_, maybe I'll have time and interest for hacking on callbacks, ref-ptrs and such. Until then, I'm not doing much good by commenting more... Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From bms at incunabulum.net Fri Nov 20 07:35:00 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Fri, 20 Nov 2009 15:35:00 +0000 Subject: [Xorp-hackers] Runtime diagnostics and callbacks in backtraces In-Reply-To: <4B0456F8.1060801@candelatech.com> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> <4AF859A7.7070600@candelatech.com> <4AF86606.4050304@incunabulum.net> <4AF86D39.50108@candelatech.com> <4B02CD33.9050305@incunabulum.net> <4B03594B.90608@candelatech.com> <4B040B8F.6040701@incunabulum.net> <4B0456F8.1060801@candelatech.com> Message-ID: <4B06B724.50703@incunabulum.net> Hi Ben, To summarise: whilst DEBUG_CALLBACKS already takes appropriate steps to maintain the information we need in an efficient way, it unconditionally prints this information, which is not useful for production deployment. I guess what we really need is a means of selectively printing and extracting this information, in situations where we may want to unwind the stack to discover where the callback was actually instantiated and invoked. That said, it's a more general problem than XORP's, and I'd be very surprised if there weren't already projects out there which had dealt with this in a reasonable way. I'd be interested to hear what others come up with. I did start on a xorp_backtrace() patch this week, however, it is a total distraction from what I'm currently meant to be doing. Ben Greear wrote: > > I'd like to know what triggered a chain of events leading up to > an assert/crash/whatever. It's probably a timer (which could > store it's __LINE__ and __FILE__ (and/or other debug info) > on creation, or it's the result of some request from outside (XRL). This is a reasonable need. Correlating what's happening in one process, with what's happening in another, would be non-trivial in the case where the processes are distributed. I think the best we're going to get out of this is function names for now, though. In the case of an XRL target, we really don't need to do any additional work in the target; the method names are named after the RPC calls involved already. JT has strong feelings that the logging framework could do with being changed, to support uses like this. This is really a job for something like libunwind. We're still going to depend on a valid runtime C/C++ heap for this. In an ideal world, we could take a snapshot of the call stack when callback() is instantiated. Then, if we encounter a problem during the callback dispatch, we can then unwind the stack, use DWARF debug information to identify if we hit the dispatch() method of a callback class, and look for the stack frame which instantiated the callback object itself. Re the backtrace patch this week: The holdup there was that libexecinfo was producing differently formatted output from GLIBC's implementation of backtrace_symbols(), which makes it pretty useless in both cases unless I then parse the output of both equivalent routines. I emailed the maintainers/authors, but received no response as yet. libunwind seems much more together, however, it doesn't compile nor work on the BSDs -- although it seems extremely portable between architectures. [DEBUG_CALLBACKS] > > If you can't turn it on for production, then it's worthless for getting > bug reports out of the field. If we're only storing strings in objects, > and only writing them out to logs when a problem actually occurs, it > shouldn't > be _too_ much overhead. We _definitely_ want to avoid any allocations on this path, this rules out placing std::string in the callback object. DEBUG_CALLBACKS already does this. When DEBUG_CALLBACKS is defined, callback() becomes a macro, which is what we need in order to see the correct information at the point where the callback is instantiated. I did a few quick experiments to identify what the cost of this is likely to be. I assumed that the runtime cost is likely to be sizeof(const char*) * 2 plus sizeof(int32_t) at a minimum. So, GCC seems to generate string literals in the .rodata segment for __FILE__ and __func__. The __LINE__ seems to evaluate to an int32_t, *NOT* size_t. Both __FILE__ and __LINE__ are handled entirely by the preprocessor; implementing __func__ needs help from the compiler. __func__ is special, because it gets its own symbol name in .rodata, consisting of the C++ ABI mangled name with a __func__ suffix. When GCC sees these constructs, it substitutes the const char[] itself by value. This normally evaluates to a const char*, but I should point out it's by value -- sizeof(__func__) returns the length of the const char[]. __FILE__ and __LINE__ on their own are probably not going to lend themselves well to automated traceback, though. However, C99 also adds __func__ support to the preprocessor. G++ was fine with this in -ansi, -pedantic, -std=c++98, -std=c++0x modes. One of the quirks of C99 __func__ is that it only returns the name of the function or method, NOT the class containing it. __PRETTY_FUNCTION__ is probably what we want, as it preserves the C++ ABI name. However, it demangles the name for us when producing the string literal in .rodata, and it's wholly GCC specific. The symbols are produced in a similar way to __func__, the suffix changes to __PRETTY_FUNCTION__. That said, it's probably 'good enough' for tracing where a callback was sourced, without tedious manual cross-referencing with filenames and line numbers in source navigation tools, and without getting into the technical specifics of preserving the stack frame where the callback() was instantiated. That's about all I've got time for on the subject of callbacks and backtraces, for the moment. cheers, BMS -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: foo.cc Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091120/f7def5b7/attachment.ksh From bms at incunabulum.net Fri Nov 20 11:15:37 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Fri, 20 Nov 2009 19:15:37 +0000 Subject: [Xorp-hackers] Template aliases: no gcc support Message-ID: <4B06EAD9.3070103@incunabulum.net> Regarding 'template <> using' to get a first-class object for a callback template instantiation: http://gcc.gnu.org/gcc-4.5/cxx0x_status.html ...template aliases aren't supported by the current GCC 4.5 drop anyway. From bms at incunabulum.net Mon Nov 23 19:47:11 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 24 Nov 2009 03:47:11 +0000 Subject: [Xorp-hackers] Runtime diagnostics and callbacks in backtraces In-Reply-To: <4B06B724.50703@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> <4AF6A8BE.2000603@incunabulum.net> <4AF6CF87.3040106@incunabulum.net> <4AF6E006.7040506@incunabulum.net> <4AF84DAD.3090004@incunabulum.net> <4AF859A7.7070600@candelatech.com> <4AF86606.4050304@incunabulum.net> <4AF86D39.50108@candelatech.com> <4B02CD33.9050305@incunabulum.net> <4B03594B.90608@candelatech.com> <4B040B8F.6040701@incunabulum.net> <4B0456F8.1060801@candelatech.com> <4B06B724.50703@incunabulum.net> Message-ID: <4B0B573F.6020401@incunabulum.net> Bruce Simpson wrote: > > I did start on a xorp_backtrace() patch this week, however, it is a > total distraction from what I'm currently meant to be doing. I got a ping back from the libexecinfo maintainers, and sent them a patch which aligns it with GLIBC's backtrace_symbols() output. On another note, I'm following up on thrift-users@ about some intermediate copies which don't go away with Thrift in RPC. Even so, with these copies, they are probably cheaper per cycle than the nested Xrl/XrlAtom/XrlArgs instances. I've sent them a patch for eliminating the copies from the write path, the read path needs a little bit more thinking about. later, BMS