From bms at incunabulum.net Thu Oct 1 03:06:36 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 01 Oct 2009 11:06:36 +0100 Subject: [Xorp-hackers] OLSR assert In-Reply-To: <4AC3CEA6.5040207@candelatech.com> References: <4AC3B065.3070300@candelatech.com> <4AC3C015.5070703@incunabulum.net> <4AC3C3B7.40204@candelatech.com> <4AC3C60A.7060708@incunabulum.net> <4AC3CEA6.5040207@candelatech.com> Message-ID: <4AC47F2C.8010903@incunabulum.net> Ben Greear wrote: > > Here's an attached patch that seems to fix things. I believe the main > error > was checking for (!is_mpr()) in consider_remaining_cand_mprs > > I can't see why that check helps anything, and it was excluding from > consideration the mpr > that was needed to find the 2-hop neighbor in my setup. I'm not 100% sure about this. It's been a long time since that code was written, so I'm hazy on details. [..reads code..] Good catch. I'd conservatively check it in, given that the real hard work of MPR set computation in OLSR, is in fact in minimizing the set. As you can see, OLSR is tricky to do in an event-driven way, and it's easy to introduce bugs. The bug is (in English): Just because a node was selected to cover a poorly covered N2, should not exclude it from consideration for other N2. The is_mpr flag is cleared on every new MPR recount. It should only be set by the MPR recount code. The check for !is_mpr() was probably there as an optimization against the work already done by the consider_poorly_covered_twohops() and consider_persistent_cand_mprs(). Yes, this could cause otherwise valid MPRs to be skipped in Neighborhood::consider_remaining_cand_mprs(), given that all MPRs for a subset of N2 have to be considered anyway; the notion of 'persistent' only really applies to N (- WILL_ALWAYS. When considering all other candidate MPRs, the CandMprOrderPred will return the first (highest) match anyway. Multiple candidates get filtered out when the MPR set is later minimized anyway. It might be better just to get rid of consider_persistent_cand_mprs() in this case. later, BMS From bms at incunabulum.net Thu Oct 1 03:32:45 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 01 Oct 2009 11:32:45 +0100 Subject: [Xorp-hackers] OLSR assert In-Reply-To: <4AC3C015.5070703@incunabulum.net> References: <4AC3B065.3070300@candelatech.com> <4AC3C015.5070703@incunabulum.net> Message-ID: <4AC4854D.7050908@incunabulum.net> Bruce Simpson wrote: > Ben Greear wrote: > >> The reset_twohop_mpr_state counts neighbors that are strict and reachable. >> But, the consider_poorly_covered method checks for reachability == 1. >> In the log below, neighbor 10.7.7.7 is not counted in poorly_covered. >> Should we maybe check for reachability() > 0 instead of == 1? >> >> > > Off the top of my head, for classical OLSR, as specified in the RFC, it > needs to be covered by a minimum of 1 neighbour, in terms of links. > > I don't have the code in front of me, obviously a test of reachability > == 1 would be naive. If the fix is that simple, that's great. > This is logically correct, a poorly covered N2 is one which has reachability of 1. When computing the MPR set, N which are the only means of reaching those N2 need to be considered first. It's the is_essential_mpr() predicate (within minimize_mpr_set()) which is responsible for making sure that those critical links aren't thrown out, when pruning the MPR set to reduce flooding. Most of the work involved in computing MPRs upfront is done to limit (minimize?) the work minimize_mpr_set() has to do. From bms at incunabulum.net Thu Oct 1 04:11:21 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 01 Oct 2009 12:11:21 +0100 Subject: [Xorp-hackers] valgrind: selector.cc: Reading free'd memory In-Reply-To: <4AC38E63.6030308@candelatech.com> References: <4AC2AA1F.1080308@candelatech.com> <4AC2BFEC.6010802@candelatech.com> <4AC324FE.7010700@incunabulum.net> <4AC37746.2080004@candelatech.com> <4AC37E69.4040407@incunabulum.net> <4AC38E63.6030308@candelatech.com> Message-ID: <4AC48E59.9080405@incunabulum.net> Ben Greear wrote: > > The problem is that a method called by an object can cause that object > to be deleted, and when that method continues, it is accessing deleted > memory. SelectorList::Node::run_hooks(), right? That one *is* nasty... (re comment) WinDispatcher doesn't have this problem; there, the callbacks are held in separate maps, and the ref_ptr for the callback protects the callback itself; where multiple dispatches are taking place within a for-block, the iterators involved are protected also. SelectorList::Node does not have such protection -- it's entirely possible that the callback will go off and try to remove an event, but as soon as it does, it can invalidate the SelectorList::Node. The protection in run_hooks() seems insufficient... Are there specific places where this is triggered? The comment would seem to indicate it's only an issue if more than one callback runs on the same FD, which is certainly possible even if they're *not* for the same IoEventType. From lizhaous2000 at yahoo.com Thu Oct 1 06:27:19 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Thu, 1 Oct 2009 06:27:19 -0700 (PDT) Subject: [Xorp-hackers] rtrmgr restart Message-ID: <321304.32172.qm@web58703.mail.re1.yahoo.com> Correct me if I am wrong. When router is dying, rtrmgr is not terminating other processes gracefully. After the router is coming back to live, the lastest running can not be picked up. The only way I can think of to save the running config is through xorpsh (some scripts), but I can not make this script called successfully when rtrmgr is dying. Thanks. Li From greearb at candelatech.com Thu Oct 1 09:03:46 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 01 Oct 2009 09:03:46 -0700 Subject: [Xorp-hackers] valgrind: selector.cc: Reading free'd memory In-Reply-To: <4AC48E59.9080405@incunabulum.net> References: <4AC2AA1F.1080308@candelatech.com> <4AC2BFEC.6010802@candelatech.com> <4AC324FE.7010700@incunabulum.net> <4AC37746.2080004@candelatech.com> <4AC37E69.4040407@incunabulum.net> <4AC38E63.6030308@candelatech.com> <4AC48E59.9080405@incunabulum.net> Message-ID: <4AC4D2E2.5090302@candelatech.com> On 10/01/2009 04:11 AM, Bruce Simpson wrote: > Ben Greear wrote: >> >> The problem is that a method called by an object can cause that object >> to be deleted, and when that method continues, it is accessing deleted >> memory. > > SelectorList::Node::run_hooks(), right? That one *is* nasty... (re comment) > > WinDispatcher doesn't have this problem; there, the callbacks are held > in separate maps, and the ref_ptr for the callback protects the callback > itself; where multiple dispatches are taking place within a for-block, > the iterators involved are protected also. > > SelectorList::Node does not have such protection -- it's entirely > possible that the callback will go off and try to remove an event, but > as soon as it does, it can invalidate the SelectorList::Node. The > protection in run_hooks() seems insufficient... > > Are there specific places where this is triggered? The comment would > seem to indicate it's only an issue if more than one callback runs on > the same FD, which is certainly possible even if they're *not* for the > same IoEventType. As soon as the memory is free'd due to resize, all bets are off and the loop might think it should continue even when it shouldn't have because something else has acquired and written to the memory before the loop completes. We have to ensure that the Node memory can never be deleted while that method is running. My work-around solves this for any sane amount of file-descriptors (up to 1024). My patch is better than what previously existed, but some day we could revisit the whole logic in that area perhaps. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Thu Oct 1 15:49:56 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 01 Oct 2009 15:49:56 -0700 Subject: [Xorp-hackers] PATCH: XrlRouter timeout needs to be allowed higher. Message-ID: <4AC53214.3020600@candelatech.com> Please note that the old 'max timeout' that you could set as an eviron variable was only 6 seconds. This is less than the default of 30 seconds, which makes no sense at all. The attached patch fixes this: Give user a better clue as to why xrl router timed out. Allow user to set up to 2 minute timeout..helps with running lots of instances under valgrind and other strange things. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: xrl_router_timeout.patch Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091001/88394222/attachment.ksh From greearb at candelatech.com Thu Oct 1 17:13:36 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 01 Oct 2009 17:13:36 -0700 Subject: [Xorp-hackers] PATCH: Fix uninitialized memory, found by valgrind Message-ID: <4AC545B0.8080403@candelatech.com> This patch fixes some errors relating to not initializing memory properly. I found these by using valgrind. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: xorp_uninit_memory.patch Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091001/fda8ef09/attachment.ksh From bms at incunabulum.net Fri Oct 2 04:34:54 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Fri, 02 Oct 2009 12:34:54 +0100 Subject: [Xorp-hackers] PATCH: XrlRouter timeout needs to be allowed higher. In-Reply-To: <4AC53214.3020600@candelatech.com> References: <4AC53214.3020600@candelatech.com> Message-ID: <4AC5E55E.1020803@incunabulum.net> I've checked in the logic part of this patch. Thanks! From bms at incunabulum.net Fri Oct 2 04:42:24 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Fri, 02 Oct 2009 12:42:24 +0100 Subject: [Xorp-hackers] Stub licensing In-Reply-To: <4AC3EFFA.8040705@incunabulum.net> References: <4AC2AA1F.1080308@candelatech.com> <4AC2BFEC.6010802@candelatech.com> <4AC324FE.7010700@incunabulum.net> <4AC37746.2080004@candelatech.com> <4AC37E69.4040407@incunabulum.net> <4AC38E63.6030308@candelatech.com> <4AC3C4C5.7040307@incunabulum.net> <4AC3C8EC.7090001@candelatech.com> <4AC3EFFA.8040705@incunabulum.net> Message-ID: <4AC5E720.3060407@incunabulum.net> Bruce Simpson wrote: > ... > The scope of the GPL was purely limited to individual routing processes, > not the core libraries, which are LGPL. The XRL RPC stubs don't actually > have an explicit license, and should probably be updated to reflect > either LGPL or public domain status. > Correction: The generated RPC stubs contain a reference to the LGPL, but don't embed the license text itself. From greearb at candelatech.com Fri Oct 2 11:34:24 2009 From: greearb at candelatech.com (Ben Greear) Date: Fri, 02 Oct 2009 11:34:24 -0700 Subject: [Xorp-hackers] Question on startup errors/warnings. Message-ID: <4AC647B0.4060007@candelatech.com> I'm crawling through xorp logs trying to clean or explain xorp errors. Any idea what this indicates? [ 2009/10/02 11:21:53 WARNING xorp_rtrmgr:28398 XrlFinderTarget obj/x86_64-linux-public17/xrl/targets/finder_base.cc:715 handle_finder_event_notifier_0_1_register_class_event_interest ] Handling method for finder_event_notifier/0.1/register_class_event_interest failed: XrlCmdError 102 Command failed failed to add watch [ 2009/10/02 11:21:53 ERROR xorp_rtrmgr:28398 RTRMGR rtrmgr/xrl_rtrmgr_interface.cc:334 finder_register_done ] Failed to register with finder about XRL xorpsh-28454-i7-dqc-1 (err: Command failed) [ 2009/10/02 11:21:53 INFO xorp_rtrmgr:28398 RTRMGR rtrmgr/module_manager.cc:101 execute ] Executing module: igmp (mld6igmp/xorp_igmp) [ 2009/10/02 11:21:53 WARNING xorp_rtrmgr:28398 XrlFinderTarget obj/x86_64-linux-public17/xrl/targets/finder_base.cc:453 handle_finder_0_2_resolve_xrl ] Handling method for finder/0.2/resolve_xrl failed: XrlCmdError 102 Command failed Target "IGMP" does not exist or is not enabled. [ 2009/10/02 11:21:53 WARNING xorp_rtrmgr:28398 RTRMGR rtrmgr/task.cc:212 xrl_done ] Failed to receive reply, code: 201 Resolve failed retries: 0 max_retries: 30 Is this a real error, or just complaints because everything hasn't properly started yet? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From bms at incunabulum.net Sat Oct 3 03:30:16 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Sat, 03 Oct 2009 11:30:16 +0100 Subject: [Xorp-hackers] valgrind: selector.cc: Reading free'd memory In-Reply-To: <4AC4D2E2.5090302@candelatech.com> References: <4AC2AA1F.1080308@candelatech.com> <4AC2BFEC.6010802@candelatech.com> <4AC324FE.7010700@incunabulum.net> <4AC37746.2080004@candelatech.com> <4AC37E69.4040407@incunabulum.net> <4AC38E63.6030308@candelatech.com> <4AC48E59.9080405@incunabulum.net> <4AC4D2E2.5090302@candelatech.com> Message-ID: <4AC727B8.7030705@incunabulum.net> Ben Greear wrote: > > We have to ensure that the Node memory can never be deleted while that > method > is running. My work-around solves this for any sane amount of > file-descriptors > (up to 1024). I've committed the part of the change which preallocates _selector_entries, but limited it to 256 file descriptors to keep the memory wastage down. thanks! From bms at incunabulum.net Sat Oct 3 03:33:10 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Sat, 03 Oct 2009 11:33:10 +0100 Subject: [Xorp-hackers] Question on startup errors/warnings. In-Reply-To: <4AC647B0.4060007@candelatech.com> References: <4AC647B0.4060007@candelatech.com> Message-ID: <4AC72866.2020206@incunabulum.net> Ben Greear wrote: > I'm crawling through xorp logs trying to clean or explain xorp errors. > > Any idea what this indicates? > > [ 2009/10/02 11:21:53 WARNING xorp_rtrmgr:28398 XrlFinderTarget obj/x86_64-linux-public17/xrl/targets/finder_base.cc:715 > handle_finder_event_notifier_0_1_register_class_event_interest ] Handling method for finder_event_notifier/0.1/register_class_event_interest failed: XrlCmdError > 102 Command failed failed to add watch > [ 2009/10/02 11:21:53 ERROR xorp_rtrmgr:28398 RTRMGR rtrmgr/xrl_rtrmgr_interface.cc:334 finder_register_done ] Failed to register with finder about XRL > xorpsh-28454-i7-dqc-1 (err: Command failed) > This could just be xorpsh startup racing with the Router Manager finishing its initial configuration tree pass. > [ 2009/10/02 11:21:53 INFO xorp_rtrmgr:28398 RTRMGR rtrmgr/module_manager.cc:101 execute ] Executing module: igmp (mld6igmp/xorp_igmp) > [ 2009/10/02 11:21:53 WARNING xorp_rtrmgr:28398 XrlFinderTarget obj/x86_64-linux-public17/xrl/targets/finder_base.cc:453 handle_finder_0_2_resolve_xrl ] > Handling method for finder/0.2/resolve_xrl failed: XrlCmdError 102 Command failed Target "IGMP" does not exist or is not enabled. > [ 2009/10/02 11:21:53 WARNING xorp_rtrmgr:28398 RTRMGR rtrmgr/task.cc:212 xrl_done ] Failed to receive reply, code: 201 Resolve failed retries: 0 max_retries: 30 > > > Is this a real error, or just complaints because everything hasn't properly started yet? > This could be the same situation with the igmp child process. I've seen similar log verbiage when there are debug hooks active in the system. From bms at incunabulum.net Sat Oct 3 03:41:49 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Sat, 03 Oct 2009 11:41:49 +0100 Subject: [Xorp-hackers] Patch to update build notes slightly In-Reply-To: <4AC1126D.4030907@candelatech.com> References: <4AC1126D.4030907@candelatech.com> Message-ID: <4AC72A6D.8020108@incunabulum.net> An appropriate update has been committed for now. From bms at incunabulum.net Sat Oct 3 04:10:33 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Sat, 03 Oct 2009 12:10:33 +0100 Subject: [Xorp-hackers] PATCH: Fix uninitialized memory, found by valgrind In-Reply-To: <4AC545B0.8080403@candelatech.com> References: <4AC545B0.8080403@candelatech.com> Message-ID: <4AC73129.4060905@incunabulum.net> Ben Greear wrote: > This patch fixes some errors relating to not initializing memory > properly. I found these by using valgrind. A few questions/points: * Why is the initializer for TransactionManager::_next_tid required? This integer key is never exposed outside of TransactionManager, and the std::map it indexes doesn't make any assumptions about the key space. Can you provide the valgrind hit? * Why is the initializer for IfConfigTransactionManager::_tid_exec required? This member is only referenced in two places: when it's set on the pre_commit, and when the operation result callback fires, it gets passed by value. There are other places in the FEA using the TransactionManager. Are they also affected/is there coverage? * Can you provide the valgrind hits which are fixed by the memset() calls in io_ip_socket.cc? The CMSG macros should notice if a buffer, passed to a socket call, didn't return any data. If they aren't, that could be a bug elsewhere. We really need to understand the problems these fixes address before taking them. It is normally good practice to clear buffers, when needed, but it's OK to omit that step for performance if and only if it doesn't cause stale state to get picked up. cheers, BMS From greearb at candelatech.com Sat Oct 3 08:43:26 2009 From: greearb at candelatech.com (Ben Greear) Date: Sat, 03 Oct 2009 08:43:26 -0700 Subject: [Xorp-hackers] valgrind: selector.cc: Reading free'd memory In-Reply-To: <4AC727B8.7030705@incunabulum.net> References: <4AC2AA1F.1080308@candelatech.com> <4AC2BFEC.6010802@candelatech.com> <4AC324FE.7010700@incunabulum.net> <4AC37746.2080004@candelatech.com> <4AC37E69.4040407@incunabulum.net> <4AC38E63.6030308@candelatech.com> <4AC48E59.9080405@incunabulum.net> <4AC4D2E2.5090302@candelatech.com> <4AC727B8.7030705@incunabulum.net> Message-ID: <4AC7711E.3080702@candelatech.com> Bruce Simpson wrote: > Ben Greear wrote: >> >> We have to ensure that the Node memory can never be deleted while >> that method >> is running. My work-around solves this for any sane amount of >> file-descriptors >> (up to 1024). > > I've committed the part of the change which preallocates > _selector_entries, but limited it to 256 file descriptors to keep the > memory wastage down. thanks! Hopefully no one will ever get a descriptor bigger than 256! It should certainly be better than before, but I'm going to leave my tree at 1024 since I open a descriptor per interface and sometimes run lots of protocols. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Sat Oct 3 08:55:51 2009 From: greearb at candelatech.com (Ben Greear) Date: Sat, 03 Oct 2009 08:55:51 -0700 Subject: [Xorp-hackers] PATCH: Fix uninitialized memory, found by valgrind In-Reply-To: <4AC73129.4060905@incunabulum.net> References: <4AC545B0.8080403@candelatech.com> <4AC73129.4060905@incunabulum.net> Message-ID: <4AC77407.2050205@candelatech.com> Bruce Simpson wrote: > Ben Greear wrote: >> This patch fixes some errors relating to not initializing memory >> properly. I found these by using valgrind. > > A few questions/points: > > * Why is the initializer for TransactionManager::_next_tid required? > This integer key is never exposed outside of TransactionManager, and > the std::map it indexes doesn't make any assumptions about the key > space. Can you provide the valgrind hit? > > * Why is the initializer for IfConfigTransactionManager::_tid_exec > required? This member is only referenced in two places: when it's set > on the pre_commit, and when the operation result callback fires, it > gets passed by value. There are other places in the FEA using the > TransactionManager. Are they also affected/is there coverage? > > * Can you provide the valgrind hits which are fixed by the memset() > calls in io_ip_socket.cc? > > The CMSG macros should notice if a buffer, passed to a socket call, > didn't return any data. If they aren't, that could be a bug elsewhere. > > We really need to understand the problems these fixes address before > taking them. It is normally good practice to clear buffers, when > needed, but it's OK to omit that step for performance if and only if > it doesn't cause stale state to get picked up. Run rtrmgr under valgrind with OSPF (though it's not OSPF related), and you should see these errors. I don't think any of them are critical, but they make valgrind noisy so you can't see other errors that might be real. At any rate, it isn't clean code to leave member variables un-initialized. It's just asking for weird problems some day with someone starts using the variables differently. The changes are not in any hot path, so they are not going to hurt any performance. Here's my valgrind start command: valgrind --trace-children=yes --log-file=valgrind_xorp_$XORP_FINDER_SERVER_PORT.%p.txt --leak-check=full --track-origins=yes --track-fds=yes xorp_rtrmgr -p $XORP_FINDER_SERVER_PORT -b $CFG_FILE -P $PIDFILE.rtrmgr Thanks, Ben > > cheers, > BMS -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Mon Oct 5 11:31:35 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 05 Oct 2009 11:31:35 -0700 Subject: [Xorp-hackers] PATCH: Remove some dead code, unlink pid-file on exit. Message-ID: <4ACA3B87.70309@candelatech.com> This is mostly just a cleanup patch. It removes some dead code and changes around the pidfile logic a bit. It also allows unlinking the pid-file on exit using the atexit call. Tested on Linux. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: unlink_pidfile.patch Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091005/119c5b08/attachment.ksh From greearb at candelatech.com Mon Oct 5 14:54:35 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 05 Oct 2009 14:54:35 -0700 Subject: [Xorp-hackers] rtrmgr and TaskManager Message-ID: <4ACA6B1B.8020604@candelatech.com> I'm trying to figure out why xorpsh commits take so long, and in doing so, I'm trying to understand the TaskManager. There is one part that is particularly confusing: There is a _completion_cb that is assigned when a task is queued up, but the next task to run isn't necessarily that task if there are other higher priority tasks running. Seems like that could be a problem to me? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Mon Oct 5 15:34:51 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 05 Oct 2009 15:34:51 -0700 Subject: [Xorp-hackers] rtrmgr and TaskManager In-Reply-To: <4ACA6B1B.8020604@candelatech.com> References: <4ACA6B1B.8020604@candelatech.com> Message-ID: <4ACA748B.6070308@candelatech.com> On 10/05/2009 02:54 PM, Ben Greear wrote: > I'm trying to figure out why xorpsh commits take so long, and in doing so, > I'm trying to understand the TaskManager. > > There is one part that is particularly confusing: > > There is a _completion_cb that is assigned when a task is queued up, but > the next task to run isn't necessarily that task if there are other higher > priority tasks running. > > Seems like that could be a problem to me? Well, here's the slow-down I'm seeing..but WTF would someone add a 1-second sleep here??? task.cc: XrlStatusValidation::validate } else { // // When we're running with do_exec == false, we want to // exercise most of the same machinery, but we want to ensure // that the xrl_done response gets the right arguments even // though we're not going to call the XRL. // _retry_timer = eventloop().new_oneoff_after_ms(1000, callback(this, &XrlStatusValidation::dummy_response)); } } -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Mon Oct 5 15:58:33 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 05 Oct 2009 15:58:33 -0700 Subject: [Xorp-hackers] PATCH: Logging improvements, fix artificial deal for xorpsh commit. Message-ID: <4ACA7A19.30909@candelatech.com> The attached patch has these improvements: 1) Fix logging & tracing to show micro-seconds, greatly aids debugging performance issues. 2) Change some pass-by-value string arguments to const string& in router-mgr. This will improve performance and a small bit of memory usage. 3) Remove 1 second timeout in 'commit' path. At best, the timeout might have worked around a race condition, but I can see no reason to leave it in. I tested with it set to zero timeout and things work fine. This makes commits last around 200ms instead of 1.2ms, which is a big improvement when scripting xorp. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: logging_commit_timeout.patch Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091005/b39b9a69/attachment.ksh From greearb at candelatech.com Mon Oct 5 16:42:36 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 05 Oct 2009 16:42:36 -0700 Subject: [Xorp-hackers] PATCH: Fix commit failure on device removal race, related to IGMP. Message-ID: <4ACA846C.7040908@candelatech.com> If an interface is removed from the system, then you can no longer remove it from xorp igmp configuration because the commit will fail (due to lack of vif). This is a race of some sort or another, and was fairly difficult to reproduce even on our setup. Here's the fix: * Don't fail vif_stop in Mld6igmpNode::stop_vif if the interface is already removed. Log the inconsistency, but return XORP_OK so the commit can continue. This is similar to code I've had in 'mfea_node' for several years. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: igmp_commit_fail.patch Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091005/0c356418/attachment.ksh From greearb at candelatech.com Mon Oct 5 21:26:41 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 05 Oct 2009 21:26:41 -0700 Subject: [Xorp-hackers] PATCH: Don't fail commit on multicast address removal failure. Message-ID: <4ACAC701.1070300@candelatech.com> This patch fixes a bug where a commit can fail if the multicast addresses trying to be removed are already gone (probably because an entire network device disappeared shortly ago). If it's already gone, log a warning, but don't fail the commit. -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- A non-text attachment was scrubbed... Name: multicast_fea_rm.patch Type: text/x-patch Size: 1821 bytes Desc: not available Url : http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091005/149b009a/attachment.bin From greearb at candelatech.com Mon Oct 5 21:31:02 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 05 Oct 2009 21:31:02 -0700 Subject: [Xorp-hackers] PATCH: Add startup methods for faster startup. Message-ID: <4ACAC806.1000000@candelatech.com> If there is no status and no startup method in a xorp target, the router-mgr uses a 2-second sleep for 'verification'. This slows down startup of Xorp quite a bit when you have lots of protocols running. This patch adds startup methods to many of the common targets. There are still more to go, however. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- A non-text attachment was scrubbed... Name: startup_methods.patch Type: text/x-patch Size: 6324 bytes Desc: not available Url : http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091005/9098b29a/attachment-0001.bin From bms at incunabulum.net Tue Oct 6 04:50:57 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 06 Oct 2009 12:50:57 +0100 Subject: [Xorp-hackers] PATCH: Add startup methods for faster startup. In-Reply-To: <4ACAC806.1000000@candelatech.com> References: <4ACAC806.1000000@candelatech.com> Message-ID: <4ACB2F21.7080603@incunabulum.net> Ben Greear wrote: > If there is no status and no startup method in a xorp target, the > router-mgr uses a 2-second > sleep for 'verification'. This slows down startup of Xorp quite a bit > when you have lots > of protocols running. > > This patch adds startup methods to many of the common targets. There > are still more to > go, however. Thanks for tracking this down; yes, I've noticed that process startup is slower than it could be, but have only had free time / mindspace to look at the XRL specifics. Could this be made a more general change? If the XIF method for startup you are adding is not specific to a particular protocol, it might be an idea to make it part of the common.xif -- which is where most of the process control knobs are. I'd rather not get too far into the machinery here, because I'm about to take a badly needed break. I guess the firewall and ifmgr modules are a special case, because they're separate service bundles located in the FEA process. On a more general note: One of the things Pavlin raised in an old BugZilla ticket, is the fact that the Router Manager is fairly complex because it implements transactions on the config tree itself. If this is pushed into the protocols themselves (they'd have to keep their own config snapshot, and adopt a commit-rollback transaction model in the XIF RPC interfaces), then the Router Manager gets a bit simpler overall. cheers, BMS From bms at incunabulum.net Tue Oct 6 05:44:02 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 06 Oct 2009 13:44:02 +0100 Subject: [Xorp-hackers] PATCH: Logging improvements, fix artificial deal for xorpsh commit. In-Reply-To: <4ACA7A19.30909@candelatech.com> References: <4ACA7A19.30909@candelatech.com> Message-ID: <4ACB3B92.3050505@incunabulum.net> Ben Greear wrote: > The attached patch has these improvements: > > 1) Fix logging & tracing to show micro-seconds, greatly aids > debugging performance issues. > 2) Change some pass-by-value string arguments to const string& in > router-mgr. This will improve > performance and a small bit of memory usage. > 3) Remove 1 second timeout in 'commit' path. At best, the timeout > might have worked around > a race condition, but I can see no reason to leave it in. I > tested with it set to zero > timeout and things work fine. This makes commits last around > 200ms instead of 1.2ms, which is > a big improvement when scripting xorp. Comments: * It should be possible to turn off the millisecond logging if desired. Whilst it's certainly a useful feature to have when debugging time contingent code, it does add clutter to the output. * Perhaps putting it under the other debug knobs in SConstruct would be a good idea? * %llu is not a portable format specifier, and 'unsigned long long' is not a portable type, please don't use them in portable code. * Perhaps the code which prints the timeval is a candidate for a function like xlog_localtime2string_short() ? * xlog_localtime2string_short() is still defined in xlog.c; so why comment out its prototype, are you getting warnings from the compiler? * A XorpTimer of 0 is a possible candidate for a XorpTask. I can't really delve further into that change at the moment, though. * Yes, it may be useful to constify the string arguments in those callback functions, but this change considered low priority. * Please avoid introducing unnecessary whitespace changes in diffs. Can you please raise a Trac item for these suggested improvements? I probably won't have time to look at the Router Manager in detail for at least 4 weeks. Sorry for the bureaucracy... I appreciate you're doing what you can in the here and now to improve the code, however, it makes reviewing patches and applying them that much easier, and we do need to keep the code alignment and type clean, etc. thanks, BMS From bms at incunabulum.net Tue Oct 6 05:45:33 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 06 Oct 2009 13:45:33 +0100 Subject: [Xorp-hackers] PATCH: Fix commit failure on device removal race, related to IGMP. In-Reply-To: <4ACA846C.7040908@candelatech.com> References: <4ACA846C.7040908@candelatech.com> Message-ID: <4ACB3BED.2090100@incunabulum.net> Ben, Can you please raise a Trac ticket about this issue, and attach your patch? Ben Greear wrote: > If an interface is removed from the system, then you can no longer remove > it from xorp igmp configuration because the commit will fail (due to > lack of vif). This is a race of some sort or another, and was fairly > difficult > to reproduce even on our setup. > > Here's the fix: > > * Don't fail vif_stop in Mld6igmpNode::stop_vif if the interface is > already removed. > Log the inconsistency, but return XORP_OK so the commit can continue. Thank you BMS From bms at incunabulum.net Tue Oct 6 05:47:09 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 06 Oct 2009 13:47:09 +0100 Subject: [Xorp-hackers] PATCH: Don't fail commit on multicast address removal failure. In-Reply-To: <4ACAC701.1070300@candelatech.com> References: <4ACAC701.1070300@candelatech.com> Message-ID: <4ACB3C4D.3020304@incunabulum.net> Hi Ben, Thanks for your patch. Ben Greear wrote: > This patch fixes a bug where a commit can fail if the multicast > addresses trying to be > removed are already gone (probably because an entire network device > disappeared > shortly ago). If it's already gone, log a warning, but don't fail the > commit. Can you please attach this to a Trac ticket for the interface removal condition? It would be easier to tackle this on the basis of 'this is a specific problem which needs to be solved', rather than doing piecemeal commits. thanks, BMS From bms at incunabulum.net Tue Oct 6 06:07:43 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 06 Oct 2009 14:07:43 +0100 Subject: [Xorp-hackers] PATCH: Remove some dead code, unlink pid-file on exit. In-Reply-To: <4ACA3B87.70309@candelatech.com> References: <4ACA3B87.70309@candelatech.com> Message-ID: <4ACB411F.9070309@incunabulum.net> Ben, A few comments about this patch: * Do you have a specific distribution which relies on the behaviour of removing the pid file after the daemon terminates? For the BSD distributions at least, when running XORP from an rc.subr init script, this shouldn't be an issue; the file is just ignored, and overwritten on the next run. It's good filesystem hygeine to remove it, though. Ben Greear wrote: > This is mostly just a cleanup patch. It removes some dead code and > changes around the pidfile logic a bit. It also allows unlinking the > pid-file on exit using the atexit call. Tested on Linux. * atexit() is better specified now, though, although a check for a C99 compliant implementation would be useful (for folk trying to link against it on embedded platforms): http://www.opengroup.org/onlinepubs/009695399/functions/atexit.html * handle_atexit() should be renamed to reflect what it's doing through the atexit() mechanism. We are indeed daemonizing the whole process, not Rtrmgr, so it certainly belongs at the C top level scope. * Please don't use cout. C stdio is used elsewhere in this file, so no point in pulling in iostreams; C stdio should still be accessible. In practice this isn't an issue if libc is shared, however, it does pull in parts of the C++ runtime we don't immediately need. * The reason the pid gets written out from the parent is because on most distributions, we are writing to an absolute path under /var (usually /var/run/.pid). If the child is chrooted it may not have access to this absolute path, which breaks POLA. JT indicated that he wasn't 100% happy with some POLA elements of how XORP daemonizes. For example, it won't chdir() to /. This potentially breaks chroot()-ed operation, or at least means that the rtrmgr still holds a vnode lock on the place it was started from. So hopefully he will chime in on this. cheers, BMS From bms at incunabulum.net Tue Oct 6 06:15:06 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 06 Oct 2009 14:15:06 +0100 Subject: [Xorp-hackers] PATCH: Remove some dead code, unlink pid-file on exit. In-Reply-To: <4ACB411F.9070309@incunabulum.net> References: <4ACA3B87.70309@candelatech.com> <4ACB411F.9070309@incunabulum.net> Message-ID: <4ACB42DA.2070604@incunabulum.net> Bruce Simpson wrote: > * Do you have a specific distribution which relies on the behaviour of > removing the pid file after the daemon terminates? > For the BSD distributions at least, when running XORP from an > rc.subr init script, this shouldn't be an issue; the file is just > ignored, and overwritten on the next run. > It's good filesystem hygeine to remove it, though. > P.S. can you please raise a Trac enhancement request for this issue? Thanks. From bms at incunabulum.net Tue Oct 6 06:21:09 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 06 Oct 2009 14:21:09 +0100 Subject: [Xorp-hackers] PATCH: Fix uninitialized memory, found by valgrind In-Reply-To: <4AC77407.2050205@candelatech.com> References: <4AC545B0.8080403@candelatech.com> <4AC73129.4060905@incunabulum.net> <4AC77407.2050205@candelatech.com> Message-ID: <4ACB4445.5050307@incunabulum.net> Ben, Thanks for raising the uninitialized buffer issue. Unfortunately I won't have free time to perform valgrind runs on the code before I leave for my trip. Ben Greear wrote: > ... >> >> * Can you provide the valgrind hits which are fixed by the memset() >> calls in io_ip_socket.cc? >> > Run rtrmgr under valgrind with OSPF (though it's not OSPF related), > and you should see these errors. It would be really useful if you could attach the valgrind logs (or at least the relevant excerpt) to Trac ticket(s) so that the issue can be investigated further. I really need to stay focused on the XRL code when I get back from my trip, however. Keeping the relevant information with the issue, in the Trac database, is really helpful, as it helps others to pick up and investigate in an ongoing way; they may not have all the context/state from what you are actually trying to do at that moment. I do appreciate the work you're doing in tracking down possible issues with valgrind, and hope you understand it is easy to get overwhelmed by issue reports individually. thanks, BMS From bms at incunabulum.net Tue Oct 6 06:31:35 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 06 Oct 2009 14:31:35 +0100 Subject: [Xorp-hackers] Away 7th Oct - 21st Oct Message-ID: <4ACB46B7.90707@incunabulum.net> Hi all, I'll be away on a trip in Scotland from 7th Oct - 21st Oct, getting a well needed rest, so will only have sporadic access to email during that time, and will be unavailable for support requests. I'll endeavour to respond to email in detail when I get back. A 1.7-RC for the community code beginning late November is a possibility, this is depending on how much of the XRL replacement work can be finished by then. thanks, BMS From bms at incunabulum.net Tue Oct 6 06:55:49 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 06 Oct 2009 14:55:49 +0100 Subject: [Xorp-hackers] rtrmgr and TaskManager In-Reply-To: <4ACA748B.6070308@candelatech.com> References: <4ACA6B1B.8020604@candelatech.com> <4ACA748B.6070308@candelatech.com> Message-ID: <4ACB4C65.1080303@incunabulum.net> Ben Greear wrote: > Well, here's the slow-down I'm seeing..but WTF would someone add a 1-second > sleep here??? > I don't really have time to delve into the Router Manager specifics right now, but I'll share some of what's evident from the XRL replacement work. My guess here is that the Router Manager code is letting the EventLoop run so other task(s) can get serviced. At this point in execution, the processes are not being started; instead, the XRLs being fired off to configure the process during startup, are shimmed. One reason why this is required, is because XRL is trying to be completely asynchronous. There's a fair amount of complexity in XRL, and the Router Manager, to deal with the fact that XRL method resolution happens on a per-method basis, and is completely asynchronous. The EventLoop needs to be run() in order for things to happen, mostly because C++ doesn't have continuations. The lack of a synchronous model for coding to XRL, as an RPC mechanism, means that we have some complexity in the 'show_*' tools. These are also written in C++, because XRL is tied to C++ as an implementation language. Any XRLs invoked by the Router Manager come from the template files; the *.xrls files are used to validate XRL invocation against the targets. These are always XRLs of the form 'finder://' which forces the resolution to go via the Finder (an indirect method call using the textual Finder protocol). ... I'd argue that the Router Manager really needs to be revisited entirely, and it has been in the commercial product, although not to the extent I'd argue needs doing to make the routing processes useful outside of the framework they're embedded in right now, as the code is realized in the community branch. It is purely configuration space stuff, and involves a text parser for a configuration language, a configuration tree containing the current router config state, and the marshaling/pushing of that state to and from the routing processes. One might argue that it doesn't even need to be written in C++, and an object scripting language (e.g. Python, Ruby) would be sufficiently mature (and fast) to do what a Router Manager needs to do. All of these things could be realized in an OO scripting language. Of course, we don't really have free time on the board right now to deal with this right off the bat. The timer you mention here as an issue probably could be speeded up, however a time gap there is probably still necessary to let other callbacks run. I'm wary of wading into it too much before a 1.7-RC is cut, although if you find that cutting corners in these areas helps, and doesn't disturb functionality, it is something we can consider at that time. cheers, BMS From greearb at candelatech.com Tue Oct 6 09:26:34 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 06 Oct 2009 09:26:34 -0700 Subject: [Xorp-hackers] rtrmgr and TaskManager In-Reply-To: <4ACB4C65.1080303@incunabulum.net> References: <4ACA6B1B.8020604@candelatech.com> <4ACA748B.6070308@candelatech.com> <4ACB4C65.1080303@incunabulum.net> Message-ID: <4ACB6FBA.5060109@candelatech.com> On 10/06/2009 06:55 AM, Bruce Simpson wrote: > Ben Greear wrote: >> Well, here's the slow-down I'm seeing..but WTF would someone add a >> 1-second >> sleep here??? > > I don't really have time to delve into the Router Manager specifics > right now, but I'll share some of what's evident from the XRL > replacement work. > > My guess here is that the Router Manager code is letting the EventLoop > run so other task(s) can get serviced. At this point in execution, the > processes are not being started; instead, the XRLs being fired off to > configure the process during startup, are shimmed. > > One reason why this is required, is because XRL is trying to be > completely asynchronous. There's a fair amount of complexity in XRL, and > the Router Manager, to deal with the fact that XRL method resolution > happens on a per-method basis, and is completely asynchronous. The > EventLoop needs to be run() in order for things to happen, mostly > because C++ doesn't have continuations. > > The lack of a synchronous model for coding to XRL, as an RPC mechanism, > means that we have some complexity in the 'show_*' tools. These are also > written in C++, because XRL is tied to C++ as an implementation language. > > Any XRLs invoked by the Router Manager come from the template files; the > *.xrls files are used to validate XRL invocation against the targets. > These are always XRLs of the form 'finder://' which forces the > resolution to go via the Finder (an indirect method call using the > textual Finder protocol). > > ... > > I'd argue that the Router Manager really needs to be revisited entirely, > and it has been in the commercial product, although not to the extent > I'd argue needs doing to make the routing processes useful outside of > the framework they're embedded in right now, as the code is realized in > the community branch. > > It is purely configuration space stuff, and involves a text parser for a > configuration language, a configuration tree containing the current > router config state, and the marshaling/pushing of that state to and > from the routing processes. > > One might argue that it doesn't even need to be written in C++, and an > object scripting language (e.g. Python, Ruby) would be sufficiently > mature (and fast) to do what a Router Manager needs to do. All of these > things could be realized in an OO scripting language. > > Of course, we don't really have free time on the board right now to deal > with this right off the bat. The timer you mention here as an issue > probably could be speeded up, however a time gap there is probably still > necessary to let other callbacks run. > > I'm wary of wading into it too much before a 1.7-RC is cut, although if > you find that cutting corners in these areas helps, and doesn't disturb > functionality, it is something we can consider at that time. Anything that depends on waiting for other tasks to run by just sleeping for a while is a broken algorithm, so I'd prefer to see the problems sooner than later. From my poking at the code, I can't see any reason it should need to sleep though...other tasks can run just fine after that one completes. If there are others that *must* run first, hopefully they are properly chained with callbacks (the commit seems to be done thus). I'm going to run with zero timer there and see if any problems shake out. After several hours yesterday, I had seen no problems, but saw significant speed-up in 'commit' xorpsh commands which is very useful for me. With regard to re-architecting rtr-mgr: Networking is asynchronous by design and considering that external events (interfaces coming & going, link state bouncing, etc) can happen at any time, the code just needs to deal properly with async events. The one thing I'd work towards is more of a 'desired' v/s 'actual' config. Users could always configure any logical configuration and the system will try to make this happen, but it will also deal properly with 'phantom' things like interfaces that don't exist currently. A different programming language isn't going to help any of that I think..and I'd very much like to keep with c/c++. Thanks, Ben > > cheers, > BMS -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Tue Oct 6 09:39:47 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 06 Oct 2009 09:39:47 -0700 Subject: [Xorp-hackers] [Xorp-users] Away 7th Oct - 21st Oct In-Reply-To: <4ACB46B7.90707@incunabulum.net> References: <4ACB46B7.90707@incunabulum.net> Message-ID: <4ACB72D3.40300@candelatech.com> On 10/06/2009 06:31 AM, Bruce Simpson wrote: > Hi all, > > I'll be away on a trip in Scotland from 7th Oct - 21st Oct, getting > a well needed rest, so will only have sporadic access to email during > that time, and will be unavailable for support requests. I'll endeavour > to respond to email in detail when I get back. > A 1.7-RC for the community code beginning late November is a > possibility, this is depending on how much of the XRL replacement work > can be finished by then. Enjoy, and please ignore all my emails during that time :) Ben > > thanks, > BMS > > _______________________________________________ > Xorp-users mailing list > Xorp-users at xorp.org > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-users -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Tue Oct 6 09:42:19 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 06 Oct 2009 09:42:19 -0700 Subject: [Xorp-hackers] PATCH: Remove some dead code, unlink pid-file on exit. In-Reply-To: <4ACB411F.9070309@incunabulum.net> References: <4ACA3B87.70309@candelatech.com> <4ACB411F.9070309@incunabulum.net> Message-ID: <4ACB736B.5090304@candelatech.com> On 10/06/2009 06:07 AM, Bruce Simpson wrote: > Ben, > > A few comments about this patch: > * Do you have a specific distribution which relies on the behaviour of > removing the pid file after the daemon terminates? > For the BSD distributions at least, when running XORP from an rc.subr > init script, this shouldn't be an issue; the file is just ignored, and > overwritten on the next run. > It's good filesystem hygeine to remove it, though. I have my own xorp startup logic, and having valid pid files makes things slightly more efficient. More of a hygiene thing though. You can never absolutely depend on atexit working, so you still need to check pid file contents against /proc/[pid]/ to be certain. Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Tue Oct 6 09:51:25 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 06 Oct 2009 09:51:25 -0700 Subject: [Xorp-hackers] PATCH: Add startup methods for faster startup. In-Reply-To: <4ACB2F21.7080603@incunabulum.net> References: <4ACAC806.1000000@candelatech.com> <4ACB2F21.7080603@incunabulum.net> Message-ID: <4ACB758D.1060302@candelatech.com> On 10/06/2009 04:50 AM, Bruce Simpson wrote: > Ben Greear wrote: >> If there is no status and no startup method in a xorp target, the >> router-mgr uses a 2-second >> sleep for 'verification'. This slows down startup of Xorp quite a bit >> when you have lots >> of protocols running. >> >> This patch adds startup methods to many of the common targets. There >> are still more to >> go, however. > > Thanks for tracking this down; yes, I've noticed that process startup is > slower than it could be, but have only had free time / mindspace to look > at the XRL specifics. > > Could this be made a more general change? If the XIF method for startup > you are adding is not specific to a particular protocol, it might be an > idea to make it part of the common.xif -- which is where most of the > process control knobs are. > I'd rather not get too far into the machinery here, because I'm about to > take a badly needed break. I guess the firewall and ifmgr modules are a > special case, because they're separate service bundles located in the > FEA process. It could probably be put in common code. I'm just learning this code myself...likely can get a cleaner patch later when I understand things better. > > On a more general note: > One of the things Pavlin raised in an old BugZilla ticket, is the fact > that the Router Manager is fairly complex because it implements > transactions on the config tree itself. > If this is pushed into the protocols themselves (they'd have to keep > their own config snapshot, and adopt a commit-rollback transaction model > in the XIF RPC interfaces), then the Router Manager gets a bit simpler > overall. I dislike that because then it would become virtually impossible to restart failed protocol processes. I think the rtr-mgr should hold all config state. As mentioned earlier, I think the commit/rollback thing has been somewhat over-thought as well. There are way too many ways to fail a commit...I think it should only fail if there are logical issues (and in that case, the rtr-mgr shouldn't even try to 'commit': it should not even accept the change in the first place.) If it tries to push something to a module and that module reports error, then we flag that piece of configuration as pending, or invalid, or similar and these flags would show up when the user did a 'show run' or similar. Then they know they need to fix it, perhaps by re-configuring, or perhaps by fixing broken hardware, or some other external thing. My various patches to not fail commits when we don't have to is my ongoing effort towards this type of behaviour.... Ben > > cheers, > BMS -- Ben Greear Candela Technologies Inc http://www.candelatech.com From bms at incunabulum.net Tue Oct 6 09:54:12 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 06 Oct 2009 17:54:12 +0100 Subject: [Xorp-hackers] rtrmgr and TaskManager In-Reply-To: <4ACB6FBA.5060109@candelatech.com> References: <4ACA6B1B.8020604@candelatech.com> <4ACA748B.6070308@candelatech.com> <4ACB4C65.1080303@incunabulum.net> <4ACB6FBA.5060109@candelatech.com> Message-ID: <4ACB7634.8040701@incunabulum.net> Ben Greear wrote: > > Anything that depends on waiting for other tasks to run by just sleeping > for a while is a broken algorithm, so I'd prefer to see the problems > sooner > than later. That's not entirely true. Let me clarify. In a threaded environment, this comment is valid; threads can starve each other of resources, or cause deadlock/livelock/experience race conditions, if synchronization is incorrect. So yes, your point about tasks sleeping to achieve synchronization being a flawed mechanism, is valid, in a threaded environment. In a coroutine based environment, which is what XORP uses, this isn't a valid comment. Explicit yield points are necessary to allow other tasks to run, and 'synchronization' is achieved using state variables of some kind. This is the case in C++ as with any other language which implements coroutines -- there is only a single thread of execution, so in effect, nothing is ever sleeping. The 'synchronization' point, if you like, is when select() finally gets called. This is pretty much what the io_service idiom in Boost C++ is doing. Continuations offer language support for the coroutine construct, which is something C++ doesn't have; see here and further on in this reply: http://en.wikipedia.org/wiki/Coroutine In the case of the Router Manager, I wouldn't be entirely surprised if there were callbacks stacked up waiting for dispatch in the background, however given how serial it is in nature (in terms of process bringup and trying to avoid thundering herd problems for OS resources), I'm not surprised it errs on the side of conservatism, hence the large timeout thresholds. > From my poking at the code, I can't see any reason it should > need to sleep though...other tasks can run just fine after that one > completes. If there are others that *must* run first, hopefully they > are properly chained with callbacks (the commit seems to be done thus). > I'm going to run with zero timer there and see if any problems > shake out. After several hours yesterday, I had seen no problems, but > saw significant > speed-up in 'commit' xorpsh commands which is very useful for me. I buy the argument, but I'm sure you can understand my hands-off / kid-gloves position with regards to the Router Manager and taking changes for it -- it is a large C++ subsystem which I'm not entirely familiar with, and when I've made changes to it in the past, mostly when porting to Win32, it's been a case of get in, get out, stay focused, get it over with, and survive it. If you experiment with turning those timeouts down, and it works for you, that's great, but I really need to have a clear picture of what's going on, if I'm to be expected to support it on an ongoing basis. > > With regard to re-architecting rtr-mgr: Networking is asynchronous by > design > and considering that external events (interfaces coming & going, link > state > bouncing, etc) can happen at any time, the code just needs to deal > properly > with async events. In XORP's case, more engineering time seems to have been burnt up on getting the XRL layer written than on these external events you mention. The FEA in theory handles all of these events, it is something of a kitchen sink. What could do with better realization is how these events are propagated to the rest of the system -- which is why I've been focused on looking at XRL. > The one thing I'd work towards is more of a 'desired' > v/s 'actual' config. Users could always configure any logical > configuration > and the system will try to make this happen, but it will also deal > properly > with 'phantom' things like interfaces that don't exist currently. A > different > programming language isn't going to help any of that I think..and I'd > very > much like to keep with c/c++. As you've probably already seen, the Router Manager code is non-trivial, and there's a lot of complexity in there to deal with the asynchrony of the XRL RPC calls. I agree that the configuration model needs serious looking at for things like dynamic interfaces (VPN, wireless, hot-swappable cards etc) and it's something which I raised several times as an agenda point during my time at ICSI. Unfortunately, the development focus has been in other areas, and I haven't been in a position to call the shots on where the effort went. I certainly got the impression that this put some folk off from trying XORP in the here and now. Regarding the use of C/C++ for development: XORP is strongly tied to the concept of continuations, even if it doesn't have language support. Twisted Python at least has the benefit of strong language support for continuations, in the form of how it overloads the 'yield' operator. This allows a call stack frame to be easily tucked away and restored at a later point in time, and in an exception safe way. There have been efforts over the years to try to do this in C++, e.g. uC++, Concurrent C++ and others, but none of them have matured sufficiently for production use. What we have in XORP is a compromise, and it's largely tied to the semantics of how I/O happens in a UNIX-like system. cheers BMS From greearb at candelatech.com Tue Oct 6 10:05:43 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 06 Oct 2009 10:05:43 -0700 Subject: [Xorp-hackers] PATCH: Fix commit failure on device removal race, related to IGMP. In-Reply-To: <4ACB3BED.2090100@incunabulum.net> References: <4ACA846C.7040908@candelatech.com> <4ACB3BED.2090100@incunabulum.net> Message-ID: <4ACB78E7.6010402@candelatech.com> On 10/06/2009 05:45 AM, Bruce Simpson wrote: > Ben, > > Can you please raise a Trac ticket about this issue, and attach your patch? These bugs are all over the place...I think it will be a waste of effort to open bugs and attach patches for each instance. (Notice bug-trac: 10599, open for more than a year, with the simplest possible patch attached). I think it's best that I post patches, get feedback, fix them as much as possible, and keep them in my tree for continuous testing. When you have time to review & commit this sort of stuff, we can deal with it in larger chunks. By then I should have more of the issues found and fixed. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Tue Oct 6 10:34:12 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 06 Oct 2009 10:34:12 -0700 Subject: [Xorp-hackers] rtrmgr and TaskManager In-Reply-To: <4ACB7634.8040701@incunabulum.net> References: <4ACA6B1B.8020604@candelatech.com> <4ACA748B.6070308@candelatech.com> <4ACB4C65.1080303@incunabulum.net> <4ACB6FBA.5060109@candelatech.com> <4ACB7634.8040701@incunabulum.net> Message-ID: <4ACB7F94.10103@candelatech.com> On 10/06/2009 09:54 AM, Bruce Simpson wrote: > Ben Greear wrote: > I buy the argument, but I'm sure you can understand my hands-off / > kid-gloves position with regards to the Router Manager and taking > changes for it -- it is a large C++ subsystem which I'm not entirely > familiar with, and when I've made changes to it in the past, mostly when > porting to Win32, it's been a case of get in, get out, stay focused, get > it over with, and survive it. > > If you experiment with turning those timeouts down, and it works for > you, that's great, but I really need to have a clear picture of what's > going on, if I'm to be expected to support it on an ongoing basis. You are welcome to expect me to support it, but then you'll need to accept my patches, and I'm liable to get medieval on it :) As you can tell, I don't mind changing things I don't well understand :) And, I'll probably work towards my ideas about desired v/s actual and definitely not towards more fine-grained threading (which is what that 'continuation' stuff you talk about sounds like to me.) I do like select loops with event handling though, and I am continuously grateful that there are no pthreads in xorp! > In XORP's case, more engineering time seems to have been burnt up on > getting the XRL layer written than on these external events you mention. > The FEA in theory handles all of these events, it is something of a > kitchen sink. What could do with better realization is how these events > are propagated to the rest of the system -- which is why I've been > focused on looking at XRL. I think joining fea and rtr-mgr into a single process makes a lot of sense. Let the protocols remain separate. >> The one thing I'd work towards is more of a 'desired' >> v/s 'actual' config. Users could always configure any logical >> configuration >> and the system will try to make this happen, but it will also deal >> properly >> with 'phantom' things like interfaces that don't exist currently. A >> different >> programming language isn't going to help any of that I think..and I'd >> very >> much like to keep with c/c++. > > As you've probably already seen, the Router Manager code is non-trivial, > and there's a lot of complexity in there to deal with the asynchrony of > the XRL RPC calls. The XRL RPC basically just works, as far as I can tell. The logic needed to deal with these dynamic events should be entirely outside of the RPC mechanism..it's just a transport. The bugs I find in this area are in protocols and FEA, mostly because they always expect they know everything and return errors and/or assert whenever something unexpected happens. I'm (slowly) fixing this because I need it for my own efforts. Someday I'll add a XORP_WARNING return instead of just OK and ERROR so that we can return warning messages w/out failing commands. > I agree that the configuration model needs serious looking at for things > like dynamic interfaces (VPN, wireless, hot-swappable cards etc) and > it's something which I raised several times as an agenda point during my > time at ICSI. Unfortunately, the development focus has been in other > areas, and I haven't been in a position to call the shots on where the > effort went. I certainly got the impression that this put some folk off > from trying XORP in the here and now. Well, I'm in a position to fix this in my tree, and I'm doing so. I've no idea who has position to do that to the public tree if you don't! Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Tue Oct 6 10:56:02 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 06 Oct 2009 10:56:02 -0700 Subject: [Xorp-hackers] PATCH: Logging improvements, fix artificial deal for xorpsh commit. In-Reply-To: <4ACB3B92.3050505@incunabulum.net> References: <4ACA7A19.30909@candelatech.com> <4ACB3B92.3050505@incunabulum.net> Message-ID: <4ACB84B2.2010909@candelatech.com> On 10/06/2009 05:44 AM, Bruce Simpson wrote: > Ben Greear wrote: > Comments: > * It should be possible to turn off the millisecond logging if desired. > Whilst it's certainly a useful feature to have when debugging time > contingent code, it does add clutter to the output. > * Perhaps putting it under the other debug knobs in SConstruct would be > a good idea? Will add an #ifdef that could be twiddled in scons. > * %llu is not a portable format specifier, and 'unsigned long long' is > not a portable type, please don't use them in portable code. Ok, I can use uint64_t, but what do you use instead of %llu to print it? > * Perhaps the code which prints the timeval is a candidate for a > function like xlog_localtime2string_short() ? > > * xlog_localtime2string_short() is still defined in xlog.c; so why > comment out its prototype, are you getting warnings from the compiler? It was all commented out...I removed entirely now. > * A XorpTimer of 0 is a possible candidate for a XorpTask. I can't > really delve further into that change at the moment, though. > * Yes, it may be useful to constify the string arguments in those > callback functions, but this change considered low priority. > * Please avoid introducing unnecessary whitespace changes in diffs. > > > Can you please raise a Trac item for these suggested improvements? > I probably won't have time to look at the Router Manager in detail for > at least 4 weeks. > > Sorry for the bureaucracy... I appreciate you're doing what you can in > the here and now to improve the code, however, it makes reviewing > patches and applying them that much easier, and we do need to keep the > code alignment and type clean, etc. I'll let these changes perk in my tree..plz let me know when you're ready to work on it (no hurry) and I'll make diffs against upstream and we can quickly resolve any issues and commit the code. In the meantime, these reviews are appreciated and will help make the eventual merge easier I think. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Tue Oct 6 10:58:36 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 06 Oct 2009 10:58:36 -0700 Subject: [Xorp-hackers] PATCH: Add startup methods for faster startup. In-Reply-To: <4ACB2F21.7080603@incunabulum.net> References: <4ACAC806.1000000@candelatech.com> <4ACB2F21.7080603@incunabulum.net> Message-ID: <4ACB854C.3070008@candelatech.com> On 10/06/2009 04:50 AM, Bruce Simpson wrote: > Ben Greear wrote: >> If there is no status and no startup method in a xorp target, the >> router-mgr uses a 2-second >> sleep for 'verification'. This slows down startup of Xorp quite a bit >> when you have lots >> of protocols running. >> >> This patch adds startup methods to many of the common targets. There >> are still more to >> go, however. > > Thanks for tracking this down; yes, I've noticed that process startup is > slower than it could be, but have only had free time / mindspace to look > at the XRL specifics. By the way, I did a quick oprofile run the other day. Xorpsh was the top offender by far...and I don't remember seeing xrl anywhere near the top of the list. Not all configurations will show the same performance graphs, of course...but at least it isn't a large problem in all cases. I'll be testing with oprofile some more..will post results next time I get something interesting. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From bms at incunabulum.net Wed Oct 7 01:36:16 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Wed, 07 Oct 2009 09:36:16 +0100 Subject: [Xorp-hackers] PATCH: Logging improvements, fix artificial deal for xorpsh commit. In-Reply-To: <4ACB84B2.2010909@candelatech.com> References: <4ACA7A19.30909@candelatech.com> <4ACB3B92.3050505@incunabulum.net> <4ACB84B2.2010909@candelatech.com> Message-ID: <4ACC5300.3010703@incunabulum.net> Ben Greear wrote: > > Will add an #ifdef that could be twiddled in scons. Excellent... > >> * %llu is not a portable format specifier, and 'unsigned long long' is >> not a portable type, please don't use them in portable code. > > Ok, I can use uint64_t, but what do you use instead of %llu to print it? %j and intmax_t is ISO C99 portable. It sucks because it means casting to the widest integer type on the platform, but it's a known quantity. 'long long' has been a problem since well before Sun brought out SPARCV9. cheers, BMS From bms at incunabulum.net Wed Oct 7 01:49:51 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Wed, 07 Oct 2009 09:49:51 +0100 Subject: [Xorp-hackers] rtrmgr and TaskManager In-Reply-To: <4ACB7F94.10103@candelatech.com> References: <4ACA6B1B.8020604@candelatech.com> <4ACA748B.6070308@candelatech.com> <4ACB4C65.1080303@incunabulum.net> <4ACB6FBA.5060109@candelatech.com> <4ACB7634.8040701@incunabulum.net> <4ACB7F94.10103@candelatech.com> Message-ID: <4ACC562F.9010807@incunabulum.net> Ben Greear wrote: > > You are welcome to expect me to support it, but then you'll need > to accept my patches, and I'm liable to get medieval on it :) > As you can tell, I don't mind changing things I don't well understand :) Yeah, that's what gets things done in the end. > And, I'll probably work towards my ideas about desired v/s actual and > definitely not towards more fine-grained threading (which is what > that 'continuation' stuff you talk about sounds like to me.) It is a real problem. Facebook are shipping a C++ concurrency library in Thrift which is largely modeled on Java constructs. Fortunately, the splice for XORP doesn't need to use this -- you can easily end up with several such models which don't overlap. There have been efforts, around and under the umbrella of the Boost project, to come up with better frameworks for implementing state machines e.g. Boost.Statechart. It's kind of sad that in CS education that Java has been pushed over languages which force students into a situation where they don't learn about computer architecture (I was a bit of a rebel at uni, and spent my time learning THIS stuff instead of the syllabus, by that time this 'dumbing down' element was already happening in Scottish higher education -- I could go on and on about how I was reading Jay Sussman's 'wizard book' at 16, and never touched it at uni, but I'd just sound like a bitter geek). Sometimes there is no substitute for a finite state machine (FSM), if you want tight code; threads have their own penalties. Erlang is an interesting exception because they plain did away with both notions of coroutine and thread. Tasks are extremely lightweight in Erlang, although the scheduler is purely best-effort (at least in the openly-available Erlang Open Telecoms Platform (OTP), open sourced by Ericsson), and tasks can't even share variables; they communicate by message passing. So they are somewhere between those two extremes -- scheduling is not necessarily cooperative. Erlang also has language framework support for FSMs, and there are nice abstractions for tying protocol decode (at a bitfield level) to Erlang variables. This just eliminates a whole layer of complexity in the code developers have to write for communication apps. Education is great, more important, the will and intent to just DO THINGS, and sometimes that means side-stepping what is known already -- or applying it appropriately on a jagged path, a bit like forked lightning. > I do > like select loops with event handling though, and I am continuously > grateful > that there are no pthreads in xorp! Yes, appropriately threaded code is harder to debug, and inappropriate locking can really rain on your parade. > > I think joining fea and rtr-mgr into a single process makes a lot of > sense. Let the protocols remain separate. There's a lot of state in there which makes that non-trivial. I've played with the idea of making certain components 'in-process servers' COM style, i.e. loadable .so's. Thrift should speed up RPC performance, so I'm not P.S. Robert Watson is being funded by Google to finish SOCK_SEQPACKET for AF_UNIX on FreeBSD which helps. Chrome is using it under the hood, it turns out. There's a little bit of additional complexity in async RPC (both Thrift and XRL) to deal with out-of-order delivery. Not fully implemented, though. And not all kernels will dispatch async in flight. This is where you see the design schism between the UNIX-like ones (Linux, BSD) and Windows (NT), which is fully async under the hood, and reordering of any local IPC can be a real issue (I/O completion ports). > ... > Someday I'll add a XORP_WARNING return instead of just OK and ERROR so > that we can return warning messages w/out failing commands. More appropriate use of exceptions might be better. Orion has argued that removing exceptions keeps the footprint down, I wonder if it's worth the churn. > > Well, I'm in a position to fix this in my tree, and I'm doing so. > I've no idea who has position to do that to the public tree if you don't! My personal agenda is that we have a whole load of stuff in XORP which makes it easy for people to do things in the routing space, we just need to work on the realization of the goal of folk actually using it. Got a train to catch... BMS From greearb at candelatech.com Wed Oct 7 09:36:07 2009 From: greearb at candelatech.com (Ben Greear) Date: Wed, 07 Oct 2009 09:36:07 -0700 Subject: [Xorp-hackers] rtrmgr and TaskManager In-Reply-To: <4ACC562F.9010807@incunabulum.net> References: <4ACA6B1B.8020604@candelatech.com> <4ACA748B.6070308@candelatech.com> <4ACB4C65.1080303@incunabulum.net> <4ACB6FBA.5060109@candelatech.com> <4ACB7634.8040701@incunabulum.net> <4ACB7F94.10103@candelatech.com> <4ACC562F.9010807@incunabulum.net> Message-ID: <4ACCC377.6040207@candelatech.com> On 10/07/2009 01:49 AM, Bruce Simpson wrote: > There's a little bit of additional complexity in async RPC (both Thrift > and XRL) to deal with out-of-order delivery. Not fully implemented, though. I think as long as each process's events are (or can be) ordered, it isn't so big of a deal if there is reordering in time among different processes. A single process, say xorpsh, could then just wait for completion of the previous request before making a new one to ensure serialization. I *think* that the 'commit' is serialized like this already, but if not, I'll need to make it so. For other applications, it would be good to turn on serialization by default (for instance, the router daemons often make xrl calls and appear to expect the previous one to complete before the second one is attempted). Based on my brief look at rtr-mgr, if the client process doesn't wait for completion, then it's *possible* for requests from the same process to be reordered. >> Someday I'll add a XORP_WARNING return instead of just OK and ERROR so >> that we can return warning messages w/out failing commands. > > More appropriate use of exceptions might be better. Orion has argued > that removing exceptions keeps the footprint down, I wonder if it's > worth the churn. You couldn't throw an exception across an RPC, but you could return proper error codes and text strings to describe the error/warning/info/etc. Either way, I don't like C++ exceptions and prefer using return values and/or passing an error-reporting construct by value to be filled out by lower calls (like passing in a string& err_msg, and using the return value to know if an error actually happened.) We could use a more formal construct, maybe consisting of something like: class foo { int severity; // enum, how bad was it? info,warning,error,fatal ?? int error_code; // enum, like errno perhaps? no-such-route,no-such-vif,invalid-request, ... string message; }; Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Wed Oct 7 09:51:03 2009 From: greearb at candelatech.com (Ben Greear) Date: Wed, 07 Oct 2009 09:51:03 -0700 Subject: [Xorp-hackers] PATCH: Logging improvements, fix artificial deal for xorpsh commit. In-Reply-To: <4ACC5300.3010703@incunabulum.net> References: <4ACA7A19.30909@candelatech.com> <4ACB3B92.3050505@incunabulum.net> <4ACB84B2.2010909@candelatech.com> <4ACC5300.3010703@incunabulum.net> Message-ID: <4ACCC6F7.2020404@candelatech.com> On 10/07/2009 01:36 AM, Bruce Simpson wrote: > Ben Greear wrote: >> >> Will add an #ifdef that could be twiddled in scons. > > Excellent... > >> >>> * %llu is not a portable format specifier, and 'unsigned long long' is >>> not a portable type, please don't use them in portable code. >> >> Ok, I can use uint64_t, but what do you use instead of %llu to print it? > > %j and intmax_t is ISO C99 portable. It sucks because it means casting > to the widest integer type on the platform, but it's a known quantity. > 'long long' has been a problem since well before Sun brought out SPARCV9. From MS's page, they may not support %j (or %ll for that matter). Maybe the just don't document it: http://msdn.microsoft.com/en-us/library/hf4y5e3w%28VS.71%29.aspx Anyway, I think I'll leave it %llu for now. It's not the end of the world if some obscure platform uses something other than a 64-bit value for this, and if it breaks compile due to snprintf limitations on some platform, can fix it then with #ifdef or some other kludge. (Using uint64_t and %llu is a compile warning for F11, 64-bit, btw, but unsigned long long and %llu works fine on 32 and 64 bit.) Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Wed Oct 7 21:31:18 2009 From: greearb at candelatech.com (Ben Greear) Date: Wed, 07 Oct 2009 21:31:18 -0700 Subject: [Xorp-hackers] PATCH: Allow delayed start of PIM vif Message-ID: <4ACD6B16.6080500@candelatech.com> Here's an example of not failing a commit because the network interface isn't ready. Needs a bit more testing, but this is the behaviour I'm trying to move toward. Many of the other protocols need similar work, but I'm just posting a single patch for comment now. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- A non-text attachment was scrubbed... Name: patch0.patch Type: text/x-patch Size: 2802 bytes Desc: not available Url : http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091007/e838c051/attachment.bin From lizhaous2000 at yahoo.com Thu Oct 8 08:22:04 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Thu, 8 Oct 2009 08:22:04 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls Message-ID: <569389.38414.qm@web58706.mail.re1.yahoo.com> As document said, XrlStaticRoutesV0p1Client::send_add_route4 is called from rtrmgr. But actually i do not see that symbol in rtrmgr. Actually i do not see any process is calling this method. On the other hand, target call XrlStaticRoutsNode::static_routes_0_1_add_route4 was called on xorp_static_routes. I do not know how was this triggered. Can any body explain to me? Thanks. Li From jtc at acorntoolworks.com Fri Oct 9 07:40:50 2009 From: jtc at acorntoolworks.com (J.T. Conklin) Date: Fri, 09 Oct 2009 07:40:50 -0700 Subject: [Xorp-hackers] PATCH: Logging improvements, fix artificial deal for xorpsh commit. In-Reply-To: <4ACB3B92.3050505@incunabulum.net> (Bruce Simpson's message of "Tue, 06 Oct 2009 13:44:02 +0100") References: <4ACA7A19.30909@candelatech.com> <4ACB3B92.3050505@incunabulum.net> Message-ID: <8763aobpfx.fsf@orac.acorntoolworks.com> Bruce Simpson writes: > Comments: > * It should be possible to turn off the millisecond logging if > desired. Whilst it's certainly a useful feature to have when debugging > time contingent code, it does add clutter to the output. I think that now systems are so fast that sub-second timestamps should be almost always be used for log/event timestamps. This is especially true in distributed systems where log messages from separate programs are/need to be merged into one stream. Without sub-second timestamps, everything appears to happen at one time. For what it's worth, the new RFC5424 says the "originator SHOULD include TIME-SECFRAC if its clock accuracy and performance permit". While we don't currently format log messages according to the RFC, we probably open a Trac ticket along those lines. Emitting RFC compliant log messages will make it easier for automated log analysis and data- base systems to handle XORP output. > * Perhaps putting it under the other debug knobs in SConstruct would > be a good idea? As for configure knobs to control this (and other) log behavior... I think it's worth considering totally rototilling XORP's log subsystem and using log4cxx (or log4j/log4cxx inspired code of our own). If we took full advantage of the framework, we could have much more finer grain control of log messages by defining logger hierarchies (eg., we could enable messages just from the xorp.bgp.foo.bar logger). We could also define/select format specifications with a config file, avoiding compiling in behavior like whether sub-second timestamps would be used. It's a big project, but I think has the potential of similarly big rewards. In the short term, I think we should change the log output to be RFC 5424 compliant, including sub-second timestamps. > * %llu is not a portable format specifier, and 'unsigned long long' > is not a portable type, please don't use them in portable code. As Ben found discovered, your suggestion to cast to intmax_t and use the %j format specifier doesn't work on the older systems. I think fixed sized integral types like int64_t, uint64_t, etc. and the corresponding macros like PRId64, PRIu64, etc. were interduced in C90, and should be the most portable. And, if we run into any systems that don't have them, it should be easy enough to define the types and macros in xorp_config.h. --jtc -- J.T. Conklin From jtc at acorntoolworks.com Fri Oct 9 07:58:48 2009 From: jtc at acorntoolworks.com (J.T. Conklin) Date: Fri, 09 Oct 2009 07:58:48 -0700 Subject: [Xorp-hackers] PATCH: Logging improvements, fix artificial deal for xorpsh commit. In-Reply-To: <4ACA7A19.30909@candelatech.com> (Ben Greear's message of "Mon, 05 Oct 2009 15:58:33 -0700") References: <4ACA7A19.30909@candelatech.com> Message-ID: <871vlcbolz.fsf@orac.acorntoolworks.com> Ben Greear writes: > 2) Change some pass-by-value string arguments to const string& in > router-mgr. This will improve performance and a small bit of memory > usage. Hi Ben, This, and passing string literals to functions/methods that expected string parameters, was identified as being responsible for a huge proprortion of the static footprint bloat during the "XORP on a Diet" project I did while at the company. Most, if not all, of the problems I found and fixed made it back to the public sources. I used rather rudementary tools (grep, nm, perl scripts, etc.) to identify the sources, so I'm not terribly surprised that others still exist. Fortunately, these tend to be uncontroversial and quite easy to fix. --jtc -- J.T. Conklin From greearb at candelatech.com Fri Oct 9 08:10:53 2009 From: greearb at candelatech.com (Ben Greear) Date: Fri, 09 Oct 2009 08:10:53 -0700 Subject: [Xorp-hackers] PATCH: Logging improvements, fix artificial deal for xorpsh commit. In-Reply-To: <871vlcbolz.fsf@orac.acorntoolworks.com> References: <4ACA7A19.30909@candelatech.com> <871vlcbolz.fsf@orac.acorntoolworks.com> Message-ID: <4ACF527D.9070202@candelatech.com> J.T. Conklin wrote: > Ben Greear writes: > >> 2) Change some pass-by-value string arguments to const string& in >> router-mgr. This will improve performance and a small bit of memory >> usage. >> > > Hi Ben, > > This, and passing string literals to functions/methods that expected > string parameters, was identified as being responsible for a huge > proprortion of the static footprint bloat during the "XORP on a Diet" > project I did while at the company. Most, if not all, of the problems > I found and fixed made it back to the public sources. > > I used rather rudementary tools (grep, nm, perl scripts, etc.) to > identify the sources, so I'm not terribly surprised that others still > exist. > > Fortunately, these tend to be uncontroversial and quite easy to fix. > Yeah, it was a trivial fix and almost certainly harmless. I plan to continue fixing such problems as I find them. I'll remember to keep an eye out for passing string literals too..haven't been watching for those. I've found no problems with the removal of delays for xorpsh commit and the startup logic in 3 days of heavy testing, by the way. It's possible it's uncovered a few races I wouldn't have noticed otherwise...but the bugs were there regardless (and I'm fixing the bugs as I find them.) Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From jtc at acorntoolworks.com Fri Oct 9 08:30:41 2009 From: jtc at acorntoolworks.com (J.T. Conklin) Date: Fri, 09 Oct 2009 08:30:41 -0700 Subject: [Xorp-hackers] Pending SCons configure change Message-ID: <87pr8wa8ke.fsf@orac.acorntoolworks.com> Hi, I have a change to the SCons command line option/variable processing pending in my workspace that should be ready to commit this weekend, and I wanted to give everyone the heads up (and a chance to voice objections). Currently SCons takes arch=..., a os=..., cross=..., and rel=... command line options. arch=, os=, and cross= are for (initial) support of cross compiling, and set the CPU architecture and OS for the host system. rel= is supposededly used for the "release" number, but is really only used to append to the build directory. I'm planning on completely removing rel=. Currently it defaults to "public17", so the build directory defaults to obj/--public17. IMO, This doesn't add any value. In fact it could be considered harmful. When we change the default, either for the 1.7 release candidate or for 1.8 development, it will orphan the current build directory and force everything to be rebuilt, even though no (or few) changes have been made. I'm planning on completly removing cross=. This is currently not used. After this change, we'll be able to determine that we're cross compiling if different build= and host= options were specified. I'm planning on replacing the arch= and os= options with build= and host=, which would take the GNU system tripple (--) or alias just like would be used just like the --build= and --host= options to a configure script. The build= option will allow the user to specify the build system, instead of having it guessed as it is today. While we could re-implement the function of the config.guess and config.sub scripts in python and execute them within SCons, at least for the time being I've added them to the build and have made the SConstruct use them. This ensures the behavior and the accepted system tripples are the same as every other GNU project. Like before, if no arguments are passed on the SCons command line, build and host systems are guessed, and a native XORP installation is built. The default build directory will now be obj/. Since host will now be the standard GNU system tripple, this may result in a rebuild and a new object directory (orphaning any objdirs with the old name). --jtc -- J.T. Conklin From greearb at candelatech.com Fri Oct 9 11:58:04 2009 From: greearb at candelatech.com (Ben Greear) Date: Fri, 09 Oct 2009 11:58:04 -0700 Subject: [Xorp-hackers] Pending SCons configure change In-Reply-To: <87pr8wa8ke.fsf@orac.acorntoolworks.com> References: <87pr8wa8ke.fsf@orac.acorntoolworks.com> Message-ID: <4ACF87BC.1060601@candelatech.com> On 10/09/2009 08:30 AM, J.T. Conklin wrote: > Hi, > > I have a change to the SCons command line option/variable processing > pending in my workspace that should be ready to commit this weekend, > and I wanted to give everyone the heads up (and a chance to voice > objections). This all sounds fine to me. One small gripe about scons in general: I liked the old ./configure method because you figured out your configuration once, and then all you had to do was type 'make' and not remember all of your options each build. I wonder if you could set up scons to do something like: scons config foo=bar blah=baz ... This would write out a small config file with the supplied options. Then, when you run 'scons', it would read the config file if exists and use that configuration. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From lizhaous2000 at yahoo.com Fri Oct 9 13:05:16 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Fri, 9 Oct 2009 13:05:16 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <569389.38414.qm@web58706.mail.re1.yahoo.com> Message-ID: <935513.83788.qm@web58707.mail.re1.yahoo.com> Actually this is a generic question. For any new config coming from xorpsh, how are these xrl client functions sent to the target process from rtrmgr? --- On Thu, 10/8/09, Li Zhao wrote: > From: Li Zhao > Subject: [Xorp-hackers] static xrl interface calls > To: xorp-hackers at icir.org > Date: Thursday, October 8, 2009, 11:22 AM > As document said, > XrlStaticRoutesV0p1Client::send_add_route4 is called from > rtrmgr. But actually i do not see that symbol in rtrmgr. > Actually i do not see any process is calling this method. On > the other hand, target call > XrlStaticRoutsNode::static_routes_0_1_add_route4 was called > on xorp_static_routes. I do not know how was this triggered. > Can any body explain to me? Thanks. > > Li > > > ? ? ? > > _______________________________________________ > Xorp-hackers mailing list > Xorp-hackers at icir.org > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > From greearb at candelatech.com Fri Oct 9 14:23:07 2009 From: greearb at candelatech.com (Ben Greear) Date: Fri, 09 Oct 2009 14:23:07 -0700 Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <935513.83788.qm@web58707.mail.re1.yahoo.com> References: <935513.83788.qm@web58707.mail.re1.yahoo.com> Message-ID: <4ACFA9BB.3030806@candelatech.com> On 10/09/2009 01:05 PM, Li Zhao wrote: > Actually this is a generic question. For any new config coming from xorpsh, how are these xrl client functions sent to the target process from rtrmgr? Search for 'commit'. There is some logic in that code to send updates to modules through xrl commands. I think programs also talk directly with fea...I don't understand it all that well myself at this time. Thanks, Ben > > --- On Thu, 10/8/09, Li Zhao wrote: > >> From: Li Zhao >> Subject: [Xorp-hackers] static xrl interface calls >> To: xorp-hackers at icir.org >> Date: Thursday, October 8, 2009, 11:22 AM >> As document said, >> XrlStaticRoutesV0p1Client::send_add_route4 is called from >> rtrmgr. But actually i do not see that symbol in rtrmgr. >> Actually i do not see any process is calling this method. On >> the other hand, target call >> XrlStaticRoutsNode::static_routes_0_1_add_route4 was called >> on xorp_static_routes. I do not know how was this triggered. >> Can any body explain to me? Thanks. >> >> Li >> >> >> >> >> _______________________________________________ >> Xorp-hackers mailing list >> Xorp-hackers at icir.org >> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers >> > > > > > _______________________________________________ > Xorp-hackers mailing list > Xorp-hackers at icir.org > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers -- Ben Greear Candela Technologies Inc http://www.candelatech.com From jtc at acorntoolworks.com Fri Oct 9 19:15:50 2009 From: jtc at acorntoolworks.com (J.T. Conklin) Date: Fri, 09 Oct 2009 19:15:50 -0700 Subject: [Xorp-hackers] Pending SCons configure change In-Reply-To: <4ACF87BC.1060601@candelatech.com> (Ben Greear's message of "Fri, 09 Oct 2009 11:58:04 -0700") References: <87pr8wa8ke.fsf@orac.acorntoolworks.com> <4ACF87BC.1060601@candelatech.com> Message-ID: <87vdio3sfd.fsf@orac.acorntoolworks.com> Ben Greear writes: > This all sounds fine to me. One small gripe about scons in general: > > I liked the old ./configure method because you figured out your > configuration once, and then all you had to do was type 'make' > and not remember all of your options each build. > > I wonder if you could set up scons to do something like: > > scons config foo=bar blah=baz ... > > This would write out a small config file with the supplied > options. > > Then, when you run 'scons', it would read the config file > if exists and use that configuration. Hi Ben, SCons has the ability to cache command line variables that are set via Variables(). Unfortunately, we are still using the older ARGUMENTS array for most, including the new host= and build= variables I'll be introducing in my upcoming patch. There's still cleanup that must be done first, but I hope to convert command line variable processing to use Variables() relatively soon. When done, I'll definitely be adding code to cache variables between scons invocations. --jtc -- J.T. Conklin From lizhaous2000 at yahoo.com Mon Oct 12 07:10:26 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Mon, 12 Oct 2009 07:10:26 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <4ACFA9BB.3030806@candelatech.com> Message-ID: <474379.59821.qm@web58708.mail.re1.yahoo.com> I have used gdb and cscope to trace the code flow as following: commit_changes -> send_apply_config_change -> | rtrmgr_0_1_apply_config_change ->apply_config_change -> change_config -> commit_change_pass1 -> commit_change_pass2 -> commit_changes. But i still can not find the code in rtrmgr explicitly calling (ANY) xrl interface functions to any target module. On the other hand the target mudule did receive STCP ios and the corresponding target functions were called. I do not think in the case of adding static route rtrmgr can talk to fea directly. The only puzzle was how on the earth rtrmgr called the function xrlStaticRouteV0p1Client::send_add_route4. Thanks for you reply. Li --- On Fri, 10/9/09, Ben Greear wrote: > From: Ben Greear > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Friday, October 9, 2009, 5:23 PM > On 10/09/2009 01:05 PM, Li Zhao > wrote: > > Actually this is a generic question. For any new > config coming from xorpsh, how are these xrl client > functions sent to the target process from rtrmgr? > > Search for 'commit'.? There is some logic in that code > to send updates to > modules through xrl commands. > > I think programs also talk directly with fea...I don't > understand it all that well > myself at this time. > > Thanks, > Ben > > > > > --- On Thu, 10/8/09, Li Zhao? > wrote: > > > >> From: Li Zhao > >> Subject: [Xorp-hackers] static xrl interface > calls > >> To: xorp-hackers at icir.org > >> Date: Thursday, October 8, 2009, 11:22 AM > >> As document said, > >> XrlStaticRoutesV0p1Client::send_add_route4 is > called from > >> rtrmgr. But actually i do not see that symbol in > rtrmgr. > >> Actually i do not see any process is calling this > method. On > >> the other hand, target call > >> XrlStaticRoutsNode::static_routes_0_1_add_route4 > was called > >> on xorp_static_routes. I do not know how was this > triggered. > >> Can any body explain to me? Thanks. > >> > >> Li > >> > >> > >> > >> > >> _______________________________________________ > >> Xorp-hackers mailing list > >> Xorp-hackers at icir.org > >> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > >> > > > > > > > > > > _______________________________________________ > > Xorp-hackers mailing list > > Xorp-hackers at icir.org > > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > > > -- > Ben Greear > Candela Technologies Inc? http://www.candelatech.com > > From greearb at candelatech.com Mon Oct 12 08:44:23 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 12 Oct 2009 08:44:23 -0700 Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <474379.59821.qm@web58708.mail.re1.yahoo.com> References: <474379.59821.qm@web58708.mail.re1.yahoo.com> Message-ID: <4AD34ED7.4090902@candelatech.com> Li Zhao wrote: > I have used gdb and cscope to trace the code flow as following: > commit_changes -> send_apply_config_change -> | rtrmgr_0_1_apply_config_change ->apply_config_change -> change_config -> commit_change_pass1 -> commit_change_pass2 -> commit_changes. > > But i still can not find the code in rtrmgr explicitly calling (ANY) xrl interface functions to any target module. > On the other hand the target mudule did receive STCP ios and the corresponding target functions were called. > > I do not think in the case of adding static route rtrmgr can talk to fea directly. The only puzzle was how on the earth rtrmgr called the function xrlStaticRouteV0p1Client::send_add_route4. > > Thanks for you reply. > Damn...what complicated code. Just spent an hours trying to follow the commit logic. Anyway, I think it comes down to TaskXrlItem An entry point to this code might be: template_commands.cc: int XrlAction::execute(const MasterConfigTreeNode& ctn, TaskManager& task_manager, XrlRouter::XrlCallback cb) const called from: module_command.cc: void ModuleCommand::add_action(const list& action, const XRLdb& xrldb) throw (ParseError) { I cannot figure exactly how this ties back in, but I think all of this must be called from: master_conf_tree_node.cc: bool MasterConfigTreeNode::commit_changes(TaskManager& task_manager, bool do_commit, int depth, int last_depth, string& error_msg, bool& needs_activate, bool& needs_update) { Commands are added directly by some parser, probably of the .xif files or something like that. Probably would take enabling logging and then reading the logs very carefully to figure out exactly how it actually works. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From lizhaous2000 at yahoo.com Mon Oct 12 09:50:31 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Mon, 12 Oct 2009 09:50:31 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <4AD34ED7.4090902@candelatech.com> Message-ID: <457624.23473.qm@web58702.mail.re1.yahoo.com> You are right. We are finding the same thing. What I found is the configure tree node is traversed. Please watch etc/template/static_routes.tp: route @: ipv4net { %create xrl "$(static.targetname)/static_routes/0.1/add_route4?..." What happened was when the leaf node was processed, the corresponding command will call Command::execute which in turn will call XrlAction::execute. It was adding an xrl to the task manager so the task manager will have a penfing action. That is why I can not see explicate call to XrlStaticRouteV0p1Client methods. I am studing now how the task manager is mapping from xrl->_action->_request to the real xrl calls. I am getting much closer now. Thanks. Li --- On Mon, 10/12/09, Ben Greear wrote: > From: Ben Greear > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Monday, October 12, 2009, 11:44 AM > Li Zhao wrote: > > I have used gdb and cscope to trace the code flow as > following: > > commit_changes -> send_apply_config_change -> | > rtrmgr_0_1_apply_config_change ->apply_config_change > -> change_config -> commit_change_pass1 -> > commit_change_pass2 -> commit_changes. > > > > But i still can not find the code in rtrmgr explicitly > calling (ANY) xrl interface functions to any target module. > > On the other hand the target mudule did receive STCP > ios and the corresponding target functions were called. > > > > I do not think in the case of adding static route > rtrmgr can talk to fea directly. The only puzzle was how on > the earth rtrmgr called the function > xrlStaticRouteV0p1Client::send_add_route4. > > > > Thanks for you reply. > >??? > > Damn...what complicated code.? Just spent an hours > trying to follow the commit > logic. > > Anyway, I think it comes down to TaskXrlItem > > An entry point to this code might be: > > template_commands.cc: > int > XrlAction::execute(const MasterConfigTreeNode& ctn, > ? ? ? ? ? TaskManager& > task_manager, > ? ? ? ? ? XrlRouter::XrlCallback > cb) const > > called from: > module_command.cc: > void > ModuleCommand::add_action(const list& > action, const XRLdb& xrldb) > ???throw (ParseError) > { > > I cannot figure exactly how this ties back in, but I think > all of this must be called from: > > master_conf_tree_node.cc: > bool > MasterConfigTreeNode::commit_changes(TaskManager& > task_manager, > ? ? ? ? ? ? ? ? > ? ? bool do_commit, > ? ? ? ? ? ? ? ? > ? ? int depth, int last_depth, > ? ? ? ? ? ? ? ? > ? ? string& error_msg, > ? ? ? ? ? ? ? ? > ? ? bool& needs_activate, > ? ? ? ? ? ? ? ? > ? ? bool& needs_update) > { > > > Commands are added directly by some parser, probably of the > .xif files or something like that. > > Probably would take enabling logging and then reading the > logs very carefully to figure out > exactly how it actually works. > > Thanks, > Ben > > -- Ben Greear > Candela Technologies Inc? http://www.candelatech.com > > > From lizhaous2000 at yahoo.com Mon Oct 12 12:45:22 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Mon, 12 Oct 2009 12:45:22 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <457624.23473.qm@web58702.mail.re1.yahoo.com> Message-ID: <3814.27208.qm@web58703.mail.re1.yahoo.com> The last piece of puzzle was solved. The task which was added to the task manager was a TaskXrlItem. When this task was fired, the execute method in TaskXrlItem was asking _xorp_client to send a unresolved xrl request. The reason why XrlStaticRoutesV0p1Client methods were not called, I guess, was because rtrmgr needs to utilize its task and taskmanager mechanism. If there is another process which does not have moduel, task, or taskmanager, then XrlStaticRoutesV0p1Client methods can be used directly and will have the similar code flow eventually. Li --- On Mon, 10/12/09, Li Zhao wrote: > From: Li Zhao > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Ben Greear" > Cc: xorp-hackers at icir.org > Date: Monday, October 12, 2009, 12:50 PM > You are right. We are finding the > same thing. > What I found is the configure tree node is traversed. > Please watch etc/template/static_routes.tp: > > route @: ipv4net { > ? ? ? %create xrl > "$(static.targetname)/static_routes/0.1/add_route4?..." > > What happened was when the leaf node was processed, the > corresponding command will call Command::execute which in > turn will call XrlAction::execute. It was adding an xrl to > the task manager so the task manager will have a penfing > action. That is why I can not see explicate call to > XrlStaticRouteV0p1Client methods. > > I am studing now how the task manager is mapping from > xrl->_action->_request to the real xrl calls. > > I am getting much closer now. > > Thanks. > > Li > > > --- On Mon, 10/12/09, Ben Greear > wrote: > > > From: Ben Greear > > Subject: Re: [Xorp-hackers] static xrl interface > calls > > To: "Li Zhao" > > Cc: xorp-hackers at icir.org > > Date: Monday, October 12, 2009, 11:44 AM > > Li Zhao wrote: > > > I have used gdb and cscope to trace the code flow > as > > following: > > > commit_changes -> send_apply_config_change > -> | > > rtrmgr_0_1_apply_config_change > ->apply_config_change > > -> change_config -> commit_change_pass1 -> > > commit_change_pass2 -> commit_changes. > > > > > > But i still can not find the code in rtrmgr > explicitly > > calling (ANY) xrl interface functions to any target > module. > > > On the other hand the target mudule did receive > STCP > > ios and the corresponding target functions were > called. > > > > > > I do not think in the case of adding static > route > > rtrmgr can talk to fea directly. The only puzzle was > how on > > the earth rtrmgr called the function > > xrlStaticRouteV0p1Client::send_add_route4. > > > > > > Thanks for you reply. > > >??? > > > > Damn...what complicated code.? Just spent an hours > > trying to follow the commit > > logic. > > > > Anyway, I think it comes down to TaskXrlItem > > > > An entry point to this code might be: > > > > template_commands.cc: > > int > > XrlAction::execute(const MasterConfigTreeNode& > ctn, > > ? ? ? ? ? TaskManager& > > task_manager, > > ? ? ? ? ? XrlRouter::XrlCallback > > cb) const > > > > called from: > > module_command.cc: > > void > > ModuleCommand::add_action(const > list& > > action, const XRLdb& xrldb) > > ???throw (ParseError) > > { > > > > I cannot figure exactly how this ties back in, but I > think > > all of this must be called from: > > > > master_conf_tree_node.cc: > > bool > > MasterConfigTreeNode::commit_changes(TaskManager& > > task_manager, > > ? ? ? ? ? ? ? ? > > ? ? bool do_commit, > > ? ? ? ? ? ? ? ? > > ? ? int depth, int last_depth, > > ? ? ? ? ? ? ? ? > > ? ? string& error_msg, > > ? ? ? ? ? ? ? ? > > ? ? bool& needs_activate, > > ? ? ? ? ? ? ? ? > > ? ? bool& needs_update) > > { > > > > > > Commands are added directly by some parser, probably > of the > > .xif files or something like that. > > > > Probably would take enabling logging and then reading > the > > logs very carefully to figure out > > exactly how it actually works. > > > > Thanks, > > Ben > > > > -- Ben Greear > > Candela Technologies Inc? http://www.candelatech.com > > > > > > > > > ? ? ? > > _______________________________________________ > Xorp-hackers mailing list > Xorp-hackers at icir.org > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > From lizhaous2000 at yahoo.com Mon Oct 12 12:52:10 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Mon, 12 Oct 2009 12:52:10 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <4ACFA9BB.3030806@candelatech.com> Message-ID: <269824.42514.qm@web58707.mail.re1.yahoo.com> I have used gdb and cscope to trace the code flow as following: commit_changes -> send_apply_config_change -> | rtrmgr_0_1_apply_config_change ->apply_config_change -> change_config -> commit_change_pass1 -> commit_change_pass2 -> commit_changes. But i still can not find the code in rtrmgr explicitly calling (ANY) xrl interface functions to any target module. On the other hand the target mudule did receive STCP ios and the corresponding target functions were called. I do not think in the case of adding static route rtrmgr can talk to fea directly. The only puzzle was how on the earth rtrmgr called the function xrlStaticRouteV0p1Client::send_add_route4. Thanks for you reply. Li --- On Fri, 10/9/09, Ben Greear wrote: > From: Ben Greear > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Friday, October 9, 2009, 5:23 PM > On 10/09/2009 01:05 PM, Li Zhao > wrote: > > Actually this is a generic question. For any new > config coming from xorpsh, how are these xrl client > functions sent to the target process from rtrmgr? > > Search for 'commit'.? There is some logic in that code > to send updates to > modules through xrl commands. > > I think programs also talk directly with fea...I don't > understand it all that well > myself at this time. > > Thanks, > Ben > > > > > --- On Thu, 10/8/09, Li Zhao? > wrote: > > > >> From: Li Zhao > >> Subject: [Xorp-hackers] static xrl interface > calls > >> To: xorp-hackers at icir.org > >> Date: Thursday, October 8, 2009, 11:22 AM > >> As document said, > >> XrlStaticRoutesV0p1Client::send_add_route4 is > called from > >> rtrmgr. But actually i do not see that symbol in > rtrmgr. > >> Actually i do not see any process is calling this > method. On > >> the other hand, target call > >> XrlStaticRoutsNode::static_routes_0_1_add_route4 > was called > >> on xorp_static_routes. I do not know how was this > triggered. > >> Can any body explain to me? Thanks. > >> > >> Li > >> > >> > >> > >> > >> _______________________________________________ > >> Xorp-hackers mailing list > >> Xorp-hackers at icir.org > >> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > >> > > > > > > > > > > _______________________________________________ > > Xorp-hackers mailing list > > Xorp-hackers at icir.org > > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > > > -- > Ben Greear > Candela Technologies Inc? http://www.candelatech.com > > From lizhaous2000 at yahoo.com Mon Oct 12 12:55:04 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Mon, 12 Oct 2009 12:55:04 -0700 (PDT) Subject: [Xorp-hackers] Fw: Re: static xrl interface calls Message-ID: <338944.44997.qm@web58707.mail.re1.yahoo.com> --- On Mon, 10/12/09, Ben Greear wrote: From: Ben Greear Subject: Re: [Xorp-hackers] static xrl interface calls To: "Li Zhao" Cc: xorp-hackers at icir.org Date: Monday, October 12, 2009, 11:44 AM Li Zhao wrote: > > I have used gdb and cscope to trace the code flow as > following: > > commit_changes -> send_apply_config_change -> | > rtrmgr_0_1_apply_config_change ->apply_config_change > -> change_config -> commit_change_pass1 -> > commit_change_pass2 -> commit_changes. > > > > But i still can not find the code in rtrmgr explicitly > calling (ANY) xrl interface functions to any target module. > > On the other hand the target mudule did receive STCP > ios and the corresponding target functions were called. > > > > I do not think in the case of adding static route > rtrmgr can talk to fea directly. The only puzzle was how on > the earth rtrmgr called the function > xrlStaticRouteV0p1Client::send_add_route4. > > > > Thanks for you reply. > >??? > Damn...what complicated code.? Just spent an hours trying to follow the commit logic. Anyway, I think it comes down to TaskXrlItem An entry point to this code might be: template_commands.cc: int XrlAction::execute(const MasterConfigTreeNode& ctn, ? ? ? ? ? TaskManager& task_manager, ? ? ? ? ? XrlRouter::XrlCallback cb) const called from: module_command.cc: void ModuleCommand::add_action(const list& action, const XRLdb& xrldb) ???throw (ParseError) { I cannot figure exactly how this ties back in, but I think all of this must be called from: master_conf_tree_node.cc: bool MasterConfigTreeNode::commit_changes(TaskManager& task_manager, ? ? ? ? ? ? ? ? ? ? bool do_commit, ? ? ? ? ? ? ? ? ? ? int depth, int last_depth, ? ? ? ? ? ? ? ? ? ? string& error_msg, ? ? ? ? ? ? ? ? ? ? bool& needs_activate, ? ? ? ? ? ? ? ? ? ? bool& needs_update) { Commands are added directly by some parser, probably of the .xif files or something like that. Probably would take enabling logging and then reading the logs very carefully to figure out exactly how it actually works. Thanks, Ben -- Ben Greear Candela Technologies Inc? http://www.candelatech.com From lizhaous2000 at yahoo.com Mon Oct 12 12:56:33 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Mon, 12 Oct 2009 12:56:33 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <4AD34ED7.4090902@candelatech.com> Message-ID: <285054.67829.qm@web58702.mail.re1.yahoo.com> You are right. We are finding the same thing. What I found is the configure tree node is traversed. Please watch etc/template/static_routes.tp: route @: ipv4net { %create xrl "$(static.targetname)/static_routes/0.1/add_route4?..." What happened was when the leaf node was processed, the corresponding command will call Command::execute which in turn will call XrlAction::execute. It was adding an xrl to the task manager so the task manager will have a penfing action. That is why I can not see explicate call to XrlStaticRouteV0p1Client methods. I am studing now how the task manager is mapping from xrl->_action->_request to the real xrl calls. I am getting much closer now. Thanks. Li --- On Mon, 10/12/09, Ben Greear wrote: > From: Ben Greear > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Monday, October 12, 2009, 11:44 AM > Li Zhao wrote: > > I have used gdb and cscope to trace the code flow as > following: > > commit_changes -> send_apply_config_change -> | > rtrmgr_0_1_apply_config_change ->apply_config_change > -> change_config -> commit_change_pass1 -> > commit_change_pass2 -> commit_changes. > > > > But i still can not find the code in rtrmgr explicitly > calling (ANY) xrl interface functions to any target module. > > On the other hand the target mudule did receive STCP > ios and the corresponding target functions were called. > > > > I do not think in the case of adding static route > rtrmgr can talk to fea directly. The only puzzle was how on > the earth rtrmgr called the function > xrlStaticRouteV0p1Client::send_add_route4. > > > > Thanks for you reply. > >??? > > Damn...what complicated code.? Just spent an hours > trying to follow the commit > logic. > > Anyway, I think it comes down to TaskXrlItem > > An entry point to this code might be: > > template_commands.cc: > int > XrlAction::execute(const MasterConfigTreeNode& ctn, > ? ? ? ? ? TaskManager& > task_manager, > ? ? ? ? ? XrlRouter::XrlCallback > cb) const > > called from: > module_command.cc: > void > ModuleCommand::add_action(const list& > action, const XRLdb& xrldb) > ???throw (ParseError) > { > > I cannot figure exactly how this ties back in, but I think > all of this must be called from: > > master_conf_tree_node.cc: > bool > MasterConfigTreeNode::commit_changes(TaskManager& > task_manager, > ? ? ? ? ? ? ? ? > ? ? bool do_commit, > ? ? ? ? ? ? ? ? > ? ? int depth, int last_depth, > ? ? ? ? ? ? ? ? > ? ? string& error_msg, > ? ? ? ? ? ? ? ? > ? ? bool& needs_activate, > ? ? ? ? ? ? ? ? > ? ? bool& needs_update) > { > > > Commands are added directly by some parser, probably of the > .xif files or something like that. > > Probably would take enabling logging and then reading the > logs very carefully to figure out > exactly how it actually works. > > Thanks, > Ben > > -- Ben Greear > Candela Technologies Inc? http://www.candelatech.com > > > From lizhaous2000 at yahoo.com Mon Oct 12 12:58:28 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Mon, 12 Oct 2009 12:58:28 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls Message-ID: <947168.479.qm@web58705.mail.re1.yahoo.com> The last piece of puzzle was solved. The task which was added to the task manager was a TaskXrlItem. When this task was fired, the execute method in TaskXrlItem was asking _xorp_client to send a unresolved xrl request. The reason why XrlStaticRoutesV0p1Client methods were not called, I guess, was because rtrmgr needs to utilize its task and taskmanager mechanism. If there is another process which does not have moduel, task, or taskmanager, then XrlStaticRoutesV0p1Client methods can be used directly and will have the similar code flow eventually. Li --- On Mon, 10/12/09, Li Zhao wrote: > From: Li Zhao > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Ben Greear" > Cc: xorp-hackers at icir.org > Date: Monday, October 12, 2009, 12:50 PM > You are right. We are finding the > same thing. > What I found is the configure tree node is traversed. > Please watch etc/template/static_routes.tp: > > route @: ipv4net { > ? ? ? %create xrl > "$(static.targetname)/static_routes/0.1/add_route4?..." > > What happened was when the leaf node was processed, the > corresponding command will call Command::execute which in > turn will call XrlAction::execute. It was adding an xrl to > the task manager so the task manager will have a penfing > action. That is why I can not see explicate call to > XrlStaticRouteV0p1Client methods. > > I am studing now how the task manager is mapping from > xrl->_action->_request to the real xrl calls. > > I am getting much closer now. > > Thanks. > > Li > > > --- On Mon, 10/12/09, Ben Greear > wrote: > > > From: Ben Greear > > Subject: Re: [Xorp-hackers] static xrl interface > calls > > To: "Li Zhao" > > Cc: xorp-hackers at icir.org > > Date: Monday, October 12, 2009, 11:44 AM > > Li Zhao wrote: > > > I have used gdb and cscope to trace the code flow > as > > following: > > > commit_changes -> send_apply_config_change > -> | > > rtrmgr_0_1_apply_config_change > ->apply_config_change > > -> change_config -> commit_change_pass1 -> > > commit_change_pass2 -> commit_changes. > > > > > > But i still can not find the code in rtrmgr > explicitly > > calling (ANY) xrl interface functions to any target > module. > > > On the other hand the target mudule did receive > STCP > > ios and the corresponding target functions were > called. > > > > > > I do not think in the case of adding static > route > > rtrmgr can talk to fea directly. The only puzzle was > how on > > the earth rtrmgr called the function > > xrlStaticRouteV0p1Client::send_add_route4. > > > > > > Thanks for you reply. > > >??? > > > > Damn...what complicated code.? Just spent an hours > > trying to follow the commit > > logic. > > > > Anyway, I think it comes down to TaskXrlItem > > > > An entry point to this code might be: > > > > template_commands.cc: > > int > > XrlAction::execute(const MasterConfigTreeNode& > ctn, > > ? ? ? ? ? TaskManager& > > task_manager, > > ? ? ? ? ? XrlRouter::XrlCallback > > cb) const > > > > called from: > > module_command.cc: > > void > > ModuleCommand::add_action(const > list& > > action, const XRLdb& xrldb) > > ???throw (ParseError) > > { > > > > I cannot figure exactly how this ties back in, but I > think > > all of this must be called from: > > > > master_conf_tree_node.cc: > > bool > > MasterConfigTreeNode::commit_changes(TaskManager& > > task_manager, > > ? ? ? ? ? ? ? ? > > ? ? bool do_commit, > > ? ? ? ? ? ? ? ? > > ? ? int depth, int last_depth, > > ? ? ? ? ? ? ? ? > > ? ? string& error_msg, > > ? ? ? ? ? ? ? ? > > ? ? bool& needs_activate, > > ? ? ? ? ? ? ? ? > > ? ? bool& needs_update) > > { > > > > > > Commands are added directly by some parser, probably > of the > > .xif files or something like that. > > > > Probably would take enabling logging and then reading > the > > logs very carefully to figure out > > exactly how it actually works. > > > > Thanks, > > Ben > > > > -- Ben Greear > > Candela Technologies Inc? http://www.candelatech.com > > > > > > > > > > From greearb at candelatech.com Tue Oct 13 10:15:58 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 13 Oct 2009 10:15:58 -0700 Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <947168.479.qm@web58705.mail.re1.yahoo.com> References: <947168.479.qm@web58705.mail.re1.yahoo.com> Message-ID: <4AD4B5CE.1030602@candelatech.com> On 10/12/2009 12:58 PM, Li Zhao wrote: > The last piece of puzzle was solved. The task which was added to the task manager was a TaskXrlItem. When this task was fired, the execute method in TaskXrlItem was asking _xorp_client to send a unresolved xrl request. > > The reason why XrlStaticRoutesV0p1Client methods were not called, I guess, was because rtrmgr needs to utilize its task and taskmanager mechanism. If there is another process which does not have moduel, task, or taskmanager, then XrlStaticRoutesV0p1Client methods can be used directly and will have the similar code flow eventually. > > Li So, did you get this working? If you have a patch, please post it... Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From lizhaous2000 at yahoo.com Tue Oct 13 11:37:31 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Tue, 13 Oct 2009 11:37:31 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <4AD4B5CE.1030602@candelatech.com> Message-ID: <193113.5326.qm@web58707.mail.re1.yahoo.com> I am studying the code so I have not coded anything. What I am working on is to write a control plane process which will add and delete some special static routes. These static routes can be redistributed by ospf etc. The the new daemon will use the xrl interface calls. I do not want this process talk to rtrmgr because the config tree structure is adding unnecessary complixity. This new process can be started by rtrmgr when rtrmgr starts. Then I want this new process update the static routes directly to xorp_static_routes. Then the problem is how to start xorp_static_routes and its depending processes like fea/fib/policy and make them working properly with xrl finder. This is a really a pain for me because I have just started to learn xorp for a few weeks. I am thinking if there is a simple API by which a process other than xorpsh can ask rtrmgr to start static_routes. Another problem. Commit is taking awkawrdly long time. Thanks. --- On Tue, 10/13/09, Ben Greear wrote: > From: Ben Greear > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Tuesday, October 13, 2009, 1:15 PM > On 10/12/2009 12:58 PM, Li Zhao > wrote: > > The last piece of puzzle was solved. The task which > was added to the task manager was a TaskXrlItem. When this > task was fired, the execute method in TaskXrlItem was asking > _xorp_client to send a unresolved xrl request. > > > > The reason why XrlStaticRoutesV0p1Client methods were > not called, I guess, was because rtrmgr needs to utilize its > task and taskmanager mechanism. If there is another process > which does not have moduel, task, or taskmanager, then > XrlStaticRoutesV0p1Client methods can be used directly and > will have the similar code flow eventually. > > > > Li > > So, did you get this working?? If you have a patch, > please post it... > > Thanks, > Ben > > -- > Ben Greear > Candela Technologies Inc? http://www.candelatech.com > > From lizhaous2000 at yahoo.com Tue Oct 13 11:37:31 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Tue, 13 Oct 2009 11:37:31 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <4AD4B5CE.1030602@candelatech.com> Message-ID: <193113.5326.qm@web58707.mail.re1.yahoo.com> I am studying the code so I have not coded anything. What I am working on is to write a control plane process which will add and delete some special static routes. These static routes can be redistributed by ospf etc. The the new daemon will use the xrl interface calls. I do not want this process talk to rtrmgr because the config tree structure is adding unnecessary complixity. This new process can be started by rtrmgr when rtrmgr starts. Then I want this new process update the static routes directly to xorp_static_routes. Then the problem is how to start xorp_static_routes and its depending processes like fea/fib/policy and make them working properly with xrl finder. This is a really a pain for me because I have just started to learn xorp for a few weeks. I am thinking if there is a simple API by which a process other than xorpsh can ask rtrmgr to start static_routes. Another problem. Commit is taking awkawrdly long time. Thanks. --- On Tue, 10/13/09, Ben Greear wrote: > From: Ben Greear > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Tuesday, October 13, 2009, 1:15 PM > On 10/12/2009 12:58 PM, Li Zhao > wrote: > > The last piece of puzzle was solved. The task which > was added to the task manager was a TaskXrlItem. When this > task was fired, the execute method in TaskXrlItem was asking > _xorp_client to send a unresolved xrl request. > > > > The reason why XrlStaticRoutesV0p1Client methods were > not called, I guess, was because rtrmgr needs to utilize its > task and taskmanager mechanism. If there is another process > which does not have moduel, task, or taskmanager, then > XrlStaticRoutesV0p1Client methods can be used directly and > will have the similar code flow eventually. > > > > Li > > So, did you get this working?? If you have a patch, > please post it... > > Thanks, > Ben > > -- > Ben Greear > Candela Technologies Inc? http://www.candelatech.com > > From greearb at candelatech.com Tue Oct 13 11:51:49 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 13 Oct 2009 11:51:49 -0700 Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <193113.5326.qm@web58707.mail.re1.yahoo.com> References: <193113.5326.qm@web58707.mail.re1.yahoo.com> Message-ID: <4AD4CC45.80603@candelatech.com> On 10/13/2009 11:37 AM, Li Zhao wrote: > > I am studying the code so I have not coded anything. What I am working on is to write a control plane process which will add and delete some special static routes. These static routes can be redistributed by ospf etc. The the new daemon will use the xrl interface calls. I do not want this process talk to rtrmgr because the config tree structure is adding unnecessary complixity. This new process can be started by rtrmgr when rtrmgr starts. Then I want this new process update the static routes directly to xorp_static_routes. Then the problem is how to start xorp_static_routes and its depending processes like fea/fib/policy and make them working properly with xrl finder. This is a really a pain for me because I have just started to learn xorp for a few weeks. Can you just have the control plane process call xorpsh to have it update routes in the existing static-routes logic? I've used xorpsh in similar manner to update IPs, interfaces, etc and it has worked reasonably well (after I fixed a lot of bugs with dynamic interfaces!) > I am thinking if there is a simple API by which a process other than xorpsh can ask rtrmgr to start static_routes. > > Another problem. Commit is taking awkawrdly long time. I fixed the commit problem in my tree: http://www.candelatech.com/oss/xorp-ct.html I get commit times of about 0.10 to 0.20 seconds now (counting launching xorpsh). Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From lizhaous2000 at yahoo.com Tue Oct 13 12:22:25 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Tue, 13 Oct 2009 12:22:25 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <4AD4CC45.80603@candelatech.com> Message-ID: <267244.59385.qm@web58704.mail.re1.yahoo.com> That was my first plan. But I thought I do not want unnecessay complexities related to config control, so I tried to first ask rtrmgr to start static_routes, then use the channel between daemon and static_routes directly to update static routes. But a big problem is that if a user use xorpsh CLI to "delete protocol static", then my daemon will not only lose the channel to static_routes which is terminated by CLI, but also will lose all the static routes installed by my daemon. Basically xorpsh CLI sessions can not cooperate with my daemon. I am still looking for a good design. --- On Tue, 10/13/09, Ben Greear wrote: > From: Ben Greear > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Tuesday, October 13, 2009, 2:51 PM > On 10/13/2009 11:37 AM, Li Zhao > wrote: > > > > I am studying the code so I have not coded anything. > What I am working on is to write a control plane process > which will add and delete some special static routes. These > static routes can be redistributed by ospf etc. The the new > daemon will use the xrl interface calls. I do not want this > process talk to rtrmgr because the config tree structure is > adding unnecessary complixity. This new process can be > started by rtrmgr when rtrmgr starts. Then I want this new > process update the static routes directly to > xorp_static_routes. Then the problem is how to start > xorp_static_routes and its depending processes like > fea/fib/policy and make them working properly with xrl > finder. This is a really a pain for me because I have just > started to learn xorp for a few weeks. > > Can you just have the control plane process call xorpsh to > have it update > routes in the existing static-routes logic?? I've used > xorpsh in similar manner > to update IPs, interfaces, etc and it has worked reasonably > well (after I fixed > a lot of bugs with dynamic interfaces!) > > > I am thinking if there is a simple API by which a > process other than xorpsh can ask rtrmgr to start > static_routes. > > > > Another problem. Commit is taking awkawrdly long > time. > > I fixed the commit problem in my tree: > > http://www.candelatech.com/oss/xorp-ct.html > > I get commit times of about 0.10 to 0.20 seconds now > (counting launching xorpsh). > > Thanks, > Ben > > -- > Ben Greear > Candela Technologies Inc? http://www.candelatech.com > > From greearb at candelatech.com Tue Oct 13 13:36:08 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 13 Oct 2009 13:36:08 -0700 Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <267244.59385.qm@web58704.mail.re1.yahoo.com> References: <267244.59385.qm@web58704.mail.re1.yahoo.com> Message-ID: <4AD4E4B8.2030008@candelatech.com> On 10/13/2009 12:22 PM, Li Zhao wrote: > That was my first plan. But I thought I do not want unnecessay complexities related to config control, so I tried to first ask rtrmgr to start static_routes, then use the channel between daemon and static_routes directly to update static routes. But a big problem is that if a user use xorpsh CLI to "delete protocol static", then my daemon will not only lose the channel to static_routes which is terminated by CLI, but also will lose all the static routes installed by my daemon. Basically xorpsh CLI sessions can not cooperate with my daemon. > > I am still looking for a good design. If your daemon communicates to xorp through xorpsh, it seems like it would work OK. A user could always screw something by manually messing with xorpsh (or doing worse things on the linux command-line, for example). Maybe you are worried about concurrent xorpsh usage by your script and a user? I'm not sure how that would work..but I can imagine it being a problem. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From lizhaous2000 at yahoo.com Tue Oct 13 18:49:54 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Tue, 13 Oct 2009 18:49:54 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <4AD4E4B8.2030008@candelatech.com> Message-ID: <375084.43781.qm@web58703.mail.re1.yahoo.com> Basically I am adding a new application process to the xorp linux router. That application requires xorp_static_routes running and it periodically updates the static routes through xrl interface API. Because it is a router, an administrator can easily configure CLI via command "delete protocol static" and it will end up with terminating xorp_static_routes and removing static routes from rib. --- On Tue, 10/13/09, Ben Greear wrote: > From: Ben Greear > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Tuesday, October 13, 2009, 4:36 PM > On 10/13/2009 12:22 PM, Li Zhao > wrote: > > That was my first plan. But I thought I do not want > unnecessay complexities related to config control, so I > tried to first ask rtrmgr to start static_routes, then use > the channel between daemon and static_routes directly to > update static routes. But a big problem is that if a user > use xorpsh CLI to "delete protocol static", then my daemon > will not only lose the channel to static_routes which is > terminated by CLI, but also will lose all the static routes > installed by my daemon. Basically xorpsh CLI sessions can > not cooperate with my daemon. > > > > I am still looking for a good design. > > If your daemon communicates to xorp through xorpsh, it > seems like it would work OK. > > A user could always screw something by manually messing > with xorpsh (or > doing worse things on the linux command-line, for > example). > > Maybe you are worried about concurrent xorpsh usage by your > script and > a user?? I'm not sure how that would work..but I can > imagine it being > a problem. > > Thanks, > Ben > > -- > Ben Greear > Candela Technologies Inc? http://www.candelatech.com > > From globalcouchsurfer at gmail.com Mon Oct 19 03:14:52 2009 From: globalcouchsurfer at gmail.com (CouchSurfer) Date: Mon, 19 Oct 2009 11:14:52 +0100 Subject: [Xorp-hackers] IPv4 Message-ID: <2148c4de0910190314m76ae69c1k6053e6a67bb37bf0@mail.gmail.com> Hi guys, Just wanted to say that I tried to disable IPv6 when compiling xorp on a CentOS box and it produced a number of ipv6 error messages. Thanks From globalcouchsurfer at gmail.com Mon Oct 19 03:23:07 2009 From: globalcouchsurfer at gmail.com (CouchSurfer) Date: Mon, 19 Oct 2009 11:23:07 +0100 Subject: [Xorp-hackers] vlan Config Message-ID: <2148c4de0910190323r50407e2ct35b2c247fb95c859@mail.gmail.com> I am having trouble configuring vlan interfaces - or indeed vif's in general - in xorp. For example, if I have something like interfaces { interface eth0 { vif eth0 { ... } vif xxx { [vlan { vlan-id: yyy }] ... } } } I get an error saying cannot create interface eth0/xxx regardless of how I name the vif. (i have trued including and excluding the clause inside the square brackets) The only way I can get around this is if I create the interface manually in the system (such as eth0.20 etc) and use these in the config file without using the vlan clause. Can anyone tell me whether doing th above would affect vlan tagging? Thanks From globalcouchsurfer at gmail.com Mon Oct 19 03:37:22 2009 From: globalcouchsurfer at gmail.com (CouchSurfer) Date: Mon, 19 Oct 2009 11:37:22 +0100 Subject: [Xorp-hackers] BGP Config Message-ID: <2148c4de0910190337y5d1c39bctb3d8199c1b24373b@mail.gmail.com> I am havig problem with my BGP configuration. So far it seems I have configured basic essentials such as AS numbers, peers IPs ,next-hop and ipv4-unicast settings. On running xorp with this configuration, I can see the routes from my BGP peers. However, apparently my routes are not being distributed. Basically, I want to (in cisco terms) redistribute my static (and connected) routes. I have created protocols->static. I have also created two policies ( and applied them as export and import respectively) as follow: policy policy-statement "to_bgp" term 0 from protocol: static then accept term 1 from protocol: bgp then accept policy-statement "from_bgp" term 0 from protocol: static then accept term 1 from then accept However, my routes are still not being distributed. I was wondering if anyone can help me on this matter. Thanks From bms at incunabulum.net Mon Oct 19 07:59:36 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 19 Oct 2009 15:59:36 +0100 Subject: [Xorp-hackers] PATCH: Logging improvements, fix artificial deal for xorpsh commit. In-Reply-To: <4ACCC6F7.2020404@candelatech.com> References: <4ACA7A19.30909@candelatech.com> <4ACB3B92.3050505@incunabulum.net> <4ACB84B2.2010909@candelatech.com> <4ACC5300.3010703@incunabulum.net> <4ACCC6F7.2020404@candelatech.com> Message-ID: <4ADC7ED8.4020109@incunabulum.net> Ben Greear wrote: > ... >> >> %j and intmax_t is ISO C99 portable. It sucks because it means casting >> to the widest integer type on the platform, but it's a known quantity. >> 'long long' has been a problem since well before Sun brought out >> SPARCV9. > > From MS's page, they may not support %j (or %ll for that matter). Maybe > the just don't document it: There are a number of places where MS don't fully comply with the ISO C99 spec in either their CL.EXE compiler or the runtime library MSVCRT.DLL, this is but one of them. They have made more progress towards this in MSVC 7 and 8, but it's still far from ideal. The snprintf() behaviour took a bit of hacking to track down in the textual XRL code. I'd still be much happier if intmax_t is used, because it's a portable code construct. cheers, BMS From bms at incunabulum.net Mon Oct 19 08:04:55 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 19 Oct 2009 16:04:55 +0100 Subject: [Xorp-hackers] PATCH: Allow delayed start of PIM vif In-Reply-To: <4ACD6B16.6080500@candelatech.com> References: <4ACD6B16.6080500@candelatech.com> Message-ID: <4ADC8017.9010507@incunabulum.net> Thanks for the patch. If you can preserve existing code style, then it's more likely changes can be taken as-is (i.e. don't use camelCase if possible, opening brace of {} block on separate line for methods, etc). I'd probably call the flag 'start_is_pending'. What I'm likely to do, when I return (I'm catching up on email now, although I'm still on my break, and might have some social stuff going on when I return to London) is to flag patches for possible future inclusion. I really need to finish what I've started with XRL; it's probably easier to deal with stuff like this as a sweep during a 1.7-RC. thanks, BMS From bms at incunabulum.net Mon Oct 19 08:06:51 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 19 Oct 2009 16:06:51 +0100 Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <935513.83788.qm@web58707.mail.re1.yahoo.com> References: <935513.83788.qm@web58707.mail.re1.yahoo.com> Message-ID: <4ADC808B.5020302@incunabulum.net> Li Zhao wrote: > Actually this is a generic question. For any new config coming from xorpsh, how are these xrl client functions sent to the target process from rtrmgr? > The Router Manager uses the textual Finder protocol to make indirect XRL method calls, as it parses the configuration tree; it does not use the C++ bindings directly. Please see the '*.xrls' files generated as part of the XRL stubs. thanks, BMS From bms at incunabulum.net Mon Oct 19 08:32:52 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 19 Oct 2009 16:32:52 +0100 Subject: [Xorp-hackers] IPv4 In-Reply-To: <2148c4de0910190314m76ae69c1k6053e6a67bb37bf0@mail.gmail.com> References: <2148c4de0910190314m76ae69c1k6053e6a67bb37bf0@mail.gmail.com> Message-ID: <4ADC86A4.2020609@incunabulum.net> CouchSurfer wrote: > Hi guys, > > Just wanted to say that I tried to disable IPv6 when compiling xorp on > a CentOS box and it produced a number of ipv6 error messages. > Patches were recently committed to the tree to fix the IPv6 build, please try updating your SVN sources. If this does not resolve the issue, can you please raise a Trac ticket on Sourceforge about this issue and someone can try to look at it during the 1.7-RC? Thanks. regards, BMS From bms at incunabulum.net Mon Oct 19 08:35:09 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Mon, 19 Oct 2009 16:35:09 +0100 Subject: [Xorp-hackers] Pending SCons configure change In-Reply-To: <87pr8wa8ke.fsf@orac.acorntoolworks.com> References: <87pr8wa8ke.fsf@orac.acorntoolworks.com> Message-ID: <4ADC872D.3070003@incunabulum.net> J.T. Conklin wrote: > The default build directory will now be obj/. Since host will > now be the standard GNU system tripple, this may result in a rebuild > and a new object directory (orphaning any objdirs with the old name). > I like this change, thanks for committing it. It does make us dependent on a POSIX shell, though, but since we've pretty much ditched Windows backwards compatibility, that's fine. regards, BMS From lizhaous2000 at yahoo.com Mon Oct 19 09:57:08 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Mon, 19 Oct 2009 09:57:08 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <4ADC808B.5020302@incunabulum.net> Message-ID: <909406.86958.qm@web58703.mail.re1.yahoo.com> Thanks for the reply. I have coded my prototype protocol process. Two things I am still working on. In order to start dependended modules, it takes a long time. Sencond, it static routes is having a depending nodule, I dont want cli to delete xorp_static_routes. C++ xrl interface functions are working fine. My process can use them directly talking to static routes to update the routes. --- On Mon, 10/19/09, Bruce Simpson wrote: > From: Bruce Simpson > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Monday, October 19, 2009, 11:06 AM > Li Zhao wrote: > > Actually this is a generic question. For any new > config coming from xorpsh, how are these xrl client > functions sent to the target process from rtrmgr? > >??? > > The Router Manager uses the textual Finder protocol to make > indirect XRL method calls, as it parses the configuration > tree; it does not use the C++ bindings directly. Please see > the '*.xrls' files generated as part of the XRL stubs. > > thanks, > BMS > > From greearb at candelatech.com Mon Oct 19 10:29:38 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 19 Oct 2009 10:29:38 -0700 Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <909406.86958.qm@web58703.mail.re1.yahoo.com> References: <909406.86958.qm@web58703.mail.re1.yahoo.com> Message-ID: <4ADCA202.3040509@candelatech.com> On 10/19/2009 09:57 AM, Li Zhao wrote: > Thanks for the reply. I have coded my prototype protocol process. Two things I am still working on. In order to start dependended modules, it takes a long time. Sencond, it static routes is having a depending nodule, I dont want cli to delete xorp_static_routes. C++ xrl interface functions are working fine. My process can use them directly talking to static routes to update the routes. I also have patches in my tree to start up modules quicker...(removes a 2-second sleep for each module, basically). But, since this is a one-time cost, it shouldn't be too bad even w/out the patch? Thanks, Ben > > --- On Mon, 10/19/09, Bruce Simpson wrote: > >> From: Bruce Simpson >> Subject: Re: [Xorp-hackers] static xrl interface calls >> To: "Li Zhao" >> Cc: xorp-hackers at icir.org >> Date: Monday, October 19, 2009, 11:06 AM >> Li Zhao wrote: >>> Actually this is a generic question. For any new >> config coming from xorpsh, how are these xrl client >> functions sent to the target process from rtrmgr? >>> >> >> The Router Manager uses the textual Finder protocol to make >> indirect XRL method calls, as it parses the configuration >> tree; it does not use the C++ bindings directly. Please see >> the '*.xrls' files generated as part of the XRL stubs. >> >> thanks, >> BMS >> >> > > > > > _______________________________________________ > Xorp-hackers mailing list > Xorp-hackers at icir.org > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Mon Oct 19 14:04:50 2009 From: greearb at candelatech.com (Ben Greear) Date: Mon, 19 Oct 2009 14:04:50 -0700 Subject: [Xorp-hackers] PATCH: Allow delayed start of PIM vif In-Reply-To: <4ADC8017.9010507@incunabulum.net> References: <4ACD6B16.6080500@candelatech.com> <4ADC8017.9010507@incunabulum.net> Message-ID: <4ADCD472.2020203@candelatech.com> On 10/19/2009 08:04 AM, Bruce Simpson wrote: > Thanks for the patch. > > If you can preserve existing code style, then it's more likely changes > can be taken as-is (i.e. don't use camelCase if possible, opening brace > of {} block on separate line for methods, etc). I'd probably call the > flag 'start_is_pending'. > > What I'm likely to do, when I return (I'm catching up on email now, > although I'm still on my break, and might have some social stuff going > on when I return to London) is to flag patches for possible future > inclusion. I really need to finish what I've started with XRL; it's > probably easier to deal with stuff like this as a sweep during a 1.7-RC. I can change the coding style, but this particular patch is useless without a bunch of other fixes relating to transient interfaces, since those hit before this one would be noticeable. Probably best to wait until the next dev cycle when we can work towards integrating more of my changes. With regard to XRL, I've a question: If an application makes 3 XRL calls: do_a() do_b() commit_all() Is there any guarantee that these are strictly delivered to the peer process in the order called? Code appears to expect this to be true, but I'm suspicious that perhaps it does not. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From lizhaous2000 at yahoo.com Tue Oct 20 05:53:03 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Tue, 20 Oct 2009 05:53:03 -0700 (PDT) Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <4ADCA202.3040509@candelatech.com> Message-ID: <442246.87308.qm@web58702.mail.re1.yahoo.com> If we pick 2 second as sleep time, that might not a good idea. --- On Mon, 10/19/09, Ben Greear wrote: > From: Ben Greear > Subject: Re: [Xorp-hackers] static xrl interface calls > To: "Li Zhao" > Cc: "Bruce Simpson" , xorp-hackers at icir.org > Date: Monday, October 19, 2009, 1:29 PM > On 10/19/2009 09:57 AM, Li Zhao > wrote: > > Thanks for the reply. I have coded my prototype > protocol process. Two things I am still working on. In order > to start dependended modules, it takes a long time. Sencond, > it static routes is having a depending nodule, I dont want > cli to delete xorp_static_routes. C++ xrl interface > functions are working fine. My process can use them directly > talking to static routes to update the routes. > > I also have patches in my tree to start up modules > quicker...(removes a 2-second sleep for each module, > basically). > > But, since this is a one-time cost, it shouldn't be too bad > even w/out the patch? > > Thanks, > Ben > > > > > > --- On Mon, 10/19/09, Bruce Simpson? > wrote: > > > >> From: Bruce Simpson > >> Subject: Re: [Xorp-hackers] static xrl interface > calls > >> To: "Li Zhao" > >> Cc: xorp-hackers at icir.org > >> Date: Monday, October 19, 2009, 11:06 AM > >> Li Zhao wrote: > >>> Actually this is a generic question. For any > new > >> config coming from xorpsh, how are these xrl > client > >> functions sent to the target process from rtrmgr? > >>> > >> > >> The Router Manager uses the textual Finder > protocol to make > >> indirect XRL method calls, as it parses the > configuration > >> tree; it does not use the C++ bindings directly. > Please see > >> the '*.xrls' files generated as part of the XRL > stubs. > >> > >> thanks, > >> BMS > >> > >> > > > > > > > > > > _______________________________________________ > > Xorp-hackers mailing list > > Xorp-hackers at icir.org > > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > > > -- > Ben Greear > Candela Technologies Inc? http://www.candelatech.com > > From greearb at candelatech.com Tue Oct 20 08:26:20 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 20 Oct 2009 08:26:20 -0700 Subject: [Xorp-hackers] static xrl interface calls In-Reply-To: <442246.87308.qm@web58702.mail.re1.yahoo.com> References: <442246.87308.qm@web58702.mail.re1.yahoo.com> Message-ID: <4ADDD69C.4080909@candelatech.com> Li Zhao wrote: > If we pick 2 second as sleep time, that might not a good idea. > I managed to remove it entirely in my tree...with no bad effects so far, but it requires a relatively large amount of (simple) changes. Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From lizhaous2000 at yahoo.com Wed Oct 21 11:49:10 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Wed, 21 Oct 2009 11:49:10 -0700 (PDT) Subject: [Xorp-hackers] set_command_map Message-ID: <472319.96385.qm@web58708.mail.re1.yahoo.com> set_command_map was only invoked in three files: test_fea_rawlink.cc, test_xrl_sockets4_tcp.cc and test_xrl_sockets4_udp.cc. It is interesting to see that: in these three test_main functions: there is no wait_until_xrl_router_is_ready called. But magically they are working just fine. I have a process which has a class implemented interface socket4_user/0.1. I have wait_until_xrl_router_is_ready. But if I leave out set_command_map, then send_bind and send_listen does not really work properly. That is the packets delivered to the sockets are not passed to my implemented function socket4_user_0_1_inbound_connect_event or socket4_user_0_1_recv_event. But in olsr4 and rip, they dont have set_command_map. Maybe because they do not register send_bind or send_listen? From lizhaous2000 at yahoo.com Wed Oct 21 11:49:59 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Wed, 21 Oct 2009 11:49:59 -0700 (PDT) Subject: [Xorp-hackers] set_command_map Message-ID: <112855.60121.qm@web58702.mail.re1.yahoo.com> set_command_map was only invoked in three files: test_fea_rawlink.cc, test_xrl_sockets4_tcp.cc and test_xrl_sockets4_udp.cc. It is interesting to see that: in these three test_main functions: there is no wait_until_xrl_router_is_ready called. But magically they are working just fine. I have a process which has a class implemented interface socket4_user/0.1. I have wait_until_xrl_router_is_ready. But if I leave out set_command_map, then send_bind and send_listen does not really work properly. That is the packets delivered to the sockets are not passed to my implemented function socket4_user_0_1_inbound_connect_event or socket4_user_0_1_recv_event. But in olsr4 and rip, they dont have set_command_map. Maybe because they do not register send_bind or send_listen? From bms at incunabulum.net Thu Oct 22 04:21:48 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 22 Oct 2009 12:21:48 +0100 Subject: [Xorp-hackers] set_command_map In-Reply-To: <472319.96385.qm@web58708.mail.re1.yahoo.com> References: <472319.96385.qm@web58708.mail.re1.yahoo.com> Message-ID: <4AE0404C.4010209@incunabulum.net> Li Zhao wrote: > I have a process which has a class implemented interface socket4_user/0.1. > I have wait_until_xrl_router_is_ready. But if I leave out set_command_map, then send_bind and send_listen does not really work properly. That is the packets delivered to the sockets are not passed to my implemented function > socket4_user_0_1_inbound_connect_event or socket4_user_0_1_recv_event. > The command map is used implicitly by the XRL target stubs. All instances of XrlRouter embed a default implementation of it, which just shims to the basic functionality required by any process speaking XRL internally. Normally set_command_map() is called 'behind the scenes' by the XRL target stub, and there's no need to override it. However, if you are making an XRL endpoint look like a target on the fly, or need to switch between multiple XRL target implementations *in the same process*, you will need to call this method directly. From bms at incunabulum.net Tue Oct 27 08:53:47 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Tue, 27 Oct 2009 15:53:47 +0000 Subject: [Xorp-hackers] Omitting XrlDB from Router Manager Message-ID: <4AE7178B.9000709@incunabulum.net> Hi all, I'm still looking at the XRL replacement since I got back from holiday, which is why I've been mostly silent on lists. Something came up in analysis, which broadly relates to Ben Greear's work on reducing Router Manager startup times, etc. and some of the questions Li Zhao has been asking in other threads on this list. @Ben: It would be interesting to know what difference omitting the XRLDB code makes to your Router Manager startup times. * The XRLDB seems to exist pretty much to validate what's in the template files and how the Router Manager uses them, although this is done completely at run time. * I wonder if disabling this code would make a difference to performance. * To do this, I'd hack rtrmgr/template_commands.cc, and comment out the calls to the XRLdb methods. * The rtrmgr/xrldb.cc is the only place in the whole system where the '*.xrls' files are parsed and used. They are used only to validate the syntax and structure of potential XRL method calls. * It would mean that there is no up-front validation of the XRLs, but in practice, this validation step is probably only of interest to people developing XORP, to catch problems with template files. * It's probably best folded under a compile-time #define for developer use. @Li: You were looking for information on how XRLs are sent by the Router Manager to the XORP routing processes. * I've been looking at this code with a view to replacement. * This uses an indirect method call and lookup from the finder:// XRLs in the *.xrls files. * Implementing Thrift directly affects the Router Manager: in particular, the core functionality which configures processes by sending XRLs to them, in rtrmgr/template_commands.cc, class XrlAction. * In any event, because the Router Manager is trying to do method calls without an IDL or C++ stubs, using the textual Finder protocol, a different mechanism would be needed in Thrift. * The '*.tp' template files explicitly identify all argument and result types used when configuring a XORP process via an XRL. If these are correct, then additional validation shouldn't be needed. * Therefore: it's possible to construct a binary blob at runtime, using exactly the same techniques as in the clnt-gen Thrifted code generator. cheers, BMS From greearb at candelatech.com Tue Oct 27 15:48:09 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 27 Oct 2009 15:48:09 -0700 Subject: [Xorp-hackers] oprofile reports Message-ID: <4AE778A9.3020201@candelatech.com> I'm running 100 xorp instances on a dual-quad core system (E5530, 2.4Ghz) I let them run a while, which tends to hide the xrl stuff that is mainly a startup cost. This is my patched tree, by the way. FEA is the top entry on the entire OS (but, there are 100 FEAs running, so that's not un-expected) At least in my code tree, the get_ready_priority is a couple of for loops..and could probably be optimized, to better ignore fds that are not in use. samples % image name app name symbol name ------------------------------------------------------------------------------- 22 0.0766 xorp_fea xorp_fea EventLoop::do_work(bool) 28687 99.9234 xorp_fea xorp_fea SelectorList::wait_and_dispatch(TimeVal& ) 27759 6.5897 xorp_fea xorp_fea SelectorList::get_ready_priority(bool) 27759 99.6196 xorp_fea xorp_fea SelectorList::get_ready_priority(bool) [ self] 54 0.1938 xorp_fea xorp_fea SelectorList::do_select(timeval*, bool) 52 0.1866 xorp_fea xorp_fea std::vector >::operator[](unsigned long) Nothing else really stands out, except that we are probably creating and deleting a lot of strings (or something else with an underlying vector in it), which calls memset. I can't tell from oprofile what the call chain for the memset usage is though, will look for other ways to get at that... Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Tue Oct 27 18:09:53 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 27 Oct 2009 18:09:53 -0700 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. Message-ID: <4AE799E1.8010300@candelatech.com> XRL caches a pointer to the resolved_sender, but when something deletes a sender, it doesn't appear to clean up any existing XRLs. This leads to a crash on a highly loaded system (where senders must be timing out or something like that). Looks like a good place for smart pointers. I'm going to attempt that unless someone has another idea... Thanks, Ben XrlPFSender* XrlRouter::get_sender(const Xrl& xrl, FinderDBEntry* dbe) { const Xrl& x = dbe->xrls().front(); XrlPFSender* s = NULL; // Use the cache pointer to the sender. if (xrl.resolved()) { s = xrl.resolved_sender(); >>> CRASH HERE, s is pointing to bogus memory..probably deleted and scribbled upon: if (s->alive()) return s; (gdb) bt #0 0x00000000005e3eac in XrlRouter::get_sender (this=0x7fff13b9ae30, xrl=@0x18ca130, dbe=0x18caa38) at libxipc/xrl_router.cc:424 #1 0x00000000005e39f1 in XrlRouter::send_resolved (this=0x7fff13b9ae30, xrl=@0x18ca130, dbe=0x18caa38, cb=@0x7fff13b99730, direct_call=true) at libxipc/xrl_router.cc:391 #2 0x00000000005e4784 in XrlRouter::send (this=0x7fff13b9ae30, xrl=@0x18ca130, user_cb=@0x7fff13b99730) at libxipc/xrl_router.cc:630 #3 0x00000000005bdcc2 in XrlRawPacket4V0p1Client::send_register_receiver (this=0x7fff13b997b0, dst_xrl_target_name=0x189a938 "fea", xrl_target_instance_name="ospfv2-4a020f2e6b53955e3362796be672a55e at 127.0.0.1", if_name="17.100.17", vif_name="17.100.17", ip_protocol=@0x7fff13b99808, enable_multicast_loopback=@0x7fff13b99807, cb=@0x7fff13b997c0) at obj/x86_64-linux-public17/xrl/interfaces/fea_rawpkt4_xif.cc:111 #4 0x00000000004a2956 in XrlIO::enable_interface_vif (this=0x7fff13b9abd0, interface="17.100.17", vif="17.100.17") at ospf/xrl_io.cc:215 #5 0x0000000000420f73 in Ospf::enable_interface_vif (this=0x7fff13b9a8a0, interface="17.100.17", vif="17.100.17") at ospf/ospf.cc:130 #6 0x000000000045e578 in PeerOut::start_receiving_packets (this=0x18cdb50) at ospf/peer.cc:635 #7 0x000000000045ead4 in PeerOut::bring_up_peering (this=0x18cdb50) at ospf/peer.cc:566 #8 0x000000000045c158 in PeerOut::peer_change (this=0x18cdb50) at ospf/peer.cc:316 #9 0x000000000045c032 in PeerOut::set_link_status (this=0x18cdb50, state=true) at ospf/peer.cc:297 #10 0x000000000043ae82 in PeerManager::vif_status_change (this=0x7fff13b9a978, interface="17.100.17", vif="17.100.17", state=true) at ospf/peer_manager.cc:789 #11 0x000000000045670e in XorpMemberCallback3B0, std::string const&, std::string const&, bool>::dispatch ( this=0x18c4450, a1="17.100.17", a2="17.100.17", a3=true) at ./libxorp/callback_nodebug.hh:6801 #12 0x00000000004a4ea0 in XrlIO::updates_made (this=0x7fff13b9abd0) at ospf/xrl_io.cc:1259 #13 0x0000000000596fa9 in IfMgrXrlMirror::do_updates (this=0x7fff13b9aca8) at libfeaclient/ifmgr_xrl_mirror.cc:1168 #14 0x0000000000596e21 in IfMgrXrlMirror::updates_made (this=0x7fff13b9aca8) at libfeaclient/ifmgr_xrl_mirror.cc:1145 #15 0x000000000059540e in IfMgrXrlMirrorTarget::fea_ifmgr_mirror_0_1_hint_updates_made (this=0x18b3c00) at libfeaclient/ifmgr_xrl_mirror.cc:927 #16 0x00000000005cae6a in XrlFeaIfmgrMirrorTargetBase::handle_fea_ifmgr_mirror_0_1_hint_updates_made (this=0x18b3c00, xa_inputs=@0x18bae28) at obj/x86_64-linux-public17/xrl/targets/fea_ifmgr_mirror_base.cc:1362 #17 0x00000000005cb57a in XorpMemberCallback2B0::dispatch ( this=0x18b5f60, a1=@0x18bae28, a2=0x7fff13b99f80) at ./libxorp/callback_nodebug.hh:4616 #18 0x00000000005f9692 in XrlCmdEntry::dispatch (this=0x18b6008, inputs=@0x18bae28, outputs=0x7fff13b99f80) at libxipc/xrl_cmd_map.hh:44 #19 0x000000000060032c in XrlDispatcher::dispatch_xrl_fast (this=0x18b3420, xi=@0x18bae10, outputs=@0x7fff13b99f80) at libxipc/xrl_dispatcher.cc:83 #20 0x000000000060114a in STCPRequestHandler::do_dispatch (this=0x18bf610, packed_xrl=0x7f2947ae1776 "", packed_xrl_bytes=0, response=@0x7fff13b99f80) at libxipc/xrl_pf_stcp.cc:288 #21 0x0000000000601237 in STCPRequestHandler::dispatch_request (this=0x18bf610, seqno=518, batch=false, packed_xrl=0x7f2947ae171f , packed_xrl_bytes=87) at libxipc/xrl_pf_stcp.cc:300 #22 0x0000000000600dc9 in STCPRequestHandler::read_event (this=0x18bf610, ev=BufferedAsyncReader::DATA, buffer=0x7f2947ae1707 "STCP\1\1", buffer_bytes=111) at libxipc/xrl_pf_stcp.cc:234 ---Type to continue, or q to quit--- #23 0x000000000060a12a in XorpMemberCallback4B0::dispatch (this=0x18b9410, a1=0x18bf620, a2=BufferedAsyncReader::DATA, a3=0x7f2947ae1707 "STCP\1\1", a4=111) at ./libxorp/callback_nodebug.hh:8966 #24 0x0000000000620728 in BufferedAsyncReader::announce_event (this=0x18bf620, ev=BufferedAsyncReader::DATA) at libxorp/buffered_asyncio.cc:261 #25 0x0000000000620600 in BufferedAsyncReader::io_event (this=0x18bf620, fd={_filedesc = 48}, type=IOT_READ) at libxorp/buffered_asyncio.cc:214 #26 0x0000000000620eda in XorpMemberCallback2B0::dispatch (this=0x18bde50, a1={_filedesc = 48}, a2=IOT_READ) at ./libxorp/callback_nodebug.hh:4636 #27 0x0000000000634b46 in SelectorList::Node::run_hooks (this=0x1885990, m=SEL_RD, fd={_filedesc = 48}) at libxorp/selector.cc:200 #28 0x0000000000634004 in SelectorList::wait_and_dispatch (this=0x7fff13b9a540, timeout=@0x7fff13b9a320) at libxorp/selector.cc:523 #29 0x0000000000622be9 in EventLoop::do_work (this=0x7fff13b9a3b0, can_block=true) at libxorp/eventloop.cc:147 #30 0x0000000000622a7e in EventLoop::run (this=0x7fff13b9a3b0) at libxorp/eventloop.cc:100 #31 0x000000000040514a in main (argv=0x7fff13b9b098) at ospf/xorp_ospfv2.cc:77 (gdb) -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Tue Oct 27 22:14:03 2009 From: greearb at candelatech.com (Ben Greear) Date: Tue, 27 Oct 2009 22:14:03 -0700 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AE799E1.8010300@candelatech.com> References: <4AE799E1.8010300@candelatech.com> Message-ID: <4AE7D31B.1040107@candelatech.com> Ben Greear wrote: > XRL caches a pointer to the resolved_sender, but when something > deletes a sender, it doesn't appear to clean up any existing XRLs. > This leads to a crash on a highly loaded system (where senders must be > timing out > or something like that). > > Looks like a good place for smart pointers. I'm going to attempt that > unless > someone has another idea... > The attached patch seems to fix the problem. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- A non-text attachment was scrubbed... Name: xorp_xrlsender_ref_ptr.patch Type: text/x-patch Size: 13637 bytes Desc: not available Url : http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091027/d06b4c87/attachment.bin From greearb at candelatech.com Wed Oct 28 15:11:40 2009 From: greearb at candelatech.com (Ben Greear) Date: Wed, 28 Oct 2009 15:11:40 -0700 Subject: [Xorp-hackers] OLSR: Fix olsr/tools build problems. Message-ID: <4AE8C19C.4040009@candelatech.com> This patch, when layered on top of my previous OLSR related patches, lets the olsr/tools build as expected. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: olsr_tools_scons.patch Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091028/b2629258/attachment.ksh From greearb at candelatech.com Wed Oct 28 16:48:57 2009 From: greearb at candelatech.com (Ben Greear) Date: Wed, 28 Oct 2009 16:48:57 -0700 Subject: [Xorp-hackers] [Xorp-users] Xorp installation fails on Ubuntu In-Reply-To: <6e49b4d40910281616k703782d1v37860204e0cbf96f@mail.gmail.com> References: <6e49b4d40910281554u4e2932f9mfcf21142e38588d4@mail.gmail.com> <4AE8CF0B.9070402@candelatech.com> <6e49b4d40910281616k703782d1v37860204e0cbf96f@mail.gmail.com> Message-ID: <4AE8D869.1080304@candelatech.com> On 10/28/2009 04:16 PM, mahendra nunna wrote: > thanks ben. I got the source (xorp-1.6.tar.gz) from > http://www.xorp.org/downloads.html.... > > is there a newer version than 1.6?.... > > and strangly ...... locate xorpsh gives me no result.... > > but i have done ./configure and make..... > > thanks They have later code on sourceforge. I just put some binaries I compiled on Fedora up at: http://www.candelatech.com/oss/xorp_binaries/ They might require that you install some different libraries. There is a xorp_install.bash script in the package that attempts to fix up some of the library issues and create proper users, etc. The files are meant to be un-tarred in /usr/local These are from our xorp tree, but should support everything that the vanilla xorp does. Info on our tree is at: http://www.candelatech.com/oss/xorp-ct.html Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From mnunna0 at gmail.com Wed Oct 28 22:36:51 2009 From: mnunna0 at gmail.com (mahendra nunna) Date: Thu, 29 Oct 2009 01:36:51 -0400 Subject: [Xorp-hackers] Using Java Native Interference with XORP Message-ID: <6e49b4d40910282236o4138d570ka7ab087ff2e75ea9@mail.gmail.com> hi I want to modify the xorp. Considering the complexities involved in the modification of native XORP code, it was proposed to use Java code on top of XORP, Use interfaces and manage the XORP behaviour through Java code. It could either be done as 1. Implementing the Java code and the native XORP code in the same process, using Java Native Interface (Faster Processing) 2. or having the java code and the native XORP code run in seperate process, using Inter Process Communication. is it good to do this... or should we proceed modifying the native xorp code and compile it Please advise us on this .... we need your opinion about this.... thanks mahen -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091029/2d84b961/attachment.html From greearb at candelatech.com Wed Oct 28 23:17:16 2009 From: greearb at candelatech.com (Ben Greear) Date: Wed, 28 Oct 2009 23:17:16 -0700 Subject: [Xorp-hackers] [Xorp-users] Using Java Native Interference with XORP In-Reply-To: <6e49b4d40910282236o4138d570ka7ab087ff2e75ea9@mail.gmail.com> References: <6e49b4d40910282236o4138d570ka7ab087ff2e75ea9@mail.gmail.com> Message-ID: <4AE9336C.5090107@candelatech.com> mahendra nunna wrote: > hi > > I want to modify the xorp. Considering the complexities involved in > the modification of native XORP code, it was proposed to use Java code > on top of XORP, Use interfaces and manage the XORP behaviour through > Java code. > It could either be done as > 1. Implementing the Java code and the native XORP code in the same > process, using Java Native Interface (Faster Processing) This seems like a bad idea...you'd have to understand Xorp well enough to bind to it, and then pay all the price of making JNI work on top of that. > 2. or having the java code and the native XORP code run in seperate > process, using Inter Process Communication. That might work, but probably painful to integrate with XRL since I don't think there is any automatic code generation for java. I'd just copy something relatively simple (maybe rip?) and start hacking C++ code, but perhaps I'm biased! Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Wed Oct 28 23:20:28 2009 From: greearb at candelatech.com (Ben Greear) Date: Wed, 28 Oct 2009 23:20:28 -0700 Subject: [Xorp-hackers] [Xorp-users] Xorp installation fails on Ubuntu In-Reply-To: <4AE8D869.1080304@candelatech.com> References: <6e49b4d40910281554u4e2932f9mfcf21142e38588d4@mail.gmail.com> <4AE8CF0B.9070402@candelatech.com> <6e49b4d40910281616k703782d1v37860204e0cbf96f@mail.gmail.com> <4AE8D869.1080304@candelatech.com> Message-ID: <4AE9342C.2030904@candelatech.com> Ben Greear wrote: > They have later code on sourceforge. I just put some binaries > I compiled on Fedora up at: > > http://www.candelatech.com/oss/xorp_binaries/ > I just uploaded a lanforge-xorp .deb file to that directory. No idea if it actually works...will do some testing on it tomorrow if all goes well. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From lizhaous2000 at yahoo.com Thu Oct 29 07:54:05 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Thu, 29 Oct 2009 07:54:05 -0700 (PDT) Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done Message-ID: <633162.71070.qm@web58707.mail.re1.yahoo.com> I added a new protocol and I can start it in CLI by command "create protocol XXX", but the rtrmgr crashed after command "delete protocol XXX". I can also easily reproduce the exactlt same crash via the following steps: 0. I am running xorp processes on an embedded system. 1. start rtrmgr from linux shell on the system; 2. manually start xorp_static_routes from linux shell. This static will hijack the xrl channels to rtrmgr; 3. use cli command "create protocol static" to start a second xorp_static_routes. 4. use cli command "delete protocol static" to stop static. both xorp_static_routes were terminated. depended process like fea, rib and policy were also terminated. rtrmgr crash. I am attaching two stack traces. the first one is for my new protocl XXX case and the second is for the static triggered case. Anybody has any clue? Thanks. Li case 1: (gdb) tar rem 10.65.1.117:6666 Remote debugging using 10.65.1.117:6666 0x0059a850 in _start () from /lib/ld-linux.so.2 Current language: auto; currently c (gdb) dis b (gdb) c Continuing. [New Thread 0] Program received signal SIGABRT, Aborted. [Switching to Thread 0] 0xb80cd424 in ?? () (gdb) bt #0 0xb80cd424 in ?? () #1 0xbffc2624 in ?? () #2 0x00000006 in ?? () #3 0x000017fe in ?? () #4 0x00a71450 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #5 0x00a72e18 in abort () at abort.c:88 #6 0x00aaefdd in __libc_message (do_abort=2, fmt=0xb89bc8 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:170 #7 0x00ab5394 in malloc_printerr (action=2, str=0xb86a88 "free(): invalid pointer", ptr=0x8d55238) at malloc.c:5994 #8 0x00ab7346 in __libc_free (mem=0x8d55238) at malloc.c:3625 #9 0x05438591 in operator delete (ptr=0x0) at ../../../../libstdc++-v3/libsupc++/del_op.cc:49 #10 0x080a2f5f in __gnu_cxx::new_allocator >::deallocate (this=0x8d55238, __p=0x8d55238) at /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/ext/new_allocator.h:98 #11 0x080a2f84 in std::_List_base >::_M_put_node ( this=0x8d55238, __p=0x8d55238) at /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:318 #12 0x080a6f39 in std::list >::_M_erase ( ---Type to continue, or q to quit--- this=0x8d55238, __position={_M_node = 0x8d55238}) at /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:1361 #13 0x080a6f6b in std::list >::pop_front ( this=0x8d55238) at /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:861 #14 0x08098c23 in TaskManager::task_done (this=0x8d55210, success=true, errmsg= {static npos = 4294967295, _M_dataplus = {> = {<__gnu_cxx::new_allocator> = {}, }, _M_p = 0x546ccd4 ""}}) at task.cc:2251 #15 0x080a5911 in XorpMemberCallback2B0::dispatch (this=0x8d60228, a1=true, a2= {static npos = 4294967295, _M_dataplus = {> = {<__gnu_cxx::new_allocator> = {}, }, _M_p = 0x546ccd4 ""}}) at ../libxorp/callback_nodebug.hh:4636 #16 0x08095bd1 in Task::step8_report (this=0x8d60460) at task.cc:1993 #17 0x080a22e7 in XorpMemberCallback0B0::dispatch (this=0x8d5fd90) at ../libxorp/callback_nodebug.hh:306 #18 0x0808b2c1 in Module::terminate_with_prejudice (this=0x8d58450, cb= {_M_ptr = 0x8d5fd90, _M_index = 110}) at module_manager.cc:218 #19 0x0808f36e in XorpMemberCallback0B1 > >::dispatch (this=0x8d60938) at ../libxorp/callback_nodebug.hh:598 ---Type to continue, or q to quit--- #20 0x081af7da in OneoffTimerNode2::expire (this=0x8d5ff28) at timer.cc:167 #21 0x081ae8ed in TimerList::expire_one (this=0xbffcce4c, worst_priority=4) at timer.cc:441 #22 0x081aea48 in TimerList::run (this=0xbffcce4c) at timer.cc:389 #23 0x08198564 in EventLoop::do_work (this=0xbffcce48, can_block=true) at eventloop.cc:153 #24 0x08198828 in EventLoop::run (this=0xbffcce48) at eventloop.cc:99 #25 0x080682df in Rtrmgr::run (this=0xbffcd4b4) at main_rtrmgr.cc:418 #26 0x08069432 in main (argc=6, argv=0xbffcd5c4) at main_rtrmgr.cc:725 (gdb) Case 2: Program received signal SIGABRT, Aborted. [Switching to Thread 0] 0xb80db424 in ?? () (gdb) bt #0 0xb80db424 in ?? () #1 0xbffceeb4 in ?? () #2 0x00000006 in ?? () #3 0x00001802 in ?? () #4 0x00a71450 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #5 0x00a72e18 in abort () at abort.c:88 #6 0x00aaefdd in __libc_message (do_abort=2, fmt=0xb89bc8 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:170 #7 0x00ab5394 in malloc_printerr (action=2, str=0xb89bf4 "munmap_chunk(): invalid pointer", ptr=0x93ed238) at malloc.c:5994 #8 0x05438591 in operator delete (ptr=0x0) at ../../../../libstdc++-v3/libsupc++/del_op.cc:49 #9 0x080a2f5f in __gnu_cxx::new_allocator >::deallocate (this=0x93ed238, __p=0x93ed238) at /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/ext/new_allocator.h:98 #10 0x080a2f84 in std::_List_base >::_M_put_node ( this=0x93ed238, __p=0x93ed238) at /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:318 #11 0x080a6f39 in std::list >::_M_erase ( ---Type to continue, or q to quit--- this=0x93ed238, __position={_M_node = 0x93ed238}) at /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:1361 #12 0x080a6f6b in std::list >::pop_front ( this=0x93ed238) at /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:861 #13 0x08098c23 in TaskManager::task_done (this=0x93ed210, success=true, errmsg= {static npos = 4294967295, _M_dataplus = {> = {<__gnu_cxx::new_allocator> = {}, }, _M_p = 0x546ccd4 ""}}) at task.cc:2251 #14 0x080a5911 in XorpMemberCallback2B0::dispatch (this=0x93f4e80, a1=true, a2= {static npos = 4294967295, _M_dataplus = {> = {<__gnu_cxx::new_allocator> = {}, }, _M_p = 0x546ccd4 ""}}) at ../libxorp/callback_nodebug.hh:4636 #15 0x08095bd1 in Task::step8_report (this=0x93f3c78) at task.cc:1993 #16 0x080a22e7 in XorpMemberCallback0B0::dispatch (this=0x93f4ba0) at ../libxorp/callback_nodebug.hh:306 #17 0x0808b64b in Module::terminate (this=0x93f39a0, cb= {_M_ptr = 0x93f4ba0, _M_index = 284}) at module_manager.cc:166 #18 0x0808c0a5 in ModuleManager::kill_module (this=0xbffdbb68, module_name=@0x93f3c80, cb={_M_ptr = 0x93f4ba0, _M_index = 284}) ---Type to continue, or q to quit--- at module_manager.cc:472 #19 0x08093e38 in Task::step7_kill (this=0x93f3c78) at task.cc:1983 #20 0x080a22e7 in XorpMemberCallback0B0::dispatch (this=0x93f3910) at ../libxorp/callback_nodebug.hh:306 #21 0x081af7da in OneoffTimerNode2::expire (this=0x942f198) at timer.cc:167 #22 0x081ae8ed in TimerList::expire_one (this=0xbffdb65c, worst_priority=4) at timer.cc:441 #23 0x081aea48 in TimerList::run (this=0xbffdb65c) at timer.cc:389 #24 0x08198564 in EventLoop::do_work (this=0xbffdb658, can_block=true) at eventloop.cc:153 #25 0x08198828 in EventLoop::run (this=0xbffdb658) at eventloop.cc:99 #26 0x080682df in Rtrmgr::run (this=0xbffdbcc4) at main_rtrmgr.cc:418 #27 0x08069432 in main (argc=6, argv=0xbffdbdd4) at main_rtrmgr.cc:725 (gdb) c Continuing. Program terminated with signal SIGABRT, Aborted. The program no longer exists. From lizhaous2000 at yahoo.com Thu Oct 29 08:16:32 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Thu, 29 Oct 2009 08:16:32 -0700 (PDT) Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done In-Reply-To: <633162.71070.qm@web58707.mail.re1.yahoo.com> Message-ID: <89697.2773.qm@web58705.mail.re1.yahoo.com> I am puzzled by operator delete(prt=0x0). But inside deallocate(this=0x8d55238, __p=0x8d55238), the __p is not 0x0. pop_front means "removes and deletes". So somewhere else this list node was deleted again? --- On Thu, 10/29/09, Li Zhao wrote: > From: Li Zhao > Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done > To: xorp-hackers at icir.org > Date: Thursday, October 29, 2009, 10:54 AM > I added a new protocol and I can > start it in CLI by command "create protocol XXX", but the > rtrmgr crashed after command "delete protocol XXX". > I can also easily reproduce the exactlt same crash via the > following steps: > > 0. I am running xorp processes on an embedded system. > 1. start rtrmgr from linux shell on the system; > 2. manually start xorp_static_routes from linux shell. This > static will hijack the xrl channels to rtrmgr; > 3. use cli command "create protocol static" to start a > second xorp_static_routes. > 4. use cli command "delete protocol static" to stop static. > both xorp_static_routes were terminated. depended process > like fea, rib and policy were also terminated. rtrmgr > crash. > > I am attaching two stack traces. the first one is for my > new protocl XXX case and the second is for the static > triggered case. > > Anybody has any clue? Thanks. > > Li > > case 1: > > (gdb) tar rem 10.65.1.117:6666 > Remote debugging using 10.65.1.117:6666 > 0x0059a850 in _start () from /lib/ld-linux.so.2 > Current language:? auto; currently c > (gdb) dis b > (gdb) c > Continuing. > [New Thread 0] > > Program received signal SIGABRT, Aborted. > [Switching to Thread 0] > 0xb80cd424 in ?? () > (gdb) bt > #0? 0xb80cd424 in ?? () > #1? 0xbffc2624 in ?? () > #2? 0x00000006 in ?? () > #3? 0x000017fe in ?? () > #4? 0x00a71450 in raise (sig=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:64 > #5? 0x00a72e18 in abort () at abort.c:88 > #6? 0x00aaefdd in __libc_message (do_abort=2, > ? ? fmt=0xb89bc8 "*** glibc detected *** %s: %s: > 0x%s ***\n") > ? ? at > ../sysdeps/unix/sysv/linux/libc_fatal.c:170 > #7? 0x00ab5394 in malloc_printerr (action=2, > ? ? str=0xb86a88 "free(): invalid pointer", > ptr=0x8d55238) at malloc.c:5994 > #8? 0x00ab7346 in __libc_free (mem=0x8d55238) at > malloc.c:3625 > #9? 0x05438591 in operator delete (ptr=0x0) > ? ? at > ../../../../libstdc++-v3/libsupc++/del_op.cc:49 > #10 0x080a2f5f in > __gnu_cxx::new_allocator > >::deallocate > ? ? (this=0x8d55238, __p=0x8d55238) > ? ? at > /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/ext/new_allocator.h:98 > #11 0x080a2f84 in std::_List_base std::allocator >::_M_put_node ( > ? ? this=0x8d55238, __p=0x8d55238) > ? ? at > /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:318 > #12 0x080a6f39 in std::list std::allocator >::_M_erase ( > ---Type to continue, or q to > quit--- > ? ? this=0x8d55238, __position={_M_node = > 0x8d55238}) > ? ? at > /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:1361 > #13 0x080a6f6b in std::list std::allocator >::pop_front ( > ? ? this=0x8d55238) > ? ? at > /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:861 > #14 0x08098c23 in TaskManager::task_done (this=0x8d55210, > success=true, errmsg= > ? ? ? ? {static npos = 4294967295, > _M_dataplus = {> = > {<__gnu_cxx::new_allocator> = { fields>}, }, _M_p = 0x546ccd4 ""}}) > at task.cc:2251 > #15 0x080a5911 in XorpMemberCallback2B0 TaskManager, bool, std::string>::dispatch > (this=0x8d60228, a1=true, a2= > ? ? ? ? {static npos = 4294967295, > _M_dataplus = {> = > {<__gnu_cxx::new_allocator> = { fields>}, }, _M_p = 0x546ccd4 ""}}) > at ../libxorp/callback_nodebug.hh:4636 > #16 0x08095bd1 in Task::step8_report (this=0x8d60460) at > task.cc:1993 > #17 0x080a22e7 in XorpMemberCallback0B0 Task>::dispatch (this=0x8d5fd90) > ? ? at ../libxorp/callback_nodebug.hh:306 > #18 0x0808b2c1 in Module::terminate_with_prejudice > (this=0x8d58450, cb= > ? ? ? {_M_ptr = 0x8d5fd90, _M_index = 110}) > at module_manager.cc:218 > #19 0x0808f36e in XorpMemberCallback0B1 ref_ptr > >::dispatch > (this=0x8d60938) at ../libxorp/callback_nodebug.hh:598 > ---Type to continue, or q to > quit--- > #20 0x081af7da in OneoffTimerNode2::expire (this=0x8d5ff28) > at timer.cc:167 > #21 0x081ae8ed in TimerList::expire_one (this=0xbffcce4c, > worst_priority=4) > ? ? at timer.cc:441 > #22 0x081aea48 in TimerList::run (this=0xbffcce4c) at > timer.cc:389 > #23 0x08198564 in EventLoop::do_work (this=0xbffcce48, > can_block=true) > ? ? at eventloop.cc:153 > #24 0x08198828 in EventLoop::run (this=0xbffcce48) at > eventloop.cc:99 > #25 0x080682df in Rtrmgr::run (this=0xbffcd4b4) at > main_rtrmgr.cc:418 > #26 0x08069432 in main (argc=6, argv=0xbffcd5c4) at > main_rtrmgr.cc:725 > (gdb) > > > Case 2: > > Program received signal SIGABRT, Aborted. > [Switching to Thread 0] > 0xb80db424 in ?? () > (gdb) bt > #0? 0xb80db424 in ?? () > #1? 0xbffceeb4 in ?? () > #2? 0x00000006 in ?? () > #3? 0x00001802 in ?? () > #4? 0x00a71450 in raise (sig=6) at > ../nptl/sysdeps/unix/sysv/linux/raise.c:64 > #5? 0x00a72e18 in abort () at abort.c:88 > #6? 0x00aaefdd in __libc_message (do_abort=2, > ? ? fmt=0xb89bc8 "*** glibc detected *** %s: %s: > 0x%s ***\n") > ? ? at > ../sysdeps/unix/sysv/linux/libc_fatal.c:170 > #7? 0x00ab5394 in malloc_printerr (action=2, > ? ? str=0xb89bf4 "munmap_chunk(): invalid > pointer", ptr=0x93ed238) > ? ? at malloc.c:5994 > #8? 0x05438591 in operator delete (ptr=0x0) > ? ? at > ../../../../libstdc++-v3/libsupc++/del_op.cc:49 > #9? 0x080a2f5f in > __gnu_cxx::new_allocator > >::deallocate > ? ? (this=0x93ed238, __p=0x93ed238) > ? ? at > /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/ext/new_allocator.h:98 > #10 0x080a2f84 in std::_List_base std::allocator >::_M_put_node ( > ? ? this=0x93ed238, __p=0x93ed238) > ? ? at > /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:318 > #11 0x080a6f39 in std::list std::allocator >::_M_erase ( > ---Type to continue, or q to > quit--- > ? ? this=0x93ed238, __position={_M_node = > 0x93ed238}) > ? ? at > /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:1361 > #12 0x080a6f6b in std::list std::allocator >::pop_front ( > ? ? this=0x93ed238) > ? ? at > /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/bits/stl_list.h:861 > #13 0x08098c23 in TaskManager::task_done (this=0x93ed210, > success=true, errmsg= > ? ? ? ? {static npos = 4294967295, > _M_dataplus = {> = > {<__gnu_cxx::new_allocator> = { fields>}, }, _M_p = 0x546ccd4 ""}}) > at task.cc:2251 > #14 0x080a5911 in XorpMemberCallback2B0 TaskManager, bool, std::string>::dispatch > (this=0x93f4e80, a1=true, a2= > ? ? ? ? {static npos = 4294967295, > _M_dataplus = {> = > {<__gnu_cxx::new_allocator> = { fields>}, }, _M_p = 0x546ccd4 ""}}) > at ../libxorp/callback_nodebug.hh:4636 > #15 0x08095bd1 in Task::step8_report (this=0x93f3c78) at > task.cc:1993 > #16 0x080a22e7 in XorpMemberCallback0B0 Task>::dispatch (this=0x93f4ba0) > ? ? at ../libxorp/callback_nodebug.hh:306 > #17 0x0808b64b in Module::terminate (this=0x93f39a0, cb= > ? ? ? {_M_ptr = 0x93f4ba0, _M_index = 284}) > at module_manager.cc:166 > #18 0x0808c0a5 in ModuleManager::kill_module > (this=0xbffdbb68, > ? ? module_name=@0x93f3c80, cb={_M_ptr = > 0x93f4ba0, _M_index = 284}) > ---Type to continue, or q to > quit--- > ? ? at module_manager.cc:472 > #19 0x08093e38 in Task::step7_kill (this=0x93f3c78) at > task.cc:1983 > #20 0x080a22e7 in XorpMemberCallback0B0 Task>::dispatch (this=0x93f3910) > ? ? at ../libxorp/callback_nodebug.hh:306 > #21 0x081af7da in OneoffTimerNode2::expire (this=0x942f198) > at timer.cc:167 > #22 0x081ae8ed in TimerList::expire_one (this=0xbffdb65c, > worst_priority=4) > ? ? at timer.cc:441 > #23 0x081aea48 in TimerList::run (this=0xbffdb65c) at > timer.cc:389 > #24 0x08198564 in EventLoop::do_work (this=0xbffdb658, > can_block=true) > ? ? at eventloop.cc:153 > #25 0x08198828 in EventLoop::run (this=0xbffdb658) at > eventloop.cc:99 > #26 0x080682df in Rtrmgr::run (this=0xbffdbcc4) at > main_rtrmgr.cc:418 > #27 0x08069432 in main (argc=6, argv=0xbffdbdd4) at > main_rtrmgr.cc:725 > (gdb) c > Continuing. > > Program terminated with signal SIGABRT, Aborted. > The program no longer exists. > > > > ? ? ? > > _______________________________________________ > Xorp-hackers mailing list > Xorp-hackers at icir.org > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers > From bms at incunabulum.net Thu Oct 29 08:30:29 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 29 Oct 2009 15:30:29 +0000 Subject: [Xorp-hackers] [Xorp-users] Using Java Native Interference with XORP In-Reply-To: <6e49b4d40910282236o4138d570ka7ab087ff2e75ea9@mail.gmail.com> References: <6e49b4d40910282236o4138d570ka7ab087ff2e75ea9@mail.gmail.com> Message-ID: <4AE9B515.1060901@incunabulum.net> mahendra nunna wrote: > ... > 1. Implementing the Java code and the native XORP code in the same > process, using Java Native Interface (Faster Processing) Regardless of JNI, cross-language interop isn't happening until the Thrift drop of XORP is done. I am edging closer to this goal. cheers, BMS From bms at incunabulum.net Thu Oct 29 08:42:09 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 29 Oct 2009 15:42:09 +0000 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AE7D31B.1040107@candelatech.com> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> Message-ID: <4AE9B7D1.803@incunabulum.net> Ben Greear wrote: > The attached patch seems to fix the problem. Thanks for the patch, and the analysis. This seems to introduce a ref_ptr -- a class I'm not 100% happy about. Are you sure that this patch does not leak any memory? Passing a ref_ptr around is bad, because every time it crosses a C++ scope boundary, the refcount is bumped -- Boost at least has a weak_ptr and a shared_ptr, which cleanly separates the smart pointer semantics between 'I am passing this around' and 'I am sharing ownership of the pointed-to object'. Is there a simpler workaround possible for the issue? I'd rather not get too deep into reviewing a patch which cuts fairly deep into internals which are probably about to get rewritten. thanks, BMS From greearb at candelatech.com Thu Oct 29 08:55:17 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 29 Oct 2009 08:55:17 -0700 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AE9B7D1.803@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> Message-ID: <4AE9BAE5.20406@candelatech.com> Bruce Simpson wrote: > Ben Greear wrote: >> The attached patch seems to fix the problem. > > Thanks for the patch, and the analysis. > > This seems to introduce a ref_ptr -- a class I'm not 100% happy about. > Are you sure that this patch does not leak any memory? If it does, then xorp leaks memory everywhere it uses this ref_ptr. It does stop the crash...I haven't run valgrind on it lately, but if ref_ptr was broken, earlier valgrind runs should have seen it. > > Passing a ref_ptr around is bad, because every time it crosses a C++ > scope boundary, the refcount is bumped -- Boost at least has a > weak_ptr and a shared_ptr, which cleanly separates the smart pointer > semantics between 'I am passing this around' and 'I am sharing > ownership of the pointed-to object'. That's why I pass by reference...keeps ref counts from changing needlessly. Either way, a bit of addition and subtraction is cheap..not like we're doing millions of xrls a second! > > Is there a simpler workaround possible for the issue? I'd rather not > get too deep into reviewing a patch which cuts fairly deep into > internals which are probably about to get rewritten. I doubt it...don't know where all the xrls are stored..would have to search all of them and clean out any with pointers to the sender that is to be deleted. In general, I dislike smart pointers, but in this case, they seem tailor made for the problem. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Thu Oct 29 09:53:44 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 29 Oct 2009 09:53:44 -0700 Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done In-Reply-To: <89697.2773.qm@web58705.mail.re1.yahoo.com> References: <89697.2773.qm@web58705.mail.re1.yahoo.com> Message-ID: <4AE9C898.9070100@candelatech.com> On 10/29/2009 08:16 AM, Li Zhao wrote: > I am puzzled by operator delete(prt=0x0). But inside deallocate(this=0x8d55238, __p=0x8d55238), the __p is not 0x0. pop_front means "removes and deletes". So somewhere else this list node was deleted again? > > --- On Thu, 10/29/09, Li Zhao wrote: > >> From: Li Zhao >> Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done >> To: xorp-hackers at icir.org >> Date: Thursday, October 29, 2009, 10:54 AM >> I added a new protocol and I can >> start it in CLI by command "create protocol XXX", but the >> rtrmgr crashed after command "delete protocol XXX". >> I can also easily reproduce the exactlt same crash via the >> following steps: >> >> 0. I am running xorp processes on an embedded system. >> 1. start rtrmgr from linux shell on the system; >> 2. manually start xorp_static_routes from linux shell. This >> static will hijack the xrl channels to rtrmgr; >> 3. use cli command "create protocol static" to start a >> second xorp_static_routes. >> 4. use cli command "delete protocol static" to stop static. >> both xorp_static_routes were terminated. depended process >> like fea, rib and policy were also terminated. rtrmgr >> crash. I can reproduce it here..will take a quick look to see if I can figure it out. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From bms at incunabulum.net Thu Oct 29 10:10:03 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 29 Oct 2009 17:10:03 +0000 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AE9BAE5.20406@candelatech.com> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> Message-ID: <4AE9CC6B.9050302@incunabulum.net> Hi Ben, Not really meant to be spending time on this at the moment, but I shall... it is not too far off from what I'm actually doing, and I think we probably do need to go over this again ground a bit, given that I am effectively rewriting the affected code right now. Ben Greear wrote: > > In general, I dislike smart pointers, but in this case, they seem > tailor made for the problem. I would disagree that smart pointers are even necessarily the right answer to the issue you've found; in some cases, they can do more harm than good. I got the electronic equivalent of dirty looks, at first, when I had to work around a problem in template class Spt using ref_ptr<..>&. However, that was an existing, isolated use of ref_ptr within the tree. I'd prefer not to use ref_ptr for new code if at all possible. XrlPFSender is a stereotype of an object which should not be created and destroyed trivially. If something in libxipc is tripping over it, it is possibly a race condition, or updates not being propagated elsewhere. In this scenario, the Xrl blob contains a cached pointer to a transport channel (XrlPFSender) which has now potentially gone away. Given that Xrl instances are like confetti, it would be difficult to track them all, and I'm not sure a refcount is the most appropriate way to deal with that (see below). It's a little trickier in this scenario because of how class Xrl is treated in the code base. >> >> Is there a simpler workaround possible for the issue? I'd rather not >> get too deep into reviewing a patch which cuts fairly deep into >> internals which are probably about to get rewritten. > I doubt it...don't know where all the xrls are stored..would have to > search all of them and clean out any with > pointers to the sender that is to be deleted. There are several layers of indirection and caching going on in the XRL layer, but the important ones here are:- 1) the cached FinderDBEntry used to hold the previous results of an indirect XRL call through the Finder. The most fundamental cache mechanism in libxipc. 2) the cached resolved_sender() pointer in Xrl -- what we're interested in here. The Xrl instance involved, based on your backtrace, seems to be allocated by XrlRawPacket4V0p1Client::send_register_receiver(), and held in a statically declared pointer. When an XRL is to be sent through the C++ bindings, it will call back into XrlRouter to see if there is a cached XrlPFSender for the given XRL. The lookup is done w/o arguments. One glaring blot on the landscape beckons this question: * Are any processes sharing the segment that this 'static Xrl*' pointer happens to be in? The pointer looks like it should be in BSS and thus subject to copy-on-write, so this should not be an issue. However, if multiple entities in the same process are calling the C++ bindings, this COULD be a reentrancy issue. If the Finder learns that an XRL target has gone away, it should blow away the FinderClient cache entries, and then the XrlRouter::send() method should notice this. Unfortunately, this may not help in the failure case we're examining. If this check is raced by the XRL being withdrawn and later re-advertised by its target (e.g. its host process got restarted), then obviously the cached XrlPFSender is going to be invalid in the XRL. It's not 100% impossible that this notification has been raced. If the code in your OSPF process which wants to send the XRL, is running from a timer callback, and this callback happens to collide with the FinderClient learning about the XRL target moving somewhere else in the system -- then where the XRL data is going to get sent, will be affected by which point in time it races the FinderClient cache update. In many ways, the fact that the problem exists, is an artefact of how method call resolution is working in the XRL RPC layer; it is per-method rather than per-service, and this is really one of the things I'm trying to address through the Thrift rewrite. What the code is trying to do, is to cache the transport pointer right next to the outgoing data. In principle this would be fine, were it not for the fact that the transport can go away for a variety of reasons. XrlPFSender has no knowledge of Xrl referencing it, and no meaningful way to convey the failure mode to Xrl. It's really XrlRouter's role to deal with this. In the situation above, even if we held a ref_ptr on an XrlPFSender, we wouldn't even know if the underlying network transport is still valid. The "right thing" to do would be to force the inner Xrl's cached resolved_sender pointer to be invalidated -- or validate the pointer upfront when it's used. Again, this is really XrlRouter's responsibility. It's possible for the Xrl's target to be known, and its XRL method resolved, but its destination transport still unresolved, which is what XrlRouter::get_sender() is trying to deal with. For what it's worth, class Xrl largely exists because XORP RPC calls can't be expressed as simple binary blobs. There are *two* RPC protocols running in tandem inside libxipc, and one of them is textual. To my mind, Xrl should be a more lightweight class than it actually is. Caching the transport pointer (XrlPFSender) in the Xrl itself is just asking for trouble in situations like this, given that we have no means of telling the Xrl 'your resolved sender has gone away' -- it's buried in libfooxif.so's copy-on-write BSS segment. It's a non-trivial issue to fix. Using the ref_ptr seems deceptively simple -- we get a handle on the transport, and so the code doesn't blow up, but we probably don't fix the underlying issue (unless I'm missing something). Something interesting to try might be to modify clnt-gen to do a few things in the client shims: %%% return _sender->send(*x, callback(....)); %%% to become %%% bool retval = _sender->send(*x, callback(....)); x->set_resolved(false); return retval; %%% This, however, doesn't fix the root problem either - it just makes it possible to work around the issue without changing the allocation semantics for XrlPFSender, by deprecating one of the cache mechanisms in libxipc. The current cache mechanism is fubar, because it apparently can't deal with something in the XrlPFSender life cycle which causes it to be deleted. In summary: I strongly believe that what you're actually seeing is a race which class Xrl is not able to defend itself against... because the responsibility for it belongs in class XrlRouter. It would be good to get a handle first on who introduced the secondary caching mechanism, and why. Most likely this was to avoid any STL container traversals when an XRL is actually being sent, but given that you've probably run into a race which blows this mechanism up, it needs revisiting. (Yes, this has fallen under the axe in the Thrift branch...!) cheers, BMS From greearb at candelatech.com Thu Oct 29 10:26:49 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 29 Oct 2009 10:26:49 -0700 Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done In-Reply-To: <89697.2773.qm@web58705.mail.re1.yahoo.com> References: <89697.2773.qm@web58705.mail.re1.yahoo.com> Message-ID: <4AE9D059.20102@candelatech.com> On 10/29/2009 08:16 AM, Li Zhao wrote: > I am puzzled by operator delete(prt=0x0). But inside deallocate(this=0x8d55238, __p=0x8d55238), the __p is not 0x0. pop_front means "removes and deletes". So somewhere else this list node was deleted again? > > --- On Thu, 10/29/09, Li Zhao wrote: > >> From: Li Zhao >> Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done >> To: xorp-hackers at icir.org >> Date: Thursday, October 29, 2009, 10:54 AM >> I added a new protocol and I can >> start it in CLI by command "create protocol XXX", but the >> rtrmgr crashed after command "delete protocol XXX". >> I can also easily reproduce the exactlt same crash via the >> following steps: >> >> 0. I am running xorp processes on an embedded system. >> 1. start rtrmgr from linux shell on the system; >> 2. manually start xorp_static_routes from linux shell. This >> static will hijack the xrl channels to rtrmgr; >> 3. use cli command "create protocol static" to start a >> second xorp_static_routes. >> 4. use cli command "delete protocol static" to stop static. >> both xorp_static_routes were terminated. depended process >> like fea, rib and policy were also terminated. rtrmgr >> crash. I ran under valgrind, and saw this info: ==27820== Invalid free() / delete / delete[] ==27820== at 0x4A05E3F: operator delete(void*) (vg_replace_malloc.c:342) ==27820== by 0x463531: __gnu_cxx::new_allocator >::deallocate(std::_List_node*, unsigned long) (new_a llocator.h:95) ==27820== by 0x462427: std::_List_base >::_M_put_node(std::_List_node*) (stl_list.h:320) ==27820== by 0x46143B: std::list >::_M_erase(std::_List_iterator) (stl_list.h:1431) ==27820== by 0x45FF0B: std::list >::pop_front() (stl_list.h:906) ==27820== by 0x45DB73: TaskManager::task_done(bool, std::string const&) (task.cc:2256) ==27820== by 0x465970: XorpMemberCallback2B0::dispatch(bool, std::string const&) (call back_nodebug.hh:4636) ==27820== by 0x45C540: Task::step8_report() (task.cc:1998) ==27820== by 0x4659DF: XorpMemberCallback0B0::dispatch() (callback_nodebug.hh:306) ==27820== by 0x449613: Module::terminate_with_prejudice(ref_ptr >) (module_manager.cc:218) ==27820== by 0x44F63C: XorpMemberCallback0B1 > >::dispatch() (callback_nodebug.hh:598) ==27820== by 0x549D72: OneoffTimerNode2::expire(XorpTimer&, void*) (timer.cc:167) ==27820== Address 0x50c9340 is 80 bytes inside a block of size 200 alloc'd ==27820== at 0x4A06FFC: operator new(unsigned long) (vg_replace_malloc.c:230) ==27820== by 0x42C81F: MasterConfigTree::MasterConfigTree(std::string const&, MasterTemplateTree*, ModuleManager&, XorpClient&, boo l, bool) (master_conf_tree.cc:119) ==27820== by 0x406ED6: Rtrmgr::run() (main_rtrmgr.cc:319) ==27820== by 0x407E57: main (main_rtrmgr.cc:665) It appears to me that the task-manager object (this) is already deleted when the taskmanager::task_done() method is called. Could probably add some debugging to the destructors and constructors of TaskManager to verify. I have some other things to do first..but will look at this a bit later if no one beats me to it. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Thu Oct 29 10:43:55 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 29 Oct 2009 10:43:55 -0700 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AE9CC6B.9050302@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> Message-ID: <4AE9D45B.1090003@candelatech.com> On 10/29/2009 10:10 AM, Bruce Simpson wrote: > Hi Ben, > > Not really meant to be spending time on this at the moment, but I > shall... it is not too far off from what I'm actually doing, and I think > we probably do need to go over this again ground a bit, given that I am > effectively rewriting the affected code right now. > > Ben Greear wrote: >> >> In general, I dislike smart pointers, but in this case, they seem >> tailor made for the problem. > > I would disagree that smart pointers are even necessarily the right > answer to the issue you've found; in some cases, they can do more harm > than good. > > I got the electronic equivalent of dirty looks, at first, when I had to > work around a problem in template class Spt using ref_ptr<..>&. However, > that was an existing, isolated use of ref_ptr within the tree. I'd > prefer not to use ref_ptr for new code if at all possible. > > XrlPFSender is a stereotype of an object which should not be created and > destroyed trivially. If something in libxipc is tripping over it, it is > possibly a race condition, or updates not being propagated elsewhere. > > In this scenario, the Xrl blob contains a cached pointer to a transport > channel (XrlPFSender) which has now potentially gone away. Given that > Xrl instances are like confetti, it would be difficult to track them > all, and I'm not sure a refcount is the most appropriate way to deal > with that (see below). The refcount just keeps the sender object from being destroyed until all xrls referencing it are cleaned up. The sender was probably destroyed because it timed out (I was starting 100 virtual router processes...loads the system very heavy). Please note that the sender will be marked in-active, so the XRL will not actually try to use it, but if the memory is gone, then it can't even check the foo->active() flag w/out crashing. It seems a pretty simple use-after-free bug, and the fix seems pretty trivial to me. > Caching the transport pointer (XrlPFSender) in the Xrl itself is just > asking for trouble in situations like this, given that we have no means > of telling the Xrl 'your resolved sender has gone away' -- it's buried > in libfooxif.so's copy-on-write BSS segment. > > It's a non-trivial issue to fix. Using the ref_ptr seems deceptively > simple -- we get a handle on the transport, and so the code doesn't blow > up, but we probably don't fix the underlying issue (unless I'm missing > something). Assuming a new sender is created, the Xrl will notice the cached one is inactive and search for a new one. Seems like it all works out to me. > It would be good to get a handle first on who introduced the secondary > caching mechanism, and why. Most likely this was to avoid any STL > container traversals when an XRL is actually being sent, but given that > you've probably run into a race which blows this mechanism up, it needs > revisiting. (Yes, this has fallen under the axe in the Thrift branch...!) I think you are over-thinking this one! Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Thu Oct 29 10:57:18 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 29 Oct 2009 10:57:18 -0700 Subject: [Xorp-hackers] PATCH: Small memory leak in CliCommand Message-ID: <4AE9D77E.4050502@candelatech.com> Found this while using valgrind. I think the patch will work, but I haven't actually tested it yet. ==27835== 6,184 (240 direct, 5,944 indirect) bytes in 1 blocks are definitely lost in loss record 48 of 59 ==27835== at 0x4A06FFC: operator new(unsigned long) (vg_replace_malloc.c:230) ==27835== by 0x53D5DD: CliCommand::add_pipes(std::string&) (cli_command.cc:426) ==27835== by 0x5215C3: CliNode::CliNode(int, xorp_module_id, EventLoop&) (cli_node.cc:94) ==27835== by 0x40714F: XrlFeaNode::XrlFeaNode(EventLoop&, std::string const&, std::string const&, std::string const&, unsigned shor t, bool) (xrl_fea_node.cc:79) ==27835== by 0x40638C: fea_main(std::string const&, unsigned short) (xorp_fea.cc:97) ==27835== by 0x406681: main (xorp_fea.cc:181) ==27835== [greearb at ben-dt2 xorp.ct]$ git diff diff --git a/cli/cli_command.cc b/cli/cli_command.cc index 99a003b..256157f 100644 --- a/cli/cli_command.cc +++ b/cli/cli_command.cc @@ -95,6 +95,7 @@ CliCommand::~CliCommand() { // Delete recursively all child commands delete_pointers_list(_child_command_list); + delete_pipes(); } // @@ -428,6 +429,7 @@ CliCommand::add_pipes(string& error_msg) if (com0 == NULL) { return (XORP_ERROR); } + delete_pipes(); // be sure to not leak memory if one is already set. set_cli_command_pipe(com0); cli_pipe = new CliPipe("count"); -- Ben Greear Candela Technologies Inc http://www.candelatech.com From bms at incunabulum.net Thu Oct 29 11:02:07 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 29 Oct 2009 18:02:07 +0000 Subject: [Xorp-hackers] XRL call serialization In-Reply-To: <4ADCD472.2020203@candelatech.com> References: <4ACD6B16.6080500@candelatech.com> <4ADC8017.9010507@incunabulum.net> <4ADCD472.2020203@candelatech.com> Message-ID: <4AE9D89F.7080406@incunabulum.net> Hi Ben, Just saw this... sorry for the delay. Ben Greear wrote: > With regard to XRL, I've a question: > > If an application makes 3 XRL calls: > > do_a() > do_b() > commit_all() > > Is there any guarantee that these are strictly delivered to > the peer process in the order called? Code appears to expect > this to be true, but I'm suspicious that perhaps it does not. XORP processes are intended to be asynchronous; this is realized using explicit coroutines, with no additional C++ runtime support other than UNIX system calls. XRL is intended to be an asynchronous IPC layer. The calls in your example won't be guaranteed to be serialized, unless you explicitly serialize them in your process. You can see this in XORP processes and tools in the form of a ping-pong between callback routines. An example of forcing serialization can be found in contrib/olsr/tools/print_databases.cc, where you can see the EventLoop being run whilst the Getter does its thing. get() is called, this fires off an XRL, and the list_cb() will successively be called for each fetch, until Getter::_done is set to true by the final fetch. BTW: One example of what NOT to do would be the XrlIO::register_rib() function in contrib/olsr/xrl_io.cc. The semantics behind those two XRL calls are co-dependent, and will be different depending on whether or not an OLSR origin table is already registered with the RIB. But you can see that two different XRLs can be fired off 'in parallel'. Class Xrl has no notion of call/reply sequence numbers, which are necessary in order to deal with out-of-order delivery, as well as identifying individual method calls on-the-wire. However, the XORP application code is written with the expectation that XRL is async. The fact that a few things 'under the hood' in XRL prevent it being fully async, is largely academic -- the tutorial materials are pretty clear you shouldn't assume serial method call returns, etc. In practice, what happens is that the XRL transport(s) themselves will stamp each call with a sequence ID. You can see this happening in XrlPFSTCPSender::send(). Although it *does* expect delivery in sequence (you can see this in XrlPFSTCPSender::read_event()), this is purely how it's been done here. In this respect, XRL is totally tied to TCP semantics in its implemention, and RPCs should not be reordered, given that their dispatch in the XRL target is synchronous with their delivery -- there is no intermediate queueing, apart from the kernel's socket buffers. But there should be no expectation of this by application developers. Indeed, if you look at the stubs which Thrift generates, the client code only allows 1 request in-flight; it always sets the sequence number to 0. In practice, this isn't a problem, because in Thrift, the servers tell clients apart per session. In XRL, we tell calls apart by method name. Something tells me this gets really interesting if we try to thread the RIB or otherwise move it into another process. I should point out that XRL targets never actually get to see the Xrl itself -- they just get passed a bunch of arguments by the XrlRouter, and their handler function invoked. On a Grim Code Reaper's note: This makes it pretty much impossible, using the existing code, to implement any serialization or parallelism policy within each XORP process, as well as making it impossible to decentralize the method call disposition, because it's tied to TCP streams. Therefore: Synchronous dispatch of method calls doesn't change in a Thrifted XORP to begin with -- too much of the existing router code is written around this expectation. cheers, BMS From greearb at candelatech.com Thu Oct 29 11:15:33 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 29 Oct 2009 11:15:33 -0700 Subject: [Xorp-hackers] XRL call serialization In-Reply-To: <4AE9D89F.7080406@incunabulum.net> References: <4ACD6B16.6080500@candelatech.com> <4ADC8017.9010507@incunabulum.net> <4ADCD472.2020203@candelatech.com> <4AE9D89F.7080406@incunabulum.net> Message-ID: <4AE9DBC5.8050600@candelatech.com> On 10/29/2009 11:02 AM, Bruce Simpson wrote: > Hi Ben, > > Just saw this... sorry for the delay. > > Ben Greear wrote: >> With regard to XRL, I've a question: >> >> If an application makes 3 XRL calls: >> >> do_a() >> do_b() >> commit_all() >> >> Is there any guarantee that these are strictly delivered to >> the peer process in the order called? Code appears to expect >> this to be true, but I'm suspicious that perhaps it does not. > > XORP processes are intended to be asynchronous; this is realized using > explicit coroutines, with no additional C++ runtime support other than > UNIX system calls. It seems that the router-mgr *might* could read and queue several xrl requests, and then possibly answer them out of order. (Been a few days since I poked at the router mgr code, not sure I fully understood it when I did). OSPF, at least, seems to expect the XRL calls (and responses) are serialized, at least in a few places. Considering TCP is the transport, if the rtr-mgr was made to be strictly serialized in handling requests for each client, that should do the trick. > You can see this in XORP processes and tools in the form of a ping-pong > between callback routines. Yes, I've seen this..but in other cases, it seems programmers got lazy and made assumptions that are *almost* always right. If I can reproduce the problems I saw in OSPF, I'll keep this async'ness in mind while debugging... Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From bms at incunabulum.net Thu Oct 29 11:34:08 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 29 Oct 2009 18:34:08 +0000 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AE9D45B.1090003@candelatech.com> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> Message-ID: <4AE9E020.4060403@incunabulum.net> Ben Greear wrote: > ... > Please note that the sender will be marked in-active, so the XRL will > not actually > try to use it, but if the memory is gone, then it can't even check the > foo->active() > flag w/out crashing. > > It seems a pretty simple use-after-free bug, and the fix seems pretty > trivial to me. I'm pleased that you've found an issue, and come up with a fix that appears to work for you in the here and now. I would also class part of the issue you've run into as a design bug in XRL, and have tried to explain (as best I can) why I believe that is the case. I would prefer to know what the root cause of the transport pointer being invalidated is; this is mostly so that I can avoid introducing a similar situation in new code. However, I'm concerned that the suggested fix, actually makes the code more difficult to read than it already is. I'm not happy with ref_ptr, and it has been a source of problems for me in the past. Of course, it's worth bearing in mind that I am looking at this from a very critical viewpoint at the moment. ;-) cheers, BMS From bms at incunabulum.net Thu Oct 29 12:09:16 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 29 Oct 2009 19:09:16 +0000 Subject: [Xorp-hackers] XRL call serialization In-Reply-To: <4AE9DBC5.8050600@candelatech.com> References: <4ACD6B16.6080500@candelatech.com> <4ADC8017.9010507@incunabulum.net> <4ADCD472.2020203@candelatech.com> <4AE9D89F.7080406@incunabulum.net> <4AE9DBC5.8050600@candelatech.com> Message-ID: <4AE9E85C.60509@incunabulum.net> Hi Ben, Time for more devil's advocate action. Ben Greear wrote: > > It seems that the router-mgr *might* could read and queue several xrl > requests, and then possibly answer them out of order. Based on my recent footprints in XrlAction, that is very likely. But not because XRL is doing anything wrong. One of the things I wanted to mention in my previous reply on this thread: if you keep calling different XRL methods, reentrancy in the client isn't a problem -- you can tell your own requests apart just fine, they're for different methods. But but but: if we have multiple XRL calls in-flight, for the same method, this breaks down. Now the dispatch of the callback ('here's the answer to my question') will be on a per-call basis. And so the only guarantee you get of in-order dispatch, is the fact that XRL transport is using a stream (TCP out of the box). If you mix possibly co-dependent operations and fire them off, problems may happen. [Although the XRLs in these scenarios aren't being batched.] This is why the Router Manager is pretty tight about its timings, and keeping the XRL actions tied down to particular commit steps, is pretty critical to making sure stuff doesn't go out of control. Again, it might be worth revisiting Pavlin's original idea, that we teach the routing processes to keep their own snapshots of state and implement commit/rollback there. The more I stare at Thrift and XRL, the more I believe that's a good idea. It simplifies the Router Manager interface with the other processes. Although as you point out, we still need to keep those snapshots around in the Router Manager so that the process can restart OK -- either that or we give processes some abstract form of non-volatile storage we can easily propagate back to the management point at the point of commit. However, you're quite right -- I see no reason why you can't introduce funk into the system from the Router Manager, the same way that olsr's register_rib() method might. Consider this scenario -- let's imagine that xorp_olsr has crashed. It left a whole bunch of OLSR routes in the RIB. It is using a non-default admin distance. For whatever reason, this was configured on-the-fly, and was an uncommitted change. That process is restarted. Along comes the existing register_rib() function. Let's assume the set-admin-distance step modifies the old origin table from the previous incarnation of xorp_olsr. Let's also assume that there is a redistribution policy in effect for OSPF, which is redistributing routes above a given admin distance.on another interface to an OSPF backbone area. You can see how that gets really interesting. As soon as the call to change the admin distance has fired, the routes will be rewritten to contain the new admin distance, the RIB will redistribute the routes (via policy) to xorp_ospf, and we've got a fair amount of system activity going on, just due to a process restart. Fortunately, the RIB method to set the admin distance does not rewrite existing routes at the moment, and that was deliberately left unfinished (although not for this reason). So this scenario, whilst it's been elaborated on somewhat, isn't possible just now with the mechanisms I've described. But it does point towards the need to either have a configurable policy for method disposition, or strong guaranteees about the RPC layer behaving in-order. You end up having to rely on a reliable network transport. You can assume that the XRL request you just got is to be executed right away, but only insofar as the transport you read from, has not re-ordered anything in transit. Reliability doesn't imply in-order delivery to the user process. If you receive XRL requests out of order, you'll need to buffer them. If your transport isn't reliable, you have no way of knowing that you won't get an earlier message -- without implementing the concept of a time-out; i.e. if Mr Server don't see an out-of-order message within N time units, I will time it out and send you a NACK, to stop blocking all other access to the resource. [Sounds like kernel driver locking to me...] Up until now, we have relied on TCP to do all of this for us behind the scenes. The price we pay for that is some inefficiency in the implementation: head-of-line blocking, and being unable to preserve RPC method boundaries. (This is why the AMQP guys have the hots for STCP, but the STCP guys can't do much about pushing the model forward until Microsoft sit up and take notice -- no-one's shipping STCP as a Windows 7 NDIS/TDI driver, as far as I know.) You can see why stuff like TIPC happens. But I seriously disagree with their approach. Pushing all asynchrony into the kernel isn't the answer, and it limits your client uptake -- Linux is not the only game in town, and there are very good reasons for that which I won't go into here. Just using the existing Berkeley Sockets API is cute, but far from perfect -- it has holes of its own. Also, they never really tackled the cross-language interop issue the way Thrift has. ... So I guess it boils down to: caveat implementor. If you use XRL, don't rely on call serialization from the API. If you need to cross road after pushing button, do so. Otherwise, you might end up in a traffic accident. :-) > (Been a few > days since I poked at the router mgr code, not sure I fully understood > it when I did). There is a lot going on in there. XRLs should be dispatched in the order in which they are received. However, there are actually no guarantees for this behaviour -- it is 'best effort'. When an XRL call is received, for example, STCPRequestHandler will attempt to dispatch it immediately, in line with further reception. XRL targets are internally synchronous. The method call dispatch happens in the context of XrlRouter's event I/O callbacks, which are registered with the outer EventLoop. So from the server's point of view, XRL is pretty much synchronous. But, even on the same host, that dispatch could happen on another CPU. [As I've probably mentioned elsewhere, most of XORP's inter-process sync in the time domain, is actually pinioned on the host's socket buffer locks.] The uncertainty in the whole system to do with time and call dispatch is however localized: * When/how did that XRL get fired off? * How are my socket buffers? * How many cores do I have? * How's my scheduling? Just out of interest, I will reveal that as of this week, that I have written most of the code generator needed to shim XRL calls directly into Thrift ones. This is so that adopting Thrift does not mean a dragnet across all 400+ KLOCs of XORP, but should make it a mostly drop-in replacement for XRL in the existing code. I have yet to write most of the new libxipc, though, which is why I'm feeling that space out just now, and being pretty conservative in what I'm disclosing (people have got a whiff of what I'm doing; knowledge breeds expectations; expectations pump up the volume). Thrift's C++ RPC libraries are actually pretty written. They make it possible to pull off a few tricks for making the method calls a bit more scalable, and for providing guarantees about call serialization in a scalable system. However, making that work requires some additional movement. As you can see, there are a few assumptions about how the whole system actually behaves, which are incorrect in some places. cheers, BMS From greearb at candelatech.com Thu Oct 29 12:16:41 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 29 Oct 2009 12:16:41 -0700 Subject: [Xorp-hackers] Crash due to stale cached xrl sender pointer. In-Reply-To: <4AE9E020.4060403@incunabulum.net> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> Message-ID: <4AE9EA19.5030702@candelatech.com> On 10/29/2009 11:34 AM, Bruce Simpson wrote: > Ben Greear wrote: >> ... >> Please note that the sender will be marked in-active, so the XRL will >> not actually >> try to use it, but if the memory is gone, then it can't even check the >> foo->active() >> flag w/out crashing. >> >> It seems a pretty simple use-after-free bug, and the fix seems pretty >> trivial to me. > > I'm pleased that you've found an issue, and come up with a fix that > appears to work for you in the here and now. I would also class part of > the issue you've run into as a design bug in XRL, and have tried to > explain (as best I can) why I believe that is the case. If anything can ever delete a sender, and if we don't clean up outstanding XRLs when we delete the sender, then the bug exists. Grep for 'destroy_sender' to see how xrl_router.cc can destroy them..because the are no longer 'live'. Base-class doesn't handle setting 'aliveness', so no idea what object is actually no longer thinking it is alive. We either need to clean up those XRLs by invalidating their cache, remove the xrl sender cache entirely, or make sure the sender can't be deleted while XRLs referencing it exist. The first is liable to be difficult. The second a performance penalty. The third used ref-ptrs and changed very little actual logic. > I would prefer to know what the root cause of the transport pointer > being invalidated is; this is mostly so that I can avoid introducing a > similar situation in new code. > > However, I'm concerned that the suggested fix, actually makes the code > more difficult to read than it already is. I'm not happy with ref_ptr, > and it has been a source of problems for me in the past. Xorp is a royal pain in the arse to read, with all it's typedefs, deep class inheritance, auto-generated templated code (try to read the callback code some day..impossible), timers, xrl black hole, chained (and unchained, for that matter) callbacks, etc. It is one of the most hard to read pieces of code I've ever looked at (only the original Vocal project was just barely worse, primarily because it was threaded and full of bugs). Maybe Thrift will help..but if it's just yet another indirection or black hole of magic, I doubt it. > Of course, it's worth bearing in mind that I am looking at this from a > very critical viewpoint at the moment. ;-) I think if I spent more than 2 days looking at XRL I'd rip it's guts out and re-implement it entirely. I don't envy your task, and I hope it works out well. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From bms at incunabulum.net Thu Oct 29 13:07:47 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Thu, 29 Oct 2009 20:07:47 +0000 Subject: [Xorp-hackers] More on XRL and Thrift. In-Reply-To: <4AE9EA19.5030702@candelatech.com> References: <4AE799E1.8010300@candelatech.com> <4AE7D31B.1040107@candelatech.com> <4AE9B7D1.803@incunabulum.net> <4AE9BAE5.20406@candelatech.com> <4AE9CC6B.9050302@incunabulum.net> <4AE9D45B.1090003@candelatech.com> <4AE9E020.4060403@incunabulum.net> <4AE9EA19.5030702@candelatech.com> Message-ID: <4AE9F613.8080504@incunabulum.net> Ben Greear wrote: > > Xorp is a royal pain in the arse to read, with all it's typedefs, deep > class inheritance, > auto-generated templated code (try to read the callback code some > day..impossible), > timers, xrl black hole, chained (and unchained, for that matter) > callbacks, etc. > > It is one of the most hard to read pieces of code I've ever looked at > (only the original > Vocal project was just barely worse, primarily because it was threaded > and full of bugs). > Maybe Thrift will help..but if it's just yet another indirection or > black hole of > magic, I doubt it. XRL and Thrift bear some comparison. The one that sprang to mind just as I put my dinner on the hob, was this one: Thrift draws a clean distinction between the RPC transport, and the representation used on the transport. XRL does have such a distinction, but it isn't as clear cut for the transport, or the representation. So there is quite a bit of bleed-through across the code for each XrlPF. In places this leads to some dire performance problems due to the level of indirection involved. Thrift, on the other hand, is pretty lean and mean in its C++ library. What I'll aim to do is to release parts of the Thrifted XORP tree I'm comfortable with and which could use further review. Earlier in this thread, we saw a situation which could only arise because in XRL, we attempt to cache a pointer to the transport endpoint, in the client-side RPC stub itself, with no way that pointer could be cleanly invalidated. Xrl's use of XrlPFSender here is strictly as a cache -- class Xrl does not participate in the life cycle of XrlPFSender, beyond being used by it. This is actually a really good use of a Boost weak_ptr. It is quite literally an observation of a shared_ptr. That pointer can happen to be invalid. But the situation arose in the first place because of XRL's granularity being per-method only, which is why I'd argue it's a design bug. But let's flip back to how callback stubs are generated, and end up in libfubarxif.so. The XrlFooClient object is instantiated. Whilst it's associated with an XrlRouter to begin with, this is in fact an association with the stereotype XrlSender (a thing which can send XRLs). We don't interact with this object much beyond invoking its send() method, when the client calls XrlFooClient::send_foo(). Part of the problem here is that XRL attempts to do call resolution per-method. In a Thrifted world, the XrlSender can inform the XrlFooClient object that the endpoint changed, but this need only be on a per-service basis; there's no need to cache every single method call resolution, as the XRL foo_xif.cc stubs currently attempt to, and as you saw, this just caused problems when the endpoints themselves were possibly subject to a race. Broadly, in Thrift, the equivalent of that Xrl::resolved_sender() pointer is the output transport pointer. Because the transport is just something which can be written to, to issue an RPC call, we can deal with the semantics of moving to a new endpoint outside of the scope of that call. In fact, we may be best off providing the XORP apps a TMemoryBuffer to scribble the shimmed XRL calls into. This means a transport is always available. Because we're then dealing with a binary blob, rather than a local copy of an Xrl frankenblob, dispatching it is really easy, and we can cache the endpoint to our heart's content, probably using a boost::weak_ptr to boot. We can then let libxipc decide how to route it, in a runtime scope where we are more in control of the endpoint situation, rather than being at the mercy of a lone pointer. > > I think if I spent more than 2 days looking at XRL I'd rip it's guts > out and > re-implement it entirely. I don't envy your task, and I hope it works > out well. No comment. ;-) I am generally pleased with how it's going, and got much closer to the action this week. It took a lot of reading to figure out that what is being attempted is in fact possible, but it's going to take a bit of effort to get it off the ground. thanks, BMS From greearb at candelatech.com Thu Oct 29 20:24:42 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 29 Oct 2009 20:24:42 -0700 Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done In-Reply-To: <633162.71070.qm@web58707.mail.re1.yahoo.com> References: <633162.71070.qm@web58707.mail.re1.yahoo.com> Message-ID: <4AEA5C7A.9070807@candelatech.com> Li Zhao wrote: > I added a new protocol and I can start it in CLI by command "create protocol XXX", but the rtrmgr crashed after command "delete protocol XXX". > I can also easily reproduce the exactlt same crash via the following steps: > > 0. I am running xorp processes on an embedded system. > 1. start rtrmgr from linux shell on the system; > 2. manually start xorp_static_routes from linux shell. This static will hijack the xrl channels to rtrmgr; > 3. use cli command "create protocol static" to start a second xorp_static_routes. > 4. use cli command "delete protocol static" to stop static. both xorp_static_routes were terminated. depended process like fea, rib and policy were also terminated. rtrmgr crash. > Ok, the crash is because if you do a pop_front() on an empty list, it's going to crash. I'm not sure why the list is empty here. Seems to indicate task-manager logic is busted with regard to task list management and/or callbacks are being called against a wrong task-manager. Do you actually need to do this operation for your project? If so, you probably will want to investigate task-manager logic in detail to figure out why this is happening. The attached patch fixes the crash, but the underlying bug persists. Most of the patch is debugging code, but I'm leaving it in my tree because it will help next time we hit a similar problem. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- A non-text attachment was scrubbed... Name: patch0.patch Type: text/x-patch Size: 3306 bytes Desc: not available Url : http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091029/eb2b6df3/attachment.bin From greearb at candelatech.com Thu Oct 29 22:30:25 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 29 Oct 2009 22:30:25 -0700 Subject: [Xorp-hackers] PATCH: Enable libxipc tests Message-ID: <4AEA79F1.5050507@candelatech.com> FYI: Here are stats from my 2.4Ghz E5530 system: Patch to enable these tests is attached (and pushed to my tree). [root at i7-dqc-1 tests]# ./test_xrl_receiver& [root at i7-dqc-1 tests]# ./test_xrl_sender XrlAtoms per call = 1 Send method = pipeline start_transmission_cb 100 Okay Received 10000 XRLs; delta_time = 0.738458 secs; speed = 13541.731554 XRLs/s start_transmission_cb 100 Okay Received 10000 XRLs; delta_time = 0.439395 secs; speed = 22758.565755 XRLs/s start_transmission_cb 100 Okay Received 10000 XRLs; delta_time = 0.408516 secs; speed = 24478.845382 XRLs/s start_transmission_cb 100 Okay Received 10000 XRLs; delta_time = 0.407115 secs; speed = 24563.084141 XRLs/s Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- A non-text attachment was scrubbed... Name: xorp_xipc_tests.patch Type: text/x-patch Size: 5858 bytes Desc: not available Url : http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091029/7127b404/attachment-0001.bin From greearb at candelatech.com Thu Oct 29 22:47:43 2009 From: greearb at candelatech.com (Ben Greear) Date: Thu, 29 Oct 2009 22:47:43 -0700 Subject: [Xorp-hackers] PATCH: Enable libxipc tests In-Reply-To: <4AEA79F1.5050507@candelatech.com> References: <4AEA79F1.5050507@candelatech.com> Message-ID: <4AEA7DFF.5050002@candelatech.com> Ben Greear wrote: > FYI: Here are stats from my 2.4Ghz E5530 system: Here's oprofile output for a similar test (test_xrl_sender) -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: xrl_test_oprofile_summary.txt Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091029/c4a6c46e/attachment.txt From lizhaous2000 at yahoo.com Fri Oct 30 07:23:25 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Fri, 30 Oct 2009 07:23:25 -0700 (PDT) Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done In-Reply-To: <4AEA5C7A.9070807@candelatech.com> Message-ID: <150870.8613.qm@web58705.mail.re1.yahoo.com> I have three cases in which this crash occured. The one you set up is one of them. I used you fix. It did prevent rtrmgr from crashing in all three cases. That is good. But i am afraid that is not the root cause because task manager always check if the tasklist is not empty before it run any task. I will keep debugging to look for the root cause and will let you know if I found anything. Thank you for spending time on this. Li --- On Thu, 10/29/09, Ben Greear wrote: > From: Ben Greear > Subject: Re: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Thursday, October 29, 2009, 11:24 PM > Li Zhao wrote: > > I added a new protocol and I can start it in CLI by > command "create protocol XXX", but the rtrmgr crashed after > command "delete protocol XXX". > > I can also easily reproduce the exactlt same crash via > the following steps: > > > > 0. I am running xorp processes on an embedded system. > > 1. start rtrmgr from linux shell on the system; > > 2. manually start xorp_static_routes from linux shell. > This static will hijack the xrl channels to rtrmgr; > > 3. use cli command "create protocol static" to start a > second xorp_static_routes. > > 4. use cli command "delete protocol static" to stop > static. both xorp_static_routes were terminated. depended > process like fea, rib and policy were also terminated. > rtrmgr crash. > >??? > Ok, the crash is because if you do a pop_front() on an > empty list, it's going to crash. > > I'm not sure why the list is empty here.? Seems to > indicate task-manager logic is busted > with regard to task list management and/or callbacks are > being called against a wrong > task-manager. > > Do you actually need to do this operation for your > project?? If so, you probably will want > to investigate task-manager logic in detail to figure out > why this is happening. > > The attached patch fixes the crash, but the underlying bug > persists.? Most of the patch is debugging > code, but I'm leaving it in my tree because it will help > next time we hit a similar problem. > > Thanks, > Ben > > -- Ben Greear > Candela Technologies Inc? http://www.candelatech.com > > > > -----Inline Attachment Follows----- > > From greearb at candelatech.com Fri Oct 30 07:29:55 2009 From: greearb at candelatech.com (Ben Greear) Date: Fri, 30 Oct 2009 07:29:55 -0700 Subject: [Xorp-hackers] Omitting XrlDB from Router Manager In-Reply-To: <4AE7178B.9000709@incunabulum.net> References: <4AE7178B.9000709@incunabulum.net> Message-ID: <4AEAF863.7000500@candelatech.com> Bruce Simpson wrote: > Hi all, > > I'm still looking at the XRL replacement since I got back from > holiday, which is why I've been mostly silent on lists. > > Something came up in analysis, which broadly relates to Ben > Greear's work on reducing Router Manager startup times, etc. and some > of the questions Li Zhao has been asking in other threads on this list. > > @Ben: It would be interesting to know what difference omitting the > XRLDB code makes to your Router Manager startup times. > * The XRLDB seems to exist pretty much to validate what's in the > template files and how the Router Manager uses them, although this is > done completely at run time. > * I wonder if disabling this code would make a difference to performance. > * To do this, I'd hack rtrmgr/template_commands.cc, and comment out > the calls to the XRLdb methods. > * The rtrmgr/xrldb.cc is the only place in the whole system where the > '*.xrls' files are parsed and used. They are used only to validate the > syntax and structure of potential XRL method calls. > * It would mean that there is no up-front validation of the XRLs, but > in practice, this validation step is probably only of interest to > people developing XORP, to catch problems with template files. > * It's probably best folded under a compile-time #define for developer > use. Something like the attached patch? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- A non-text attachment was scrubbed... Name: xorp_xrldb_verification.patch Type: text/x-patch Size: 4454 bytes Desc: not available Url : http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091030/28957ef2/attachment.bin From lizhaous2000 at yahoo.com Fri Oct 30 07:30:30 2009 From: lizhaous2000 at yahoo.com (Li Zhao) Date: Fri, 30 Oct 2009 07:30:30 -0700 (PDT) Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done In-Reply-To: <4AE9D059.20102@candelatech.com> Message-ID: <982408.82472.qm@web58702.mail.re1.yahoo.com> I thought task manager was fine. But it might be that the first node was deleted twice, one of which is this pop_front and another hidden one. --- On Thu, 10/29/09, Ben Greear wrote: > From: Ben Greear > Subject: Re: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done > To: "Li Zhao" > Cc: xorp-hackers at icir.org > Date: Thursday, October 29, 2009, 1:26 PM > On 10/29/2009 08:16 AM, Li Zhao > wrote: > > I am puzzled by operator delete(prt=0x0). But inside > deallocate(this=0x8d55238, __p=0x8d55238), the __p is not > 0x0. pop_front means "removes and deletes". So somewhere > else this list node was deleted again? > > > > --- On Thu, 10/29/09, Li Zhao? > wrote: > > > >> From: Li Zhao > >> Subject: [Xorp-hackers] rtrmgr crash on SIGABRT > because of pop_front in task_done > >> To: xorp-hackers at icir.org > >> Date: Thursday, October 29, 2009, 10:54 AM > >> I added a new protocol and I can > >> start it in CLI by command "create protocol XXX", > but the > >> rtrmgr crashed after command "delete protocol > XXX". > >> I can also easily reproduce the exactlt same crash > via the > >> following steps: > >> > >> 0. I am running xorp processes on an embedded > system. > >> 1. start rtrmgr from linux shell on the system; > >> 2. manually start xorp_static_routes from linux > shell. This > >> static will hijack the xrl channels to rtrmgr; > >> 3. use cli command "create protocol static" to > start a > >> second xorp_static_routes. > >> 4. use cli command "delete protocol static" to > stop static. > >> both xorp_static_routes were terminated. depended > process > >> like fea, rib and policy were also terminated. > rtrmgr > >> crash. > > I ran under valgrind, and saw this info: > > ==27820== Invalid free() / delete / delete[] > ==27820==? ? at 0x4A05E3F: operator delete(void*) > (vg_replace_malloc.c:342) > ==27820==? ? by 0x463531: > __gnu_cxx::new_allocator > >::deallocate(std::_List_node*, unsigned > long) (new_a > llocator.h:95) > ==27820==? ? by 0x462427: > std::_List_base > >::_M_put_node(std::_List_node*) > (stl_list.h:320) > ==27820==? ? by 0x46143B: std::list std::allocator > >::_M_erase(std::_List_iterator) > (stl_list.h:1431) > ==27820==? ? by 0x45FF0B: std::list std::allocator >::pop_front() > (stl_list.h:906) > ==27820==? ? by 0x45DB73: > TaskManager::task_done(bool, std::string const&) > (task.cc:2256) > ==27820==? ? by 0x465970: > XorpMemberCallback2B0 std::string const&>::dispatch(bool, std::string > const&) (call > back_nodebug.hh:4636) > ==27820==? ? by 0x45C540: Task::step8_report() > (task.cc:1998) > ==27820==? ? by 0x4659DF: > XorpMemberCallback0B0::dispatch() > (callback_nodebug.hh:306) > ==27820==? ? by 0x449613: > Module::terminate_with_prejudice(ref_ptr > >) (module_manager.cc:218) > ==27820==? ? by 0x44F63C: > XorpMemberCallback0B1 ref_ptr > >::dispatch() > (callback_nodebug.hh:598) > ==27820==? ? by 0x549D72: > OneoffTimerNode2::expire(XorpTimer&, void*) > (timer.cc:167) > ==27820==? Address 0x50c9340 is 80 bytes inside a > block of size 200 alloc'd > ==27820==? ? at 0x4A06FFC: operator new(unsigned > long) (vg_replace_malloc.c:230) > ==27820==? ? by 0x42C81F: > MasterConfigTree::MasterConfigTree(std::string const&, > MasterTemplateTree*, ModuleManager&, XorpClient&, > boo > l, bool) (master_conf_tree.cc:119) > ==27820==? ? by 0x406ED6: Rtrmgr::run() > (main_rtrmgr.cc:319) > ==27820==? ? by 0x407E57: main > (main_rtrmgr.cc:665) > > > It appears to me that the task-manager object (this) is > already deleted when > the taskmanager::task_done() method is called. > > Could probably add some debugging to the destructors and > constructors of TaskManager > to verify.? I have some other things to do first..but > will look at this a bit later > if no one beats me to it. > > Thanks, > Ben > > -- > Ben Greear > Candela Technologies Inc? http://www.candelatech.com > > From greearb at candelatech.com Fri Oct 30 07:48:44 2009 From: greearb at candelatech.com (Ben Greear) Date: Fri, 30 Oct 2009 07:48:44 -0700 Subject: [Xorp-hackers] rtrmgr crash on SIGABRT because of pop_front in task_done In-Reply-To: <982408.82472.qm@web58702.mail.re1.yahoo.com> References: <982408.82472.qm@web58702.mail.re1.yahoo.com> Message-ID: <4AEAFCCC.8070802@candelatech.com> Li Zhao wrote: > I thought task manager was fine. But it might be that the first node was deleted twice, one of which is this pop_front and another hidden one. > > The task-manager is fine. (See the assert_not_deleted() check in my patch). I bet if you added print statements around adding/deleting tasks, and print out the 'this' pointer, you'd learn something interesting... Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb at candelatech.com Fri Oct 30 11:47:43 2009 From: greearb at candelatech.com (Ben Greear) Date: Fri, 30 Oct 2009 11:47:43 -0700 Subject: [Xorp-hackers] PATCH: Enable compiling with gprof support Message-ID: <4AEB34CF.5050500@candelatech.com> Also, this allows writing config vars to a file so an external program (maybe a packager or installer), can use them automatically. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: xorp_gprof.patch Url: http://mailman.ICSI.Berkeley.EDU/pipermail/xorp-hackers/attachments/20091030/d4c0d162/attachment.ksh From bms at incunabulum.net Sat Oct 31 09:53:53 2009 From: bms at incunabulum.net (Bruce Simpson) Date: Sat, 31 Oct 2009 16:53:53 +0000 Subject: [Xorp-hackers] Omitting XrlDB from Router Manager In-Reply-To: <4AEAF863.7000500@candelatech.com> References: <4AE7178B.9000709@incunabulum.net> <4AEAF863.7000500@candelatech.com> Message-ID: <4AEC6BA1.3010007@incunabulum.net> Ben Greear wrote: >> >> * The rtrmgr/xrldb.cc is the only place in the whole system where the >> '*.xrls' files are parsed and used. They are used only to validate >> the syntax and structure of potential XRL method calls. >> * It would mean that there is no up-front validation of the XRLs, but >> in practice, this validation step is probably only of interest to >> people developing XORP, to catch problems with template files. >> * It's probably best folded under a compile-time #define for >> developer use. > > Something like the attached patch? Great stuff :-) Does it work for you? Have you seen any measurable increase in performance for production systems? I have actually chopped the entire Router Manager from my dev branch. There are parts of libxipc which are neither used or needed by anything but the Finder or Router Manager, and aren't essential for knitting processes together. I'll be merging it back on a piecemeal basis once I've actually got Thrift protocol working. From greearb at candelatech.com Sat Oct 31 15:51:52 2009 From: greearb at candelatech.com (Ben Greear) Date: Sat, 31 Oct 2009 15:51:52 -0700 Subject: [Xorp-hackers] Omitting XrlDB from Router Manager In-Reply-To: <4AEC6BA1.3010007@incunabulum.net> References: <4AE7178B.9000709@incunabulum.net> <4AEAF863.7000500@candelatech.com> <4AEC6BA1.3010007@incunabulum.net> Message-ID: <4AECBF88.7030704@candelatech.com> Bruce Simpson wrote: > Ben Greear wrote: >>> >>> * The rtrmgr/xrldb.cc is the only place in the whole system where >>> the '*.xrls' files are parsed and used. They are used only to >>> validate the syntax and structure of potential XRL method calls. >>> * It would mean that there is no up-front validation of the XRLs, >>> but in practice, this validation step is probably only of interest >>> to people developing XORP, to catch problems with template files. >>> * It's probably best folded under a compile-time #define for >>> developer use. >> >> Something like the attached patch? > > Great stuff :-) Does it work for you? Have you seen any measurable > increase in performance for production systems? > > I have actually chopped the entire Router Manager from my dev branch. > There are parts of libxipc which are neither used or needed by > anything but the Finder or Router Manager, and aren't essential for > knitting processes together. I'll be merging it back on a piecemeal > basis once I've actually got Thrift protocol working. It can't hurt, but I didn't do any performance tests specifically for this change. It does seem to function fine, however. My bigger problem is an N^2 problem with routes and number of routers (with 100 routers, and 300 routes each, I get extreme numbers of netlink route update messages on each router. I'm patching the kernel to allow netlink to bind to a particular routing table, so I should get rid of all the un-needed route updates for other routers' tables. Hope to test this in a day or two. Do you have an estimate for when you plan to post your changes? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com