[Xorp-hackers] Patch to fix issue with interface removal and multicast.

Pavlin Radoslavov pavlin at ICSI.Berkeley.EDU
Tue Sep 2 15:25:12 PDT 2008


Ben,

XORP has to be able to handle interfaces that disappear, including
pure software-created interfaces like tunnels and VPNs.
Hence, please submit a bug about the issue including a little bit
more detailed instructions how to reproduce it. E.g., please include
the starting configuration, and your monitoring script.

I am afraid that the patch itself cannot be accepted for
the following two reasons:
* It hides the real problem instead of fixing it.
* Performing error checking by testing the content of the error
  message itself is not acceptable: it is error prone and is not
  efficient.
However, the patch could be useful to point toward the problem,
hence feel free to include it in the Bugzilla entry.

Thanks,
Pavlin

Ben Greear <greearb at candelatech.com> wrote:

> The attached patch fixes (or hacks around) some problems I see with
> certain scenarios related to removing interfaces dynamically when
> multicast is enabled.  It's possible this problem is because of my other
> fea changes, but I think it's a problem with the base logic.
> 
> There is likely more that one issue here, but I think this use case described
> below would probably reproduce this on un-patched xorp configured with
> multicast.
> 
> My case is with my patched xorp, and the patch is on top of my other patches.
> 
> Here is the scenario:
> 
> Suppose a xorp instance has 3 interfaces, br0, eth0, eth1, and suppose eth0
> and eth1
> are usb NICs that can dissappear when removed from the machine (I'm using
> different virtual interfaces, but the logic is similar.)
> 
> I have a control program that notices (or causes) interface removal, and when
> it does,
> it sends cli commands to xorpsh to remove the config related to that interface.
> 
> Now, xorp is all fine with 3 interfaces.
> ...
> eth0 is removed
>  my monitor program initiates xorpsh commands to remove eth0 config info
> eth1 is removed
>  xorp notices eth0 is removed and cleans up fea logic.
>  xorp notices eth1 is removed and cleans up fea logic.
>  xorpsh commands above get to the 'commit' stage
>  monitor program queues additional commands to remove eth1 config info
> 
> The commit fails, because the commit tries to set info for eth1, and the
> device is no
> longer in the kernel.  Since this commit fails, the next one to remove eth1
> will as well
> (because eth0 is still in the config, and still not in the OS).
> 
> The scenario above can happen in several different orders, but the basic
> problem is that
> xorp commit cannot deal with interfaces being removed from the OS at arbitrary
> times.
> 
> The attached patch ignores failure cases related to interfaces being removed.
> I think
> it would be better if we could return a warning message to the xorpsh without
> failing the commit, but that is beyond the scope of what I have time to
> accomplish
> currently.
> 
> Please consider applying any part of this you see as useful, and/or coming up
> with something
> better.
> 
> Thanks,
> Ben
> 
> 
> -- 
> 
> Ben Greear <greearb at candelatech.com> Candela Technologies Inc
> http://www.candelatech.com
> 
> 
> diff --git a/fea/data_plane/ifconfig/ifconfig_set.cc b/fea/data_plane/ifconfig/ifconfig_set.cc
> index 5693373..9362d3c 100644
> --- a/fea/data_plane/ifconfig/ifconfig_set.cc
> +++ b/fea/data_plane/ifconfig/ifconfig_set.cc
> @@ -140,9 +140,11 @@ IfConfigSet::push_config(const IfTree& iftree)
>  		// XXX: ignore deleted interfaces that are not recognized
>  		continue;
>  	    }
> -	    error_reporter.interface_error(config_iface.ifname(),
> -					   "interface not recognized");
> -	    break;
> +	    // Maybe it was already deleted from xorp due to OS removal.  We should
> +	    // just ignore this one
> +	    //error_reporter.interface_error(config_iface.ifname(),
> +	    //			   "interface not recognized");
> +	    //break;
>  	}
>  
>  	//
> @@ -244,6 +246,7 @@ IfConfigSet::push_config(const IfTree& iftree)
>  
>  	    push_vif_creation(system_ifp, system_vifp, config_iface,
>  			      config_vif);
> +
>  	}
>      }
>  
> @@ -648,8 +651,24 @@ IfConfigSet::push_vif_address(const IfTreeInterface*	system_ifp,
>  			       config_iface, config_vif, config_addr,
>  			       error_msg)
>  	    != XORP_OK) {
> -	    error_msg = c_format("Failed to add address: %s",
> -				 error_msg.c_str());
> +	    if (strstr(error_msg.c_str(), "No such device")) {
> +		XLOG_ERROR("Failed to configure address because of device not found: %s",
> +			   error_msg.c_str());
> +		// We can't pass this back as an error to the cli because the device is gone,
> +		// and if we fail this set, the whole commit will fail.  This commit could be
> +		// deleting *another* device, with a (very near) future set of xorpsh commands
> +		// coming in to delete *this* interface in question.
> +		// This is a hack...we really need a way to pass warnings back to the
> +		// cli but NOT fail the commit if the logical state is correct but the
> +		// (transient, unpredictable, asynchronously discovered) state of the actual
> +		// OS network devices is currently out of state.
> +		// --Ben
> +		error_msg = "";
> +	    }
> +	    else {
> +		error_msg = c_format("Failed to add address, not device-no-found error: %s",
> +				     error_msg.c_str());
> +	    }
>  	}
>      } else {
>  	//
> @@ -720,8 +739,24 @@ IfConfigSet::push_vif_address(const IfTreeInterface*	system_ifp,
>  			       config_iface, config_vif, config_addr,
>  			       error_msg)
>  	    != XORP_OK) {
> -	    error_msg = c_format("Failed to configure address: %s",
> -				 error_msg.c_str());
> +	    if (strstr(error_msg.c_str(), "No such device")) {
> +		XLOG_ERROR("Failed to configure address because of device not found: %s",
> +			   error_msg.c_str());
> +		// We can't pass this back as an error to the cli because the device is gone,
> +		// and if we fail this set, the whole commit will fail.  This commit could be
> +		// deleting *another* device, with a (very near) future set of xorpsh commands
> +		// coming in to delete *this* interface in question.
> +		// This is a hack...we really need a way to pass warnings back to the
> +		// cli but NOT fail the commit if the logical state is correct but the
> +		// (transient, unpredictable, asynchronously discovered) state of the actual
> +		// OS network devices is currently out of state.
> +		// --Ben
> +		error_msg = "";
> +	    }
> +	    else {
> +		error_msg = c_format("Failed to configure address, not device-no-found error: %s",
> +				     error_msg.c_str());
> +	    }
>  	}
>      } else {
>  	//
> diff --git a/fea/data_plane/ifconfig/ifconfig_set_netlink_socket.cc b/fea/data_plane/ifconfig/ifconfig_set_netlink_socket.cc
> index 0880b5d..2381d6b 100644
> --- a/fea/data_plane/ifconfig/ifconfig_set_netlink_socket.cc
> +++ b/fea/data_plane/ifconfig/ifconfig_set_netlink_socket.cc
> @@ -961,7 +961,7 @@ IfConfigSetNetlinkSocket::add_addr(const string& ifname,
>      if (ns.sendto(&buffer, nlh->nlmsg_len, 0,
>  		  reinterpret_cast<struct sockaddr*>(&snl), sizeof(snl))
>  	!= (ssize_t)nlh->nlmsg_len) {
> -	error_msg = c_format("Cannot add address '%s' "
> +	error_msg = c_format("IfConfigSetNetlinkSocket::add_addr: sendto: Cannot add address '%s' "
>  			     "on interface '%s' vif '%s': %s",
>  			     addr.str().c_str(),
>  			     ifname.c_str(), vifname.c_str(), strerror(errno));
> @@ -970,7 +970,7 @@ IfConfigSetNetlinkSocket::add_addr(const string& ifname,
>      if (NlmUtils::check_netlink_request(_ns_reader, ns, nlh->nlmsg_seq,
>  					last_errno, error_msg)
>  	!= XORP_OK) {
> -	error_msg = c_format("Cannot add address '%s' "
> +	error_msg = c_format("IfConfigSetNetlinkSocket::add_addr: check_nl_req: Cannot add address '%s' "
>  			     "on interface '%s' vif '%s': %s",
>  			     addr.str().c_str(),
>  			     ifname.c_str(), vifname.c_str(),
> diff --git a/fea/mfea_node.cc b/fea/mfea_node.cc
> index 04027d7..a216414 100644
> --- a/fea/mfea_node.cc
> +++ b/fea/mfea_node.cc
> @@ -1125,7 +1125,7 @@ MfeaNode::enable_vif(const string& vif_name, string& error_msg)
>  {
>      MfeaVif *mfea_vif = vif_find_by_name(vif_name);
>      if (mfea_vif == NULL) {
> -	error_msg = c_format("Cannot enable vif %s: no such vif",
> +	error_msg = c_format("MfeaNode:  Cannot enable vif %s: no such vif",
>  			     vif_name.c_str());
>  	XLOG_ERROR("%s", error_msg.c_str());
>  	return (XORP_ERROR);
> @@ -1153,7 +1153,9 @@ MfeaNode::disable_vif(const string& vif_name, string& error_msg)
>  	error_msg = c_format("Cannot disable vif %s: no such vif",
>  			     vif_name.c_str());
>  	XLOG_ERROR("%s", error_msg.c_str());
> -	return (XORP_ERROR);
> +        // If it's gone, it's as disabled as it can be.
> +	//return (XORP_ERROR);
> +        return XORP_OK;
>      }
>      
>      mfea_vif->disable();
> @@ -1211,7 +1213,10 @@ MfeaNode::stop_vif(const string& vif_name, string& error_msg)
>  	error_msg = c_format("Cannot stop vif %s: no such vif",
>  			     vif_name.c_str());
>  	XLOG_ERROR("%s", error_msg.c_str());
> -	return (XORP_ERROR);
> +	// If it doesn't exist, it's as stopped as it's going to get.  Returning
> +	// error will cause entire commit to fail.
> +	//return (XORP_ERROR);
> +	return XORP_OK;
>      }
>      
>      if (mfea_vif->stop(error_msg) != XORP_OK) {
> diff --git a/fea/xrl_mfea_node.cc b/fea/xrl_mfea_node.cc
> index 79a1afd..b834f21 100644
> --- a/fea/xrl_mfea_node.cc
> +++ b/fea/xrl_mfea_node.cc
> @@ -1407,8 +1407,10 @@ XrlMfeaNode::mfea_0_1_enable_vif(
>      else
>  	ret_code = MfeaNode::disable_vif(vif_name, error_msg);
>  
> -    if (ret_code != XORP_OK)
> -	return XrlCmdError::COMMAND_FAILED(error_msg);
> +    if (ret_code != XORP_OK) {
> +        XLOG_ERROR("Mfea, enable/disable vif failed.  Allowing commit to succeed anyway since this is likely a race with a deleted interface, error: %s\n", error_msg.c_str());
> +	//return XrlCmdError::COMMAND_FAILED(error_msg);
> +    }
>      
>      return XrlCmdError::OKAY();
>  }
> diff --git a/mld6igmp/mld6igmp_node.cc b/mld6igmp/mld6igmp_node.cc
> index bd4610d..d0b70b6 100644
> --- a/mld6igmp/mld6igmp_node.cc
> +++ b/mld6igmp/mld6igmp_node.cc
> @@ -1028,7 +1028,7 @@ Mld6igmpNode::enable_vif(const string& vif_name, string& error_msg)
>  {
>      Mld6igmpVif *mld6igmp_vif = vif_find_by_name(vif_name);
>      if (mld6igmp_vif == NULL) {
> -	error_msg = c_format("Cannot enable vif %s: no such vif",
> +	error_msg = c_format("Mld6igmpNode:  Cannot enable vif %s: no such vif",
>  			     vif_name.c_str());
>  	XLOG_ERROR("%s", error_msg.c_str());
>  	return (XORP_ERROR);
> @@ -1056,7 +1056,9 @@ Mld6igmpNode::disable_vif(const string& vif_name, string& error_msg)
>  	error_msg = c_format("Cannot disable vif %s: no such vif",
>  			     vif_name.c_str());
>  	XLOG_ERROR("%s", error_msg.c_str());
> -	return (XORP_ERROR);
> +	// It's as disabled as it will get, don't fail commit.
> +	error_msg = "";
> +	return XORP_OK;
>      }
>      
>      mld6igmp_vif->disable();
> diff --git a/pim/pim_node.cc b/pim/pim_node.cc
> index 7ea543c..e18feee 100644
> --- a/pim/pim_node.cc
> +++ b/pim/pim_node.cc
> @@ -1155,7 +1155,7 @@ PimNode::enable_vif(const string& vif_name, string& error_msg)
>  {
>      PimVif *pim_vif = vif_find_by_name(vif_name);
>      if (pim_vif == NULL) {
> -	error_msg = c_format("Cannot enable vif %s: no such vif",
> +	error_msg = c_format("PimNode:  Cannot enable vif %s: no such vif",
>  			     vif_name.c_str());
>  	XLOG_ERROR("%s", error_msg.c_str());
>  	return (XORP_ERROR);
> @@ -1183,7 +1183,9 @@ PimNode::disable_vif(const string& vif_name, string& error_msg)
>  	error_msg = c_format("Cannot disable vif %s: no such vif",
>  			     vif_name.c_str());
>  	XLOG_ERROR("%s", error_msg.c_str());
> -	return (XORP_ERROR);
> +	// It's as disabled as it's going to get..don't fail the commit.
> +	error_msg = "";
> +	return XORP_OK;
>      }
>      
>      pim_vif->disable();
> diff --git a/pim/xrl_pim_node.cc b/pim/xrl_pim_node.cc
> index 45da17e..6cb38cd 100644
> --- a/pim/xrl_pim_node.cc
> +++ b/pim/xrl_pim_node.cc
> @@ -1322,8 +1322,9 @@ XrlPimNode::mfea_client_send_register_unregister_protocol_cb(
>  	//
>  	// If a command failed because the other side rejected it, this is
>  	// fatal.
> +	// Shouldn't be fatal, network device can suddenly dissappear, for instance.
>  	//
> -	XLOG_FATAL("Cannot %s protocol with the MFEA: %s",
> +	XLOG_ERROR("Cannot %s protocol with the MFEA: %s",
>  		   entry->operation_name(), xrl_error.str().c_str());
>  	break;
>  
> _______________________________________________
> Xorp-hackers mailing list
> Xorp-hackers at icir.org
> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers



More information about the Xorp-hackers mailing list