[Xorp-hackers] Using Netlink to lookup forwarding entries in Linux kernel

Tue, 21 Oct 2003 12:54:16 -0700

I was playing with testing the FEA when adding/deleting unicast
forwarding entries in the kernel, and I found the following problems
when I use the Netlink mechanism with Linux kernel:

* The kernel doesn't appear to support looking-up of a subnet
  address from user space.
  Example:
  If we install a route entry for 10.30.0.0/16 in the
  kernel, and we send a request to the kernel to lookup subnet
  10.30.0.0/16, we would expect that the kernel will return a
  netlink message that contains the previously installed
  information. E.g., the returned info should contain at least the
  subnet mask length of 24. Instead, the kernel returns the result
  of looking-up address 10.30.0.0/32 which could be different from
  entry 10.30.0.0/16 (e.g, it could based on the info of the more
  specific entry 10.30.0.0/24).
  BTW, the returned 10.30.0.0/32 entry is "cloned" (more on this
  subject later).

  Only if we fetch the whole forwarding table from the kernel, then
  the information for each entry matches the information when it was
  installed.
  After reading the source code for iproute2 (which contains utility
  "ip" and is presumably the example of how to use the Netlink
  interface), I found that the way it supports a command like
  "ip route list exact 10.30.0.0/24" which basically lookup the
  exact network routing entry is to:
  1. Get the whole forwarding table
  2. Go through the list of all entries, and select the one that
  exactly matches the request (if such entry exists).

  Obviously, the overhead of always fetching the forwarding table
  and then filtering/selecting at user space may increase
  considerably if the forwarding table size becomes significantly
  large.
  However, I coudn't find any other solution of the problem (no
  documentation, and reading the source code and playing with
  the Netlink interface were frutless), hence I had to use the same
  mechanism inside the FEA.
  On the upside, looking-up a specific subnet address currently is
  needed only for debug purpose, hence I don't expect that we
  should worry much about the overhead of reading the whole
  forwarding table.

* When we lookup a host address from user space, the returned result
  is actually a "cloned" entry inside the kernel. E.g., if we have
  installed 10.30.0.0/16 in the kernel, and we lookup destination
  address 10.30.0.10, the kernel will create internally a cloned
  entry for 10.30.0.10/32, and will return that result.
  However, if we delete entry 10.30.0.0/16, it looks like that the
  cloned 10.30.0.10/32 entry will remain in the kernel for up to 2
  seconds or so, and then it will be automatically deleted.

  Hence, if I try to perform the following test:
  1. Install routing entry for 10.30.0.0/16
  2. Test that the kernel returns a valid route for destination
     10.30.0.10
  3. Delete routing entry for 10.30.0.0/16
  4. Test that the kernel does NOT have a valid route for
     destination 10.30.0.10 (assuming no other matching entries were
     installed previously in the kernel).

  the test will fail at step 4, because the kernel will return the
  obsoleted cloned entry for 10.30.0.10/32.

  Only if I wait at least 2-3 seconds between step 3 and step 4,
  then the test will succeed.
  For now I don't have a reasonable solution of the problem except
  that to explicitly modify the above test such that in step 4 we
  always wait for 3 seconds first before sending the request to the
  FEA.
  However, looking-up a specific host address currently is used only
  for debugging purpose, so I expect that the above behavior
  would create problems only in our test scripts (solvable by the
  "sleep 3" hack).

FWIW, in *BSD the routing sockets interface doesn't appear to have
the above problems.

Any comments or suggestions if we should handle those problems
differently?

Pavlin