[Zeek] Performance hit with long lived flows

Nabil Memon nabilmemon.ec at gmail.com
Thu Apr 9 23:58:32 PDT 2020

Hi Zeek,

Hope you're all doing well.
I have a big 4GB sized PCAP and I am running many iterations of it at 10
Gbps with the help of load balancer and multiple instance of bro running on
top of it.
It takes around 3.5 seconds to finish one iteration. I have to run multiple
iteration of the same pcap because of not having a test network which can
pump 10Gbps traffic to my software. I don't even have a very large pcap so
that I can run only one iteration for a long time.

Other than proper(SYN-SYNACK-ACK--------FIN/RST) TCP flows, bro is able to
hold all the other connections. If the run is for let's say an hour, it
notifies about the connection after the test is over. This particular
scenario is a test specific, but the need to tackle long lived flows is a
valid one.

I tired, *"connection_status_update" *way of handling this. If the update
interval is configured to 10 min, it starts dropping exactly around 40
mins. If the interval is kept to 1 min, it starts having problem at around
4 min. I could not figure out why bro behaves this way, what is causing at
(interval * 4) mins(there is one parameter i think is playing a role which
is the time taken by a PCAP to complete one iteration, but still it doesn't
help coming up with any theory). So, after it starts dropping, the number
of broccoli sockets seems to be increasing and bro then goes into
unresponsive state.

I tried Connection polling using ConnPolling::watch(), this approach is way
better than *connection_status_update* for sure is what I observed, this
takes a little while to drop but it doesn't go into the very bad state of
sockets being increasing and the unresponsive state.

I also tried schedule, and it didn't serve my purpose either.

After trying out whatever bro suggests me to handle this, I came up with my
own implementation.
redef record connection += {
  loop_count: count &default=1;

global connection_status_interval = 1 min;
global connTable: table[string] of conn_id = table();

event connection_state_remove(c: connection)
  delete connTable[c$uid];

event new_connection(c: connection)
  connTable[c$uid] = c$id;

event checkConnectionInterval()
      local conn: connection;
      for (uid1 in connTable)
                conn = lookup_connection(connTable[uid1]);
                if (conn$duration >= connection_status_interval *
                handle_connection_data(conn, T);
                conn$loop_count += 1;
      schedule 30secs {checkConnectionInterval()};

schedule 30secs {checkConnectionInterval()};

As you can see, I maintained my own connection table, and with the help of
schedule, I am managed to scan the table every 30 seconds and compare the
connection's duration with the time configured.
I haven't explore if schedule routines works in the same main thread?? If
yes, then obviously, this can hold bro's main packet processing thread and
we may have a serious damage going through a big list of such table
entries. But I also thought of scanning in some batches, with the help of
*Two_Such_Tables* and a *flag_For_In_Which_To_Fill_NewConnections*
approach. Scanning in some batches will surely help in overall balancing of
the software.

With this approach, I am having a successful run.

I just want to know what do you guys feel about this by keeping
everything(test scenario/overall system's condition etc.) in mind.
Can bro's suggested approach work in real 10Gbps network traffic??
Any suggestions how I can simulate 10Gbps real network traffic with the
packet containing protocols or conversations I am interested in??

Nabil Jada
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/zeek/attachments/20200410/2e212ad9/attachment.html 

More information about the Zeek mailing list