[Bro-Dev] Queueing in Broker?

Wed Feb 14 08:39:47 PST 2018

>> And if you still planned on message routing/auto-forwarding being more
>> widely used, I think you would want to buffer the message while the
>> longest subscribed *path* has a down node?
> 
> I was thinking to do the buffering at the routing/hop-level. The
> messsage would get as far as it can at first. If a peer is down that a
> node would have normally forwarded to, it'd buffer for a bit until
> that comes back (but I realize this makes it even more fuzzy which
> peers to wait for: in a flexible topology peers could come and go all
> the time; see below).
> 
> That said, I'm now wondering if such buffering functionality should
> really be located inside CAF, as that's in charge of low-level message
> propagation.

CAF already implements cumulative ACKs. Combine this with send buffers, snapshotting and a cluster manager and you have fault-tolerant pipelines with automatic redeployment/failover - in theory. That’s all up in the air of course, since we don’t have the manpower to fully flesh this out at the moment. However, many prerequisites are already there (such as ACKs on a per-batch level and customization points in stream mangers to deal with errors) that we could leverage for this use case.

I think your use case is simple enough that we can make a few additions to CAF and then implement this in Broker-land. Let me outline a solution here:
- on disconnect, keep the outbound path alive
- add new data to path’s buffer up to maximum (or timeout)
- include some form of unique identifier (host name? configured ID?) in handshakes
- rebind and resume sends on an outbound path if a client reconnects

An outbound path in a CAF stream is essentially a buffer with additional state for batch ID and credit bookkeeping. Does that outlined solution make sense? This would have "at least once" semantics, so the receiving peer can receive messages twice for anything it already processed but didn’t have the chance to ACK. Just pointing it out.

Disclaimer: I’m weeks away from finishing work in my topic/streaming branch. After that point it’s straightforward to give you scaffold for this.

>> Yeah, I'm also unclear if there's anyway you can tell if the peer is
>> supposed to be permanent vs. transient in come cases.
> 
> We could make that an explicit endpoint option: "for this peer, on
> disconnect buffer stuff it would normally receive until it comes back
> (subject to some limits)". We may need a better way to identify the
> same peer though, just IP probably wouldn't work well. Maybe through
> some ID/name sent during the handshake? One would need to configure
> such a name for peers when turning on the buffering.

Yes, I think a custom ID via the caf-application.ini is the simplest solution. Using the hostname is an option too, as long as users make sure hostnames in their network are unique.

>> Last observation is that I think any of these types of changes would
>> be to the internal messaging pattern/protocol and so maybe reasonable
>> to change/improve in subsequent releases in a way that's transparent
>> to users.
> 
> Yeah, nothing to get in immediately, still needs some thinking. I'm
> getting the sense though that we'll need it for some applications,
> osquery being the main one on my mind.

That’s good to know. I will keep this in mind as a topic for later, when my topic branch is merged back to master.

    Dominik