[Zeek-Dev] Proposal: Improve Zeek's log-writing system with batch support and better status reporting

Fri Jul 17 02:54:04 PDT 2020

On Thu, Jul 16, 2020 at 17:15 -0700, Bob Murphy wrote:

> Here’s how it would work:

It would be helpful to see a draft API for the full batch writing
functionality to see how the pieces would work together. Could you
mock that up?

That said, couple of thoughts:

> 2. The failure_type value would still indicate generally what
> happened, with predefined values indicating things like “network
> failure”, “protocol error”, “unable to write to disk”, or
> “unspecified failure".

In my experience, such detailed numerical error codes are rarely
useful in practice. Different writers will implement them to different
degrees and associate different semantics with them, and callers will
never quite know what to expect and how to react.

Do you actually need to distinguish the semantics for all these
different cases? Seems an alternative would be having a small set of
possible "impact" values telling the caller what to do. To take a
stab:

    - temporary error: failed, but should try again with same log data
    - error: failed, and trying same log data again won't help; but ok to continue with new log data
    - fatal error: Panic, shutdown writer.

Depending on who's going to log failures, we could also just include a
textual error message as well. Logging is where more context seems
most useful I'd say.

> 3. first_index and index_count would specify a range. That way, if
> several successive log records aren’t sent for the same reason, that
> could be represented by a single struct, instead of a different struct
> for each one.

One reason I'm asking about the full API is because I'm not sure where
the ownership of logs resides that fail to write. Is the writer
keeping them? If so, it could handle the retry case internally. If the
writers discards after failure, and the caller needs to send the data
again, I'd wonder if there's a simpler return type here where we just
point to the first failed entry in the batch. The writer would simply
abort on first failure (how likely is it really that the next succeeds
immediately afterwards?)

And just to be clear why I'm making all these comments: I'm worried
about the difficulty of using this API, on both ends. The more complex
we make the things being passed around, the more difficult it gets to
implement the logic correctly and efficiently.

Robin

-- 
Robin Sommer * Corelight, Inc. * robin at corelight.com * www.corelight.com