From robin at corelight.com  Wed Jul  1 01:59:03 2020
From: robin at corelight.com (Robin Sommer)
Date: Wed, 1 Jul 2020 08:59:03 +0000
Subject: [Zeek-Dev] Log archival (Re: Zeek Supervisor: designing client and
 log archival) behavior
In-Reply-To: <CAMzgZ0JKwZNM6kgmNxOPLRXk9md1o3u-CYhbC69v1rbeoe_ohw@mail.gmail.com>
References: <CAMzgZ0JKwZNM6kgmNxOPLRXk9md1o3u-CYhbC69v1rbeoe_ohw@mail.gmail.com>
Message-ID: <20200701085903.GI33767@corelight.com>


On Tue, Jun 30, 2020 at 01:39 -0700, Jon Siwek wrote:

> * https://github.com/zeek/zeek/wiki/Zeek-Supervisor-Log-Handling

This overall sounds good to me. Some notes & questions:

> Log Rotation

> To help bridge/replace Step (4) and (5), suggest adding a new option:
> Log::default_rotation_dir. The Log::rotation_format_func() will use
> this as part of its default return value.

Seems we should then set this to "." by default, and have the cluster
framework override it.

> The log_mgr will attempt to create necessary dirs just-in-time,
> failing to do so emits an error, but otherwise continues with rotation
> using working directory instead.

I'd extend this to any error case: if moving from current location to
Log::default_rotation_dir fails (e.g., because the latter is a on
different file system), continue with new name inside the current
working directory (and report the error).

Once moved, I suppose we would continue to optionally run a
post-processor, right? For a supervised cluster, we wouldn't use that
and suggest that people go with "zeek-archive" instead; but with
ZeekControl we'd keep the current behavior of gzipping behavior so
that we don't break any setups.

We can implement that distinction through the post-processer function:
the new default function would just do the rename according to the new
scheme, and a separate legacy function for ZeekControl spawns the
"archive-log" script.

> zeek-archiver

I like making this a standard tool, but seems like something we could
postpone doing right now and prioritize getting the Zeek-side
infrastructure in place.

> We can potentially have the Zeek Supervisor process configurable to
> auto-start and keep a zeek-archiver child alive. 

I'd say that's a job for systemd (or whatever service manager). I know
Seth disagress. :-)

> Leftover Log Rotation

> The rotation for such a leftover log file uses the metadata in the
> shadowfile to help try to go through the exact rotation that it should
> have occurred, including running the postprocessor function.

Not sure it's worth retaining the information about the post-processor
function, and it could to potentially lead to trouble if the function
changed somehow in between (or disppeared). We could instead just run
the leftovers through whatever the restarted config says to do with
files.

Do we even need any other meta data at all in the new scheme? I'm
wondering if we could simplify this all to: "If at open() time, X.log
exists, first rotate it away through the currently configured
postprocessor function". If we did that, we should probably have an
global boolean that allows to choose between that and just overwriting
existing files. The latter would be the default to retain current
command-line behavior, and the cluster framework would enable leftover
recovery.

Hmm, actually, there's a piece of meta that we'll need: the opening
timestamp, so that one can incorporate that into the name of the
rotated file (assuming we want to retain that capability). Unless we
parsed that out of the X.log itself ...

Robin

-- 
Robin Sommer * Corelight, Inc. * robin at corelight.com * www.corelight.com

From robin at corelight.com  Wed Jul  1 02:00:38 2020
From: robin at corelight.com (Robin Sommer)
Date: Wed, 1 Jul 2020 09:00:38 +0000
Subject: [Zeek-Dev] Supervisor client (Re: Zeek Super-isor: designing client
 and log archival behavior)
In-Reply-To: <CAMzgZ0JKwZNM6kgmNxOPLRXk9md1o3u-CYhbC69v1rbeoe_ohw@mail.gmail.com>
References: <CAMzgZ0JKwZNM6kgmNxOPLRXk9md1o3u-CYhbC69v1rbeoe_ohw@mail.gmail.com>
Message-ID: <20200701090038.GJ33767@corelight.com>


> * https://github.com/zeek/zeek/wiki/Zeek-Supervisor-Client

Some thoughts on the commands:

> $ zeekc status [all | <node_name>]

> Do we need to include any other metrics in the returned status?

That information is mostly static, would be nice to get some dynamic
information in there as well, like uptime, CPU/memory/traffic stats,
No need to have that right away, but worth keeping in mind.

> # Do we need more categories to filter by (e.g. node type) ?

I'd skip for now.

> # If there's downed nodes at this point, what do we expect users to do?
> # Check the standard services logs for stderr/stdout info?  Check reporter.log ?

Yeah, would be cool if zeekc had access to the stderr/stdout from the
nodes through their supervisors. The supervisors could buffer that for
a while and return on request. More generally, the supervisor could
get a "diagnostics buffer" that, over time, we could use for more
stuff like store backtraces etc.

"reporter.log" is out I'd say, that will go through the normal log
rotation & archival, and be accessible that way.

> # A `zeekc diag` command could help gather information, like ask Zeek supervisor
> # to find core dumps and extract stack trace.  Would it do more than that, like
> # show last N lines of downed nodes' stderr, or last N lines of reporter.log?

> $ zeekc check

I'm wondering which supervisor that would be be talking to in a
multi-system setup? All?

> $ zeekc terminate
>  ...

> # Normally wouldn't terminate the supervisor if a service-manager is handling
> # the Zeek supervisor process itself and will just restart it, but`terminate`
> # would be helpful for anyone running a supervised Zeek cluster
> "manually".

Another use case: If for some reason one wants to restart the
supervisor itself, "terminate" would kill it and the service
manager would then restart it.

Robin

-- 
Robin Sommer * Corelight, Inc. * robin at corelight.com * www.corelight.com

From robin at corelight.com  Wed Jul  1 02:02:08 2020
From: robin at corelight.com (Robin Sommer)
Date: Wed, 1 Jul 2020 09:02:08 +0000
Subject: [Zeek-Dev] Zeek Supervisor Command-Line Client
In-Reply-To: <CAMzgZ0Jr7CzGW5bxQBw2nr4Xha1YkJxevx89qv10QSfUg4otug@mail.gmail.com>
References: <CAMzgZ0Lm132-TyDObSDn-h6wA9syhAvWm5v72C1ABVgRpCJj3g@mail.gmail.com>
	<CAPqbkwsOKhcjNF6hSY6zohrUaG6b-6UtFLBkoce-UfiUmsjvWg@mail.gmail.com>
	<20200618071141.GH9200@corelight.com>
	<CAMzgZ0JiP_mXj3EfGRGbv07PmnZ+qY9u_WxAh9_nnB0X4MOtXQ@mail.gmail.com>
	<20200619083810.GE49063@corelight.com>
	<CAMzgZ0J4+B2u-7nwCpGc0Pncm=JawDGUs-8K71ej0EpNhvPxNA@mail.gmail.com>
	<EE095381-B9B7-4CA4-B793-6D31C6E541C1@corelight.com>
	<CAMzgZ0Jr7CzGW5bxQBw2nr4Xha1YkJxevx89qv10QSfUg4otug@mail.gmail.com>
Message-ID: <20200701090208.GK33767@corelight.com>

On Tue, Jun 30, 2020 at 14:29 -0700, Jon Siwek wrote:

> Maybe the important observation is that the logic can be performed
> anywhere that has access to the Zeek-Supervisor process.

Agree.

> So where we put the logic at this point may not be important.  If we
> can find a single-best-place for the logic to live, that's great

I believe that's what Seth is arguing for: have a Zeek-side script be
the single point of that logic, rather than implement it multiple
times and/or outside of Zeek.

I can see doing that in Zeek but I think there's a trade-off here: if
we want to do the singe-place approach with a multi-system setup, we'd
need an authoritative place to run this logic and hence depend on
*that* Zeek supervisor being up and running for performing the
operation. That may be a reasonably assumption (say if we dedicated
the supervisor running the manager to also be the cluster
coordinator), but it's different from a world where the client can
execute higher-level operations on its own.

Robin

-- 
Robin Sommer * Corelight, Inc. * robin at corelight.com * www.corelight.com

From jsiwek at corelight.com  Wed Jul  1 14:03:52 2020
From: jsiwek at corelight.com (Jon Siwek)
Date: Wed, 1 Jul 2020 14:03:52 -0700
Subject: [Zeek-Dev] Log archival (Re: Zeek Supervisor: designing client
 and log archival) behavior
In-Reply-To: <20200701085903.GI33767@corelight.com>
References: <CAMzgZ0JKwZNM6kgmNxOPLRXk9md1o3u-CYhbC69v1rbeoe_ohw@mail.gmail.com>
	<20200701085903.GI33767@corelight.com>
Message-ID: <CAMzgZ0LhLH4mWUK2L3vkt4TidhZyQdxj24eO1egwioOaJbcpTw@mail.gmail.com>

On Wed, Jul 1, 2020 at 1:59 AM Robin Sommer <robin at corelight.com> wrote:
>
> > Log::default_rotation_dir
>
> Seems we should then set this to "." by default, and have the cluster
> framework override it.

Yes, exactly.

> Once moved, I suppose we would continue to optionally run a
> post-processor, right? For a supervised cluster, we wouldn't use that
> and suggest that people go with "zeek-archive" instead; but with
> ZeekControl we'd keep the current behavior of gzipping behavior so
> that we don't break any setups.

Yes, with the proposed changes, custom postprocessors still work the
same as before and everything is backwards compatible / equivalent in
non-supervised-mode.

Supervised-mode is just picking some different default settings from
non-supervised-mode:

* don't use a postprocessing script (archive-log)
* rotate into a `Log::default_rotation_dir` of "log-queue" instead of "."

> Not sure it's worth retaining the information about the post-processor
> function, and it could to potentially lead to trouble if the function
> changed somehow in between (or disppeared). We could instead just run
> the leftovers through whatever the restarted config says to do with
> files.

* Disappeared: easy to notice the function no longer exists and
fallback to default post-processor

* Changed: running through a function of same-name, but it happened to
get changed between restart is probably still going to be closer to
what user expects than running it through the default post-processor
which is completely different ?

> Do we even need any other meta data at all in the new scheme? I'm
> wondering if we could simplify this all to: "If at open() time, X.log
> exists, first rotate it away through the currently configured
> postprocessor function".

What if an open() rarely or never happens again for a given log?

I'm thinking the rotation of leftover logs needs to happen once at
startup rather than lazily.

> Hmm, actually, there's a piece of meta that we'll need: the opening
> timestamp, so that one can incorporate that into the name of the
> rotated file (assuming we want to retain that capability). Unless we
> parsed that out of the X.log itself ...

Don't think we'd have the opening timestamp to parse from the log when
LogAscii::use_json=T.

So still think it's necessary to obtain open-time meta from a
`.shadow.X.log`, either it's explicitly in there or use the files
modified time (essentially creation time).

The close-time of X.log is just taken as last-modified time of X.log.

- Jon

From robin at corelight.com  Thu Jul  2 00:44:08 2020
From: robin at corelight.com (Robin Sommer)
Date: Thu, 2 Jul 2020 07:44:08 +0000
Subject: [Zeek-Dev] Log archival (Re: Zeek Supervisor: designing client
 and log archival) behavior
In-Reply-To: <CAMzgZ0LhLH4mWUK2L3vkt4TidhZyQdxj24eO1egwioOaJbcpTw@mail.gmail.com>
References: <CAMzgZ0JKwZNM6kgmNxOPLRXk9md1o3u-CYhbC69v1rbeoe_ohw@mail.gmail.com>
	<20200701085903.GI33767@corelight.com>
	<CAMzgZ0LhLH4mWUK2L3vkt4TidhZyQdxj24eO1egwioOaJbcpTw@mail.gmail.com>
Message-ID: <20200702074408.GO33767@corelight.com>


On Wed, Jul 01, 2020 at 14:03 -0700, Jon Siwek wrote:

> What if an open() rarely or never happens again for a given log?

Ah, right, forgot about that case. So yeah, agree, the shadow files
are useful for this and to retain whatever information we need.

> * Changed: running through a function of same-name, but it happened to
> get changed between restart is probably still going to be closer to
> what user expects than running it through the default post-processor
> which is completely different ?

I was thinking not the default post-processor, but whatever is
configured for the log file we are just opening (if we did it at
open() time). But yeah, won't work when the cleanup happens already
before the new open.

Robin

-- 
Robin Sommer * Corelight, Inc. * robin at corelight.com * www.corelight.com

From petar.backovic.fit at gmail.com  Sun Jul  5 01:19:33 2020
From: petar.backovic.fit at gmail.com (Petar Backovic)
Date: Sun, 5 Jul 2020 10:19:33 +0200
Subject: [Zeek-Dev] Email Zeek
In-Reply-To: <mailman.12630.1593935982.827.zeek-dev@zeek.org>
References: <mailman.12630.1593935982.827.zeek-dev@zeek.org>
Message-ID: <CAMfQqE4HnvQg9QES5yvuo8cDHOtZYLf+VJE6CdkQG3kN7mceew@mail.gmail.com>

Respected devs,

I installed Zeek and configure interface, email, private IP address, etc.
I copied script for SSH password guessing from docs.zeekweb site and my
listener on wlan0 works.
When I failed to login on SSH with Putty enough times, I never recieved
alert email.
Every SSH login is in ssh.log file, but nothing on email.
Internet works on my Rassoberry pi.

Could you help me, where is the problem?

Thank you in advance,
Petar Backovic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.icsi.berkeley.edu/pipermail/zeek-dev/attachments/20200705/44cb2888/attachment.html 

From johanna at corelight.com  Thu Jul  9 13:21:44 2020
From: johanna at corelight.com (Johanna Amann)
Date: Thu, 09 Jul 2020 13:21:44 -0700
Subject: [Zeek-Dev] Zeek Table Cluster distribution using broker ready for
	testing
Message-ID: <A6C725D8-76F6-437B-B9D3-80410526DC5C@corelight.com>

Hello everyone,

If you followed last year?s Zeek Week, you might be aware that we have 
been working on a new way to more easily distribute Zeek Table content 
in a cluster setup. We now have a working prototype - and I would be 
happy for feedback if someone wants to start playing with it.

We tried to make this feature as easy to use as possible. In the case 
that you just want to distribute a table over an entire Zeek-cluster, 
you only have to add &backend=Broker::MEMORY to the table definition.

So - for example:

global table_to_share: table[string] of count &backend=Broker::MEMORY;

This will automatically synchronized the table over the entire cluster. 
In the background, a Broker store (in this case a memory-backed store) 
is created and used for the actual data synchronization. Changes to the 
table are automatically sent to the broker store and distributed over 
the cluster.

We also support persistent broker stores. At the moment you need to 
specify the path in which the database should be stored for this 
feature. Example:

redef Broker::auto_store_db_directory = "[path]";
global table_to_share: table[string] of count &backend=Broker::SQLITE;

Data that is stored in the table will be persistent across restarts of 
Zeek.

Current limitations:
  * there is no conflict resolution. Simultaneous inserts for the same 
key will probably lead to a divergent state over the cluster. This is by 
design - if you need to be absolutely sure that you do not loose any 
data, or if you want conflict resolution for multiple inserts, you will 
still have to roll your own script-level logic using events.
  * tables only can have a single index, multi-indexed tables (like 
table[string, count] of X) are not yet supported
  * tables only can have simple values. Tables that store records, 
tables, sets, vectors are not supported. The reason for this is that we 
cannot track table-changes in these cases.
  * &expire_func cannot be used simultaneously. Normal expiry should 
work correctly.
  * documentation is basically still completely missing - I will write 
it over the next days.

If you want to try this you have to compile the 
topic/johanna/table-changes branch of the Zeek repository. To check out 
this branch into a new directory, use something like:

git clone https://github.com/zeek/zeek --branch 
topic/johanna/table-changes --recursive [target-directory]

Please let me know if you have any feedback/questions/problems :)

Johanna

From bob.murphy at corelight.com  Thu Jul  9 16:57:04 2020
From: bob.murphy at corelight.com (Bob Murphy)
Date: Thu, 9 Jul 2020 16:57:04 -0700
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
Message-ID: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>

Right now, if you try to use Zeek's debug logging facilities in DebugLogger.h concurrently from multiple threads, the contents of debug.log can get mixed up and look like like "word salad".

I've been working on log writers for Zeek. Those operate in different threads, and using Zeek's current open-source debug logging implementation, trying to make sense of debug logs from those was a real headache.

So in my own code, I've made debug logging thread-safe, so log text from different threads winds up on different lines in the debug.log file. I've also added more convenience macros to make logging some kinds of debug information easier.

This proposal is to integrate those debug logging changes into open-source Zeek. I'd welcome any questions, suggestions or feedback.

Bob Murphy | Corelight, Inc. | bob.murphy at corelight.com | www.corelight.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.icsi.berkeley.edu/pipermail/zeek-dev/attachments/20200709/d01ee629/attachment.html 

From bob.murphy at corelight.com  Thu Jul  9 18:19:43 2020
From: bob.murphy at corelight.com (Bob Murphy)
Date: Thu, 9 Jul 2020 18:19:43 -0700
Subject: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with batch
 support and better status reporting
Message-ID: <8D06AACD-8721-4EDA-95BD-DAB3D60ACD84@corelight.com>

Summary

This proposal is aimed at solving two intertwined problems in Zeek's log-
writing system:


Problem: Batch writing code duplication
- Some log writers need to send multiple log records at a time in "batches".
  These include writers that send data to elasticsearch, splunk hec, kinesis,
  and various HTTP-based destinations.
- Right now, each of these log writers has similar or identical code to create
  and manage batches.
- This code duplication makes writing and maintaining "batching" log writers
  harder and more bug-prone.

Proposed Solution: Add a new optional API for writing a batch all at once, while
still supporting older log writers that don't need to write batches.


Problem: Insufficient information about failures
- Different log writers can fail in a variety of ways.
- Some of these failure modes are amenable to automatic recovery within Zeek,
  and others could be corrected by an administrator if they knew about it.
- However, the current system for writing log records returns a boolean
  indicating only two log writer statuses: "true" means "Everything's fine!",
  and "false" means "Emergency!!! The log writer needs to be shut down!"

Proposed Solution:
a. For non-batching log writers, change the "false" status to just mean
   "There was an error writing a log record". The log writing system will then
   report those failures to other Zeek components such as plug-ins, so they can
   monitor a log writer's health, and make more sophisticated decisions about
   whether a log writer can continue running or needs to be shut down.
b. Batching log writers will have a new API anyway, so that will let log
   writers report more detail about write failures, including suggestions about
   possible ways to recover.

--------------------------------------------------------------------------------
    
Design Details


Current Implementation

At present, log writers are C++ classes which descend from the WriterBackend
pure-virtual superclass. Each log writer must override several pure virtual
member functions, which include:
* DoInit: Writer-specific initialization method.
* DoWrite: Write one log record.
  Returns a boolean, where true means "everything's fine", and false means
  "things are so bad, the log writer needs to be shut down."

Log writers can also optionally override this virtual member functions:
* DoWriteLogs: Possibly writer-specific output method implementing recording
  zero or more log entries. The default implementation in the superclass simply
  calls DoWrite() in a loop.


New Implementation

This has two main goals:
* Provide a new base class for log writers that supports writing a batch of
  records at once, handles all the batch creation and write logic, and offers
  more sophisticated per-record reporting on failures.
* Provide backward compatibility so "legacy" (existing, non-batching) log
  writers can build and run without code changes, while changing the meaning of
  "false" when returned from DoWrite() to "sending this one log record failed."

These goals will be achieved using three writer backend classes:

1. BaseWriterBackend

This will be a virtual base class, and is a superclass for both legacy and
batching log writers.
- It will have the same API signature as the existing WriterBackend, except it
  will omit DoWrite().
- It will also expose the existing DoWriteLogs() member function as a pure
  virtual function, so there's a standard interface for WriterBackend::Write()
  to call.

2. WriterBackend

This class will derive from BaseWriterBackend, and will support legacy log
writers as a drop-in replacement for the existing WriterBackend class.
- It will add a pure virtual DoWrite member function to BaseWriterBackend, so
  its API signature will be identical to the existing WriterBackend class. That
  will let legacy log writers inherit from it with no code changes, and also
  support new log writers that don't need batching.
- The return semantics for DoWrite will change so when it returns false, that
  will simply mean the argument record wasn't successfully written.
- Its specialization of DoWriteLogs will be nearly identical to Zeek's current
  implementation, except that when DoWrite returns false, DoWriteLogs will
  simply report the failure to the rest of Zeek, rather than triggering a log
  writer shutdown. Then, other Zeek components can monitor the writer's health
  and decide whether to shut down the log writer or let it continue.

3. BatchWriterBackend

This class will derive from BaseWriterBackend, and will write logs in batches.
- Instead of DoWrite, it will expose a DoWriteBatch pure virtual member function
  to accept logs in batches.
- Its specialization of DoWriteLogs will call DoWriteBatch.
- It will support configuring per-log-writer criteria that trigger flushing a
  batch, including:
    * Maximum age of the oldest cached log (default value TBD)
    * Maximum number of cached log records (default value TBD)
- DoWriteBatch will support rejecting logs at random indices in the batch,
  and will report details on which logs were rejected and why.

This is the proposed signature for DoWriteBatch:

int BatchWriterBackend::DoWriteBatch(
                                     int num_writes,
                                     threading::Value*** vals,
                                     BatchWriterBackend::status_vector& failures
                                     );
                                     
where:
  num_writes = the number of log records in the batch
  vals = the values of the log records to be written
  failures  = information about failed record writes
  The return value is the number of log records actually written.

Compared to DoWriteLogs, DoWriteBatch omits the num_fields and fields arguments.
Those aren't needed because the log writer already has those values, which were
stored when they were supplied to its Init member function.

The failures argument is a reference to a std::vector of structs the log writer
can fill in with details on failures to write individual records. The individual
status structs will generally look like this:

    struct status {
        int m_failed_record_index;
        uint32_t m_failure_reason;
        uint32_t m_recovery_suggestion;
    };
    
where: 
  m_failure_reason indicates the general reason for the failure
  m_recovery_suggestion might contain a suggestion about handling the failure

If DoWriteBatch() returns a number that's smaller than num_writes, and the
failures vector is empty, the caller will assume all the failed records were at
the end of the batch, and try to re-transmit them in a later batch.

--------------------------------------------------------------------------------
    
I'd welcome any questions, suggestions or feedback.

Bob Murphy | Corelight, Inc. | bob.murphy at corelight.com | www.corelight.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.icsi.berkeley.edu/pipermail/zeek-dev/attachments/20200709/a4e60227/attachment-0001.html 

From johanna at corelight.com  Thu Jul  9 19:16:47 2020
From: johanna at corelight.com (Johanna Amann)
Date: Thu, 09 Jul 2020 19:16:47 -0700
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>
References: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>
Message-ID: <0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>

On 9 Jul 2020, at 16:57, Bob Murphy wrote:

> Right now, if you try to use Zeek's debug logging facilities in 
> DebugLogger.h concurrently from multiple threads, the contents of 
> debug.log can get mixed up and look like like "word salad".

Is there a reason why you didn?t just use the Debug call of the 
threading framework (which goes through the message queues then ends up 
in debug.log)?

Johanna

From bob.murphy at corelight.com  Fri Jul 10 10:53:37 2020
From: bob.murphy at corelight.com (Bob Murphy)
Date: Fri, 10 Jul 2020 10:53:37 -0700
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
References: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>
	<0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
Message-ID: <576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>

Hi Johanna,

I wasn?t aware of that call, but it also wouldn?t have done what I needed.

If I understand the code correctly, each MsgThread has a FIFO queue that it pushes messages onto. Later on, the main thread occasionally runs a loop where it handles all the queued messages from the first MsgThread, then all the queued messages from the second MsgThread, etc.

The development I was doing sometimes required me to examine the debug messages from different threads in the chronological order they were generated. But if I understand it correctly, the threading framework?s logging doesn?t maintain that ordering.

Also, that work sometimes generated a LOT of debug messages - thousands or millions of lines of them - when only a tiny fraction of them were interesting. To cut down on the garbage, I used the DebugLogger class?s member functions to selectively enable and disable individual streams when particular conditions occurred. However, those member functions have immediate effect, and because the threading framework?s Debug member function emits log lines after a delay, it seems likely I would have not seen debug output I wanted to see, and seeing debug output I didn?t want to see.

Best regards,
Bob

> On Jul 9, 2020, at 7:16 PM, Johanna Amann <johanna at corelight.com> wrote:
> 
> On 9 Jul 2020, at 16:57, Bob Murphy wrote:
> 
>> Right now, if you try to use Zeek's debug logging facilities in DebugLogger.h concurrently from multiple threads, the contents of debug.log can get mixed up and look like like "word salad".
> 
> Is there a reason why you didn?t just use the Debug call of the threading framework (which goes through the message queues then ends up in debug.log)?
> 
> Johanna


From jsiwek at corelight.com  Mon Jul 13 13:42:03 2020
From: jsiwek at corelight.com (Jon Siwek)
Date: Mon, 13 Jul 2020 13:42:03 -0700
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>
References: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>
	<0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
	<576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>
Message-ID: <CAMzgZ0Li9wNZ+vF_RihQ7f4LCvPdhx+gb_z=kDKiFiHjV_gyfw@mail.gmail.com>

On Fri, Jul 10, 2020 at 11:00 AM Bob Murphy <bob.murphy at corelight.com> wrote:

> The development I was doing sometimes required me to examine the debug messages from different threads in the chronological order they were generated. But if I understand it correctly, the threading framework?s logging doesn?t maintain that ordering.

Yeah, or at least the time associated with a Debug message is its
time-of-processing, not time-of-generation.  Can see how the latter is
more useful, but want to discuss the proposed solution with a bit more
detail?  Does it involve a locked mutex around only the underlying
fprintf() or something more? I imagine it should be "something more"
if the requirement is to make debug.log a convenient way of
understanding operation ordering among many threads.

- Jon


From bob.murphy at corelight.com  Tue Jul 14 08:05:20 2020
From: bob.murphy at corelight.com (Bob Murphy)
Date: Tue, 14 Jul 2020 08:05:20 -0700
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <CAMzgZ0Li9wNZ+vF_RihQ7f4LCvPdhx+gb_z=kDKiFiHjV_gyfw@mail.gmail.com>
References: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>
	<0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
	<576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>
	<CAMzgZ0Li9wNZ+vF_RihQ7f4LCvPdhx+gb_z=kDKiFiHjV_gyfw@mail.gmail.com>
Message-ID: <C4332FE5-B733-4592-8F24-FFCCAC149A10@corelight.com>

> On Jul 13, 2020, at 1:42 PM, Jon Siwek <jsiwek at corelight.com> wrote:
> 
> On Fri, Jul 10, 2020 at 11:00 AM Bob Murphy <bob.murphy at corelight.com> wrote:
> 
>> The development I was doing sometimes required me to examine the debug messages from different threads in the chronological order they were generated. But if I understand it correctly, the threading framework?s logging doesn?t maintain that ordering.
> 
> Yeah, or at least the time associated with a Debug message is its
> time-of-processing, not time-of-generation.  Can see how the latter is
> more useful, but want to discuss the proposed solution with a bit more
> detail?  Does it involve a locked mutex around only the underlying
> fprintf() or something more? I imagine it should be "something more"
> if the requirement is to make debug.log a convenient way of
> understanding operation ordering among many threads.
> 
> - Jon

My current implementation does just use a mutex to control access to the output file, and reports the time of generation.

Outside of this email thread, one person has suggested adding something to each debugging log line to identify its source thread. That could potentially be the thread ID, or the thread name, or both.

Another person who runs multiple Zeek instances concurrently also suggested adding the process ID to each log line.

So I was planning to add those to each debug log line before doing a pull request to merge my changes to Zeek master.

- Bob

P.S. If Zeek were to emit a lot of debugging log lines from enough threads very quickly, it?s possible the mutex would add excessive overhead. Boost has a lockfree interthread queue that could be the nucleus of a solution for that, but that would be a lot more complicated than just using a mutex. So I don?t want to look further into that unless and until we know it?s really needed.


From jsiwek at corelight.com  Tue Jul 14 11:35:24 2020
From: jsiwek at corelight.com (Jon Siwek)
Date: Tue, 14 Jul 2020 11:35:24 -0700
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <C4332FE5-B733-4592-8F24-FFCCAC149A10@corelight.com>
References: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>
	<0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
	<576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>
	<CAMzgZ0Li9wNZ+vF_RihQ7f4LCvPdhx+gb_z=kDKiFiHjV_gyfw@mail.gmail.com>
	<C4332FE5-B733-4592-8F24-FFCCAC149A10@corelight.com>
Message-ID: <CAMzgZ0KbsQF8Z-h_pk7oio+yoU5LuS4d88O=EWQ0W2WssjDcnQ@mail.gmail.com>

On Tue, Jul 14, 2020 at 8:05 AM Bob Murphy <bob.murphy at corelight.com> wrote:

> My current implementation does just use a mutex to control access to the output file, and reports the time of generation.

I was also trying to break down a couple distinct requirements and
wondered if that actually covers the 2nd:

(1) Fix the "word salad"
(2) Ability to examine debug output from multiple threads in chronological order

Is it fine to just be able to understand the ordering of "when the
fprintf() happened" or is what's really needed is to understand
ordering of "when operations associated with debug messages happened"
?

Thread 1:
  Foo();
  LockedDebugMsg("I did Foo.");

Thread 2:
  Bar();
  LockedDebugMsg("I did Bar.");

debug.log
  [Timestamp 1] I did Foo.
  [Timestamp 2] I did Bar.

That debug.log doesn't really tell us whether Foo() happened before
Bar(), right?

- Jon

From bob.murphy at corelight.com  Tue Jul 14 11:56:55 2020
From: bob.murphy at corelight.com (Bob Murphy)
Date: Tue, 14 Jul 2020 11:56:55 -0700
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <CAMzgZ0KbsQF8Z-h_pk7oio+yoU5LuS4d88O=EWQ0W2WssjDcnQ@mail.gmail.com>
References: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>
	<0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
	<576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>
	<CAMzgZ0Li9wNZ+vF_RihQ7f4LCvPdhx+gb_z=kDKiFiHjV_gyfw@mail.gmail.com>
	<C4332FE5-B733-4592-8F24-FFCCAC149A10@corelight.com>
	<CAMzgZ0KbsQF8Z-h_pk7oio+yoU5LuS4d88O=EWQ0W2WssjDcnQ@mail.gmail.com>
Message-ID: <ED5769EC-0A50-4142-AC7B-B29FA4DB98F9@corelight.com>

> On Jul 14, 2020, at 11:35 AM, Jon Siwek <jsiwek at corelight.com> wrote:
> 
> On Tue, Jul 14, 2020 at 8:05 AM Bob Murphy <bob.murphy at corelight.com> wrote:
> 
>> My current implementation does just use a mutex to control access to the output file, and reports the time of generation.
> 
> I was also trying to break down a couple distinct requirements and
> wondered if that actually covers the 2nd:
> 
> (1) Fix the "word salad"
> (2) Ability to examine debug output from multiple threads in chronological order
> 
> Is it fine to just be able to understand the ordering of "when the
> fprintf() happened" or is what's really needed is to understand
> ordering of "when operations associated with debug messages happened"
> ?
> 
> Thread 1:
>  Foo();
>  LockedDebugMsg("I did Foo.");
> 
> Thread 2:
>  Bar();
>  LockedDebugMsg("I did Bar.");
> 
> debug.log
>  [Timestamp 1] I did Foo.
>  [Timestamp 2] I did Bar.
> 
> That debug.log doesn't really tell us whether Foo() happened before
> Bar(), right?
> 
> - Jon

The version I have definitely fixes #1, the word salad. It also fixes #2 in the sense that the output is in the same chronological order the calls to LockedDebugMsg occur.

The code you show should give correct ordering on when Foo() and Bar() finish.

If you also want to know when they start, you could do:
Thread 1:
 LockedDebugMsg(?About to do Foo.");
 Foo();
 LockedDebugMsg("I did Foo.");

Thread 2:
 LockedDebugMsg(?About to do Bar.");
 Bar();
 LockedDebugMsg("I did Bar.?);


From jsiwek at corelight.com  Tue Jul 14 13:14:50 2020
From: jsiwek at corelight.com (Jon Siwek)
Date: Tue, 14 Jul 2020 13:14:50 -0700
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <ED5769EC-0A50-4142-AC7B-B29FA4DB98F9@corelight.com>
References: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>
	<0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
	<576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>
	<CAMzgZ0Li9wNZ+vF_RihQ7f4LCvPdhx+gb_z=kDKiFiHjV_gyfw@mail.gmail.com>
	<C4332FE5-B733-4592-8F24-FFCCAC149A10@corelight.com>
	<CAMzgZ0KbsQF8Z-h_pk7oio+yoU5LuS4d88O=EWQ0W2WssjDcnQ@mail.gmail.com>
	<ED5769EC-0A50-4142-AC7B-B29FA4DB98F9@corelight.com>
Message-ID: <CAMzgZ0LaNw+EGk7LLjK_jZEPUSdtBNN5QNRDmYNn+fNQ55LYCw@mail.gmail.com>

On Tue, Jul 14, 2020 at 11:56 AM Bob Murphy <bob.murphy at corelight.com> wrote:

> The code you show should give correct ordering on when Foo() and Bar() finish.

Wondering what's meant by "correct ordering" here.  Bar() can finish
before Foo() and yet debug.log can report "I did Foo" before "I did
Bar" for whatever thread-scheduling reasons happened to make that the
case.  Or Foo() and Bar() can execute together in complete concurrency
and it's just the LockedDebugMsg() picking an arbitrary "winner".

- Jon

From bob.murphy at corelight.com  Tue Jul 14 14:58:20 2020
From: bob.murphy at corelight.com (Bob Murphy)
Date: Tue, 14 Jul 2020 14:58:20 -0700
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <CAMzgZ0LaNw+EGk7LLjK_jZEPUSdtBNN5QNRDmYNn+fNQ55LYCw@mail.gmail.com>
References: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>
	<0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
	<576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>
	<CAMzgZ0Li9wNZ+vF_RihQ7f4LCvPdhx+gb_z=kDKiFiHjV_gyfw@mail.gmail.com>
	<C4332FE5-B733-4592-8F24-FFCCAC149A10@corelight.com>
	<CAMzgZ0KbsQF8Z-h_pk7oio+yoU5LuS4d88O=EWQ0W2WssjDcnQ@mail.gmail.com>
	<ED5769EC-0A50-4142-AC7B-B29FA4DB98F9@corelight.com>
	<CAMzgZ0LaNw+EGk7LLjK_jZEPUSdtBNN5QNRDmYNn+fNQ55LYCw@mail.gmail.com>
Message-ID: <9585DDC6-82DD-4F42-935D-08B6F4100C3B@corelight.com>


> On Jul 14, 2020, at 1:14 PM, Jon Siwek <jsiwek at corelight.com> wrote:
> 
> On Tue, Jul 14, 2020 at 11:56 AM Bob Murphy <bob.murphy at corelight.com> wrote:
> 
>> The code you show should give correct ordering on when Foo() and Bar() finish.
> 
> Wondering what's meant by "correct ordering" here.  Bar() can finish
> before Foo() and yet debug.log can report "I did Foo" before "I did
> Bar" for whatever thread-scheduling reasons happened to make that the
> case.  Or Foo() and Bar() can execute together in complete concurrency
> and it's just the LockedDebugMsg() picking an arbitrary "winner".
> 
> - Jon

I see your point.

For example:
a. Foo() in thread 1 finishes before Bar() in thread 2 finishes
b. The scheduler deactivates thread 1 for a while between the return from Foo() and the execution of LockedDebugMsg("I did Foo.?)
c. Thread 2 proceeds from the return from Bar() without interruption

Then debug.log would contain the message ?I did Bar? before ?I did Foo?.

So the ordering in the log file really reflects how the kernel sees the temporal order of mutex locking inside LockedDebugMsg. That?s an inexact approximation of the temporal order of calls to LockedDebugMsg, and that?s an even more inexact approximation of the temporal order of code executed before LockedDebugMsg.

For what I was doing, though, that proved to be good enough. :-)

I?d be very interested in ideas about how to improve that, especially if they?re simple. I can think of a way to improve it, but it would be substantially more complicated than just a mutex.


From robin at corelight.com  Wed Jul 15 00:52:17 2020
From: robin at corelight.com (Robin Sommer)
Date: Wed, 15 Jul 2020 07:52:17 +0000
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <9585DDC6-82DD-4F42-935D-08B6F4100C3B@corelight.com>
References: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>
	<0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
	<576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>
	<CAMzgZ0Li9wNZ+vF_RihQ7f4LCvPdhx+gb_z=kDKiFiHjV_gyfw@mail.gmail.com>
	<C4332FE5-B733-4592-8F24-FFCCAC149A10@corelight.com>
	<CAMzgZ0KbsQF8Z-h_pk7oio+yoU5LuS4d88O=EWQ0W2WssjDcnQ@mail.gmail.com>
	<ED5769EC-0A50-4142-AC7B-B29FA4DB98F9@corelight.com>
	<CAMzgZ0LaNw+EGk7LLjK_jZEPUSdtBNN5QNRDmYNn+fNQ55LYCw@mail.gmail.com>
	<9585DDC6-82DD-4F42-935D-08B6F4100C3B@corelight.com>
Message-ID: <20200715075217.GF41059@corelight.com>

Reading through this thread, I'm wondering if we should focus on
improving identification of log lines in terms of where they come from
and when they were generated, while keeping to go through the existing
mechanism of sending messages back to main process for output (so that
we don't need the mutex). If we sent timestamps & thread IDs along
with the Debug() messages, one could later post-process debug.log to,
get things sorted/split as desired.

This wouldn't support the use case of "millions of lines" very well,
but I'm not convinced that's what we should be designing this for. A
mutex becomes potentially problematic at that volume as well, and it
also seems like a rare use case to begin with. In cases where it's
really needed, a local patch to get logs into files directly (as you
have done already) might just do the trick, no?

Robin

On Tue, Jul 14, 2020 at 14:58 -0700, Bob Murphy wrote:

> 
> > On Jul 14, 2020, at 1:14 PM, Jon Siwek <jsiwek at corelight.com> wrote:
> > 
> > On Tue, Jul 14, 2020 at 11:56 AM Bob Murphy <bob.murphy at corelight.com> wrote:
> > 
> >> The code you show should give correct ordering on when Foo() and Bar() finish.
> > 
> > Wondering what's meant by "correct ordering" here.  Bar() can finish
> > before Foo() and yet debug.log can report "I did Foo" before "I did
> > Bar" for whatever thread-scheduling reasons happened to make that the
> > case.  Or Foo() and Bar() can execute together in complete concurrency
> > and it's just the LockedDebugMsg() picking an arbitrary "winner".
> > 
> > - Jon
> 
> I see your point.
> 
> For example:
> a. Foo() in thread 1 finishes before Bar() in thread 2 finishes
> b. The scheduler deactivates thread 1 for a while between the return from Foo() and the execution of LockedDebugMsg("I did Foo.?)
> c. Thread 2 proceeds from the return from Bar() without interruption
> 
> Then debug.log would contain the message ?I did Bar? before ?I did Foo?.
> 
> So the ordering in the log file really reflects how the kernel sees the temporal order of mutex locking inside LockedDebugMsg. That?s an inexact approximation of the temporal order of calls to LockedDebugMsg, and that?s an even more inexact approximation of the temporal order of code executed before LockedDebugMsg.
> 
> For what I was doing, though, that proved to be good enough. :-)
> 
> I?d be very interested in ideas about how to improve that, especially if they?re simple. I can think of a way to improve it, but it would be substantially more complicated than just a mutex.
> 
> 
> 
> _______________________________________________
> Zeek-Dev mailing list
> Zeek-Dev at zeek.org
> http://mailman.icsi.berkeley.edu/mailman/listinfo/zeek-dev


-- 
Robin Sommer * Corelight, Inc. * robin at corelight.com * www.corelight.com

From robin at corelight.com  Wed Jul 15 01:09:15 2020
From: robin at corelight.com (Robin Sommer)
Date: Wed, 15 Jul 2020 08:09:15 +0000
Subject: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with
 batch support and better status reporting
In-Reply-To: <8D06AACD-8721-4EDA-95BD-DAB3D60ACD84@corelight.com>
References: <8D06AACD-8721-4EDA-95BD-DAB3D60ACD84@corelight.com>
Message-ID: <20200715080915.GG41059@corelight.com>


On Thu, Jul 09, 2020 at 18:19 -0700, Bob Murphy wrote:

> Proposed Solution: Add a new optional API for writing a batch all at once, while
> still supporting older log writers that don't need to write batches.

That sounds good to me, a PR with the proposed API would be great.

> a. For non-batching log writers, change the "false" status to just mean
>    "There was an error writing a log record". The log writing system will then
>    report those failures to other Zeek components such as plug-ins, so they can
>    monitor a log writer's health, and make more sophisticated decisions about
>    whether a log writer can continue running or needs to be shut down.

Not quite sure what this would look like. Right now we just shut down
the thread on error, right? Can you elaborate how "report those
failures to other Zeek components" and "make more sophisticated
decisions" would look like?

Could we just change the boolean result into a tri-state (1) all good;
(2) recoverable error, and (3) fatal error? Here, (2) would mean that
the writer failed with an individual write, but remains prepared to
receive further messages for output. We could the also implicitly
treat a current "false" as (3), so that existing writers wouldn't even
notice the difference (at the source code level at least).

> b. Batching log writers will have a new API anyway, so that will let log
>    writers report more detail about write failures, including suggestions about
>    possible ways to recover.

Similar question here: how would these "suggestions" look like?

Robin

-- 
Robin Sommer * Corelight, Inc. * robin at corelight.com * www.corelight.com

From bob.murphy at corelight.com  Wed Jul 15 14:57:36 2020
From: bob.murphy at corelight.com (Bob Murphy)
Date: Wed, 15 Jul 2020 14:57:36 -0700
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <20200715075217.GF41059@corelight.com>
References: <DFB1F37C-8787-4A3A-BC20-14DB968A73DC@corelight.com>
	<0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
	<576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>
	<CAMzgZ0Li9wNZ+vF_RihQ7f4LCvPdhx+gb_z=kDKiFiHjV_gyfw@mail.gmail.com>
	<C4332FE5-B733-4592-8F24-FFCCAC149A10@corelight.com>
	<CAMzgZ0KbsQF8Z-h_pk7oio+yoU5LuS4d88O=EWQ0W2WssjDcnQ@mail.gmail.com>
	<ED5769EC-0A50-4142-AC7B-B29FA4DB98F9@corelight.com>
	<CAMzgZ0LaNw+EGk7LLjK_jZEPUSdtBNN5QNRDmYNn+fNQ55LYCw@mail.gmail.com>
	<9585DDC6-82DD-4F42-935D-08B6F4100C3B@corelight.com>
	<20200715075217.GF41059@corelight.com>
Message-ID: <70CC6BD9-B068-4DBB-BD1F-D21208CA45CA@corelight.com>

> 
> On Jul 15, 2020, at 12:52 AM, Robin Sommer <robin at corelight.com> wrote:
> Reading through this thread, I'm wondering if we should focus on
> improving identification of log lines in terms of where they come from
> and when they were generated, while keeping to go through the existing
> mechanism of sending messages back to main process for output (so that
> we don't need the mutex). If we sent timestamps & thread IDs along
> with the Debug() messages, one could later post-process debug.log to,
> get things sorted/split as desired.
> 
> This wouldn't support the use case of "millions of lines" very well,
> but I'm not convinced that's what we should be designing this for. A
> mutex becomes potentially problematic at that volume as well, and it
> also seems like a rare use case to begin with. In cases where it's
> really needed, a local patch to get logs into files directly (as you
> have done already) might just do the trick, no?
> 
> Robin

We could definitely change DebugLogger to improve the log line identification, and route it through the threading framework?s Debug() call. That will avoid turning debug.log into "word salad?.

However, that would also cause a delay in writing the log lines, and I've run into situations working on Zeek where that kind of delay would make debugging harder.

For example, sometimes I run tail on the log file in a terminal window. Then, when the code hits a breakpoint in a debugger, I can analyze the program state by looking at log lines emitted right before the breakpoint triggers, and compare them to variable contents, the stack trace, etc. That won't work if logging is delayed.

There are multiple, conflicting use cases for logging in Zeek. Sometimes a developer might think:
- Maximized throughput is important, but a delay is okay
- No delay can be tolerated, but slower throughput is okay
- Correct temporal ordering in the log is (or isn?t) important
- fflush() after every write is (or isn?t) important
- Debug logging output should go to the debug.log file, or stdout, or somewhere else

This is a pretty common situation around logging, in my experience.

One way to solve it, as Robin says, is for a developer with a use case Zeek doesn't support to apply a temporary local patch. Unfortunately, that doesn't help other developers who might have the same use case. Also, I personally hate to spend time writing code and getting it to work well, and then throw it away.

On other projects, I've used a different approach that's worked really well: use a single, common logging API, but let it send its output to different output mechanisms that support different use cases. Then a developer could pick the output mechanism that works best for their use case at runtime, using a command line option or environment variable. I think it wouldn?t be very complicated to add that to Zeek.

- Bob


From bob.murphy at corelight.com  Wed Jul 15 17:45:11 2020
From: bob.murphy at corelight.com (Bob Murphy)
Date: Wed, 15 Jul 2020 17:45:11 -0700
Subject: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with
 batch support and better status reporting
In-Reply-To: <20200715080915.GG41059@corelight.com>
References: <8D06AACD-8721-4EDA-95BD-DAB3D60ACD84@corelight.com>
	<20200715080915.GG41059@corelight.com>
Message-ID: <E4867262-D00A-4ECA-8809-D2F805E771C1@corelight.com>

> On Jul 15, 2020, at 1:09 AM, Robin Sommer <robin at corelight.com> wrote:
> 
> On Thu, Jul 09, 2020 at 18:19 -0700, Bob Murphy wrote:
> 
>> Proposed Solution: Add a new optional API for writing a batch all at once, while
>> still supporting older log writers that don't need to write batches.
> 
> That sounds good to me, a PR with the proposed API would be great.

That?s sounds great. I wanted to bounce the ideas around with people who know more about Zeek than i do before going into detail on a proposed API.

> 
>> a. For non-batching log writers, change the "false" status to just mean
>>   "There was an error writing a log record". The log writing system will then
>>   report those failures to other Zeek components such as plug-ins, so they can
>>   monitor a log writer's health, and make more sophisticated decisions about
>>   whether a log writer can continue running or needs to be shut down.
> 
> Not quite sure what this would look like. Right now we just shut down
> the thread on error, right? Can you elaborate how "report those
> failures to other Zeek components" and "make more sophisticated
> decisions" would look like?

Yes, right now, any writer error just shuts down the entire thread.

That?s a good solution for destinations like a disk, because if a write fails, something really bad has probably happened. But Seth Hall pointed out that some log destinations can recover, and it?s not a good solution for those.

Here are a couple of examples:

1. A writer might send log records to a network destination. If the connection is temporarily congested, it would start working again when the congestion clears.

2. The logs go to another computer that?s hung, and everything would work again if somebody rebooted it.

Seth's idea was to report the failures to a plugin that could be configured by an administrator. A plugin for a writer that goes to disk could shut down the writer on the first failure, like Zeek does now. And plugins for other writers could approach the examples above with a little more intelligence:

1. The plugin for the network destination writer could decide to shut down the writer only after no records have been successfully sent for a minimum of ten minutes.

2. The plugin for the remote-computer writer could alert an administrator to reboot the other computer. After that, the writer would successfully resume sending logs.


> Could we just change the boolean result into a tri-state (1) all good;
> (2) recoverable error, and (3) fatal error? Here, (2) would mean that
> the writer failed with an individual write, but remains prepared to
> receive further messages for output. We could the also implicitly
> treat a current "false" as (3), so that existing writers wouldn't even
> notice the difference (at the source code level at least).

I don?t think that would work, because the member function in question returns a bool. To change that return value to represent more than two states, we?d have to do one of two things:

1. Change that bool to some other type.

If we did that, existing writers wouldn?t compile any more.

2. Use casts or a union to store and retrieve values other than 0 and 1 in that bool, and hope those values will be preserved across the function return and into the code that needs to analyze them.

We can?t count on values other than 0 or 1 being preserved, because the bool type in C++ is a little weird, and some behaviors are implementation-dependent. I wrote a test program using a pointer to store 0x0F into a bool, and other than looking at it in a debugger, everything I did to read the value out of that bool turned it into 0x01, including assigning it to another bool or an int. The only thing that saw 0x0F in there was taking a pointer to the bool, casting it to a pointer to char or uint8_t, and dereferencing that pointer.


> 
>> b. Batching log writers will have a new API anyway, so that will let log
>>   writers report more detail about write failures, including suggestions about
>>   possible ways to recover.
> 
> Similar question here: how would these "suggestions" look like?


For batching, I was thinking of having a way to send back a std::vector of structs that would be something like this:

struct failure_info {
    uint32_t index_in_batch;
    uint16_t failure_type;
    uint16_t recovery_suggestion;
};

The values of failure_type would be an enumeration indicating things like ?fatal, shut down the writer?, ?log record exceeds protocol limit?, ?unable to send packet?, ?unable to write to disk?, etc. Using a fixed-size struct member that?s larger than the enum would allow extra values to be added in the future.

recovery_suggestion would be a similar enum-in-larger-type, and let the writer convey more information, based on what it knows about the log destination. That could indicate things like, ?the network connection has entirely dropped and no recovery is possible?, ?the network connection is busy, try again later?, ?this log record is too large for the protocol, but re-sending it might succeed if it?s truncated or split up?, etc.

- Bob


From seth at corelight.com  Thu Jul 16 05:46:01 2020
From: seth at corelight.com (Seth Hall)
Date: Thu, 16 Jul 2020 08:46:01 -0400
Subject: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with
 batch support and better status reporting
In-Reply-To: <E4867262-D00A-4ECA-8809-D2F805E771C1@corelight.com>
References: <8D06AACD-8721-4EDA-95BD-DAB3D60ACD84@corelight.com>
	<20200715080915.GG41059@corelight.com>
	<E4867262-D00A-4ECA-8809-D2F805E771C1@corelight.com>
Message-ID: <DFAB21A6-FD06-4A33-9B27-F27B64E3D21E@corelight.com>


On 15 Jul 2020, at 20:45, Bob Murphy wrote:

>> On Jul 15, 2020, at 1:09 AM, Robin Sommer <robin at corelight.com> 
>> wrote:
>>
>> Not quite sure what this would look like. Right now we just shut down
>> the thread on error, right? Can you elaborate how "report those
>> failures to other Zeek components" and "make more sophisticated
>> decisions" would look like?
>
> Yes, right now, any writer error just shuts down the entire thread.
>
> That?s a good solution for destinations like a disk, because if a 
> write fails, something really bad has probably happened. But Seth Hall 
> pointed out that some log destinations can recover, and it?s not a 
> good solution for those.

More or less this is the same sort of thing that I'm always pushing for 
to move more functionality into scripts.  If I got an event in 
scriptland I might be able to determine what resulting action to take in 
the script and whether or not to shut down the writer or to let it keep 
going.

> For batching, I was thinking of having a way to send back a 
> std::vector of structs that would be something like this:
>
> struct failure_info {
>     uint32_t index_in_batch;
>     uint16_t failure_type;
>     uint16_t recovery_suggestion;
> };

This is almost starting to sound a bit more complicated than is worth 
it.  We may need to discuss this a bit more to figure out something 
simpler.  The immediate problem that springs to mind is that as a 
developer, I don't think I'd have any clue what failure_types and 
recovery_suggestions could be common among export destinations.

   .Seth

--
Seth Hall * Corelight, Inc * www.corelight.com

From bob.murphy at corelight.com  Thu Jul 16 17:15:38 2020
From: bob.murphy at corelight.com (Bob Murphy)
Date: Thu, 16 Jul 2020 17:15:38 -0700
Subject: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with
 batch support and better status reporting
In-Reply-To: <DFAB21A6-FD06-4A33-9B27-F27B64E3D21E@corelight.com>
References: <8D06AACD-8721-4EDA-95BD-DAB3D60ACD84@corelight.com>
	<20200715080915.GG41059@corelight.com>
	<E4867262-D00A-4ECA-8809-D2F805E771C1@corelight.com>
	<DFAB21A6-FD06-4A33-9B27-F27B64E3D21E@corelight.com>
Message-ID: <1509B814-58A1-4FF0-A334-579E5DE882AA@corelight.com>

>> For batching, I was thinking of having a way to send back a std::vector of structs that would be something like this:
>> 
>> struct failure_info {
>>    uint32_t index_in_batch;
>>    uint16_t failure_type;
>>    uint16_t recovery_suggestion;
>> };
> 
> This is almost starting to sound a bit more complicated than is worth it.  We may need to discuss this a bit more to figure out something simpler.  The immediate problem that springs to mind is that as a developer, I don't think I'd have any clue what failure_types and recovery_suggestions could be common among export destinations.

Seth and I were talking today, and came up with something like this:
struct failure_info {
    uint32_t first_index;
    uint16_t index_count;
    uint16_t failure_type;
};

Here?s how it would work:

1. The batch writing function would return a std::vector of these. If the entire batch wrote successfully, the vector would be empty.

2. The failure_type value would still indicate generally what happened, with predefined values indicating things like ?network failure?, ?protocol error?, ?unable to write to disk?, or ?unspecified failure". Seth thought we?d be likely to start out with about ten values like this. Using a 32-bit value for this provides lots of room for expansion :-) and maintain reasonable alignment within the struct.

3. first_index and index_count would specify a range. That way, if several successive log records aren?t sent for the same reason, that could be represented by a single struct, instead of a different struct for each one.

This drops the recovery suggestion.

The sizes of the struct fields are currently set to pack nicely into eight bytes, with no wasted space either within the struct or between structs in an array. We could make the fields different sizes, though.

From robin at corelight.com  Fri Jul 17 02:54:04 2020
From: robin at corelight.com (Robin Sommer)
Date: Fri, 17 Jul 2020 09:54:04 +0000
Subject: [Zeek-Dev] Proposal: Improve Zeek's log-writing system with
 batch support and better status reporting
In-Reply-To: <1509B814-58A1-4FF0-A334-579E5DE882AA@corelight.com>
References: <8D06AACD-8721-4EDA-95BD-DAB3D60ACD84@corelight.com>
	<20200715080915.GG41059@corelight.com>
	<E4867262-D00A-4ECA-8809-D2F805E771C1@corelight.com>
	<DFAB21A6-FD06-4A33-9B27-F27B64E3D21E@corelight.com>
	<1509B814-58A1-4FF0-A334-579E5DE882AA@corelight.com>
Message-ID: <20200717095404.GC43266@corelight.com>


On Thu, Jul 16, 2020 at 17:15 -0700, Bob Murphy wrote:

> Here?s how it would work:

It would be helpful to see a draft API for the full batch writing
functionality to see how the pieces would work together. Could you
mock that up?

That said, couple of thoughts:

> 2. The failure_type value would still indicate generally what
> happened, with predefined values indicating things like ?network
> failure?, ?protocol error?, ?unable to write to disk?, or
> ?unspecified failure".

In my experience, such detailed numerical error codes are rarely
useful in practice. Different writers will implement them to different
degrees and associate different semantics with them, and callers will
never quite know what to expect and how to react.

Do you actually need to distinguish the semantics for all these
different cases? Seems an alternative would be having a small set of
possible "impact" values telling the caller what to do. To take a
stab:

    - temporary error: failed, but should try again with same log data
    - error: failed, and trying same log data again won't help; but ok to continue with new log data
    - fatal error: Panic, shutdown writer.

Depending on who's going to log failures, we could also just include a
textual error message as well. Logging is where more context seems
most useful I'd say.

> 3. first_index and index_count would specify a range. That way, if
> several successive log records aren?t sent for the same reason, that
> could be represented by a single struct, instead of a different struct
> for each one.

One reason I'm asking about the full API is because I'm not sure where
the ownership of logs resides that fail to write. Is the writer
keeping them? If so, it could handle the retry case internally. If the
writers discards after failure, and the caller needs to send the data
again, I'd wonder if there's a simpler return type here where we just
point to the first failed entry in the batch. The writer would simply
abort on first failure (how likely is it really that the next succeeds
immediately afterwards?)

And just to be clear why I'm making all these comments: I'm worried
about the difficulty of using this API, on both ends. The more complex
we make the things being passed around, the more difficult it gets to
implement the logic correctly and efficiently.

Robin

-- 
Robin Sommer * Corelight, Inc. * robin at corelight.com * www.corelight.com

From robin at corelight.com  Fri Jul 17 03:01:38 2020
From: robin at corelight.com (Robin Sommer)
Date: Fri, 17 Jul 2020 10:01:38 +0000
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <70CC6BD9-B068-4DBB-BD1F-D21208CA45CA@corelight.com>
References: <0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
	<576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>
	<CAMzgZ0Li9wNZ+vF_RihQ7f4LCvPdhx+gb_z=kDKiFiHjV_gyfw@mail.gmail.com>
	<C4332FE5-B733-4592-8F24-FFCCAC149A10@corelight.com>
	<CAMzgZ0KbsQF8Z-h_pk7oio+yoU5LuS4d88O=EWQ0W2WssjDcnQ@mail.gmail.com>
	<ED5769EC-0A50-4142-AC7B-B29FA4DB98F9@corelight.com>
	<CAMzgZ0LaNw+EGk7LLjK_jZEPUSdtBNN5QNRDmYNn+fNQ55LYCw@mail.gmail.com>
	<9585DDC6-82DD-4F42-935D-08B6F4100C3B@corelight.com>
	<20200715075217.GF41059@corelight.com>
	<70CC6BD9-B068-4DBB-BD1F-D21208CA45CA@corelight.com>
Message-ID: <20200717100138.GD43266@corelight.com>


On Wed, Jul 15, 2020 at 14:57 -0700, Bob Murphy wrote:

> use a single, common logging API, but let it send its output to
> different output mechanisms that support different use cases.

I get that in general. It's just that afaik this is the first time
this need comes up. Adding a full-featured, thread-safe logging
framework is a trade-off against complexity and maintainance costs.
Not saying it's impossible, but I'd like to hear more people thinking
this is a good idea before committing to such a route. 

Robin

-- 
Robin Sommer * Corelight, Inc. * robin at corelight.com * www.corelight.com

From bob.murphy at corelight.com  Sat Jul 18 13:48:33 2020
From: bob.murphy at corelight.com (Bob Murphy)
Date: Sat, 18 Jul 2020 13:48:33 -0700
Subject: [Zeek-Dev] Proposal: Make Zeek's debug logging thread-safe
In-Reply-To: <20200717100138.GD43266@corelight.com>
References: <0E62D8DB-CD97-44D5-9173-02C7DD175320@corelight.com>
	<576BA5D9-0682-4619-B8FD-13D1BDAC4979@corelight.com>
	<CAMzgZ0Li9wNZ+vF_RihQ7f4LCvPdhx+gb_z=kDKiFiHjV_gyfw@mail.gmail.com>
	<C4332FE5-B733-4592-8F24-FFCCAC149A10@corelight.com>
	<CAMzgZ0KbsQF8Z-h_pk7oio+yoU5LuS4d88O=EWQ0W2WssjDcnQ@mail.gmail.com>
	<ED5769EC-0A50-4142-AC7B-B29FA4DB98F9@corelight.com>
	<CAMzgZ0LaNw+EGk7LLjK_jZEPUSdtBNN5QNRDmYNn+fNQ55LYCw@mail.gmail.com>
	<9585DDC6-82DD-4F42-935D-08B6F4100C3B@corelight.com>
	<20200715075217.GF41059@corelight.com>
	<70CC6BD9-B068-4DBB-BD1F-D21208CA45CA@corelight.com>
	<20200717100138.GD43266@corelight.com>
Message-ID: <D8F5B6AB-506C-4AB4-8A81-47D499BF6247@corelight.com>


> On Jul 17, 2020, at 3:01 AM, Robin Sommer <robin at corelight.com> wrote:
> 
> On Wed, Jul 15, 2020 at 14:57 -0700, Bob Murphy wrote:
> 
>> use a single, common logging API, but let it send its output to
>> different output mechanisms that support different use cases.
> 
> I get that in general. It's just that afaik this is the first time
> this need comes up. Adding a full-featured, thread-safe logging
> framework is a trade-off against complexity and maintainance costs.
> Not saying it's impossible, but I'd like to hear more people thinking
> this is a good idea before committing to such a route. 
> 
> Robin

I completely agree about that trade-off, which is why the work I?ve done so far is pretty simple. It doesn?t change the existing DebugLogger system other than adding thread safety. Then on the side, there are a few optional features like a scoping utility class and some preprocessor macros.

That said, different developers have different debugging styles, and I'm a big fan of using feature-rich debug logging frameworks with multiple operating modes and destinations, because they let me fix bugs and write new code much faster than I could otherwise.

Writing a powerful debug logging system does take time and effort, but my experience has been that once it?s finished, it usually doesn't require much ongoing maintenance. Working on open-source and commercial projects with lifetimes of more than a few years, I?ve always seen that time and effort pay for itself many, many times over by making it quicker and easier to diagnose bugs, write new features, and do performance enhancements.

That?s especially been true when I?ve worked on code that handled large volumes of data, like Zeek does. If I need to track down a bug in a stream of data that doesn?t manifest until megabytes have gone by, I usually find it the quickest approach is to run the software and search for a diagnostic pattern in a gigantic log file, compared to other approaches like spending hours hitting the same debugger breakpoint over and over again.

- Bob


From johanna at icir.org  Wed Jul 22 11:32:00 2020
From: johanna at icir.org (Johanna Amann)
Date: Wed, 22 Jul 2020 11:32:00 -0700
Subject: [Zeek-Dev] Zeek mailing list move (zeek.org -> lists.zeek.org)
Message-ID: <0DB62DEF-66AE-4553-820F-14BAED24F084@icir.org>

Hello everyone,

We are going to switch the zeek.org mailing lists to a new provider on 
Monday the 27th. This change means that the domain-part of all zeek.org 
mailing lists is going to change from ?zeek.org? to 
?lists.zeek.org?.

What changes does this entail / what does this mean for you:

* All zeek.org mailing list domains will switch to lists.zeek.org. So, 
?zeek at zeek.org? will be ?zeek at lists.zeek.org? afterwards.
   However, you will still be able to send messages to the old list 
address for the foreseeable future - they will automatically be 
forwarded to the new address
   If you are using mailing list filters to automatically sort Zeek 
mailing lists into folders, you will probably have to update them.

* The mailing list archives and administrative interface will move to 
https://lists.zeek.org/. The old interface at 
http://mailman.icsi.berkeley.edu/mailman/listinfo will no longer be 
available; archives will also no longer be available at the old address.

* Your subscription will automatically move, you do not have to take any 
action.

When will this happen:

* This change will happen on Monday the 27th of July, starting at 
approximately 9am PDT/noon EDT/4pm GMT/5pm BST/6pm CEST.
   Messages sent to the Zeek mailing lists during this time will be 
held. We will try to make sure that any messages that happen to be sent 
during this timeframe will make it over after the migration, but your 
message will probably make it faster if you wait till we are done.

* The change will take a few hours; I will send another message to the 
individual lists once migration is done.

Why are we moving the mailing lists:

The current setup that we are using is being retired and we have to 
switch to a new provider. We are switching to a new domain because this 
makes our setup easier to maintain.

If you have any questions or concerns, please let me know.

Johanna