<div dir="ltr">Good morning everyone.<div><br></div><div>I&#39;m researching compression of Zeek data.  I&#39;m currently dumping Zeek data into Parquet files, and one of the most challenging fields to compress is <font face="monospace">uid</font> because of its high entropy.</div><div><br></div><div>I&#39;m wondering if there&#39;s any interest in changing the format of the uid to something like <a href="https://github.com/ulid/spec" target="_blank">ULID</a>, of which there is a<a href="https://github.com/suyash/ulid" target="_blank"> C++ implementation </a>already.</div><div><br></div><div>A ULID-based uid implementation would:</div><div><ul><li style="margin-left:15px">allow uids to be sorted, which isn&#39;t helpful in-and-of-itself, but very helpful for compression</li><li style="margin-left:15px">still URL-safe</li><li style="margin-left:15px">always 26 characters, for simpler storage</li><li style="margin-left:15px">case-insensitive</li></ul></div><div><br></div><div>Looking through the code (<a href="https://github.com/bro/bro/blob/master/src/UID.h" target="_blank">UID.h</a> and <a href="https://github.com/bro/bro/blob/master/src/UID.cc" target="_blank">UID.cc</a>) and its usages, it doesn&#39;t look technically difficult but I&#39;m sure I&#39;m missing some reasons.  For example, I noticed that prefixes such as the letter &#39;C&#39; are used to denote kinds of connections.  Perhaps that data can be extracted to another field instead?</div><div><br></div><div>Anyways, looking for thoughts, comments, suggestions, and anything else.  Thank you!</div><div><br></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Karl</div></div></div></div>