I'm also the kind of person who will spin up a virtual machine to build some file-based ZFS pools in order to test the vdev names that my code generates for raidz and draid vdevs to verify that they match the names the zpool command generates. Draid vdevs have very complicated names, and I hope they can't be reshaped.
(I have real machines running ZFS but I wasn't about to assemble some test pools on them; that goes on scratch VMs that can explode without problems. Just in case.)
It turns out that I am the kind of person who will make a big ball of intertwined changes, publish it as that to get it out there, and then go back to carefully construct a series of individual changes (mostly) in a plausible order so that people could theoretically cherry-pick some of them and it looks neat.
(I did wind up with two changes intertwined because I wasn't going to actually build and test a split-apart version.)
I've now published the modified version of zfs_exporter that we're using on Linux to expose ZFS IO stats to Prometheus. https://github.com/siebenmann/zfs_exporter (branch cks-upstream, which should be the default).
It turns out that basically all of our SSH authentication probes are coming from one remote AWS machine trying to hammer on the root account of one server (something that would never succeed). I only discovered this when I started pulling stuff out of Grafana Loki out of curiosity.
(Now I have a new dashboard.)
I'm an experienced sysadmin and only today, after <redacted> years, have I firewalled incoming IPv6 connections to client devices on my network. IPv4? Implicitly firewalled years ago. IPv6? Until now, sail on through, because it was just a bit too much of a pain to get around to (and who attacks things via IPv6, really).
At some point I'm going to write a techblog entry on why I'm finding Grafana Loki to be the lazy person's log grep+awk. It's not as straightforward an issue as you might think, since you do have to learn LogQL (which is not PromQL and has irritating differences and omissions).
Stupid Grafana Loki trick of the time interval: how much do our server clocks drift over a day for machines that adjust them periodically via ntpdate?
sum(sum_over_time({syslog_identifier="ntpdate"} |= "adjust time server " | pattern "adjust time server <server> offset <delta> sec" | unwrap delta [24h])) by (host)
Current status: finished just over 85 km of bicycling, aka 'cooked'. TBN's "Richmond Hill and Dale" had a lot of hill (much of it a slow long ascent hidden in meandering through suburbia) and not as much dale as I like. We also got to see some still-surviving (for the moment) farmland.
https://ridewithgps.com/trips/98291172
(The route is only 54 km or so, but I rode to and from the start point, as I always do because I'm crazy.)
One reason Loki isn't a clear slam-dunk for us is that we run systems, not applications, and Linux system logs are both voluminous and very noisy (and also everything logs messages in different formats and ways). Still, I could do basic things like put kernel log messages on our per-system dashboards. Maybe a log volume summary would also be useful.
I'll have to experiment to see what information is worth presenting (the eternal dashboard problem).
We now have a working (Grafana) Loki setup that's ingesting logs from our modest fleet (in addition to our existing central syslog server). It works, but I'm not sure what to do with it now.
(This is unlike Prometheus, which was clearly immediately useful for showing us the status and performance of various things.)
I should probably get a CO2 monitor or two, but every time I think about doing this I go stare at Amazon listings for a bit then quietly back away for various reasons. My kingdom for a 'Linux users, buy this' answer, or at least a 'this doesn't demand you install a phone app and is pleasant to use' one.
That cks. Overcommitted sysadmin, photographer, bicyclist, and other multitudes. I write a lot of words for a programmer.