Of course it sounds nice in theory if your server didn't need to download a remote file and would simply save a given hash to then just put it into a IPFS gateway URL, however I believe in practice that's not sufficient. No guarantee that the software that creates a post uses the same thumbnail dimensions that Mastodon needs, so the file still needs to be downloaded to the server, converted, then re-uploaded to IPFS, and if that process is non-deterministic, it's not worth it...
Yeah, seems like it's down to timestamps in the metadata, but ImageMagick refuses to accept any options that are supposed to unset those. Plus it looks like all IPFS-related libraries in Ruby are both incomplete and unmaintained, so...
@chpietsch If I upload the same identical file twice, the thumbnails for it (same settings, dimensions) get different hashes.
@Gargron Maybe it's time to trash ImageMagick then. I am surprised that you use it for heavy lifting. It also has a bad security record.
@Gargron Even if thumbnails are not deduplicated, original files are
@val Only if they are smaller than the threshold where they are downsized.
@Gargron yo IPFS is like Node-or-fuck-it and it's _so_ annoying
@Gargron if both tools are broken [in Ruby], maybe it's time to look at alternatives. It would be nice to have a mature IPFS client in Ruby, but it's a lot of work (maybe ?), and ImageMagick does not have the best reputation either. 😩
@Gargron Have you ever heard of exiftool? I think it uses Pearl. But, privacy conscious people generally use it to wipe EXIF data from photos and PDFs before uploading to sites.
@DJ_Pure_Applesauce @Gargron I haven't actually checked how Mastodon handles EXIF data in images, but a lot of sites will convert images to JPEG in order to embed their own data, which can aid in tracking the images and PDFs (ones that use JPEG). And most conversions to formats like PNG, to help either save data or because it's a standard (JPEG is debatably not a free format), still retain that metadata. I just assume and wipe before uploading regardless of what site. Exiftool has many uses.
@Gargron Oh now THAT is annoying.
I wonder if the non-determinism in this case can be overridden somehow by providing a seed, or by shadowing /dev/random or providing a false system time to imagemagick.. 🤔
@Gargron That's peculiar and frustrating but I guess from an information theory standpoint it makes sense if you're using IM for compression?
Really interested to hear that you're looking into directions like this though! I've oftened wondered how we might bridge the 'federated' and 'peer to peer' models of decentralisation...
@Gargron JPEG compression SHOULD be deterministic, IF you can hardcode the quantisation and huffman tables to use, and make sure to not include any non-deterministic metadata (might as well strip it all).
Downscaling the image would have to be done carefully, so as not to run into any floating point issues, but should be doable.
@Gargron this is an interesting plan, and yes, the way image/graphics applications mess with the underlying data, even if the image is the same, is a concern. Hmm...🤔
tangent, not really solving
It's probably putting in a timestamp or something.
I had a similar problem where I wanted incremental updates for a Unity project. (work :/)
It seems that Unity randomly messes up a few bytes in the huge asset files it builds, so file-based de-dupe still resulted in 200 MB of redundant downloads.
With content-based slicing, most of the file's chunks were reusable.
Server run by the main developers of the project It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!