I wanted to find out if using as a storage backend would give file deduplication "for free", but unfortunately it looks like ImageMagick operations on the same input file are not deterministic, so you still end up with different hashes when the same file is uploaded more than once.

Of course it sounds nice in theory if your server didn't need to download a remote file and would simply save a given hash to then just put it into a IPFS gateway URL, however I believe in practice that's not sufficient. No guarantee that the software that creates a post uses the same thumbnail dimensions that Mastodon needs, so the file still needs to be downloaded to the server, converted, then re-uploaded to IPFS, and if that process is non-deterministic, it's not worth it...

Show thread

Yeah, seems like it's down to timestamps in the metadata, but ImageMagick refuses to accept any options that are supposed to unset those. Plus it looks like all IPFS-related libraries in Ruby are both incomplete and unmaintained, so...

Show thread

@Gargron IPFS should surely give you deduplication for free. Do you strip image metadata before calling ImageMagick? You might want to use something like MAT2 for cleaning uploaded files:
(I observed that Twitter removes metadata from JPEG uploads.)

@chpietsch If I upload the same identical file twice, the thumbnails for it (same settings, dimensions) get different hashes.

@Gargron Maybe it's time to trash ImageMagick then. I am surprised that you use it for heavy lifting. It also has a bad security record.

@Gargron Even if thumbnails are not deduplicated, original files are

@val Only if they are smaller than the threshold where they are downsized.

@Gargron if both tools are broken [in Ruby], maybe it's time to look at alternatives. It would be nice to have a mature IPFS client in Ruby, but it's a lot of work (maybe ?), and ImageMagick does not have the best reputation either. 😩

@Gargron Have you ever heard of exiftool? I think it uses Pearl. But, privacy conscious people generally use it to wipe EXIF data from photos and PDFs before uploading to sites.

@TheOuterLinux @Gargron so apparently this is not happening by default, already? jeezus...


@DJ_Pure_Applesauce @Gargron I haven't actually checked how Mastodon handles EXIF data in images, but a lot of sites will convert images to JPEG in order to embed their own data, which can aid in tracking the images and PDFs (ones that use JPEG). And most conversions to formats like PNG, to help either save data or because it's a standard (JPEG is debatably not a free format), still retain that metadata. I just assume and wipe before uploading regardless of what site. Exiftool has many uses.

I wonder if the non-determinism in this case can be overridden somehow by providing a seed, or by shadowing /dev/random or providing a false system time to imagemagick.. 🤔

@cathal @gargron ImageMagick stores a timestamp in the metadata. I doubt the image data itself is nondeterministic. Postprocessing with a metadata library or tool might do what you need

@Gargron That's peculiar and frustrating but I guess from an information theory standpoint it makes sense if you're using IM for compression?

Really interested to hear that you're looking into directions like this though! I've oftened wondered how we might bridge the 'federated' and 'peer to peer' models of decentralisation...

@padraic_padraic @Gargron

I'm interested in finding a solution to this since I want to do the same for the Known CMS.

Searching around it appears that libpng/libjpeg might be at fault.

Can you compare images with imagemagick 'compare' tool?

@Gargron JPEG compression SHOULD be deterministic, IF you can hardcode the quantisation and huffman tables to use, and make sure to not include any non-deterministic metadata (might as well strip it all).

Downscaling the image would have to be done carefully, so as not to run into any floating point issues, but should be doable.

@Gargron this is an interesting plan, and yes, the way image/graphics applications mess with the underlying data, even if the image is the same, is a concern. Hmm...🤔

tangent, not really solving 

@r @Gargron In principle all useful algorithms that aren't explicitly random should be deterministic.

It's probably putting in a timestamp or something.

I had a similar problem where I wanted incremental updates for a Unity project. (work :/)

It seems that Unity randomly messes up a few bytes in the huge asset files it builds, so file-based de-dupe still resulted in 200 MB of redundant downloads.

With content-based slicing, most of the file's chunks were reusable.

Sign in to participate in the conversation

Server run by the main developers of the project 🐘 It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!