Does anyone know if ffmpeg is deterministic? If you convert the same input with the same settings, will the final hashsum of the file be the same?

· · Web · 0 · 11 · 2

@gargron I just answered this for someone in #ffmpeg on freenode!

In general no - some formats have random or time based metadata. You can turn that off with -flags +bitexact in some cases, but this shouldn't be generally used.

Also, some multithreaded encoders like x264 are non-deterministic.

@kepstin Thanks! Hm. Want to work on some deduplication for Masto attachments, but looks like if the same GIF will be uploaded multiple times, the results will not count as the same file. So perhaps I will simply store the hashsum of the original file instead of end result

@gargron yes, for this sort of usage, I'd always recommend hashing the original file, then generating derivative files indexed by the hash of the original.

@gargron Among other things - hashing the original file means that you can simply re-use the already encoded derivative media, rather than having to do the encode then realizing "oh, wait, I already have this" and throwing it away.

@gargron For new code, I wouldn't recommend any hash algorithm older than sha256.

There might be some security issues here - if someone knows the md5 hash of a piece of media, they could send another file with the same hash, and it'll show the old media (which might have been private?) instead of the new. Requires knowing the hash in advance, and if you know the hash/filename you could probably see it anyways?

@kepstin @Gargron It has the disadvantage that a slightly different video could be detected as a duplicate, which you probably don't want. I guess the best approach is to just rely on the source hash and the encoded hash, which is useful when someone downloads from Mastodon and then uploads again.

@jomo @gargron ah - yes! I like the idea of tracking the encoded hash as well, to detect re-uploads.

@fluffy seems to be codec and system dependent, but there is also the multithreading thing

@gargron ah, interesting. Didn't think that the data could be valid in different orderings.

@fluffy @Gargron many modern audio codecs (AAC) use random noise as part of the input that goes into the encoder. thusly even the same audio track encoded twice from the same source material may be different.

@Gargron @fluffy beyond that the container formats usually contain a time stamp when the file was mastered which will be different each time

@gargron Using the same binaries (including dynamic libraries), I'm not aware of any nondeterministic code path.. @tomas?

@Gargron When using the same ffmpeg binary on the same architecture, it should be the same.
If you change the ffmpeg version, recompile it, or change the architecture (x86 vs ARM), maybe not.

@gargron FFmpeg dev here. This depends on codec and container. Many tests in FFmpeg's automated tests (FATE) just runs some conversion and compares the resulting file's hash to a reference. MXF uses random UUIDs as part of its format for example. I'd say: just test whichever formats you want to use if they have your desired property. I'd guess mp4 and webm works fine

@Gargron short answer: no.
Longer answer: it's complicated and depends threading, codecs and parameters in play. Hardware differences could be a factor too.

Toot 1/2

Sign in to participate in the conversation

Server run by the main developers of the project 🐘 It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!