Why I Don't Think Content Addressing Is The Answer
There's a lot of interest in content-addressed distributed stores today, where you punch in a hash of your content (a file) and get that file back.
tldr: Superficially this looks nice, but I think this is a bad idea and we will deeply regret it if we go this route.
The reason: you need BOTH a hash AND a formal 'location' . The hash is there to warn you if the document has been altered in transit.
It can't do that if it's ALSO the location!
@natecull if the hash fails you *retransmit and check against the same hash*
@kel there are two important words here: 'fail' and 'retransmit'.
In a hash-addressed storage system:
1. the hash never fails, because it IS the name/path/location of the document, so you can never use it to detect 'I got the wrong document' errors
2. there is no means of retransmitting. All you can do is ask for the same hash again, which will guarantee you get the same wrong document.
@natecull the hash does fail, in several situations
1. the content isn't accessible
2. the content is modified or damaged during or before transit
nearly all failed hashes, hash-addressed system or not, fall under number two, and the hash *is still used to check against the actual content in all systems using hash addressing I'm familiar with*. In either case, retries can occur from either of these error cases, and a new request could resolve the issue with either the content not being damaged in transit, or another provider comes online to make the undamaged file accessible.
If you're instead talking about *collisions*: this depends on a variety of variables and implementational differences. There are perfect hash functions available which would completely eliminate the issue.
@natecull this is also of course without also introducing the discussion of namespaced content. I don't think that global content-addressing with imperfect hash functions is a great idea if you intend to build a worldwide network. However, if content were to be namespaced by cryptographic public keys... ;)
@kel Yep. If you add namespacing - preferably user-configurable, recursive, namespacing - most of this immediately vanishes. Or at least you can divide by a factor of a minimum of 7.6 billion.
I would say you need:
* cryptographic ID
* arbitrarily long file path
* THEN hash
@natecull it is pretty mind boggling to think of the idea of a distributed data store that's hashed with a dynamic perfect hash[1] whose function is produced by a static functor over a blockchain of transactions on that data store
@thomas_lord @natecull well that's silly. users need not know the proofs behind everything in their computers
@thomas_lord @natecull "a sustainable society" != "a functioning computer". take a look at the topic of the conversation again before you keep talking.
@thomas_lord @natecull from this post I believe. And check nate's timeline for the technical specifics. https://maxhallinan.com/posts/2017/12/28/silos-of-subjectivity/
@thomas_lord @natecull yes. the article just gave rise to nate's posts and our discussion
@natecull @thomas_lord the goal is the same either way, however. Distributed data preservation/availability
I think DHTs might have their place for, say, a small set of large files, where you get the hash of the file that you want to request from somewhere else entirely.
But the kind of data that we want to support is vastly bigger than that.
At some point, you're just simply going to have to support 'requesting a value for a key'.
Or 'sending a message to an object'
Or 'applying a function to an argument'
Things have names; you've got to get the name from *somewhere*.
@natecull @thomas_lord There's of course the case of giving a story to versioning of files. It'd be interesting to have each file it's own hashtable, for example. Like others have brought up, changing files on IPFS changes the address completely, which in some ways is useful, it's nice to be able to address a REVISION (citations would fare a lot better this way) but wouldn't the network also be able to cut down on used space in such a system if it used hashed chunks, storing every chunk of the most accessed (or some other heuristic) version, with the differing chunks being downloaded instead for other versions, and peering allowing chunks of different versions to be shared granted that chunk is the same across the versions
I feel there are two separate 'searching a distributed planetary information space' questions being asked here. The question I'm interested in is different to the one that IPFS, git and other DHT advocates are asking, so naturally our ideas of what makes for a good answer are different.
Requesting things by content makes sense if you are dealing with large-ish (ie: much larger than 256 or 1024 bits) chunks of opaque data.
Not so much if you want to ask 'does X assert Y'
@natecull @thomas_lord how do you suppose those questions differ, exactly?
@kel @thomas_lord 'm looking for something like a programming language, where I can ask arbitrary 'X . Y = what Z?' questions, where X does not necessarily have to be a 'server' which is currently online, but should be some kind of identity that can store broadcast Y in some manner.
and then the chains of X . Y. Z .... could go on down, not just terminating at 'file' but at something like a JSON object.
I think the DHT people don't care about any X's, or Y's they just want 'the bits for Z'
@kel @natecull @thomas_lord The Tahoe-LAFS folks have done a lot of interesting work in this area. I think that there’s some additional property that Tahoe doesn’t provide that you’re looking for, but haven’t been able to figure out exactly what it is. Are you familiar with that work?
@nuttycom @natecull @thomas_lord I'm not. I'll take some time to look at it, but I'm especially interested in systems that can operate on less-trustworthy networks
@kel @natecull On the "preservation" part: I think that thermodynamics suggests preservation requires human intent -- which (e.g.) makes preservation of obscure documentary films precarious. To preserve and make such things accessible is traditionally the work of curators, such as in libraries. A change in storage media doesn't override the over-arching thermodynamic issue. Preservation requires social organization and deliberate, targeted effort.
@thomas_lord @natecull hasn't zooko's triangle been essentially refuted?
@natecull @thomas_lord in either case, I feel as though I'm missing the point of your comment, so if you could explain it to me in a different way that might be helpful
@thomas_lord @natecull I would say that the development of distributed storage platforms would qualify here as a "deliberate, targeted" effort that manages to account for the minefield of copyright law and other things such as organizational failure and death of preservers that would otherwise reverse that work. A great example here would be all of the music that possibly might not be recovered by the "public" from the death of what.cd
@natecull @thomas_lord curators are unable to legally curate this content unless they own it, and if they want to curate it without owning it, have to do it privately and/or anonymously
"But [hash of the month] is cryptographically secure! Nobody can create a collision! It's protected by math!"
True. But that only protects you against *intentionally generated* collisions between *two files*.
If you create a flat planet-wide namespace, and start putting every human file (numbering in the trillions) into it, and use one hash to access everything, guess what happens?
You get a combinatorial explosion of collision probabilities. Not 1 in BIGNUM, but TRILLIONS FACTORIAL.
Protected By Math: The True Story Of How Technology Fuckin' Went Off The Rails Because People Didn't Understand Fuckin' Math™
by Nate Cull
(just a suggestion, I am an auteur and all so I figured I'd just mention it'd be a finesse-ulated title)
Remember the Birthday Paradox? In n people, how many will have the same birthday? It's a lot higher than you'd naively expect.
The same thing will happen with hashes in a planet-wide hash-addressed storage system with sufficiently large number of documents.
The probability not that YOUR document clashes but that ANY TWO documents clash rises geometrically, not linearly.
And because documents can't be retransmitted, any collision will remain there forever.
@natecull I feel like I mostly understand what you are saying, but I'd really appreciate it if you could explain it in simpler terms, or point me towards a primer?
I get about 3/4 of the way to understanding, and then I just can't wrap my head around it.
@ajroach42 @natecull I think the bit about not being able to retransmit is somewhat separate from the collision problem. The latter is sort of the same problem with ipv4, but instead of running out of addresses, it's just a crap shoot as to which one you'll get. And that, given the speed at which we seem to be creating addressable documents, collisions will be common in all the hashing algorithms I know about.
@ajroach42 @natecull content addressable storage is the term to look for. examples are #GNUnet, #IPFS and git. pro: immutable addresses, con: more complex routing
@natecull this is why we use 128-bit or higher UUIDs, to make the chance of that happening vanishingly small.
@natecull and if we really wanted we could use 1024-bit UUIDs or even 2048-bit UUIDs. The chances of a collision, given that UUIDs are pesudorandom, become close enough to zero that a transmission error(a real one, like a physically flipped bit) becomes more likely even for more files than there are atoms in the visible universe.
@popefucker @natecull Its really fun because this problem is literally about entropy and information limits of the universe. Not just a hyperbolic abstraction, but legitimately an issue of applied information theory.
There's a great talk on the subject that goes into the topic of content addressable storage in detail: https://youtu.be/lKXe3HUG2l4?t=31m23s (whole talk is good, but gets into implementation concerns at the 31m23s timestamp i linked to)
By the way, 'trillions' of files is probably WAY low.
We have 7.6 billion people on earth.
We'd get 7.6 trillion files if we allowed ONLY 1,000 unique documents, over an entire lifetime, per person.
For reference, I've made over 12,000 toots in the last year.
@natecull You’re above average!
@natecull Note that if you start chunking up the hashspace you start cutting into this problem /a lot/.
So, say, if you have hashspaces defined /by person/, then you've got the number of files /that person/ has created. Which reduces the problem by roughly an order of 10 billion. (I said "roughly").
If you hash in, say, some date spec (day, month, year, ...), you're also confining collisions to that date bound. Hours: ~0.5 million / life. Another large reduction.
@natecull Also, note that "person" here might include numerous asserted IDs of an actual individual, so n > 7.3 billions. This depends on whether or not there is serial or single-instance-only anonymity. Or of course, 2+ people might share an identity (collaborative pseudonyms).
@natecull Any actual data on this?
Say, within large Git trees, say?
@dredmorbius Maybe! I don't have any to hand. I'd love to know.
I'm just using simple logic here.
The point remains: For the love of God, DO NOT ENGINEER A CASCADING, EXPONENTIALLY LIKELY, PERMANENT, RANDOM FAILURE MODE INTO THE FUNDAMENTAL PACKET TRANSMISSION ARCHITECTURE OF THE ENTIRE PLANET'S FUTURE INTERNET.
Seriously, just don't.
Just do something, anything, else.
@natecull @dredmorbius this isn't even a complaint about hash-addressed content in particular. Even if you do namespacing, that's still subject to the same problem. Two files have the same name, same checksum, same namespace, but different content. Same problem.
The problem is: you're using a small amount of information to specify a large amount of information. This is impossible to do in general, the best you can get is something that will almost never mess up, like compression, or long hashes.
@natecull @dredmorbius or have an upgrade path to longer hashes?
@natecull The b-day paradox is that in any group of n people, the odds that /two of the group/ (not any two people) share a birthday is fairly large. 50% with 23 people.
For the hash-address situation, it's not that /any two documents/ will share a hash, but that /within a specific namespace/ there will be a collision.
Other hashing systems deal with this by counting collisions and dealing with them, e.g., hash lookups.
@natecull if you have lots of files you need a long hash. quick search say the internet has ~50 billion websites. that's no problem with e.g. sha-256. https://crypto.stackexchange.com/questions/24732/probability-of-sha256-collisions-for-certain-amount-of-hashed-values
@natecull Doing the actual number-crunching suggests this is unlikely to be a real problem. Suppose there are 10 trillion files; that is about 2^43. By the Birthday Paradox there is no significant risk of even *one* collision occurring by random chance if the hash has significantly more possible values than the square of that number. Thus the hashes ought to be at least 86 bits long. Modern crypto hashes are 160 bits minimum, often 256 bits. No random collisions.
@natecull Here: Within a subnetwork of trust-but-verify, with widespread replication, a hash can be a "multiply-routable address" just fine. But in a fediverse, hashes as multiply-routable addresses require both network-wide sanitation crawling to detect hash clashes, and a peering substrate that is not based on content-based addressing. If that makes sense, to you. Not sure I'm saying it well.
@thomas_lord That sounds about right.
Global content-based hashing would be okay *as a secondary layer used in certain purposes for certain data*.
It just shouldn't be the primary layer, aiming to replace, eg, TCP/IP, on which we then build everything else.
@natecull A corollary to that is that is that the global fediverse needs (a) intentional peering -- people pick and choose who to peer with; (b) peering relations need ongoing - perpetual - wetspace verification. To quote a William S. Burroughs book: "It is necessary to travel."
Context:
Back in the 1980s BBS era, we all transferred files over slow, noisy modems A LOT.
(We also used modem protocols over audio jacks connected to low-fi cassette tapes. And we typed in raw hex code from magazine listings.)
We needed checksums to make sure files didn't get damaged, and retransmit if they were.
Checksums became CRCs became cryptographically secure hashes. The principle is the same. If the hash fails, *you retransmit*.
If hash is also address? You can't retransmit.