Nate Cull is a user on You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.
Nate Cull

Part of my grumbling about data formats comes from Chinese language learning and and looking at these two (fortunately text! but raw, non-structured) datasets and asking what format would be best to link, modify and share them. Or lots of other datasets like these.

· Web · 0 · 1

Like, I just want a standard dictionary showing translation *and decomposition information* for characters?

So would the best format be:

* raw line-delimited text with nonstandard field delimiters (what these two projects decided to use for some reason)

* S-expressions

* Excel/OpenOffice spreadsheet
* Word/OpenOffice document

* stick it in a proprietary binary database

PROBABLY (sigh) JSON is the only real option

@natecull RFC 7049 Concise Binary Object Representation

@h mmm, binary JSON

the 'nice' thing about JSON becoming The Universal Data Standard For Everything is that you can almost but not *quite* represent Lisp lists in it.

and you can almost but not *quite* represent sets in it

and you can almost but not *quite* represent arbitrary dictionaries in it (ie with non-text keys)

and it almost but not *quite* even has integers

and you really can't represent program code at all unless you like pain and suffering

@natecull CBOR sold me because of its ability to encode integer types predictably. (which JSON, notoriously, doesn't)

@h man why couldn't they have called it CYBOR

someone really missed a golden opportunity there

@natecull IDK, I think it may have something to do with self-touting the name of the spec's author (Carsten Bormann). I like astronomers who name their planets after deities or their pets, so that's a little annoying. But the spec is a good quality RFC, and it's useful, so I don't mind.

@natecull @h Have you heard of Rivest's canonical s-expressions? Used in encryption software you use every day, whether you know it or not! Comes with a nice binary encoding!

@h @natecull Actually the IETF draft is way better to read

I'm not sure you actually want to use it, because you'd probably need to write your own parser... but they're easy to write!

@natecull @h I am, in fact, dealing with canonical-sexps right now because I'm writing an http-signatures library for Guile and I'm using libgcrypt via guile-gcrypt... and it turns out libgcrypt uses canonical-sexps everywhere

@cwebber @natecull I'm curious, what would a typical payload look like expressed as s-exps? Structs with a LISP-y feel?

@h @natecull shows some real-world structures.

However, almost nobody is using canonical s-exps outside of crypto software. I think they're a cool undernoticed technology though.

@cwebber @natecull Ah, it seems that McCarthy himself wrote on this particualr topic in 1975 as some sort of universal data exchange lingua franca.

@natecull @cwebber These s-exp examples look a lot like YAML once you have most parenthetical paraphernalia removed.

@cwebber @natecull I agree YAML is bad for the same reason JSON is bad: numbers/scalars.

@h @natecull It's far worse though because at least you can write a JSON parser reasonably easily. YAML is incredibly hard to parse. Goofy, goofy syntax.

@h @cwebber yep, and SXML is a thing of beauty

What I think I've found is that if you add one 'term' marker (eg /) to sexps you can get complete semantic closure for describing *all* expressions, including atoms, as terms.

eg let's say we describe a string as a list of ASCII digits:

ABC == (/ string 65 66 67)


(1 2 3 . ABC) == (1 2 3 / string 65 66 67)

This seems quite useful to me. Terms could be *anything*.

@natecull @h SXML is just amazing, absolutely the best. has more (and interesting analysis) on why SXML is so well designed and pleasant to use.

@cwebber @h .. an important use case of this is for describing mixed list/set/dictionary structures in sexps;

eg JSON {"a": 1, "b":2} can be

(/ dict (a . 1) (b . 2))

but we could append this to a list, which you can't do in JSON:

(x y z / dict (a . 1) (b . 2))

And this maps quite cleanly onto logical statements, so I can see it being helpful for making RDF a little cleaner. Eg, we could use this instead of TURTLE.

@cwebber @h My feeling is that s-exps are missing just one little thing that would make them even more useful, and that's a 'term' marker. Can easily be added of course just by reserving a symbol, but then you have to deal with that symbol being reserved.

You can get a whole lot done with sexps but for dealing with, eg, JSON-like intermixed lists and dictionaries, you kind of need some syntax to indicate that there's a difference between the two.

@h @cwebber almost all the annoying things I see with data formats come *when you try to cross-connect and intermix data between formats*.

within a single format, you can make a lot of assumptions.

but when you mix-n-match, suddenly you either have to wrap every 'foreign' bit of data in a whole mess of careful abstractions or you just have ambiguity

@natecull @h Notably that's why json-ld adds the context! We want to make sure that "run" a mile and "run" a program clearly mean two different things.

I've been toying with markup for sexp-ld... :P

@h @natecull I may regret posting this, but this is my WIP syntax for sexp-ld

Idea is similar to json-ld: your local "compacted" document has symbols that you know map to particular unique URI-bound properties. You have an environment locally that maps these. You can then "expand" (or transform to json-ld or back) for exchange between servers.

@natecull @h Unlike json-ld this isn't intended on its own to be a transport serialization... it's for applications that know how they want their data represented after consuming it from an unambiguous form such as RDF or expanded json-ld.

@cwebber @natecull I am able to appreciate the mathematical rigour of McCarthy, but writing and reading LISP --to me personally-- feels like I'm pushing buttons on an IBM 704 to extract heads or tails. It's a cultural thing.

@h @natecull That's fine.. though IME parenthetical language becomes as easy or even easier to read than non-parenthetical language over time. :)

IMO your editor can help a lot too... rainbow-delimiters, smartparens (or paredit), parinfer and etc can help a lot!

@cwebber @h I like sexps as a syntax MUCH more than I like either Lisp or Scheme as a language

@cwebber @h It's just not as... common about it as some.

@natecull @h from inferring by lisp names I can tell that interlisp was written by the nethack hackers when they were hacking the first versions of the internet. If they had used commons lisp instead we would have had the commonnet instead. #truefacts

@cwebber @h And Macintoshes were developed by the hackers who invented Maclisp

It all holds together

@cwebber @natecull No argument about that, tooling can help. Provided that the tooling matches your particular culture. I just think that we often tend to underestimate cultural factors. All these languages aren't mathematics, there is nothing universal about them. Interfaces that speak differently to different sensibilities, like Esperanto or Lojban. In my case, crossing the gulf between me and a lispy continent would mean I also have to convert to Emacs. Barriers to entry are considerable.

@h @cwebber I'm sure with you on Emacs being a barrier. I've tried multiple times and just can't get my head in its space. It is actively hostile to every modern desktop UI convention. Wants to be running on a VT100.

@natecull @cwebber I don't know. Using vim isn't exactly the height of modernity... vi has been around for about 30 years in different forms too. Personally, I'm not sure that it's something dependent on any particular technological limitation. It's a je ne sais quoi that makes some people gravitate to certain specific ways of doing things. I'm not sure I can put a finger on it.

@natecull @h $ here meant that the value should be transformed into some localized type. For instance, ($ #[uri] "") should transform the value into a proper <uri> type in Guile. So yeah even this is an interim representation that can be serialized to disk :)

One reason for this is that in json-ld you have *no way* to know whether a value is a URI or a string without fully expanding the context, which I hate.

@cwebber @h That seems very similar to my idea of / as a 'term' flag!

Have you considered allowing $ to appear *anywhere* in a list, not just at the start? Then you could put typed expressions in the CDR like a dotted pair. I think that would open a lot of possibilities.

@natecull @h In fact I think that's exactly how the display hint works in canonical s-exps...

@natecull @h Fun story, I once was explaining to a friend why lisp was so cool...

"See! Lisp uses prefix syntax everywhere! No infix, so everything is super consistent! It's beautiful!"
"What's that dot doing in the middle of that expression?"
"Oh... cons is infix...!"

@cwebber @natecull Yeah, regularity in languages sounds like a good idea on paper, but semantics, syntax, orthography, prosody, etc... they all conspire to conform a language that allows us to reproduce ideas in our braincells more or less accurately as they were intended to be played back.

@natecull @h Yes! I totally agree there. So, canonical-seps do sort of have something for that with the "display" property, but I'm not sure it's really as nice as it could be.


to which humans say, "argh"

which sets me up to say "JSON and the argh-o-nauts"

@natecull How structured is it? I've always like bencode-ed data, provided no other serialization needs to happen. It's like netstrings with more types.

But it's hard to argue with the low barrier of entry of using JSON, pretty much every language had a parser readily available.