Nate Cull, a ghost in spring is a user on mastodon.social. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.
Nate Cull, a ghost in spring @natecull

Part of my grumbling about data formats comes from Chinese language learning and and looking at these two (fortunately text! but raw, non-structured) datasets and asking what format would be best to link, modify and share them. Or lots of other datasets like these.

mdbg.net/chinese/dictionary?pa

github.com/amake/cjk-decomp

· Web · 0 · 1

Like, I just want a standard dictionary showing translation *and decomposition information* for characters?

So would the best format be:

* raw line-delimited text with nonstandard field delimiters (what these two projects decided to use for some reason)

* CSV
* JSON
* XML
* S-expressions

* Excel/OpenOffice spreadsheet
* Word/OpenOffice document

* stick it in a proprietary binary database

PROBABLY (sigh) JSON is the only real option

@natecull RFC 7049 Concise Binary Object Representation cbor.io/

@h mmm, binary JSON

the 'nice' thing about JSON becoming The Universal Data Standard For Everything is that you can almost but not *quite* represent Lisp lists in it.

and you can almost but not *quite* represent sets in it

and you can almost but not *quite* represent arbitrary dictionaries in it (ie with non-text keys)

and it almost but not *quite* even has integers

and you really can't represent program code at all unless you like pain and suffering

@natecull CBOR sold me because of its ability to encode integer types predictably. (which JSON, notoriously, doesn't)

@h man why couldn't they have called it CYBOR

someone really missed a golden opportunity there

@natecull IDK, I think it may have something to do with self-touting the name of the spec's author (Carsten Bormann). I like astronomers who name their planets after deities or their pets, so that's a little annoying. But the spec is a good quality RFC, and it's useful, so I don't mind.

@natecull @h Have you heard of Rivest's canonical s-expressions? Used in encryption software you use every day, whether you know it or not! Comes with a nice binary encoding! en.wikipedia.org/wiki/Canonica

@h @natecull Actually the IETF draft is way better to read people.csail.mit.edu/rivest/Se

I'm not sure you actually want to use it, because you'd probably need to write your own parser... but they're easy to write!

@natecull @h I am, in fact, dealing with canonical-sexps right now because I'm writing an http-signatures library for Guile and I'm using libgcrypt via guile-gcrypt... and it turns out libgcrypt uses canonical-sexps everywhere

@cwebber @natecull I'm curious, what would a typical payload look like expressed as s-exps? Structs with a LISP-y feel?

@h @natecull gnupg.org/documentation/manual shows some real-world structures.

However, almost nobody is using canonical s-exps outside of crypto software. I think they're a cool undernoticed technology though.

@cwebber @natecull Ah, it seems that McCarthy himself wrote on this particualr topic in 1975 as some sort of universal data exchange lingua franca. www-formal.stanford.edu/jmc/cb

@natecull @cwebber These s-exp examples look a lot like YAML once you have most parenthetical paraphernalia removed. en.wikipedia.org/wiki/YAML

@cwebber @natecull I agree YAML is bad for the same reason JSON is bad: numbers/scalars.

@h @natecull It's far worse though because at least you can write a JSON parser reasonably easily. YAML is incredibly hard to parse. Goofy, goofy syntax.

@h @cwebber yep, and SXML is a thing of beauty

en.wikipedia.org/wiki/SXML

What I think I've found is that if you add one 'term' marker (eg /) to sexps you can get complete semantic closure for describing *all* expressions, including atoms, as terms.

eg let's say we describe a string as a list of ASCII digits:

ABC == (/ string 65 66 67)

then

(1 2 3 . ABC) == (1 2 3 / string 65 66 67)

This seems quite useful to me. Terms could be *anything*.

@natecull @h SXML is just amazing, absolutely the best.

more-magic.net/posts/lispy-dsl has more (and interesting analysis) on why SXML is so well designed and pleasant to use.

@cwebber @h .. an important use case of this is for describing mixed list/set/dictionary structures in sexps;

eg JSON {"a": 1, "b":2} can be

(/ dict (a . 1) (b . 2))

but we could append this to a list, which you can't do in JSON:

(x y z / dict (a . 1) (b . 2))

And this maps quite cleanly onto logical statements, so I can see it being helpful for making RDF a little cleaner. Eg, we could use this instead of TURTLE.

@cwebber @h My feeling is that s-exps are missing just one little thing that would make them even more useful, and that's a 'term' marker. Can easily be added of course just by reserving a symbol, but then you have to deal with that symbol being reserved.

You can get a whole lot done with sexps but for dealing with, eg, JSON-like intermixed lists and dictionaries, you kind of need some syntax to indicate that there's a difference between the two.

@h @cwebber almost all the annoying things I see with data formats come *when you try to cross-connect and intermix data between formats*.

within a single format, you can make a lot of assumptions.

but when you mix-n-match, suddenly you either have to wrap every 'foreign' bit of data in a whole mess of careful abstractions or you just have ambiguity

@natecull @h Notably that's why json-ld adds the context! We want to make sure that "run" a mile and "run" a program clearly mean two different things.

I've been toying with markup for sexp-ld... :P

@h @natecull I may regret posting this, but this is my WIP syntax for sexp-ld paste.lisp.org/display/355687

Idea is similar to json-ld: your local "compacted" document has symbols that you know map to particular unique URI-bound properties. You have an environment locally that maps these. You can then "expand" (or transform to json-ld or back) for exchange between servers.

@natecull @h Unlike json-ld this isn't intended on its own to be a transport serialization... it's for applications that know how they want their data represented after consuming it from an unambiguous form such as RDF or expanded json-ld.

@cwebber @natecull social.coop/media/SfOmvmWbCydT I am able to appreciate the mathematical rigour of McCarthy, but writing and reading LISP --to me personally-- feels like I'm pushing buttons on an IBM 704 to extract heads or tails. It's a cultural thing.

@h @natecull That's fine.. though IME parenthetical language becomes as easy or even easier to read than non-parenthetical language over time. :)

IMO your editor can help a lot too... rainbow-delimiters, smartparens (or paredit), parinfer and etc can help a lot! dustycloud.org/tmp/emacs_lisp_

@cwebber @h I like sexps as a syntax MUCH more than I like either Lisp or Scheme as a language

@cwebber @h It's just not as... common about it as some.

@natecull @h from inferring by lisp names I can tell that interlisp was written by the nethack hackers when they were hacking the first versions of the internet. If they had used commons lisp instead we would have had the commonnet instead. #truefacts

@cwebber @h And Macintoshes were developed by the hackers who invented Maclisp

It all holds together

@natecull @h Apple e-macs were designed by emacs enthusiasts coming from maclisp??? I think we're doing a great job of inferring history from language alone here and we should keep it up

@natecull @cwebber Then I'm more at home with a Wirth way of doing things, and Mac Pascal was a big thing on the Mac back then. I have the feeling that various currents and undercurrents flow in different directions, sometimes without direct relation to a specific platform. Although I do agree that technical devices *inform* ways of doing things, and they certainly influence *how* tools are used, they are not the tool themselves.

@cwebber @natecull Gotta go, thanks for the chat guys. Speak soon.

@h @cwebber I often think of modern operating systems (especially Linux) as a city with thousands of years (Internet time) of history embedded in layers of architecture. Warring empires, philosophies, junk piles, ruins... and more and more just built on top.

@natecull @h So just out of curiosity have you read the Zones of Thought series, particularly A Deepness in the Sky? Because based on this toot I'm guessing you'd love it 1000x

@cwebber @h Yep!

Programming as archeology. Everything riddled with vulnerabilities installed millennia ago by ancient unspeakable galactic evil.

Sounds about right.

@cwebber @natecull No argument about that, tooling can help. Provided that the tooling matches your particular culture. I just think that we often tend to underestimate cultural factors. All these languages aren't mathematics, there is nothing universal about them. Interfaces that speak differently to different sensibilities, like Esperanto or Lojban. In my case, crossing the gulf between me and a lispy continent would mean I also have to convert to Emacs. Barriers to entry are considerable.

@h @cwebber I'm sure with you on Emacs being a barrier. I've tried multiple times and just can't get my head in its space. It is actively hostile to every modern desktop UI convention. Wants to be running on a VT100.

@natecull @cwebber I don't know. Using vim isn't exactly the height of modernity... vi has been around for about 30 years in different forms too. Personally, I'm not sure that it's something dependent on any particular technological limitation. It's a je ne sais quoi that makes some people gravitate to certain specific ways of doing things. I'm not sure I can put a finger on it.

@natecull @h $ here meant that the value should be transformed into some localized type. For instance, ($ #[uri] "social.dustycloud.org/") should transform the value into a proper <uri> type in Guile. So yeah even this is an interim representation that can be serialized to disk :)

One reason for this is that in json-ld you have *no way* to know whether a value is a URI or a string without fully expanding the context, which I hate.

@cwebber @h That seems very similar to my idea of / as a 'term' flag!

Have you considered allowing $ to appear *anywhere* in a list, not just at the start? Then you could put typed expressions in the CDR like a dotted pair. I think that would open a lot of possibilities.

@natecull @h In fact I think that's exactly how the display hint works in canonical s-exps...

@natecull @h Fun story, I once was explaining to a friend why lisp was so cool...

"See! Lisp uses prefix syntax everywhere! No infix, so everything is super consistent! It's beautiful!"
"What's that dot doing in the middle of that expression?"
"Oh... cons is infix...!"

@cwebber @natecull Yeah, regularity in languages sounds like a good idea on paper, but semantics, syntax, orthography, prosody, etc... they all conspire to conform a language that allows us to reproduce ideas in our braincells more or less accurately as they were intended to be played back.

@natecull @h Yes! I totally agree there. So, canonical-seps do sort of have something for that with the "display" property, but I'm not sure it's really as nice as it could be.

@natecull

to which humans say, "argh"

which sets me up to say "JSON and the argh-o-nauts"

@natecull How structured is it? I've always like bencode-ed data, provided no other serialization needs to happen. It's like netstrings with more types.

But it's hard to argue with the low barrier of entry of using JSON, pretty much every language had a parser readily available.

@kensanata Only for Japanese - CC-CEDICT is the Chinese dictionary, a spinoff of EDICT, I believe.

cc-cedict.org/

But CEDICT doesn't have *character decomposition* information. That's in the unrelated CJK-DECOMP project - or rathe, this fork of it:

github.com/amake/cjk-decomp

mdbg.net's online web page / app *has* apparently linked these datasets. But they don't share their data. :( I want to play with characters offline.

@kensanata The first thing I've found, now that I've parsed and linked the sets

(and why they couldn't have just been in JSON or even CSV instead of weird custom raw-text formats!)

is that CJK-DECOMP has *vastly* more characters in it than CC-CEDICT. Like, 40,000 or so to CC-CEDICT's 10,000? Something like that.

@natecull Well, wasn't the 50k Chinese characters (including rare and historic variants) the reason for Han Unification as part of the Unicode effort beginnings. That raised quite a few hairs.

@kensanata I think it was controversial, yeah. At least I gather by reading Wikipedia; since I'm still a very basic Chinese language learner, I'm just looking for something to help me get a grasp of the characters.

Even CC-CEDICT and CJK-DECOMP are missing a bunch of very important/useful information. ie;

* whether a character is a radical or not
* because basically you don't want to decompose radicals! Or you *can* but it's less meaningful
* a _short_ English definition for recognition

@kensanata My minimal merged file (only characters in CC-CEDICT) has 10,714 characters; the maximal one (all CJK-DECOMP characters) has 85,246 characters.

@kensanata .. but there's at least 10,000 or so CJK-DECOMP 'fake characters' which don't actually have Unicode, just are the interim decompositions of other components. Necessary for lookup but not quite 'real'.

@natecull Aargh! I didn't know that. And Unicode doesn't allow you just compose the radicals using some sort of magic Unicode point?

@kensanata There is a whole separate Unicode composition system, I believe! and nobody uses it

which is sad and terrible

github.com/cjkvi/cjkvi-ids

en.wikipedia.org/wiki/Ideograp

in fact there's a bunch of other approaches:

en.wikipedia.org/wiki/Chinese_

I hope one of them wins one day. but in the meantime, everything in Chinese is Unicode characters. :(

not all components are *radicals*; radicals are 'special' ones that 'start' a character and are used as indexes.

they're not always on the top/left :(

@natecull thanks for the links. I read the section in Chinese characters on en.m.wikipedia.org/wiki/Precom and left it at that. The rabbit hole goes on and on.

@natecull Also, let me stand in as a warning: I was trying to learn Japanese for two years and ended up spending most of the time improving my learning software and parsing EDICT files and the like instead of ACTUALLY LEARNING THE LANGUAGE. Just saying. 🤷🏽‍♂️

@kensanata yeah, this may all be a waste of time

though mostly it's fun

My initial hope is to get a string derived from a character that I can run over the Chinese sentence flashcards I have in Anki, and give me cribs for character expansions. That may be a futile hope.

I just don't like going 'that character made up of ... blag.. gorp.. splat... ugh'

at least if I knew what the parts were called it might help...?

but some of the parts have no names. :(

@kensanata so, first business: give the decompositions, or at least the decompositions that ARE characters... just assign at least the sound to them, and MAYBE an English word.

so 猫 becomes at least 猫mao , and MAYBE 猫mao=犭quan:苗miao

@kensanata if I can at least get the sound for the word and its two major components, that's a big easy win

@kensanata then 苗miao-seedling (but used in 猫mao-cat just as the sound) decomposes into 卄nian-?:田tian-field

I really want visible triplets of character, sound, English. So I can associate the three.

@kensanata and if I'm being super-stupid, I'd like to replace the standard 1,2,3,4 tone marks with _,!,?,~ - as an ASCII approximation to diacritics, because it roughly reflects the intonation.

so 猫mao_ cat is the flat tone 1 mao for 'cat'.

毛mao ?hair is rising tone 1, because it sounds like a question in English

the big missing part so far is I need single-syllable English cribs, only need to type in 10,000 of those

but still! Even just with the sound and tone, that's useful

@natecull @kensanata No, miáo-seedling’s top component is 艹 cǎo.

@natecull @kensanata It’s important to remember that, while the idea of phonological components in Chinese characters is real, typically they gelled millennia ago, and since then there have been massive shifts in pronunciation. So there’s a lot of heartache when you use phonological components as a mnemonic.

@wrenpile @kensanata I'm aware the subcomponents beyond 'radical' don't have immediate semantic value.

But they're hugely important to me, a confused laowai staring at funny groups of lines and wanting to know *does that group of lines have a name*

If it has a name, *it can be searched*

@natecull I never meant to say components other than radicals are useless: quite the contrary.

For you and for me, what counts is what helps us remember what characters look like and mean, right?

The problems I pointed at were:

• radicals, for historical reasons, often don’t signify anything useful in the characters they inhabit;

• phonological components often bear only a tortured historical relationship to modern pronunciation.

@natecull Hah, Damien Elmes and Anki, I remember the time when he was a regular on the Emacs IRC channel. :)

@kensanata It's a cool program! I must install it on this machine

@natecull @kensanata Radicals aren’t laws of physics, they’re artifacts of an attempt to grasp Chinese characters hundreds of years ago. As semantic and etymological information, they’re insufficient and often misleading.

At the risk of telling you to go spend some money, these people are onto something: outlier-linguistics.com/