Welp.
I was loving the Parsimmon Javascript library for parsing Chinese until I smacked head first into the character 𠂊, \u2008a
:(
Parsimmon does not understand the concept that Unicode characters exist outside the Basic Multilingual Planes.
(Aliens Newt voice)
but they do
@natecull You are not making me miss my old unicode days...
@browneyedgirl It is kind of fun!
A huge sprawling database constructed by multiple projects spread across multiple languages all with different ideas about what even a 'character' is...
Oh boy! Can you spare a moment to be introduced to Our Lords and Saviours, Parser Combinators?
I dunno how fast they are but I am enjoying that I can finally parse the Unspeakable Documents!
I think they might even work for these only mildly abominatory things:
what's a kind way to say 'die in a fire, regexes'?
I mean I am still using regex but only for terminals.
Well, if you're looking to make their acquaintance and use Javascript, I recommend Parsimmon.
It's a very small library, and it seems to do what it does simply and neatly.
Took me a day to get my head around it but now it's absolutely what I was looking for - I just didn't know the keyword to search for.
It's a Javascript thing, I think, at the moment. JS doesn't natively deal in true codepoints; it deals in UCS-2 16-bit whatevers.
Since the whole point of parser combinators is that they're recursive, so a group of characters is the same as one, I worked around it fairly easily by just defining an entity called 'astral character' which detects a high-surrogate followed by a low-surrogate. (in regex, actually; but it's okay to use regex for terminals).
That's quite neat, I think.
If you're already using Racket, that's great! I'm sure there's a much better library in there somewhere.
The Javascript parsing combinator libraries use that weird 'fluent API' thing (ie, chaining a bunch of object method calls together to construct an object which has the properties you want), which is a clever hack, and I understand it well enough, but probably Scheme would be much clearer and to the point.
For me, I'm happy to find that Node.js basically hits my sweet spot for 'a sufficiently powerful desktop scripting language I can just pick up and start tooling around with data and banging the rocks together, interactively'.
Since what I'm looking at and trying to decipher is *literally* a foreign language, it helps having a programming language that's a bit familiar and easy to please.
A really dumb, but really useful thing of a prototype OO language like JS is that I can just easily shove all my variables into objects as little namespaces.
I don't care that they're 'not Proper Objects' and the language doesn't get in my way (like, eg, Ruby or Python) and insist that a Dictionary is not a Hash is not an Object, etc.
Ah, I guess now is when I have to learn how to construct surrogate pairs.