Nate Cull is a user on mastodon.social. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.
Nate Cull @natecull

Welp.

I was loving the Parsimmon Javascript library for parsing Chinese until I smacked head first into the character 𠂊, \u2008a

:(

Parsimmon does not understand the concept that Unicode characters exist outside the Basic Multilingual Planes.

(Aliens Newt voice)

but they do

· Web · 1 · 1

Ah, I guess now is when I have to learn how to construct surrogate pairs.

@browneyedgirl It is kind of fun!

A huge sprawling database constructed by multiple projects spread across multiple languages all with different ideas about what even a 'character' is...

@natecull @browneyedgirl "kind of" ... on the right kind of day, I guess. :-)
@natecull

https://duckduckgo.com/?q=xml+parse+regex+stackoverflow+he+comes

> If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.
@clacke @natecull while xml itself is not regular, the subset of xml documents that can be produced by a particular source is often regular.
@ayy @natecull This is true! That post is valid only for generic xml/html.

@clacke @ayy

Oh boy! Can you spare a moment to be introduced to Our Lords and Saviours, Parser Combinators?

I dunno how fast they are but I am enjoying that I can finally parse the Unspeakable Documents!

I think they might even work for these only mildly abominatory things:

en.wikipedia.org/wiki/Chinese_

what's a kind way to say 'die in a fire, regexes'?

I mean I am still using regex but only for terminals.

@natecull @ayy I intend to introduce myself to parser combinators at some point. :-)

But yeah, really the thing is to know what class of language you're dealing with. A regular language? A CFG? Something that requires a crazy-ass handwritten parser? ... and then use the appropriate tool.

And to reiterate the point already made, a restricted subset of XML can indeed be a regular language.

@clacke @ayy

Well, if you're looking to make their acquaintance and use Javascript, I recommend Parsimmon.

It's a very small library, and it seems to do what it does simply and neatly.

Took me a day to get my head around it but now it's absolutely what I was looking for - I just didn't know the keyword to search for.

@clacke @ayy

The thing about 'use the appropriate tool' is that, sometimes, a bunch of crazy people scatter all the data you need for one job in multiple documents around the web, each using a different class of language.

In that case , 'the appropriate tool for the job' is 'one tool'.

@natecull @ayy ... apart from the BMP thing apparently. :-)

But I wonder if that's a Parsimmon thing or just a JS/node/browser thing.

@clacke @ayy

It's a Javascript thing, I think, at the moment. JS doesn't natively deal in true codepoints; it deals in UCS-2 16-bit whatevers.

Since the whole point of parser combinators is that they're recursive, so a group of characters is the same as one, I worked around it fairly easily by just defining an entity called 'astral character' which detects a high-surrogate followed by a low-surrogate. (in regex, actually; but it's okay to use regex for terminals).

That's quite neat, I think.

@natecull @ayy JavaScript having been made gratuitously Java-compatible in places (like the horrible Date) back in the 90s, it using UCS-2 is not super surprising.
@natecull @ayy I was originally going to learn parser combinators in OCaml and lately probably in Rust instead. But if they're in JS maybe that's a lower threshold.

But since racket has so much stuff and is even focused on PLDD[0], it probably has them too, and that's probably the best place for me to look.

Now I'm wondering whether Parsimmon or RacketScript with racket's presumed combinators would be the most useful in a browser context.

[0] Programming Language Driven Development

@clacke @ayy

If you're already using Racket, that's great! I'm sure there's a much better library in there somewhere.

The Javascript parsing combinator libraries use that weird 'fluent API' thing (ie, chaining a bunch of object method calls together to construct an object which has the properties you want), which is a clever hack, and I understand it well enough, but probably Scheme would be much clearer and to the point.

@clacke @ayy

For me, I'm happy to find that Node.js basically hits my sweet spot for 'a sufficiently powerful desktop scripting language I can just pick up and start tooling around with data and banging the rocks together, interactively'.

Since what I'm looking at and trying to decipher is *literally* a foreign language, it helps having a programming language that's a bit familiar and easy to please.

@clacke @ayy

A really dumb, but really useful thing of a prototype OO language like JS is that I can just easily shove all my variables into objects as little namespaces.

I don't care that they're 'not Proper Objects' and the language doesn't get in my way (like, eg, Ruby or Python) and insist that a Dictionary is not a Hash is not an Object, etc.

@natecull @ayy It's not dumb at all, I love the prototype-basedness and ubiquitous malleability and ad-hoc nature of JS.

I find the old object-in-an-anonymous-function scoping and namespacing pattern quite neat:

(function(){ return {
method: function() { doThing() },
otherMethod: function() { doOtherThing() }
}})()

Of course, now with proper module support and such, it's not necessary anymore. But I like that the initial language had enough flexibility and usefulness to provide it.
@shamar Your comments are in response the the Marx picture, but I think @ayy only posted that picture because you said "specter". I was in fact tempted to do the same thing, but got preempted before I had decided whether to do it. :-)

@natecull
@natecull @ayy Yes, JavaScript and Node have fantastic whipuptitude and pretty good readability.