@mmn What would happen if Mastodon started allowing UTF8 usernames? Would that be problematic for GS? We could try using Punycode if UTF8 directly would break things.

@mmn @Gargron please use punycode.

Straight utf-8 link support enables phishing.

@munin @mmn @gargron I think punycode is a monumentally bad idea—it makes nonascii users second class citizens. I think it's better (although more difficult) to carve out an acceptable segment of Unicode and use that instead.

@mmn @Gargron @nightpool I think a single unified encoding for every glyph is a bad idea - much better to modularize with a standard, easily portable format and different language bases.

though tbh I mostly object to homoglyph attack surfaces.

@munin @gargron @nightpool @mmn I don't plan on implementing it myself, the increased attack surface size is an unacceptable risk for such a relatively small gain.

@munin @mmn @gargron sure but like. We don't have that. Mastodon isn't going to pioneer a new text encoding mechanism. We have to work within the tools we have.

@mmn @Gargron @nightpool Hence punycode on anything linklike. That's the tool available to reduce the attack surface of confusables.

@munin @mmn @gargron I garauntee you that you would not feel the same if every latin orthographic link was punycoded. That's just an untenable position. Also it's just straight up inconsistent—the average user can't tell xn--jsn638js from xn--9183jdji7.

@mmn @Gargron @nightpool You said we have to work within the solution space we have. How are you going to prevent homoglyph attacks?

@munin @mmn @gargron by selecting specific Unicode code blocks that don't contain homoglyphs.

@mmn @Gargron @nightpool Not possible. For example: Cyrillic has several characters homoglyphic with Latin but at different codepoints.

Are you going to sit around normalizing everything that comes in? Because that kettle of worms is even worse.

@munin @mmn @nightpool wew okay it sounds like I don't actually want to go that route, let's just leave usernames as ASCII

@mmn @nightpool @Gargron Probably the safer choice at the moment :-) And it's always something that could be revisited in the future if someone comes up with a solid, reliable means of normalization or summat.

@gargron Either way, I don't think there would be a problem for !GNUsocial with remote #Unicode (non-ASCII) characters. Nicknames are normalised as-is (for example removal of underscores which was another discussion). I don't remember if transliteration of nicknames is done, but we do that for #håshtägs (which has also sparked controversy due to incompatibility with farsi).

My desire is to implement unicode support for nicknames. I don't know about the best strategy for homoglyph attacks etc, but I bet the !xmpp community has lots to share there. There you can have at least a large set of Unicode in your user part of the XMPP ID. Maybe #Prosody developer @zash knows more about unicode in XMPP usernames etc?

cc: @munin @nightpool

@munin @mmn @gargron how is "use punycode, but allow a larger set of characters then just ascii to go unencoded" that bad of a suggestion? Even if it's just hjk blocks, that would be a good start.

@mmn @Gargron @nightpool How, precisely, is the data stored internally in such a scenario? Where is the decision to utf8 or punycode made?

There are several second order impacts, and such a parser is more complex than allowing utf8 - and with an even larger complexity comes a larger attack surface.

@munin @Gargron @nightpool

What if instance admins could decides which specific unicode pages (possibly minus non-printing character classes) to enable (for sign-up, and separately for display)? Don't homoglyph attacks rely on mixing different languages, etc.?

(For compatability/fall-back where instances don't allow that page, we could send them as punycode (with the original in mouseover text))

@gaditb @Gargron @nightpool

I suspect any functionality of that sort would be better handled by a plugin type module than anything in the base code.

@munin @Gargron @nightpool ..... I can KINDA see that maybe, but I feel like unicode pages/classes mostly provides that, without the need for redoing it.

@nightpool @Gargron @munin I'd really like Mastodon to try doing something to allow unicode but begin to handle homograph attacks, because, like, only ASCII *is* suboptimal for almost everyone and we should try to solve this --

@munin @Gargron @nightpool -- and Mastodon, with it's (a) multiple, independent, community (and often nationality) distinct instance, (b) still a single codebase, (c) need to actually care about UI stuff, and (d) large international userbase, (and (e) a good-sized infosec community) would be a GREAT place to explore this.

@gaditb @Gargron @nightpool

Speaking from the infosec perspective, building that functionality into the base product is a nightmare waiting to happen, and I would -vehemently- discourage any inbuilt solution.

If you make it modular, potential issues are isolated to specific instances.

@munin @Gargron @nightpool that's why I was suggesting (... oh. oops, I didn't mention it in this thread.) that it's default off -- instance admins can enable specific pages for signup, and separately for display.

Instances which allow enough pages to have spoofing be a problem has it only be a problem *within their instance*.

@nightpool @Gargron @munin
And every other instance can experiment individually for which pages are safe to display together -- only eventually-eventually, maybe if people want to do it, agreeing on a tested subset that will be default enabled.

@gaditb @Gargron @nightpool

The presence of the capability in the base product opens up the possibility of attacks to inadvertently enable it.

Isolating it to a specific module requires that instance admins specifically choose to install such a module.

This is a case of relative attack surface size.

@munin @Gargron @nightpool oh! You were saying, like, opt-in to even have unicode-enable-ing switches at all, by installing the mod. Not like, diff mods for diff lang-packs or someting.

That seems reasonable-ish, but would an admin of a large image really stall that much more to install a module than to enable a default-off section of settings?

... maybe actually.

(Also it would require holding off until Mastodon has a plugin architecture...)

@gaditb @Gargron @nightpool

Having a structure that allows for plugins would allow for a lot more experimentation around the ecosystem in general - and yes, large instance admins may well install plugins if they're asked for by enough users, or if there's a clear benefit for them.

Also, it would encourage users to start up their own small instances to control their own plugins. Net benefit.

@munin @gaditb @nightpool Putting important functionality into extensions is one of the things that killed XMPP

@Gargron @gaditb @nightpool

's nothing that says that successful plugins cannot be later integrated into the core product.

@nightpool @gaditb @Gargron

Think of it this way: a modular architecture gets you the benefit of having competing methods of addressing the problem exist, so that their relative merits can be evaluated, and the more optimal routes can be chosen.

@gaditb @munin @nightpool The network has to stay compatible with each other, which it won't if some will use utf8 usernames through plugins and others won't.

@Gargron @munin @nightpool I was thinking we'd have the canonical forms of the usernames be punycode, which gets rendered as unicode (or partially-rendered or not-but-with-mouseover or however you wanna handle it) by the plugins and by instances which have no idea what's going on just come out as punycode.

Which still has the 2nd-class citizen problem that @nightpool mentioned, but...

@gargron I think gnusocial and friendica are already compatible with utf8 and emoji.

For friendica : since (and before : appear like a ? because mysql set emoji to ? with utf8 )

And for gnusocial (+qvitter) work out of the box since a long time ( 👌 ) for username too 😉

@Gargron @mmn you mean using emoticons in @<username> ?? Not a good idea… :/

@ClovisGauzy @mmn I wasn't thinking of emoticons - more like Hiragana/Katakana/Kanji/Chinese/Arabic etc.

@mmn @Gargron look at my nickname : i'm not using :emoticonstags: but directly UTF-8 emoticon.

Maybe I just misunderstoud something…

@ClovisGauzy @mmn Afaik UTF8 e-mail addresses are a thing. It's just been brought to my attention that the ASCII nature of usernames is too Western-centered

@Gargron @mmn yep, i understand that. But, according to the Murphy's law… :p

@Gargron oh, another question : why images are not in hidden block when Content Warning is activated ?

@ClovisGauzy @Gargron @mmn you can still limit char ranges (having ASCII usernames does not allow us to put non-printable chars)

@Gargron @ClovisGauzy @mmn That can be pretty cool. Look at all these people with non-ASCII names on Wikipedia -- it's awesome.

@Gargron @mmn for usernames, you should stick to [A-Za-z0-9-_], unless you want to have dozens of issues impossible to resolve (mostly around display and similar chars).

@mmn @Gargron I'm usually a big advocate of Unicode but in this case I can see a really big load of bad things coming to you if you do this.

@Gargron @mmn I would, prior to approaching this, open a discussion in a Github issue. This is an extraordinary space for things to go wrong.

One basic example is quasi-control characters such as LEFT-TO-RIGHT EMBEDDING (U+202A).

Declaring "safe" blocks of unicode would be the safest option, even if these are usually encoded into punycode or URI encoded. You'll still run into Han unification politics for CJK though.

Well if my messing around today is any indicator, it will break all the things. I tried putting just a - in a username and my instance temporarily exploded.

@Gargron I think it will make it harder for users of different l18ns type each other handles, so I guess there are more issues than just the technical compatibilities.

Of course, maybe the joy of Japanese (and non-ascii lexicon language) users of typing using their own language could compensate this.

@Gargron @mmn does IDN in mastodon domain names currently work? might be a good first step, to learn some of the pitfalls, if not.

@Gargron I think that is not a good idea. We Japanese need to get multiple accounts (in Alphabet, Hiragana, Katakana, Kanji) so as to preserve our identity on the instatnce. We have already used a unique expression of the alphabet in Twitter etc.

@moki Thank you. I will not go the UTF8/punycode route for usernames.

@Gargron @moki this should really be a larger discussion. Not supporting UTF8/punycode usernames is something that potentially leaves out a huge amount of humans in the long run. The amount of humans in the world that don't use have Latin names is gigantic. There are many languages that have names you can't even express well with latin transliteration.

Sign in to participate in the conversation

Follow friends and discover new ones. Publish anything you want: links, pictures, text, video. This server is run by the main developers of the Mastodon project. Everyone is welcome as long as you follow our code of conduct!