Eugen is a user on mastodon.social. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.
Eugen @Gargron

@mmn What would happen if Mastodon started allowing UTF8 usernames? Would that be problematic for GS? We could try using Punycode if UTF8 directly would break things.

· Web · 9 · 16

@mmn @Gargron please use punycode.

Straight utf-8 link support enables phishing.

@munin @mmn @gargron I think punycode is a monumentally bad idea—it makes nonascii users second class citizens. I think it's better (although more difficult) to carve out an acceptable segment of Unicode and use that instead.

@mmn @Gargron @nightpool I think a single unified encoding for every glyph is a bad idea - much better to modularize with a standard, easily portable format and different language bases.

though tbh I mostly object to homoglyph attack surfaces.

@munin @gargron @nightpool @mmn I don't plan on implementing it myself, the increased attack surface size is an unacceptable risk for such a relatively small gain.

@munin @mmn @gargron sure but like. We don't have that. Mastodon isn't going to pioneer a new text encoding mechanism. We have to work within the tools we have.

@mmn @Gargron @nightpool Hence punycode on anything linklike. That's the tool available to reduce the attack surface of confusables.

@munin @mmn @gargron I garauntee you that you would not feel the same if every latin orthographic link was punycoded. That's just an untenable position. Also it's just straight up inconsistent—the average user can't tell xn--jsn638js from xn--9183jdji7.

@mmn @Gargron @nightpool You said we have to work within the solution space we have. How are you going to prevent homoglyph attacks?

@munin @mmn @gargron by selecting specific Unicode code blocks that don't contain homoglyphs.

@mmn @Gargron @nightpool Not possible. For example: Cyrillic has several characters homoglyphic with Latin but at different codepoints.

Are you going to sit around normalizing everything that comes in? Because that kettle of worms is even worse.

@munin @mmn @nightpool wew okay it sounds like I don't actually want to go that route, let's just leave usernames as ASCII

@mmn @nightpool @Gargron Probably the safer choice at the moment :-) And it's always something that could be revisited in the future if someone comes up with a solid, reliable means of normalization or summat.

@gargron Either way, I don't think there would be a problem for !GNUsocial with remote #Unicode (non-ASCII) characters. Nicknames are normalised as-is (for example removal of underscores which was another discussion). I don't remember if transliteration of nicknames is done, but we do that for #håshtägs (which has also sparked controversy due to incompatibility with farsi).

My desire is to implement unicode support for nicknames. I don't know about the best strategy for homoglyph attacks etc, but I bet the !xmpp community has lots to share there. There you can have at least a large set of Unicode in your user part of the XMPP ID. Maybe #Prosody developer @zash knows more about unicode in XMPP usernames etc?

cc: @munin @nightpool

@munin @mmn @gargron how is "use punycode, but allow a larger set of characters then just ascii to go unencoded" that bad of a suggestion? Even if it's just hjk blocks, that would be a good start.

@mmn @Gargron @nightpool How, precisely, is the data stored internally in such a scenario? Where is the decision to utf8 or punycode made?

There are several second order impacts, and such a parser is more complex than allowing utf8 - and with an even larger complexity comes a larger attack surface.

@munin @Gargron @nightpool

What if instance admins could decides which specific unicode pages (possibly minus non-printing character classes) to enable (for sign-up, and separately for display)? Don't homoglyph attacks rely on mixing different languages, etc.?

(For compatability/fall-back where instances don't allow that page, we could send them as punycode (with the original in mouseover text))

@gaditb @Gargron @nightpool

I suspect any functionality of that sort would be better handled by a plugin type module than anything in the base code.

@munin @Gargron @nightpool ..... I can KINDA see that maybe, but I feel like unicode pages/classes mostly provides that, without the need for redoing it.

@nightpool @Gargron @munin I'd really like Mastodon to try doing something to allow unicode but begin to handle homograph attacks, because, like, only ASCII *is* suboptimal for almost everyone and we should try to solve this --

@munin @Gargron @nightpool -- and Mastodon, with it's (a) multiple, independent, community (and often nationality) distinct instance, (b) still a single codebase, (c) need to actually care about UI stuff, and (d) large international userbase, (and (e) a good-sized infosec community) would be a GREAT place to explore this.

@gaditb @Gargron @nightpool

Speaking from the infosec perspective, building that functionality into the base product is a nightmare waiting to happen, and I would -vehemently- discourage any inbuilt solution.

If you make it modular, potential issues are isolated to specific instances.

@munin @Gargron @nightpool that's why I was suggesting (... oh. oops, I didn't mention it in this thread.) that it's default off -- instance admins can enable specific pages for signup, and separately for display.

Instances which allow enough pages to have spoofing be a problem has it only be a problem *within their instance*.

@nightpool @Gargron @munin
And every other instance can experiment individually for which pages are safe to display together -- only eventually-eventually, maybe if people want to do it, agreeing on a tested subset that will be default enabled.

@gaditb @Gargron @nightpool

The presence of the capability in the base product opens up the possibility of attacks to inadvertently enable it.

Isolating it to a specific module requires that instance admins specifically choose to install such a module.

This is a case of relative attack surface size.

@munin @Gargron @nightpool oh! You were saying, like, opt-in to even have unicode-enable-ing switches at all, by installing the mod. Not like, diff mods for diff lang-packs or someting.

That seems reasonable-ish, but would an admin of a large image really stall that much more to install a module than to enable a default-off section of settings?

... maybe actually.

(Also it would require holding off until Mastodon has a plugin architecture...)

@gaditb @Gargron @nightpool

Having a structure that allows for plugins would allow for a lot more experimentation around the ecosystem in general - and yes, large instance admins may well install plugins if they're asked for by enough users, or if there's a clear benefit for them.

Also, it would encourage users to start up their own small instances to control their own plugins. Net benefit.

@munin @gaditb @nightpool Putting important functionality into extensions is one of the things that killed XMPP

@Gargron @gaditb @nightpool

's nothing that says that successful plugins cannot be later integrated into the core product.

@nightpool @gaditb @Gargron

Think of it this way: a modular architecture gets you the benefit of having competing methods of addressing the problem exist, so that their relative merits can be evaluated, and the more optimal routes can be chosen.

@gaditb @munin @nightpool The network has to stay compatible with each other, which it won't if some will use utf8 usernames through plugins and others won't.

@Gargron @munin @nightpool I was thinking we'd have the canonical forms of the usernames be punycode, which gets rendered as unicode (or partially-rendered or not-but-with-mouseover or however you wanna handle it) by the plugins and by instances which have no idea what's going on just come out as punycode.

Which still has the 2nd-class citizen problem that @nightpool mentioned, but...

@gargron I think gnusocial and friendica are already compatible with utf8 and emoji.

For friendica : since https://github.com/friendica/friendica/commit/83cc56e71360c0e45f153576f8b5cbe45fd1786b (and before : appear like a ? because mysql set emoji to ? with utf8 )

And for gnusocial (+qvitter) work out of the box since a long time ( 👌 ) for username too 😉

@Gargron @mmn you mean using emoticons in @<username> ?? Not a good idea… :/

@ClovisGauzy @mmn I wasn't thinking of emoticons - more like Hiragana/Katakana/Kanji/Chinese/Arabic etc.

@mmn @Gargron look at my nickname : i'm not using :emoticonstags: but directly UTF-8 emoticon.

Maybe I just misunderstoud something…

@ClovisGauzy @mmn Afaik UTF8 e-mail addresses are a thing. It's just been brought to my attention that the ASCII nature of usernames is too Western-centered

@Gargron @mmn yep, i understand that. But, according to the Murphy's law… :p

@Gargron oh, another question : why images are not in hidden block when Content Warning is activated ?

@ClovisGauzy @Gargron @mmn you can still limit char ranges (having ASCII usernames does not allow us to put non-printable chars)

@Gargron @ClovisGauzy @mmn That can be pretty cool. Look at all these people with non-ASCII names on Wikipedia -- it's awesome.

@Gargron @mmn for usernames, you should stick to [A-Za-z0-9-_], unless you want to have dozens of issues impossible to resolve (mostly around display and similar chars).

@mmn @Gargron I'm usually a big advocate of Unicode but in this case I can see a really big load of bad things coming to you if you do this.

@Gargron @mmn I would, prior to approaching this, open a discussion in a Github issue. This is an extraordinary space for things to go wrong.

One basic example is quasi-control characters such as LEFT-TO-RIGHT EMBEDDING (U+202A).

Declaring "safe" blocks of unicode would be the safest option, even if these are usually encoded into punycode or URI encoded. You'll still run into Han unification politics for CJK though.

Well if my messing around today is any indicator, it will break all the things. I tried putting just a - in a username and my instance temporarily exploded.

@Gargron I think it will make it harder for users of different l18ns type each other handles, so I guess there are more issues than just the technical compatibilities.

Of course, maybe the joy of Japanese (and non-ascii lexicon language) users of typing using their own language could compensate this.

@Gargron @mmn does IDN in mastodon domain names currently work? might be a good first step, to learn some of the pitfalls, if not.

@Gargron I think that is not a good idea. We Japanese need to get multiple accounts (in Alphabet, Hiragana, Katakana, Kanji) so as to preserve our identity on the instatnce. We have already used a unique expression of the alphabet in Twitter etc.

@moki Thank you. I will not go the UTF8/punycode route for usernames.

@Gargron @moki this should really be a larger discussion. Not supporting UTF8/punycode usernames is something that potentially leaves out a huge amount of humans in the long run. The amount of humans in the world that don't use have Latin names is gigantic. There are many languages that have names you can't even express well with latin transliteration.

@wakest @moki to be fair all computers use ascii

@Gargron @moki well they use on and off. A universal concept.

@Gargron @wakest @moki I mean "America got their first so we'll just do it that way" seems like a silly way to make decisions.

My idea to consider at least: allow the admin to enable which pages (or possibly character classes?) of unicode for to sign up or display?

Sign-up will be default off, display we'll pick which ones to turn on, and maybe not even have options for non-display characters.

Thoughts?

@moki @wakest @Gargron
also, cc-ing @yair in this discussion, in case he hasn't seen it and has any opinions since he admins an I-think-majority-Hebrew instance?

@wakest @Gargron Is it not okay to separately consider the display name and the account name? Currently, representation of the mail address is sufficient only in ASCII, and Twitter account is same as well. But personally it is acceptable to make ISO 8859 (Latin-1) available :)

@moki @Gargron that is currently not true you can use punycode 100% in an email address. This is my email address please send me a message グリッチ@グリッチ.みんな

@wakest @Gargron I thought that punycode is not mandatory, but are there not enough e-mail addresses in the current situation without punycode?

@Gargron @moki adding @nasser to this conversation cause they have been working for such things on the internet for ages.

@wakest @Gargron @moki its a hard problem for sure (especially spoofing names with similar characters, which does not have a great solution as far as i can tell) but utf8 usernames could certainly be valuable to e.g. arabic speakers. i wonder if it could be allowed on a per-instance basis?

@nasser @wakest @Gargron I think a per-instance basis is the best solution if possible. In Japanese elementary schools, we learn to write our own name in Alphabets, so Alphabetical notation is also one of the official representation of our names. Also, as there are few people who want to use their real names on SNS. This may be one of the reasons why many of Japanese don't feel the necessity of using Japanese notation in their account names.

@mmn @Gargron It would help platform success in regions not using latin alphabet. Although it might hinder search, 500 character limit loses couple characters with each UTF-8 character.

Help phishing, come on.

- You could host your own server with IDN domain name visually matching mastodon.social. IMHO usernames are a non issue.
- What would be the benefit of phishing? Everthing is already public.

@mmn @Gargron @nemeciii Support for UTF in messages seems less problematic than in usernames. Identity is already a challenge in a federated ecosystem and allowing visual spoofing will complicate that. Let users write their "real" names and messages in unicode, but keep the usernames restricted. At the very least, leave that configurable by instance.

@ino @Gargron @nemeciii Then again, any good instance will be based on community and won't deal with identity theft because the admins, mods and users are somehow connected via trust.

@ino @mmn @Gargron @nemeciii Yes, the usernames doesn't really matter that much as long as you have utf8 display names.

@Gargron As an Asian, I don't think ASCII username is too western-centred, everyone nowadays has learned some English from child, and many languages have romanization system. The unicode problem is somelike a upstream's one, the internet whole is just not well prepared for it now.
Maybe we can have a ascii id and a alt/local language id, we can @ , mention etc *directly* use local id and use ascii where it needed(for uri, api). Like translating the app text, translate the id.

@Gargron plus: if I have a unicode id/username you can't input(without input method or don't know how), and even it has some unprintable chars that you can't copy it properly, that will be an annoying situation. But if I have a transcript/ranmonized id(all ascii) aside, I think that will make communication smoother. We do need a global language from UI to underlying code, and English (ascii) does this job good, I think.