@mmn What would happen if Mastodon started allowing UTF8 usernames? Would that be problematic for GS? We could try using Punycode if UTF8 directly would break things.
@mmn @Gargron @nightpool I think a single unified encoding for every glyph is a bad idea - much better to modularize with a standard, easily portable format and different language bases.
though tbh I mostly object to homoglyph attack surfaces.
@mmn @Gargron @nightpool Hence punycode on anything linklike. That's the tool available to reduce the attack surface of confusables.
@mmn @Gargron @nightpool You said we have to work within the solution space we have. How are you going to prevent homoglyph attacks?
@mmn @Gargron @nightpool Not possible. For example: Cyrillic has several characters homoglyphic with Latin but at different codepoints.
Are you going to sit around normalizing everything that comes in? Because that kettle of worms is even worse.
@munin @mmn @nightpool wew okay it sounds like I don't actually want to go that route, let's just leave usernames as ASCII
@mmn @nightpool @Gargron Probably the safer choice at the moment :-) And it's always something that could be revisited in the future if someone comes up with a solid, reliable means of normalization or summat.
@mmn @Gargron @nightpool How, precisely, is the data stored internally in such a scenario? Where is the decision to utf8 or punycode made?
There are several second order impacts, and such a parser is more complex than allowing utf8 - and with an even larger complexity comes a larger attack surface.
What if instance admins could decides which specific unicode pages (possibly minus non-printing character classes) to enable (for sign-up, and separately for display)? Don't homoglyph attacks rely on mixing different languages, etc.?
(For compatability/fall-back where instances don't allow that page, we could send them as punycode (with the original in mouseover text))
I suspect any functionality of that sort would be better handled by a plugin type module than anything in the base code.
@munin @Gargron @nightpool ..... I can KINDA see that maybe, but I feel like unicode pages/classes mostly provides that, without the need for redoing it.
@nightpool @Gargron @munin I'd really like Mastodon to try doing something to allow unicode but begin to handle homograph attacks, because, like, only ASCII *is* suboptimal for almost everyone and we should try to solve this --
@munin @Gargron @nightpool -- and Mastodon, with it's (a) multiple, independent, community (and often nationality) distinct instance, (b) still a single codebase, (c) need to actually care about UI stuff, and (d) large international userbase, (and (e) a good-sized infosec community) would be a GREAT place to explore this.
Speaking from the infosec perspective, building that functionality into the base product is a nightmare waiting to happen, and I would -vehemently- discourage any inbuilt solution.
If you make it modular, potential issues are isolated to specific instances.
@munin @Gargron @nightpool that's why I was suggesting (... oh. oops, I didn't mention it in this thread.) that it's default off -- instance admins can enable specific pages for signup, and separately for display.
Instances which allow enough pages to have spoofing be a problem has it only be a problem *within their instance*.
@nightpool @Gargron @munin
And every other instance can experiment individually for which pages are safe to display together -- only eventually-eventually, maybe if people want to do it, agreeing on a tested subset that will be default enabled.
The presence of the capability in the base product opens up the possibility of attacks to inadvertently enable it.
Isolating it to a specific module requires that instance admins specifically choose to install such a module.
This is a case of relative attack surface size.
@munin @Gargron @nightpool oh! You were saying, like, opt-in to even have unicode-enable-ing switches at all, by installing the mod. Not like, diff mods for diff lang-packs or someting.
That seems reasonable-ish, but would an admin of a large image really stall that much more to install a module than to enable a default-off section of settings?
... maybe actually.
(Also it would require holding off until Mastodon has a plugin architecture...)
Having a structure that allows for plugins would allow for a lot more experimentation around the ecosystem in general - and yes, large instance admins may well install plugins if they're asked for by enough users, or if there's a clear benefit for them.
Also, it would encourage users to start up their own small instances to control their own plugins. Net benefit.
@munin @gaditb @nightpool Putting important functionality into extensions is one of the things that killed XMPP
's nothing that says that successful plugins cannot be later integrated into the core product.
Think of it this way: a modular architecture gets you the benefit of having competing methods of addressing the problem exist, so that their relative merits can be evaluated, and the more optimal routes can be chosen.
@gaditb @munin @nightpool The network has to stay compatible with each other, which it won't if some will use utf8 usernames through plugins and others won't.
@Gargron @munin @nightpool I was thinking we'd have the canonical forms of the usernames be punycode, which gets rendered as unicode (or partially-rendered or not-but-with-mouseover or however you wanna handle it) by the plugins and by instances which have no idea what's going on just come out as punycode.
Which still has the 2nd-class citizen problem that @nightpool mentioned, but...
@ClovisGauzy @mmn I wasn't thinking of emoticons - more like Hiragana/Katakana/Kanji/Chinese/Arabic etc.
@ClovisGauzy @mmn Afaik UTF8 e-mail addresses are a thing. It's just been brought to my attention that the ASCII nature of usernames is too Western-centered
@Gargron oh, another question : why images are not in hidden block when Content Warning is activated ?
@ClovisGauzy @Gargron @mmn you can still limit char ranges (having ASCII usernames does not allow us to put non-printable chars)
@Gargron @ClovisGauzy @mmn That can be pretty cool. Look at all these people with non-ASCII names on Wikipedia -- it's awesome.
@Gargron @mmn I would, prior to approaching this, open a discussion in a Github issue. This is an extraordinary space for things to go wrong.
One basic example is quasi-control characters such as LEFT-TO-RIGHT EMBEDDING (U+202A).
Declaring "safe" blocks of unicode would be the safest option, even if these are usually encoded into punycode or URI encoded. You'll still run into Han unification politics for CJK though.
@Gargron I think it will make it harder for users of different l18ns type each other handles, so I guess there are more issues than just the technical compatibilities.
Of course, maybe the joy of Japanese (and non-ascii lexicon language) users of typing using their own language could compensate this.
@mlc @mmn not currently, but there is an open PR: https://github.com/tootsuite/mastodon/pull/2370
@Gargron I think that is not a good idea. We Japanese need to get multiple accounts (in Alphabet, Hiragana, Katakana, Kanji) so as to preserve our identity on the instatnce. We have already used a unique expression of the alphabet in Twitter etc.
@moki Thank you. I will not go the UTF8/punycode route for usernames.
@Gargron I'm relieved to hear that :)
@Gargron @moki this should really be a larger discussion. Not supporting UTF8/punycode usernames is something that potentially leaves out a huge amount of humans in the long run. The amount of humans in the world that don't use have Latin names is gigantic. There are many languages that have names you can't even express well with latin transliteration.
@Gargron @wakest @moki I mean "America got their first so we'll just do it that way" seems like a silly way to make decisions.
My idea to consider at least: allow the admin to enable which pages (or possibly character classes?) of unicode for to sign up or display?
Sign-up will be default off, display we'll pick which ones to turn on, and maybe not even have options for non-display characters.
Thoughts?
@nasser @wakest @Gargron I think a per-instance basis is the best solution if possible. In Japanese elementary schools, we learn to write our own name in Alphabets, so Alphabetical notation is also one of the official representation of our names. Also, as there are few people who want to use their real names on SNS. This may be one of the reasons why many of Japanese don't feel the necessity of using Japanese notation in their account names.
@Gargron matrix uses utf8 usernames. ascii is a bit amerocentric :P
@mmn @Gargron It would help platform success in regions not using latin alphabet. Although it might hinder search, 500 character limit loses couple characters with each UTF-8 character.
Help phishing, come on.
- You could host your own server with IDN domain name visually matching mastodon.social. IMHO usernames are a non issue.
- What would be the benefit of phishing? Everthing is already public.
@mmn @Gargron @nemeciii Support for UTF in messages seems less problematic than in usernames. Identity is already a challenge in a federated ecosystem and allowing visual spoofing will complicate that. Let users write their "real" names and messages in unicode, but keep the usernames restricted. At the very least, leave that configurable by instance.
@Gargron As an Asian, I don't think ASCII username is too western-centred, everyone nowadays has learned some English from child, and many languages have romanization system. The unicode problem is somelike a upstream's one, the internet whole is just not well prepared for it now.
Maybe we can have a ascii id and a alt/local language id, we can @ , mention etc *directly* use local id and use ascii where it needed(for uri, api). Like translating the app text, translate the id.
@Gargron plus: if I have a unicode id/username you can't input(without input method or don't know how), and even it has some unprintable chars that you can't copy it properly, that will be an annoying situation. But if I have a transcript/ranmonized id(all ascii) aside, I think that will make communication smoother. We do need a global language from UI to underlying code, and English (ascii) does this job good, I think.
@mmn @Gargron please use punycode.
Straight utf-8 link support enables phishing.