omg the Chinese characters for 'Ping Pong':

乒 ping
乓 pang

that is a thing of beauty

(finally I am starting to make progress again on my explore-the-Chinese-character-dataset project)

So on that note, here's a thing I can now get:

[ '候hou4{⿰⿰亻ren2丨gun3⿱ユ矢shi3} ', 'wait' ],
[ '假jia3{⿰亻ren2叚jia3} ', 'falsehood, deception' ],
[ '做zuo4{⿰亻ren2故gu4} ', 'work, make' ],
[ '停ting2{⿰亻ren2亭ting2} ', 'stop, suspend, delay' ],
[ '像xiang4{⿰亻ren2象xiang4} ', 'a picture, image, figure' ],
[ '元yuan2{⿱一yi1兀wu4} ', 'first' ]

A query of some Hong Kong Grade 1 characters, if they were simplified form, with pronunciation, decomposition, and brief English crib.

The main point of this whole project (other than just giving me an interesting dataset to try out ways of exploring it, currently using Javascript), is to break down Chinese characters into *named, pronounceable* components.

(since you can't find much about a character without already knowing its pronunciation)

First thing I notice: That 候hou4 decomposition is super weird! the radical is 人ren2, so one would think 亻should be the left component.

Second thing: My wife *immediately* spots that


has been incorrectly decomposed as


(two 'ri', or 'sun' characters)

when it should be


(yue being almost exactly like ri, but wider; and 昌 obviously puts a smaller character on top)

So that's super interesting!

and it turns out that the error was introduced in the CJKVI project, which derived from CHISE; the original CHISE data was correct.

So I'm now super suspicious of CJKVI. :(

oh and the CHISE data also has 候hou4 correctly decomposed:


(that CDP thing means a character not in Unicode but in the Chinese Data Processing character set)

while CJKVI decomposes it as


Which is fine on the right-hand side! But that left hand is just meaningless; it should be the single亻radical.

Sigh. It's gonna need a lot of hand-checking, this data, I think.

It absolutely amazes me there's no canonical dataset for this.

