hey so remember how I wanted a project gutenberg corpus with every plaintext file in an easy-to-use format? https://mastodon.social/@aparrish/100511033258021934
well I wanted it so bad I guess that I went ahead and made it https://github.com/aparrish/gutenberg-dammit
a quick exercise with this corpus: "Flower blank," alphabetized bigrams beginning with "flowers" from every Project Gutenberg book labelled as "Poetry"
if you were going to download this today, maybe hold off—I found a frustrating bug where some utf8-encoded texts were being decoded incorrectly with a different encoding, leading to hilarious mojibake when they came out the other end—will post a fix in a few hrs
none of this would be a problem if the reported charset in the metadata was always the correct charset. but there are a lot of texts that report "us-ascii" when what they really mean is "ascii with occasional 8-bit chars just for fun!"
also lots of files (>1%?) that say "yeah I'm utf8 sure whatever" but are actually ISO-8859-1 (according to chardet at least)
lessons learned: (a) never trust someone's claim about the encoding of a text file (b) character encodings are bad and trying to digitize text in the first place was bad idea
okay I just pushed an update and uploaded a new archive with (more) correct character encoding—enjoy https://github.com/aparrish/gutenberg-dammit
@aparrish be the datastore you want to see in the world
@aparrish Thank you!
@aparrish omg thank you <3
@aparrish holy cow!
@aparrish Holy spit this is amazing and something I’ve wanted to do for a few months now, and you’ve done a better job than I was even imagining. Thank you for this!
@aparrish this is amazing!
@aparrish and chardet can't always get it right. e.g. I recently ran into CSV files that weren't encoded in ISO-8859-1, but instead a MacOS encoding from a similar era.
@aparrish I thought that "ascii with occasional 8-bit chars just for fun!" was the new mandatory charset standard.
@aparrish I'm definitely going to play with this later! Thanks for doing this!
Follow friends and discover new ones. Publish anything you want: links, pictures, text, video. This server is run by the main developers of the Mastodon project. Everyone is welcome as long as you follow our code of conduct!