hey so remember how I wanted a project gutenberg corpus with every plaintext file in an easy-to-use format? mastodon.social/@aparrish/1005

well I wanted it so bad I guess that I went ahead and made it github.com/aparrish/gutenberg-

if you were going to download this today, maybe hold off—I found a frustrating bug where some utf8-encoded texts were being decoded incorrectly with a different encoding, leading to hilarious mojibake when they came out the other end—will post a fix in a few hrs

none of this would be a problem if the reported charset in the metadata was always the correct charset. but there are a lot of texts that report "us-ascii" when what they really mean is "ascii with occasional 8-bit chars just for fun!"

Follow

okay I just pushed an update and uploaded a new archive with (more) correct character encoding—enjoy github.com/aparrish/gutenberg-

Sign in to participate in the conversation
Mastodon

Follow friends and discover new ones. Publish anything you want: links, pictures, text, video. This server is run by the main developers of the Mastodon project. Everyone is welcome as long as you follow our code of conduct!