Allison Parrish
Follow

hey so remember how I wanted a project gutenberg corpus with every plaintext file in an easy-to-use format? mastodon.social/@aparrish/1005

well I wanted it so bad I guess that I went ahead and made it github.com/aparrish/gutenberg-

a quick exercise with this corpus: "Flower blank," alphabetized bigrams beginning with "flowers" from every Project Gutenberg book labelled as "Poetry"

gist.github.com/aparrish/fdcbd

excerpt:

flowers a
flowers ablaze
flowers about
flowers above
flowers absorb
flowers accompanying
flowers adorn
flowers advance
flowers afford
flowers affray
flowers aflame
flowers after
flowers again
flowers against
flowers alighting
flowers alive
flowers all
flowers allied
flowers ally
flowers almost
flowers aloft
...

@aparrish Holy spit this is amazing and something I’ve wanted to do for a few months now, and you’ve done a better job than I was even imagining. Thank you for this!

if you were going to download this today, maybe hold off—I found a frustrating bug where some utf8-encoded texts were being decoded incorrectly with a different encoding, leading to hilarious mojibake when they came out the other end—will post a fix in a few hrs

none of this would be a problem if the reported charset in the metadata was always the correct charset. but there are a lot of texts that report "us-ascii" when what they really mean is "ascii with occasional 8-bit chars just for fun!"

also lots of files (>1%?) that say "yeah I'm utf8 sure whatever" but are actually ISO-8859-1 (according to chardet at least)

lessons learned: (a) never trust someone's claim about the encoding of a text file (b) character encodings are bad and trying to digitize text in the first place was bad idea

@aparrish (crawls from the pit of parsing Norwegian text files) ghaa so much this

@aparrish and chardet can't always get it right. e.g. I recently ran into CSV files that weren't encoded in ISO-8859-1, but instead a MacOS encoding from a similar era.

@aparrish I thought that "ascii with occasional 8-bit chars just for fun!" was the new mandatory charset standard.

okay I just pushed an update and uploaded a new archive with (more) correct character encoding—enjoy github.com/aparrish/gutenberg-

@aparrish I'm definitely going to play with this later! Thanks for doing this!

Sign in to participate in the conversation
Mastodon

Follow friends and discover new ones. Publish anything you want: links, pictures, text, video. This server is run by the main developers of the Mastodon project. Everyone is welcome as long as you follow our code of conduct!