Allison Parrish
Follow

all I want is a big flat directory with all of the project gutenberg files in plain text format, unambiguously numbered in order, and an easy-to-use file with the associated metadata. why has no one done this?

there's like seventeen hundred different project gutenberg corpus projects but none of them do this. there are lots of straight-up PG mirrors (inheriting PG's weird directory structure), lots of projects that break the files up and rewrite the PG RDF, tons of projects that just select a subset of the files and work with those only, etc etc etc etc. but nothing that's just file, file, file, metadata

@aparrish (thinks...) how big is all of project gutenberg? Like, this could just be an S3 bucket, right?

@simrob straight up *mirroring* isn't that difficult (gutenberg.org/wiki/Gutenberg:M), but there are several million files in a variety of languages and formats (including audiobooks). the real problem is dealing with the byzantine way that PG organizes the files (there are several overlapping schemes for metadata and determining where a file is located, not to mention different formats and text encodings) and getting just the plain text files (and just the ones in English, for my purposes)

@simrob by which I mean that storing/hosting my hypothetical well-organized-text-files gutenberg wouldn't really be the tough part

@aparrish
At one time, I used the Gutenberg bulk API to download all plaintext english language gutenberg documents. It gave me, in return, a bunch of nested zip files. When I uncompressed the zip files and flattened everything into a single directory, there were no name conflicts.

@aparrish
HOWEVER:
1) processing the text header to produce titles and authors was not totally reliable because of changes in the format over time (though it was mostly reliable)
2) I remember thinking the file basenames didn't quite match the gutenberg numbers either in some cases!

This was years ago so it might not apply, and also I was doing an ad-hoc processing instead of looking at a spec so it might not have applied then either.

@aparrish
If you're OK with, like, 90% or more of them having metadata retrievable, just do what I did.

(Also maybe bulk download can also give you indexes with metadata? Not sure.)

@enkiv2 yeah, I've done exactly this before (or something similar without moving the files out of the directories from the dump I was working with). but the "just" there in "just do what I did" is pretty substantial, especially if I want the process to be repeatable (or if I want to include the results in an academic paper and be able to show that my use accurately reflects the contents of the archive in some way, as is my goal in this instance) and would love to have >90% accuracy on metadata

@aparrish
Fair enough. This method isn't repeatable or professional at all.

@enkiv2 the metadata is actually all available in RDF format that you can parse pretty easily! the trick is relating the RDFs back to the actual documents they refer to gutenberg.org/wiki/Gutenberg:F

I dunno, I just feel like the whole process should be easier and less arcane

@aparrish

It's super specific (and likely not related to what you're wanting but still throwing it out there) but the Delphi Works series of ebooks are really good with regards to at least getting all the major classics. Most of those classics are super white and traditional but the big ones are there: Dickens, Conrad, Twain, Hawthorne, etc.

delphiclassics.com/

Can be bought there or pirated if you know the right places.

@sonicbooming this is cool but I'm specifically interested in using PG as a corpus for text analysis/generative text—not interested in actually, like, reading them, sitting down in an easy chair or whatever

@aparrish I figured, sorry I was butting in. That was pretty rude of me.

@sonicbooming @aparrish Is it really possible to "butt in" on a public social media post? 🙂

@Ricardus @aparrish I think so. Sometimes it's easy to try to just jump in and say some shit and make assumptions. I didn't want to be that way. Often people are just in their stream sharing some insight into a thing they're dealing with and we get a peak but that doesn't mean we're always invited to contribute. Hehe.

@sonicbooming It looked to me like you were trying to be helpful. I have very strong opinions about people who go on social media, and then complain when someone reaches out to them in a reasonable way.

If they don't want to be reached, maybe social media isn't the right thing for them.

@Ricardus Fair. I think I might be a bit aware as I've seen a lot of recent social media surrounding how people, specifically men, try to mansplain in other people's lives.

And I'm hoping I wasn't coming across that way.

You're right that if you're sharing some stuff in social media, you shouldn't be surprised when others jump in, but still, I could see how people constantly interrupt a life to try to explain or tell a person some shit, that could get old/frustrating/annoying.

@Ricardus Absolute truth. I was second-guessing my comment and thinking, “Were you trying to 'solve' a stranger's shit and mansplain b/c sometimes you do that.”

I also have my own anxiety, so I'm sure that's there too.

@sonicbooming Well, I follow that person, and saw their lament about how the thing they wanted wasn't available, and they weren't clear about how they didn't want to read them, but wanted them for something else. I saw your response as helpful.

@Ricardus So welcome to living in my dumb brain, hahaha.

It's fun in here, all this anxiety and second-guessing. :D

@aparrish If you find (or build) this, I want to know!

Sign in to participate in the conversation
Mastodon

Follow friends and discover new ones. Publish anything you want: links, pictures, text, video. This server is run by the main developers of the Mastodon project. Everyone is welcome as long as you follow our code of conduct!