there's like seventeen hundred different project gutenberg corpus projects but none of them do this. there are lots of straight-up PG mirrors (inheriting PG's weird directory structure), lots of projects that break the files up and rewrite the PG RDF, tons of projects that just select a subset of the files and work with those only, etc etc etc etc. but nothing that's just file, file, file, metadata
@simrob straight up *mirroring* isn't that difficult (http://www.gutenberg.org/wiki/Gutenberg:Mirroring_How-To), but there are several million files in a variety of languages and formats (including audiobooks). the real problem is dealing with the byzantine way that PG organizes the files (there are several overlapping schemes for metadata and determining where a file is located, not to mention different formats and text encodings) and getting just the plain text files (and just the ones in English, for my purposes)
At one time, I used the Gutenberg bulk API to download all plaintext english language gutenberg documents. It gave me, in return, a bunch of nested zip files. When I uncompressed the zip files and flattened everything into a single directory, there were no name conflicts.
1) processing the text header to produce titles and authors was not totally reliable because of changes in the format over time (though it was mostly reliable)
2) I remember thinking the file basenames didn't quite match the gutenberg numbers either in some cases!
This was years ago so it might not apply, and also I was doing an ad-hoc processing instead of looking at a spec so it might not have applied then either.
@enkiv2 yeah, I've done exactly this before (or something similar without moving the files out of the directories from the dump I was working with). but the "just" there in "just do what I did" is pretty substantial, especially if I want the process to be repeatable (or if I want to include the results in an academic paper and be able to show that my use accurately reflects the contents of the archive in some way, as is my goal in this instance) and would love to have >90% accuracy on metadata
@enkiv2 the metadata is actually all available in RDF format that you can parse pretty easily! the trick is relating the RDFs back to the actual documents they refer to https://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog
I dunno, I just feel like the whole process should be easier and less arcane
It's super specific (and likely not related to what you're wanting but still throwing it out there) but the Delphi Works series of ebooks are really good with regards to at least getting all the major classics. Most of those classics are super white and traditional but the big ones are there: Dickens, Conrad, Twain, Hawthorne, etc.
Can be bought there or pirated if you know the right places.
@Ricardus @aparrish I think so. Sometimes it's easy to try to just jump in and say some shit and make assumptions. I didn't want to be that way. Often people are just in their stream sharing some insight into a thing they're dealing with and we get a peak but that doesn't mean we're always invited to contribute. Hehe.
@sonicbooming It looked to me like you were trying to be helpful. I have very strong opinions about people who go on social media, and then complain when someone reaches out to them in a reasonable way.
If they don't want to be reached, maybe social media isn't the right thing for them.
@Ricardus Fair. I think I might be a bit aware as I've seen a lot of recent social media surrounding how people, specifically men, try to mansplain in other people's lives.
And I'm hoping I wasn't coming across that way.
You're right that if you're sharing some stuff in social media, you shouldn't be surprised when others jump in, but still, I could see how people constantly interrupt a life to try to explain or tell a person some shit, that could get old/frustrating/annoying.
Follow friends and discover new ones. Publish anything you want: links, pictures, text, video. This server is run by the main developers of the Mastodon project. Everyone is welcome as long as you follow our code of conduct!