mastodon.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
The original server operated by the Mastodon gGmbH non-profit

Administered by:

Server stats:

377K
active users

top 25 keyword adjectives with length greater than seven characters from shakespeare's sonnets

(this is from my latest attempt to do the thing that I was talking about here mastodon.social/@aparrish/9966, i.e., simple keyword extraction from small corpora using only stuff you get with spaCy, in particular spaCy's English-wide unigram log probabilities. current solution: use G² [see e.g. tdunning.blogspot.com/2008/03/], as implemented in scipy's chi2_contingency function, using a wild guess for what the actual token count is for spaCy's unigram frequencies. I guessed 1000000 and it seems to work fine)

tdunning.blogspot.comSurprise and CoincidenceSome years ago, I wrote a simple paper, Accurate Methods for the Statistics of Surprise and Coincidence that has since seen quite a history...
Allison Parrish

(for comparison, the 7+ character adjectives deemed least keyword-like by this method are words like 'married', 'curious', 'necessary', 'private', 'religious', 'forward', 'beautiful', 'several', 'certain', 'different' etc. and of course it works with *all* kinds of words— limiting it to just 7+ character adjectives was just a fun experiment)

@aparrish So “beauteous” is way keywordier than “beautiful”? That’s…counterintuitive.

@wrenpile "keyword" in this context meaning words that are particular to the text in question, as opposed to some other randomly chosen English text. the insight here is that many texts contain the word "beautiful," but shakespeare's sonnets are special in that the word "beauteous" occurs frequently there but not in English in general

@aparrish Right, you made your methods quite clear earlier. I’m guessing that your goal for a keyword would be to stand out *semantically* from the corpus; if so, there’s some distance yet to be covered.

@wrenpile characterizing semantics would be a different set of techniques altogether (topic modeling). I'm personally more interested in keywords as a way of analyzing distinctive word *choices* in a text even if those words have similar meanings to higher-frequency words.

@aparrish Oh, ok — it makes perfect sense then.

@wrenpile @aparrish beauteous is an absolutely gorgiful word