Mastodon @Mastodon

**Allison Parrish** @aparrish · Mar 14, 2018

Mar 14, 2018

Allison Parrish @aparrish

top 25 keyword adjectives with length greater than seven characters from shakespeare's sonnets

python code and output: ['beauteous', 'fairest', 'eternal', 'gracious', 'precious', 'sweetest', 'outward', 'mistress', 'strange', 'antique', 'forsworn', 'contented', 'virtuous', 'heavenly', 'sovereign', "imprison'd", "unfather'd", "confin'd", 'unrespected', 'outworn', 'bounteous', 'shouldst', 'inconstant', 'obsequious', 'evermore']

**Allison Parrish** @aparrish · Mar 14, 2018

Mar 14, 2018

Allison Parrish @aparrish

(this is from my latest attempt to do the thing that I was talking about here https://mastodon.social/@aparrish/99660550376065760, i.e., simple keyword extraction from small corpora using only stuff you get with spaCy, in particular spaCy's English-wide unigram log probabilities. current solution: use G² [see e.g. http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html], as implemented in scipy's chi2_contingency function, using a wild guess for what the actual token count is for spaCy's unigram frequencies. I guessed 1000000 and it seems to work fine)

tdunning.blogspot.comSurprise and CoincidenceSome years ago, I wrote a simple paper, Accurate Methods for the Statistics of Surprise and Coincidence that has since seen quite a history...

Allison Parrish @aparrish@mastodon.social

(for comparison, the 7+ character adjectives deemed least keyword-like by this method are words like 'married', 'curious', 'necessary', 'private', 'religious', 'forward', 'beautiful', 'several', 'certain', 'different' etc. and of course it works with *all* kinds of words— limiting it to just 7+ character adjectives was just a fun experiment)

Mar 14, 2018, 10:14 PM··Web

1boost·1favorite

**Lew Perin** @wrenpile · Mar 15, 2018

Mar 15, 2018

Lew Perin @wrenpile

@aparrish So “beauteous” is way keywordier than “beautiful”? That’s…counterintuitive.

**Allison Parrish** @aparrish · Mar 15, 2018

Mar 15, 2018

Allison Parrish @aparrish

@wrenpile "keyword" in this context meaning words that are particular to the text in question, as opposed to some other randomly chosen English text. the insight here is that many texts contain the word "beautiful," but shakespeare's sonnets are special in that the word "beauteous" occurs frequently there but not in English in general

**Lew Perin** @wrenpile · Mar 15, 2018

Mar 15, 2018

Lew Perin @wrenpile

@aparrish Right, you made your methods quite clear earlier. I’m guessing that your goal for a keyword would be to stand out *semantically* from the corpus; if so, there’s some distance yet to be covered.

**Allison Parrish** @aparrish · Mar 15, 2018

Mar 15, 2018

Allison Parrish @aparrish

@wrenpile characterizing semantics would be a different set of techniques altogether (topic modeling). I'm personally more interested in keywords as a way of analyzing distinctive word *choices* in a text even if those words have similar meanings to higher-frequency words.