Poetry & Data II: Identifying meter in poetry using Python

The code that is referred to in this post can be found on Github

In order to identify which meter a poem uses, we need a computer to determine which syllables are stressed in a sentence, and which are not. It turns out that this can be quite tricky. Compare for example the following two examples:

should

have

hur

ried

youth

truth

and

moved

quick

ways

hoped

i'd

have

some

oth

year

two

Both examples contain the word "have", but in the first example it was unstressed, while in the latter example it's stressed. Why? Beats me. It's just what happens in my head when I read it. Sadly, that's not very useful information for a computer. It turns out that the problem of teaching a computer to find the right scansion (marking the stressed and unstressed syllables) is quite a complex one. I found a Python library called pronouncing that can help determine the stressed and unstressed syllables of a single word. For example:

import pronouncing
pronounciation = pronouncing.phones_for_word('have')
pronouncing.stresses(pronounciation[0])
> 1

import pronouncing
pronounciation = pronouncing.phones_for_word('another')
pronouncing.stresses(pronounciation[0])
> 010

where 1 denotes a stressed syllable, and 0 denotes an unstressed syllable. For "have", it gives us a single stressed syllable, which as we saw in the example may or may not be correct based on the context. For "another" it returns 010, which is in line with the scansion of our second example. I think this is correct regardless of context; try to pronounce "another" like "ANother" or "anothER" and you'll understand why.

So we are definitely not going to get a perfect scansion for each poem by simply using this pronouncing library. However, I came up with a method to use it to at least get the poem's primary meter:

For each line, determine the scansion by determining the scansion for each word. This will give us a string of 0's and 1's.
Divide the lines of the poem into groups by the amounts of syllables per line. (So for example, lines with 9 syllables are grouped together)
Try to find the closest match for each line in a set of known meters, also represented as 0's and 1's.
Per group, find the most found known meter.
Now we have a set of meters that together make up the meter of the poem.

There are a few more steps in this process to make it work. In the first post on this subject, we saw that sprog quite often splits a line over two or even more lines. For example;

throw

off

the

chains

pres

sion

said

fair

fet

terred

and

free

To accurately determine the meter of the poem, we could recognize that the latter two lines are actually one line split into two, and if we merge them back together we get two lines with the same meter:

throw

off

the

chains

pres

sion

said

fair

fet

terred

and

free

Now we recognize this as being anapestic tetrameter, with the first unstressed syllable omitted. This is called iambic substitution, since the first anapestic foot is replaced with an iambic foot.

In my code I have built a function that looks for these kind of lines and combines them; it looks for lines that together have the same amount of syllables as a longer line in that poem (in this example 6 + 5 = 11). If such a set of lines is found, they are merged into a single line. I'm not 100% sure if this is the right thing to do when analyzing the meter in a poem, but it does seem to make a lot of sense to me. Besides, if poets get to split lines and call that 'artistic freedom', I think data scientists are allowed some 'scientific freedom' and merge them back together.

Now, I can talk a lot more about this process, since it took me quite some time to build something that performs satisfactorily, but I propose we just continue with applying the logic to sprog's poems and take a look at the results! You can do so by returning to this post.