Automated detection of rhymes in texts (From rhymes to networks 4)


Having discussed how to annotate rhymes in last month's blog post, we can now discuss the automated detection of rhymes. I am fascinated by this topic, although I have not managed to find a proper approach yet. What fascinates me more, however, is how easily the problem is misunderstood. I have witnessed this a couple of times in discussions with colleagues. When mentioning my wish to create a magic algorithm that does the rhyme annotation for me, so that I no longer need to do it manually, nobody seems to agree with me that the problem is not trivial.

On the contrary, the problem seems to be so easy that it should have been solved already a couple of years ago. One typical answer is that I should just turn to artificial intelligence and neural networks, whatever this means in concrete, and that they would certainly outperform any algorithm that was proposed in the past. Another typical answer, which is slightly more subtle, assumes that some kind of phonetic comparison should easily reveal what we are dealing with.

Unfortunately, none of these approaches work. So, instead of presenting a magic algorithm that works, I will use this post to try and explain why I think that the problem of rhyme detection is far less trivial than people seem to think.

Defining the problem of automated rhyme detection

Before we can discuss potential solutions to rhyme detection, we need to define the problem. If we think of a rhyme annotation model that allows us to annotate rhymes at the level of specific word parts (not restricted to entire words), the most general rhyme detection problem can be presented as follows:
Given a rhyme corpus that is divided into poems, with poems divided into stanzas, and stanzas being divided into lines, find all of the word parts that clearly rhyme with each other within each stanza within each poem within the corpus.
With respect to machine learning strategies, we can further distinguish supervised versus unsupervised learning. While supervised learning for the rhyme detection problem would build on a large annotated rhyme corpus, in order to infer the best strategies to identify words that rhyme and words that do not rhyme, unsupervised approaches would not require any training data at all.

With respect to the application target, we should further specify whether we want our approach to work for a multilingual sample or just a single language. If we want the method to work on a truly multilingual (that is: cross-linguistic) basis, we would probably need to require a unified transcription for speech sounds as input. It is already obvious that, although the annotation schema I presented last month is quire general, it would not work for those languages with writing systems that are not spelled from left to write, for example, not to speak of writing systems that are not alphabetic.

Why rhyme detection is difficult

It is obvious that the most general problem for rhyme detection would be the cross-linguistic unsupervised detection of rhymes within a corpus of poetry. Developing systems for monolingual rhyme detection seems to be a bit trivial, given that one could just assemble a big list of words that rhyme in a given language, and then find where they occur in a given corpus. However, given that the goal of poetry is also to avoid "boring" rhymes, and come up with creative surprises, it may turn out to be less trivial than it seems at first sight.

As an example, consider the following refrain from a recent hip-hop song by German comedian Carolin Kebekus, in which the text rhymes Gemeinden (community) with vereinen (unite), as well as Mädchen (girl) with Päpstin (female pope) (the video has English subtitles for those who are interested in the text but do not speak German).

Figure 1: Rhyme example from a recent German hip-hop song.

While one could argue whether those words qualify as proper rhymes and were intended as such, I am quite convinced that the words were chosen for their near-rhyme similarity, and I am also convinced that most native speakers of German listening to the song will understand the intended rhyme here. Both rhymes are not perfect, but they are close enough, and they are beyond doubt creative and unexpected — it is extremely unlikely that one could find them in any German rhyme book. This example shows that humans' creative treatment of language keeps constantly searching for similarities that have not been used before by others. This leads to a situation where we cannot simply use a static look-up table of licensed rhyme words, to solve the problem of rhyme detection for a particular language.

What we instead need is some way to estimate the phonetic similarity of words parts, in order to check whether they could rhyme or not. However, since languages may have different rhyme rules, these similarities would have to be adjusted for each language. While phonetic similarity can be measured fairly well with the help of alignment algorithms applied to phonetic transcriptions, what counts as being similar may differ from language to language, and rhyme usually reflects local similarity of words.

Since rhyme is closely accompanied by rhythm and word or phrase stress, we would also need this information to be supplied from the original transcriptions. All in all, working on a general method for rhyme detection seems like a hell of an enterprise, specifically whilever we lack any datasets that we could use for testing and training.

Less interesting sub-problems and proposed solutions

While, to the best of my knowledge, nobody has every tried to propose a solution for the general problem of rhyme detection as I outlined it above, there are some studies in which a sub-problem of rhyme detection has been tackled. This sub-problem can be presented as follows:
Given a rhyme corpus of poems that are divided into stanzas, which are themselves divided into lines, try to find the rhyme schemas underlying each stanza.
This problem, which has been often called rhyme scheme discovery, has been addressed using at least three approaches that I have been able to find. Reddy and Knight (2011) employ basic assumptions about the repetition of rhyme pairs in order to create an unsupervised method based on expectation maximization. Addanki and Wu (2013) test the usefulness of Hidden Markov Models for unsupervised rhyme scheme detection. Haider and Kuhn (2018)use Siamese Recurrent Networks for a supervised approach to the same problem. Additionally, Plechač (2018) proposes a modification of the algorithm by Reddy and Knight, and tests it on three languages (English, Czech, and French).

One could go into the details, and discuss the advantages and disadvantages of these approaches. However, in my opinion it is much more important to emphasize the fundamental difference between the task of rhyme scheme detection and the problem of general rhyme detection, as I have outlined it above. Rhyme scheme detection does not seek to explain rhyme in terms of partial word similarity, but rather assumes that a general overarching structure (in terms of rhyme schemas) underlies all kinds of rhymed poetry.

There are immediate consequences to assuming that rhymed poetry needs to be organized by rhyme schemes. First, the underlying model does not accept rhymes that occur in any other place than the end of a given line, which is problematic, specifically when dealing with more recent genres like hip-hop. Second, if one assumes that rhyme scheme structure dominates rhymed poetry, the model does not accept any immediate, more spontaneous forms of rhyming, which, however, frequently occur in human language (compare the famous examples in political speech, discussed by Jakobson 1958).

Concentrating on rhyme schemes, instead of rhyme word detection, has immediate consequences for the algorithms. First, the methods need to be applied to "normal" poetry, given that any form of poetry that evades the strict dominance of rhyme schemes cannot be characterized properly by the underlying rhyme model. Second, all that the methods need as input are the words occurring at the end of a line, since these are the only ones that can rhyme (and the test datasets are all constructed in this way alone). Third, the methods are all trained in such a way that they need to identify rhymes in a text, so that they cannot be used to test whether a given text collection rhymes or not.

Outlook

In this post, I have tried to present what I consider to be the "ultimate" problem of rhyme detection, a problem that I consider to be the "general" rhyme detection problem in computational approaches to literature. In contrast, I think that the problem of detecting only rhyme schemes is much less interesting than the general rhyme detection problem. The focus on rhyme schemes, instead of focusing on the actual words that rhyme, reflects a certain lack of knowledge regarding the huge variation by which people rhyme words across different languages, cultures, styles, and epochs.

If all poetry followed the same rhyme schemes, then we would not need any rhyme detection methods at all. Think of Shakespeare's 154 sonnets, all coded in the same rhyme schema: no algorithm would be needed to detect the rhyme schema, as we already know it beforehand — for a perfect supervised method, it would be enough to pass the algorithm the line numbers and the resulting schema.

The picture changes, however, when working with different styles, especially those representing an emerging rather than an established tradition of poetry. Rhyme schemes in the most ancient Chinese inscriptions, for example, are far less fixed (Behr 2008). In modern hip-hop lyrics, which also represent a tradition that has only recently emerged, it does not make real sense to talk about rhyme schemes either, as can be easily seen from the following excerpt of Akhenaton's Mes soleils et mes lunes, which I have tried to annotate to the best of my knowledge.

Figure 2: First stanza from Akhenaton's Mes soleils et mes lunes

Surprisingly, both Haider and Kuhn (2018), as well as Addanki and Wu (2013) explicitly test their methods on hip-hop corpora. They interpret them as normal poems, extract the rhyme words, and classify them line by line. I would be curious what these methods would yield if they are fed non-rhyming text passages. For me, the ability of an algorithm to distinguish rhyming from non-rhyming texts is one of the crucial tests for its suitability. We do not need approaches that confirm what we already know.

Ultimately, we hope to find methods for rhyme detection that could actively help us to learn something about the difference between conscious rhyming versus word similarities by chance. But, given the huge differences in rhyming practice across languages and cultures, it is not clear if we will ever arrive at this point.

References

Addanki, Karteek and Wu, Dekai (2013) Unsupervised rhyme scheme identification in Hip Hop lyrics using Hidden Markov Models. In: Statistical Language and Speech Processing, pp. 39-50.

Behr, Wolfgang (2008) Reimende Bronzeinschriften und die Entstehung der Chinesischen Endreimdichtung. Bochum:Projekt Verlag.

Haider, Thomas and Kuhn, Jonas (2018) Supervised rhyme detection with Siamese recurrent networks. In: Proceedings of Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 81-86.

Jakobson, Roman (1958) Typological studies and their contribution to historical comparative linguistics. In: Proceedings of the Eighth International Congress of Linguistics, pp. 17-35.

Plecháč, Petr (2018) A collocation-driven method of discovering rhymes (in Czech, English, and French poetry). In: Masako Fidler and Václav Cvrček (eds.) Taming the Corpus: From Inflection and Lexis to Interpretation. Cham:Springer, pp. 79-95.

Automated detection of rhymes in texts (From rhymes to networks 4)


Having discussed how to annotate rhymes in last month's blog post, we can now discuss the automated detection of rhymes. I am fascinated by this topic, although I have not managed to find a proper approach yet. What fascinates me more, however, is how easily the problem is misunderstood. I have witnessed this a couple of times in discussions with colleagues. When mentioning my wish to create a magic algorithm that does the rhyme annotation for me, so that I no longer need to do it manually, nobody seems to agree with me that the problem is not trivial.

On the contrary, the problem seems to be so easy that it should have been solved already a couple of years ago. One typical answer is that I should just turn to artificial intelligence and neural networks, whatever this means in concrete, and that they would certainly outperform any algorithm that was proposed in the past. Another typical answer, which is slightly more subtle, assumes that some kind of phonetic comparison should easily reveal what we are dealing with.

Unfortunately, none of these approaches work. So, instead of presenting a magic algorithm that works, I will use this post to try and explain why I think that the problem of rhyme detection is far less trivial than people seem to think.

Defining the problem of automated rhyme detection

Before we can discuss potential solutions to rhyme detection, we need to define the problem. If we think of a rhyme annotation model that allows us to annotate rhymes at the level of specific word parts (not restricted to entire words), the most general rhyme detection problem can be presented as follows:
Given a rhyme corpus that is divided into poems, with poems divided into stanzas, and stanzas being divided into lines, find all of the word parts that clearly rhyme with each other within each stanza within each poem within the corpus.
With respect to machine learning strategies, we can further distinguish supervised versus unsupervised learning. While supervised learning for the rhyme detection problem would build on a large annotated rhyme corpus, in order to infer the best strategies to identify words that rhyme and words that do not rhyme, unsupervised approaches would not require any training data at all.

With respect to the application target, we should further specify whether we want our approach to work for a multilingual sample or just a single language. If we want the method to work on a truly multilingual (that is: cross-linguistic) basis, we would probably need to require a unified transcription for speech sounds as input. It is already obvious that, although the annotation schema I presented last month is quire general, it would not work for those languages with writing systems that are not spelled from left to write, for example, not to speak of writing systems that are not alphabetic.

Why rhyme detection is difficult

It is obvious that the most general problem for rhyme detection would be the cross-linguistic unsupervised detection of rhymes within a corpus of poetry. Developing systems for monolingual rhyme detection seems to be a bit trivial, given that one could just assemble a big list of words that rhyme in a given language, and then find where they occur in a given corpus. However, given that the goal of poetry is also to avoid "boring" rhymes, and come up with creative surprises, it may turn out to be less trivial than it seems at first sight.

As an example, consider the following refrain from a recent hip-hop song by German comedian Carolin Kebekus, in which the text rhymes Gemeinden (community) with vereinen (unite), as well as Mädchen (girl) with Päpstin (female pope) (the video has English subtitles for those who are interested in the text but do not speak German).

Figure 1: Rhyme example from a recent German hip-hop song.

While one could argue whether those words qualify as proper rhymes and were intended as such, I am quite convinced that the words were chosen for their near-rhyme similarity, and I am also convinced that most native speakers of German listening to the song will understand the intended rhyme here. Both rhymes are not perfect, but they are close enough, and they are beyond doubt creative and unexpected — it is extremely unlikely that one could find them in any German rhyme book. This example shows that humans' creative treatment of language keeps constantly searching for similarities that have not been used before by others. This leads to a situation where we cannot simply use a static look-up table of licensed rhyme words, to solve the problem of rhyme detection for a particular language.

What we instead need is some way to estimate the phonetic similarity of words parts, in order to check whether they could rhyme or not. However, since languages may have different rhyme rules, these similarities would have to be adjusted for each language. While phonetic similarity can be measured fairly well with the help of alignment algorithms applied to phonetic transcriptions, what counts as being similar may differ from language to language, and rhyme usually reflects local similarity of words.

Since rhyme is closely accompanied by rhythm and word or phrase stress, we would also need this information to be supplied from the original transcriptions. All in all, working on a general method for rhyme detection seems like a hell of an enterprise, specifically whilever we lack any datasets that we could use for testing and training.

Less interesting sub-problems and proposed solutions

While, to the best of my knowledge, nobody has every tried to propose a solution for the general problem of rhyme detection as I outlined it above, there are some studies in which a sub-problem of rhyme detection has been tackled. This sub-problem can be presented as follows:
Given a rhyme corpus of poems that are divided into stanzas, which are themselves divided into lines, try to find the rhyme schemas underlying each stanza.
This problem, which has been often called rhyme scheme discovery, has been addressed using at least three approaches that I have been able to find. Reddy and Knight (2011) employ basic assumptions about the repetition of rhyme pairs in order to create an unsupervised method based on expectation maximization. Addanki and Wu (2013) test the usefulness of Hidden Markov Models for unsupervised rhyme scheme detection. Haider and Kuhn (2018)use Siamese Recurrent Networks for a supervised approach to the same problem. Additionally, Plechač (2018) proposes a modification of the algorithm by Reddy and Knight, and tests it on three languages (English, Czech, and French).

One could go into the details, and discuss the advantages and disadvantages of these approaches. However, in my opinion it is much more important to emphasize the fundamental difference between the task of rhyme scheme detection and the problem of general rhyme detection, as I have outlined it above. Rhyme scheme detection does not seek to explain rhyme in terms of partial word similarity, but rather assumes that a general overarching structure (in terms of rhyme schemas) underlies all kinds of rhymed poetry.

There are immediate consequences to assuming that rhymed poetry needs to be organized by rhyme schemes. First, the underlying model does not accept rhymes that occur in any other place than the end of a given line, which is problematic, specifically when dealing with more recent genres like hip-hop. Second, if one assumes that rhyme scheme structure dominates rhymed poetry, the model does not accept any immediate, more spontaneous forms of rhyming, which, however, frequently occur in human language (compare the famous examples in political speech, discussed by Jakobson 1958).

Concentrating on rhyme schemes, instead of rhyme word detection, has immediate consequences for the algorithms. First, the methods need to be applied to "normal" poetry, given that any form of poetry that evades the strict dominance of rhyme schemes cannot be characterized properly by the underlying rhyme model. Second, all that the methods need as input are the words occurring at the end of a line, since these are the only ones that can rhyme (and the test datasets are all constructed in this way alone). Third, the methods are all trained in such a way that they need to identify rhymes in a text, so that they cannot be used to test whether a given text collection rhymes or not.

Outlook

In this post, I have tried to present what I consider to be the "ultimate" problem of rhyme detection, a problem that I consider to be the "general" rhyme detection problem in computational approaches to literature. In contrast, I think that the problem of detecting only rhyme schemes is much less interesting than the general rhyme detection problem. The focus on rhyme schemes, instead of focusing on the actual words that rhyme, reflects a certain lack of knowledge regarding the huge variation by which people rhyme words across different languages, cultures, styles, and epochs.

If all poetry followed the same rhyme schemes, then we would not need any rhyme detection methods at all. Think of Shakespeare's 154 sonnets, all coded in the same rhyme schema: no algorithm would be needed to detect the rhyme schema, as we already know it beforehand — for a perfect supervised method, it would be enough to pass the algorithm the line numbers and the resulting schema.

The picture changes, however, when working with different styles, especially those representing an emerging rather than an established tradition of poetry. Rhyme schemes in the most ancient Chinese inscriptions, for example, are far less fixed (Behr 2008). In modern hip-hop lyrics, which also represent a tradition that has only recently emerged, it does not make real sense to talk about rhyme schemes either, as can be easily seen from the following excerpt of Akhenaton's Mes soleils et mes lunes, which I have tried to annotate to the best of my knowledge.

Figure 2: First stanza from Akhenaton's Mes soleils et mes lunes

Surprisingly, both Haider and Kuhn (2018), as well as Addanki and Wu (2013) explicitly test their methods on hip-hop corpora. They interpret them as normal poems, extract the rhyme words, and classify them line by line. I would be curious what these methods would yield if they are fed non-rhyming text passages. For me, the ability of an algorithm to distinguish rhyming from non-rhyming texts is one of the crucial tests for its suitability. We do not need approaches that confirm what we already know.

Ultimately, we hope to find methods for rhyme detection that could actively help us to learn something about the difference between conscious rhyming versus word similarities by chance. But, given the huge differences in rhyming practice across languages and cultures, it is not clear if we will ever arrive at this point.

References

Addanki, Karteek and Wu, Dekai (2013) Unsupervised rhyme scheme identification in Hip Hop lyrics using Hidden Markov Models. In: Statistical Language and Speech Processing, pp. 39-50.

Behr, Wolfgang (2008) Reimende Bronzeinschriften und die Entstehung der Chinesischen Endreimdichtung. Bochum:Projekt Verlag.

Haider, Thomas and Kuhn, Jonas (2018) Supervised rhyme detection with Siamese recurrent networks. In: Proceedings of Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 81-86.

Jakobson, Roman (1958) Typological studies and their contribution to historical comparative linguistics. In: Proceedings of the Eighth International Congress of Linguistics, pp. 17-35.

Plecháč, Petr (2018) A collocation-driven method of discovering rhymes (in Czech, English, and French poetry). In: Masako Fidler and Václav Cvrček (eds.) Taming the Corpus: From Inflection and Lexis to Interpretation. Cham:Springer, pp. 79-95.

Annotating rhymes in texts (From rhymes to networks 3)


Having discussed some general aspects of rhyming in a couple of different languages, in last month's blog post, the third post in this series is devoted to the question of how rhyme can be annotated. Annotation plays a crucial role in almost all fields of linguistics. The main idea is to add value to a given resource (Milà-Garcia 2018). What value we add to resources can differ widely, but as far as textual resources are concerned, we can say that the information that we add can usually not be extracted automatically from the resource.

In our case, the information we want to explicitly add to rhyme texts or rhyme corpora is the rhyme relations between words. Retrieving this information may be trivial, as in the case of Shakespeare's Sonnets, where we know the rhyme schema in advance, but it is considerably complicated when working with other, less strict types of rhyming.

One usually distinguishes two basic types of annotation: inline and stand-off (Eckart 2012). For inline annotation, we add our information directly into our textual resource, while stand-off annotation creates an index over the resource, and then adds the information in a separate resource that refers to the index of the original text.

Both methods have their pros and cons. Stand-off annotation often seems to provide a cleaner solution (as one never knows how much a manual annotation added into a text might modify the text involuntarily). However, inline annotation has, in my experience, the advantage of allowing for a much faster annotation process, at least as long as the annotation has to be done in text files directly, without interfaces that could help to assist in the annotation process.

Overview of existing annotation practice

If we look at different practices that have been used to annotate rhymes in collections of poetry, we will find quite a variety of techniques that have been used so far.

Wáng (1980), for example, uses an inline annotation style in his corpus of the rhymes in the Book of Odes, as illustrated in the following example taken from List et al. (2019). In this annotation, rhyme words are indirectly annotated by providing reconstructed readings for the Chinese characters which are supposed to narrow the original pronunciation. Whenever two rhyme words share the same main vowel, the author would have judged them to have rhymed in the original text.

Annotation in Wáng (1980)

Baxter (1992) uses a stand-off annotation, which is shown (again taken from List et al. 2019) in the following table. An advantage of Baxter's annotation is that it allows him to provide multiple layers of information for each rhyme word. A disadvantage is that a clear index to the words in the poem is lacking. While this is not entirely problematic, since it is usually easy to identify which words are in rhyme position, it is not entirely "safe", from an annotation point-of-view, as it may still create ambiguities.

Annotation in Baxter (1992)

In a study of automated rhyme word detection, Haider and Kuhn (2018) use annotated rhyme datasets from a variety of German styles (Hip Hop, contemporary lyrics, and more ancient lyrics). To annotate the data, they use the standard format of the Text Encoding Initiative, which is based essentially on XML. Unfortunately, however, they do not provide tags for each word that rhymes, but instead only add an attribute to each stanza, indicating the rhyme schema, as can be seen in the example below:
<lg rhyme="aabccb" type="stanza">
<l>Vor seinem Löwengarten,</l>
<l>Das Kampfspiel zu erwarten,</l>
<l>Saß König Franz,</l>
<l>Und um ihn die Großen der Krone,</l>
<l>Und rings auf hohem Balkone</l>
<l>Die Damen in schönem Kranz.</l>
</lg>
The drawback of this annotation style is that it places the annotation where it does not belong, assuming that a poem only rhymes the words that appear in the end of a line, and that there are no exceptions.

For French, I found an interesting website called métrique en ligne, offering a large number of phonetically analyzed texts in French. They offer a rhyme analysis in an interactive fashion: one can have a look at a poem in raw form and then see which parts of the words appear in rhyme relation. A screenshot of the website (with the poem "Les Phares" from Charles Baudelaire) illustrates this annotation:



It is very nice that the project offers the rhyme annotation in such a clear form, annotating explicitly those parts of the words (albeit in orthography) that are supposed to be responsible for the rhyming. However, the annotation has a clear drawback, in that it provides rhyme annotation only on the level of the stanza, although we know well that quite a few poems have recurring rhymes that are reused across many stanzas, and we would like to acknowledge that in our annotation.

The most complete annotation of poetry I have found so far is ``MCFlow: A Digital Corpus of Rap Transcriptions'' (Condit-Schultz 2017). The goal of the annotation was not to annotate rhyme in the primary instance, but to provide a corpus that also takes the musical and rhythmic aspects of rap into account. As a result it offers annotations along seven major aspects: rhythm, stress, tone, break, rhyme, pronunciation, and the lyrics themselves. The rhyme annotation itself is provided for each syllable (the texts themselves are all syllabified), with capital letters indicating stressed, and lower case letters indicating unstressed syllables. Rhyme units (usually, but not necessarily words) are marked by brackets. The following figure from Condit-Schultz (2017) illustrates this schema.

Annotation of rhymes by Condit-Schultz (2017)

What I do not entirely understand is the motivation of using the same lowercase letters for unstressed syllables as for the stressed ones in a rhyme sequence. Given that the information about stress is generally available from the annotation, it seems redundant to add it; and it is not clear to me for what it serves, specifically also because unstressed syllables do not necessarily rhyme in rhyme sequences. But apart from this, I find the information that this annotation schema provides quite convincing, although I find the format difficult to parse computationally; and I also imagine that it is quite difficult to annotate it manually.

Initial reflections on rhyme annotation

When dealing with annotation schemas and trying to develop a framework for annotation, it is always useful to recall the Zen of Python, especially the first seven lines:
  • Beautiful is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.
  • Flat is better than nested.
  • Sparse is better than dense.
  • Readability counts.
What I think we can extract from these seven lines are the following basic rules for an initial annotation schema for rhyme data.
  • First, ideally, we want an annotation schema that gives us the same look and feel that we know when reading a poem. This does not mean we need to store the full annotation in this schema, but for a quick editing of rhyme relations, such an annotation schema has many advantages.
  • Second, in order to maintain explicitness, all rhymes should be treated as rhyming globally inside a poem — we should never restrict annotation of rhymes to a single stanza, and we should also avoid brackets to mark rhyming sequences, as there are other ways to assign words to units.
  • Third, we should be explicit enough to show which parts of a word rhyme but, for now, I think it is not necessary to annotate all syllables at the same time. Since this would cost a lot of time, and specifically since syllabification differs from language to language, it seems better to add this information later on a language-specific basis, semi-automatically. Since many words repeat across poems, one can design a lookup-table to syllabify a word much more easily from a corpus that has been assembled, than adding the information when preparing each poem.

Towards a: Standardized Annotation of Rhyme Data

Last year, we proposed an annotation schema for rhyme annotation (List et al. 2019). Our basic idea was inspired by tabular formats. These are used in linguistic software packages dealing with problems in computational historical linguistics, such as LingPy. They are also used as the backbone of the Cross-Linguistic Data Formats Initiative (Forkel et al. 2018), which uses tabular formats in combination with metadata in order to render linguistic datasets (wordlists, information on structural features) cross-linguistically comparable. Essentially, the format can be seen as a stand-off annotation, where the original data are not modified directly. While our basic format was rather powerful with respect to what can be annotated, it is also very difficult to code data in this format, at least in the absence of a proper annotation tool.

At the same time, to ease the initial preparation of annotated rhyme data conforming to these standards, we proposed an intermediate format, in which a poem was provided just in text form, with minimal markup for metadata, and in which rhymes could be annotated inline. As an example, consider the first two stanzas of the poem "Morning has broken" by Eleanor Farjeon (1881-1965):
@ANNOTATOR: Mattis
@CREATED: 2020-06-26 06:09:04
@TITLE: Morning has broken
@AUTHOR: Eleanor Farjeon
@BIODATE: 1881-1965
@YEAR: before 1965
@MODIFIED: 2020-06-26 06:09:46
@LANGUAGE: English

Morning has [a]broken like the first morning
Blackbird has [a]spoken like the first [b]bird
Praise for the [c]singing, praise for the morning
Praise for them [c]springing fresh from the [b]Word

Sweet the rain's [e]new_[f]fall, sunlit from heaven
Like the first [e]dew_[f]fall on the first [g]grass
Praise for the [d]sweet[h]ness of the wet garden
Sprung in com[d]plete[h]ness where His feet [g]pass
As you can see from this example, we start with some metadata (which is more or less a free form, consisting of the formula @key: value, and then render the stanzas, line by line, separating stanzas by one blank line. Rhymes are annotated by enclosing rhyme labels in angular brackets before the part of the word responsible for the rhyme. If wanted, one can annotate rhymes for each syllable, as done in the rhyme words [d]sweet[h]ness and com[d]plete[h]ness, but one can also only annotate the rhyme as a whole, as done in the rhyme words [a]broken and [a]spoken.

In order to assign words to rhyme units, an understroke can be used that indicates that two orthographic words are perceived as one unit in the rhyme, which is the case for [e]new_[f]fall rhyming with [e]dew_fall. Furthermore, if a stanza reappears throughout a poem or song in the form of a refrain, this can be indicated by adding two spaces before all lines of the stanza.

Comments can be added by beginning a line with the hash symbol #, as shown in this small excerpt of Bob Dylan's "Sad-Eyed Lady of the Lowlands".
# [Verse 1]
With your mercury mouth in the missionary [c]times
And your eyes like smoke and your prayers like [c]rhymes
And your silver cross, and your voice like [c]chimes
Oh, who do they think could [i]bury_[j]you?
With your pockets well protected at [e]last
And your streetcar visions which ya' place on the [e]grass
And your flesh like silk, and your face like [e]glass
Who could they get to [i]carry_[j]you?

# [Chorus]
Sad-eyed lady of the lowlands
Where the sad-eyed prophet say that no man [a]comes
My warehouse eyes, my Arabian [a]drums
Should I put them by your [b]gate
Or, sad-eyed lady, should I [b]wait?
When testing this framework on many different kinds of poems from different languages and styles, I realized that the greedy rhyme annotation that I used (you place the rhyme tag before a word, and all letters that follow will be considered to belong to that very rhyme tag) has a disadvantage in those situations where syllables in multi-syllabic rhyme units essentially do notrhyme. As an example consider the following lines from Eminem's "Not Afraid":
I'ma be what I set out to be, 
without a doubt, undoubtedly
And all those who look down on me,
I'm tearin' down your balcony
Here, the author plays with rhymes centering around the words out to be, undoubtedly, down on me, and balcony. Condit-Schultz has annotated the rhymes as follows (I use the rhyme schema inline for simplicity):
I'ma D|be what I set (C|out c|to D|be), 
wi(C|thout c|a) (C|doubt, c|un)(C|doub.c|ted.D|ly)
And all those who look (C|down c|on D|me),
I'm tearin' C|down your (C|bal.c|co.D|ny)
In my opinion, however, the parts annotated with c by Condit-Schultz do not really rhyme in these lines, they are mere fillers for the rhythm, while the most important rhyme parts, which are also perceived as such, are the stressed syllables with the main vowel ou. To mark that a syllable is not really rhyming, but also in order to mark the border of a rhyme (and thus allow indication that only the first syllable of a word rhymes with another word), I therefore decided to introduce a specific "empty" rhyme symbol, which is now represented by a plus. My annotation of the lines thus looks as follows:
I'ma be what I set [h]out_[+]to_[e]be, 
wi[h]thout a [h]doubt, un[h]doub[+]tab[e]ly
And all those who look [h]down_[d]on_[e]me
I'm tearin' down your bal[d]co[e]ny

An Interactive Tool for Rhyme Annotation

While I consider the inline-annotation format as now rather complete (with all limitations resulting from inline-annotation), I realized, when trying to annotate poems by using the format, that it is not fun to edit text files in this way. I am not talking about small edits, like one stanza, or typing in some metadata — annotating a whole rap song can become very tedious and even problematic, as one may easily forget which rhyme tags one has already used, or oversee which words have been annotated as rhyming, or forget brackets and the like.

As a result, I decided to write an interactive rhyme annotation tool that supports the inline-annotation format and can be edited both in the text and interactively at the same time. This is a bit similar to the text processing programs in blogging software, which allow writing both in the HTML source and in a more convenient version that shows you what you will get.

The following screenshot in the database, for example, shows how the rhymes in Shakespeare's Sonnet Number 98 are visually rendered.

Visual display of Shakespeare's Sonnet 98

This tool is now already available online. I call it RhyAnT, which is short for Rhyme Annotation Tool. I have been using it in combination with a small server, to populate a first database with rhymes in different languages, which already contains more than 350 annotated poems. This database can be accessed and inspected by everybody interested, at AntRhyme; but copyrighted texts from modern songs can — unfortunately — not be rendered yet (as I am not sure how many I would be allowed to share).

I do not want to claim that I am gifted as a designer (I am surely not), and it is possible that there are better ways to implement the whole interface. However, I find it important to note that the format itself, with the coloring of rhyme words, has dramatically increased my efficiency at annotating rhyme data, and also my accuracy in spotting similarities.

Annotating the same poem with RhyAnT, the interactive rhyme annotator

The above screenshot shows how I can edit the poem from my edit access to the database. Alternatively, one can just paste in the text and edit it on the publicly accessible interface of the RhyAnT tool, edit the data, and then copy-paste it to store it. In this form, the interface can already be used by anybody who wants to annotate rhymes in their work.

Outlook

The current annotation framework that I have illustrated here is not almighty, specifically because it does not allow for multi-layered annotation (Banski 2019: 230f), which would allow us to add pronunciation, rhythm, and many other aspects than rhyming alone. However, I hope that many of these aspects can be later added quickly, by creating lookup tables and processing the annotated corpus automatically. Following the Zen of Python, this seems to be much simpler than investing a lot of time in the creation of a highly annotated dataset that would discourage working with the data from the beginning.

References

Bański, Piotr and Witt, Andreas (2019) Modeling and annotating complex data structures. In: Julia Flanders and Fotis Jannidis (eds) The Shape of Data in the Digital Humanities: Modeling Texts and Text-based Resources. Oxford and New York: Routledge, pp. 217-235.

Baxter, William H. (1992) A Handbook of Old Chinese Phonology. Berlin: de Gruyter.

Nathaniel Condit-Schultz (2017) MCFlow: A Digital Corpus of Rap Transcriptions. Empirical Musicology Review 11.2: 124-147.

Eckart, Kerstin (2012):Resource annotations. In: Clarin-D, AP 5 (ed.) Berlin: DWDS, pp. 30-42.

Forkel, Robert and List, Johann-Mattis and Greenhill, Simon J. and Rzymski, Christoph and Bank, Sebastian and Cysouw, Michael and Hammarström, Harald and Haspelmath, Martin and Kaiping, Gereon A. and Gray, Russell D. (2018) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5.180205: 1-10.

Haider, Thomas and Kuhn, Jonas (2018) Supervised rhyme detection with Siamese recurrent networks. In: Proceedings of Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 81-86.

List, Johann-Mattis and Nathan W. Hill and Christopher J. Foster (2019) Towards a standardized annotation of rhyme judgments in Chinese historical phonology (and beyond). Journal of Language Relationship 17.1: 26-43.

Milà‐Garcia, Alba (2018) Pragmatic annotation for a multi-layered analysis of speech acts: a methodological proposal. Corpus Pragmatics 2.1: 265-287.

Wáng, Lì 王力 (2006) Hànyǔ shǐgǎo 漢語史稿 [History of the Chinese language]. Běijīng 北京:Zhōnghuá Shūjú 中华书局.

From rhymes to networks (A new blog series in six steps)


Whenever one feels stuck in solving a particular problem, it is useful to split this problem into parts, in order to identify exactly where the problems are. The problem that is vexing me at the moment is how to construct a network of rhymes from a set of annotated poems, either by one and the same author, or by many authors who wrote during the same epoch in a certain country using a certain language.

For me, a rhyme network is a network in which words (or parts of words) occur as nodes, and weighted links between the nodes indicate how often the linked words have been found to rhyme in a given corpus

An example

As an example, the following figure illustrates this idea for the case of two Chinese poems, where the rhyme words represented by Chinese characters are linked to form a network (taken from List 2016).


Figure 1: Constructing a network of rhymes in Chinese poetry (List 2016)

One may think that it is silly to make a network from rhymes. However, experiments on Chinese rhyme networks (of which I have reported in the past) have proven to be quite interesting, specifically because they almost always show one large connected component. I find this fascinating, since I would have expected that we would see multiple connected components, representing very distinct rhymes.

It is obvious that some writers don't have a good feeling for rhymes and fail royally when they try to do it — this happens across all languages and cultures in which rhyming plays a role. However, it was much less obvious to me that rhyming can be seen to form at least some kind of a continuum, as you can see from the rhyme networks that we have constructed from Chinese poetry (again) in the past (taken from List et al. 2017).


Figure 2: A complete rhyme network of poems in the Book of Odes (ca. 1000 BC, List et al. 2017)

The current problem

My problem now is that I do not know how to do the same for rhyme collections in other languages. During recent months, I have thought a lot about the problem of constructing rhyme networks for languages such as English or German. However, I always came to a point where I feel stuck, where I realized that I actually did not know at all how to deal with this.

I thought, first, that I could write one blog post listing the problems; but the more I thought about it, I realized that there were so many problems that I could barely do it in one blogpost. So, I decided then that I could just do another series of blog posts (after the nice experience from the series on open problems in computational historical linguistics I posted last year), but this time devoted solely to the question of how one can get from rhymes into networks.

So for the next six months, I will discuss the four major issues that keep me from presenting German or English rhyme networks here and now. I hope that at the end of this discussion I may even have solved the problem, so that I will then be able to present a first rhyme network of Goethe, Shakespeare, or Bob Dylan. (I would not do Eminem, as the rhymes are quite complex, and tedious to annotate).

Summary of the series

Before we can start to think about the modeling of rhyme patterns in rhymed verse, we need to think about the problem in general, and discuss how rhyming shows up in different languages. So, I will start the series with the problem of rhyming in general, by discussing how languages rhyme, where these practices differ, and what we can learn from these differences. Having looked into this, we can think about ways of annotating rhymes in texts in order to acquire a first corpus of examples. So, the following post will deal with the problems that we encounter when trying to annotate the rhyme words that we identify in poetry collections.

If one knows how to annotate something, one will sooner or later get impatient, and long for faster ways to do these boring tasks. Since this also holds for the manual annotation of rhyme collections (which we need for our rhyme networks), it is obvious to think about automated ways of finding rhymes in corpora — that is, to think about the inference of rhyme patterns, which can also be done semi-automatically, of course. So the major problems related to automated rhyme detection will be discussed in a separate post.

Once this is worked out, and one has a reasonably large corpus of rhyme patterns, one wants to analyze it — and the way I want to analyze annotated rhyme corpora is with the help of network models. But, as I mentioned before, I realized that I was stuck when I started to think about rhyme networks of German and English (which are relatively easy languages, one should think). So, it will be important to discuss clearly what seems to be the best way to constructrhyme networks as a first step of analysis. This will therefore be dealt with in a separate blogpost. In a final post, I then plan to tackle the second analysis step, by discussing very briefly what one can do with rhyme networks.

All in all, this makes for six posts (including this one); so we will be busy for the next six months, thinking about rhymes and poetry, which is probably not the worst thing one can do. I hope, but I cannot promise at this point, that this gives me enough time to stick to my ambitious annotation goals, and then present you with a real rhyme network of some poetry collection, other than the Chinese ones I already published in the past.

References

List, Johann-Mattis, Pathmanathan, Jananan Sylvestre, Hill, Nathan W., Bapteste, Eric, Lopez, Philippe (2017) Vowel purity and rhyme evidence in Old Chinese reconstruction. Lingua Sinica 3.1: 1-17.

List, Johann-Mattis (2016) Using network models to analyze Old Chinese rhyme data. Bulletin of Chinese Linguistics 9.2: 218-241.

Haikus generated based on your map location and OpenStreetMap data

Satellite Studio made a map thing that generates haikus based on OpenStreetMap data and your location. From the announcement:

[W]e automated making haikus about places. Looking at every aspect of the surroundings of a point, we can generate a poem about any place in the world. The result is sometimes fun, often weird, most of the time pretty terrible. Also probably horrifying for haiku purists (sorry).

This is pretty great. It’s neat how the poems generate on the fly.

Tags: , ,

Networks in Chinese poetry


Structure in Poetry

Dealing with poetry is a dangerous topic in science, since we never know whether the structures we propose are really there or not. Once it comes to the search of structure in poetry, Matthew and Luke were right, since the ones who search will find, provided they have enough creativity.

When I had Latin lessons in school, some of my classmates were incredibly diligent in trying to find alliterations (instances in which words in a sentence start with the same letter) in Cicero's speeches. This was less out of interest in the structure of the speeches, but more an attempt to divert the teacher's attention away from translation.

The problem with structure in poetry is that we never know in the end whether the people who created the poetry did things with purpose or not. Consider, for example, the following lines of a famous verse:


Apart from the fact that people might disagree whether songs by Eminem are poetry, it is interesting to look at the structures one may (or may not) detect. We know that rap and hip hop allow for rather loose rhyming schemes, which may give the impression that they were produced in an ad-hoc manner. We know also that the question of what counts as a rhyme is strictly cultural. In German, for example, employ could rhyme with supply (thanks to Goethe and other poets who would superimpose to the standard language rhyme patterns that made sense in their home dialect). If I was given Eminem's poem in an exam, I would mark its rhyming structure as follows:


I do not know whether any teacher of English would agree that music can rhyme with own it, but if Germans can rhyme [ai] (as in supply) with [ɔi] (as in employ), why not allow [ɪk] (as in music) to rhyme with [ɪt] (as in own it)? I bet that if one made an investigation of all rhymes that Bob Dylan has produced so far, we would find at least a few instances where he would tolerate Eminem's rhyme pattern.

The point here is that rhymes are important evidence to infer how Ancient Chinese was pronounced.

The Pronunciation of Ancient Chinese

The Chinese writing system gives only minimal hints regarding the pronunciation of the characters. If one writes a character like 日 which means 'sun', the writing system gives us no clue as to its pronunciation; and from the modern form in which the character is written, it is also difficult to see the image of a sun in the character. Thus, the current situation in Chinese linguistics is that we have very ancient texts, dating at times back to 1000 BC, but we do not have a real clue as to how the language was pronounced by then.

That it was pronounced differently is clear from — ancient Chinese poetry. When reading ancient poems with modern pronunciations, one often finds rhyme patterns which do not sound nice. Consider the poem from Ode 28 of the Book of Odes (Shījīng 詩經), an ancient collection of poems written between 1050 and 600 BC (translation from Karlgren 1950):


Here, we find modern rhymes between fēi and guī which is fine, since the transliteration fails to give the real pronunciation, which is [fəi] versus [kuəi]; but we also find [in] rhyming with [nan], which is so strange (due to the strong difference in the vowels) that even Bob Dylan and Eminem probably would not tolerate it. But if we do not tolerate this rhyming pattern, and if we do not want to assume that the ancient masters of Chinese poetry would simply fail in rhyming, we need to search for some explanation as to why the words do not rhyme. The explanation is, of course, language evolution — The sound systems of languages constantly change, and if things do not rhyme with our modern pronunciation, they may have been perfect rhymes when they were originally created.

When Chinese scholars of the 16th century, who investigated their ancient poetry, became aware of this, they realized that the poetry could be a clue to reconstruct the ancient pronunciation of their language. Then they began to investigate the ancient poems of the Book of Odes systematically for their rhyme patterns. It is thanks to this work on early linguistic reconstruction by Chinese scholars, that we now have a rather clear picture of how Ancient Chinese was pronounced (see especially Baxter 1992, Sagart 1999, and Baxter and Sagart 2014).

Networks in Chinese Rhyme Patterns

But where are the networks in Chinese poetry, which I promised in the title of this post? They are in the rhyme patterns — It is rather straightforward to model rhyme patterns in poetry with the help of networks. Every node is a distinct word that rhymes in at least one poem with another word. Links between nodes are created whenever one word rhymes with another word in a given stanza of a poem. So, even if we take only two stanzas of two poems of the Book of Odes, we can already create a small network of rhyme transitions, as illustrated in the following figure:


One needs, of course, to be careful when modeling this kind of data, since specific kinds of normalizations are needed to avoid exaggerating the weight assigned to specific rhyme connections. It is possible that poets just used a certain rhyme pattern because they found it somewhere else. It is also not yet entirely clear to me how to best normalize those cases in which more than two words rhyme with each other in the same stanza.

But apart from these rather technical questions, it is quite interesting to look at the patterns that evolve from collecting rhyme patterns of all poems found in the Book of Odes, and plotting them in a network. I prepared such a dataset, using the rhyme assessments by Baxter (1992). The whole data set is now available in the form of an interactive web-application at http://digling.org/shijing.

In this application, one can browse all characters that appear in potential rhyme positions in all 305 poems that constitute the Book of Odes. Additional meta-data, like reconstructions for the old pronunciations following Baxter and Sagart (2014), which were kindly provided by L. Sagart, have also been added. The core of the app is the "Poem View", by which one can see a poem, along with reconstructions for the rhyme words, and an explicit account of what experts think rhymed in the classical period, and what they think did not rhyme. The following image gives a screanshot of the second poem of the Book of Odes:



But let's now have a look at the big picture of the network we get when taking all words that rhyme into account. The following image was created with Cytoscape:



As we can see, the rhyme words in the 305 poems almost constitute a small world network, and we have a very large connected component. For me, this was quite surprising, since I was assuming that rhyme patterns would be more distinct. It would be very interesting to see a network of the works of Shakespeare or Goethe, and to compare the amount of connectivity.

There are, of course, many things we can do to analyze this network of Chinese poetry, and I am currently trying to find out to what degree this may contribute to the reconstruction of the pronunciation of Ancient Chinese. But since this work is all in a preliminary stage, I will restrict this post by showing how the big network looks if we color the nodes in six different colors, based on which of the six main vowels ([a, e, i, o, u, ə]) scholars usually reconstruct in the rhyme word for Ancient Chinese:



As can be seen, even this simple annotation shows how interesting structures emerge, and how we see more than before.

Many more things can be done with this kind of data. This is for sure. We could compare the rhyme networks of different poets, maybe even the networks of one and the same poet at different stages of their life, asking questions like: "do people rhyme more sloppy, the older they get?" It's a pity that we don't have the data for this, since we lack automatic approaches to detect rhyme words in text, and there are no manual annotations of poem collections apart from the Book of Odes that I know of.

But maybe, one day, we can use networks to study the dynamics underlying the evolution of literature. We could trace the emergence of rap and hip hop, or the impact of the "Judas!"-call on Dylan's rhyme patterns, or the loss of structure in modern poetry. But that's music from the future, of course.

References
  • Baxter, William H. (1992) A handbook of Old Chinese phonology. Berlin: De Gruyter.
  • Baxter, William H. and Sagart, Laurent (2014) Old Chinese. A new reconstruction. Oxford: Oxford University Press.
  • Karlren, Bernhard (1950) The Book of Odes. Stockholm: Museum of Far Eastern Antiquities.
  • Sagart, Laurent (1999) The roots of Old Chinese. Amsterdam: John Benjamins.

Science vs. Poetry

sam1By Sam Illingworth The estrangement between poetry and science can be traced back to 320 BC, when Aristotle laid down in his treatise Poetica that poetry paints an imaginative picture, whereas physical philosophy (i.e. science)

Astronomy + Poetry from CosmoAcademy

As you know*, we like to mix our science and our poetry. Mike has generously loaned this Philistine the reins to the Sunday Science Poem franchise, which I promptly moved to Tuesday; but I had to move it to Tuesday because I don’t want you to miss out.

CosmoQuest is offering an online course (via Google+ Hangouts) looking at the intersection of astronomy and poetry:

Astronomy has played a role in human culture for thousands of years and appears in literature from every era.  We can see not only the influence of the heavens on our writings, but also the influence of language itself on our conception of astronomy. Heralding the dawn of the International Year of Light in 2015, join us now to explore how light from the stars has been important to humans for millennia.  We will begin with Gilgamesh and Homer, and continue through Shakespeare, Robert Frost, Maya Angelou, and into contemporary music and literature.  Along the way, we will also examine how the structure of language has influenced the perception of astronomical phenomena. – CosmoQuest Academy

The classes start on Monday, 17 November 2014 at 9PM (ET). Sign-ups (cost $99) are open until Monday, but there are only 8 spots left.

HT: Matthew Francis

*Frankly, I’m tired of coddling you newbies**.

**Have we decided on a sarcasm font***?

***I imagine all those exchanges are constantly derailed by people writing, “I think this one really works” in a proposed font, and then wondering, “Do they really like it or are they being sarcastic****?”

****…which may actually be a sign that it is working.


Filed under: Curiosities of Nature, Follies of the Human Condition, The Art of Science Tagged: Astronomy, CosmoAcademy, CosmoQuest, language, poetry, science poetry, Sunday Poem

On the underrepresentation of cheese in literature…

GK Chesterton expounds on the poetic nature of cheese and condemns its notable absence from poetry. The essay is well worth reading, and I a particularly endorse this line with the proviso that it is applicable to man, woman, or child*:

…nor can I imagine why a man should want more than bread and cheese, if he can get enough of it.
-GK Chesterton

*My four and five-year olds are extremely fond of Stilton, which is how we know they are mine.

Hat tip to Steve Silberman.


Filed under: Follies of the Human Condition, From the Kitchen Tagged: cheese, GK Chesterton, Linkonomicon, poetry

Poetry, politics, plagiarism, and erotics add up to a retraction

Here’s a new category for us: Poetry. Comparative Studies of South Asia, Africa and the Middle East, a comparative studies journal, has retracted a paper on gender roles in Middle Eastern poetry due to plagiarism. Nizar Kabbani was a famed Syrian poet who wrote frankly about feminism, love, and sex. He’s well worth a read, […]