General remarks on rhyming (From rhymes to networks 2)


In this month's post, I want to provide some general remarks on rhyming and rhyme practice. I hope that they will help lay the foundations for tackling the problem of rhyme annotation, in the next post. Ideally, I should provide a maximally unbiased overview that takes all languages and cultures into account. However, since this would be an impossible task at this time (at least for myself), I hope that I can, instead, look at the phenomenon from a viewpoint that is a bit broader than the naive prescriptive accounts of rhyming used by teachers torture young school kids mentally.

What is a rhyme?

It is not easy to give an exact and exhaustive definition of rhyme. As a starting point, one can have a look at Wikipedia, where we find the following definition:
A rhyme is a repetition of similar sounds (usually, exactly the same sound) in the final stressed syllables and any following syllables of two or more words. Most often, this kind of perfect rhyming is consciously used for effect in the final positions of lines of poems and songs. Wikipedia: s. v. "Rhyme", accessed on 21.05.2020
This definition is a good starting point, but it does not apply to rhyming in general, but rather to rhyming in English as a specific language. While stress, for example, seems to play an important role in English rhyming, we don't find stress being used in a similar way in Chinese, so if we tie a definition of rhyming to stress, we exclude all of those languages in which stress plays a minor role or no role at all.

Furthermore, the notion of similar and identical sounds is also problematic from a cross-linguistic perspective on rhyming. It is true that rhyming requires some degree of similarity of sounds, but where the boundaries are being placed, and how the similarity is defined in the end, can differ from language to language and from tradition to tradition. Thus, while in German poetry it is fine to rhyme words like Mai [mai] and neu[noi], it is questionable whether English speakers would ever think that words like joy could form a rhyme with rye. Irish seems to be an extreme case of very complex rules underlying what counts as a rhyme, where consonants are clustered into certain classes (b, d, g, or ph, f, th, ch) that are defined to rhyme with each other (provided the vowels also rhyme), and as a result, words like oba and foda are judged to be good rhymes (Cuív 1966).

When looking at philological descriptions of rhyme traditions of individual languages, we often find a distinction between perfect rhymes on the one hand and imperfect rhymes on the other. But what counts as perfect or imperfect often differs from language to language. Thus, while French largely accepts the rhyming of words that sound identical, this is considered less satisfactory in English and German, and studies seem to have confirmed that speakers of French and English indeed differ in their intuitions about rhyme in this regard (Wagner and McCurdy 2010.

Peust (2014) discusses rhyme practices across several languages and epochs, suggesting that similarity in rhyming was based on some sort of rhyme phonology, that would account for the differences in rhyme judgments across languages. While the ordinary phonology of a language is a classical device in linguistics to determine those sounds that are perceived as being distinctive in a given language, rhyme phonology can achieve the same for rhyming in individual languages.

While this idea has some appeal at first sight, given that the differences in rhyme practice across languages often follow very specific rules, I am afraid it may be too restrictive. Instead, I rather prefer to see rhyming as a continuum, in which a well-defined core of perfect rhymes is surrounded by various instances of less perfect rhymes, with language-specific patterns of variation that one would still have to compare in detail.

Beyond perfection

If we accept that all languages have some notion of a perfect rhyme that they distinguish from less perfect rhymes, which will, nevertheless, still be accepted as rhymes, it is useful to have a quick look at differences in deviation from the perfect. German, for example, is often used as an example where vowel differences in rhymes are treated rather loosely; and, indeed, we find that diphthongs like the above-mentioned [ai] and [oi]are perceived as rhyming well by most German speakers. In popular songs, however, we find additional deviations from the perceived norm, which are usually not discussed in philological descriptions of German rhyming. Thus, in the famous German Schlager Griechischer Wein by Udo Jürgens (1934-2014), we find the following introductory line:
Es war schon dunkel, als ich durch Vorstadtstrassen heimwärts ging.
Da war ein Wirtshaus, aus dem das Licht noch auf den Gehsteig schien.
[Translation: It was already dark, when I went through the streets outside of the city. There was a pub which still emitted light that was shining on the street.]
There is no doubt that the artist intended these two lines to rhyme, given that the overall schema of the song shows a strict schema of AABCCB. So, in this particular case, the artist judged that rhyming ging [gɪŋ] with schien[ʃiːn] would be better than not attempting a rhyme at all, and it shows that it is difficult to assume one strict notion of rhyme phonology to guide all of the decisions that humans make when they create poems.

More extreme cases of permissive rhyming can be found in some traditions of English poetry, including Hip Hop (of course), but also the work of Bob Dylan, who does not have a problem rhyming time with fine, used with refused, or own with home, as in Like a Rolling Stone. In Spanish, where we also find a distinction between perfect (rima consonante) and imperfect rhyming (rima asonante), basically all that needs to coincide are the vowels, which allows Silvio Rodriguez to rhyme amór with canción in Te doy una canción.

While most languages coincide on the notion of perfect rhymes (notwithstanding certain differences due to general differences in their phonology), the interesting aspects for rhyming are those where they allow for imperfection. Given that rhyming seems to be something that reflects, at least to some extent, a general linguistic competence of the native speakers, a comparison of the practices across languages and cultures may help to shed light on general questions in linguistics.

Rhyming is linear

When discussing with colleagues the idea of making annotated rhyme corpora, I was repeatedly pointed to the worst cases, which I would never be able to capture. This is typical for linguists, who tend to see the complexities before they see what's simple, and who often prefer to not even try to tackle a problem before they feel they have understood all the sub-problems that could arise from the potential solution they might want to try.

One of the worst cases, when we developed our first annotation format as presented last year (List et al. 2019), was the problem of intransitive rhyming. The idea behind this is that imperfect rhyming may lead to a situation where one word rhymes with a word that follows, and this again rhymes with a word that follows that, but the first and the third would never really rhyme themselves. We find this clearly stated in Zwicky (1976: 677):
Imperfect rhymes can also be linked in a chain: X is rhymed (imperfectly) with Y, and Y with Z, so that X and Z may count as rhymes thanks to the mediation of Y, even when X and Z satisfy neither the feature nor the subsequence principle.
Intransitive rhyming is, indeed, a problem for annotation, since it would require that we think of very complex annotation schemas in which we assign words to individual rhyme chains instead of just assigning them to the same group of rhymes in a poem or a song. However, one thing that I realized afterwards, which one should never forget is: rhyming is linear. Rhyming does proceed in a chain. We first hear one line, then we hear another line, etc, so that each line is based on a succession of words that we all hear through time.

It is just as the famous Ferdinand de Saussure (1857-1913) said about the linguistic sign and its material representation, which can be measured in a single dimension ("c'est une linge", Saussure 1916: 103). Since we perceive poetry and songs in a linear fashion, we should not be surprised that the major attention we give to a rhyme when perceiving it is on those words that are not too far away from each other in their temporal arrangement.

The same holds accordingly for the concrete comparison of words that rhyme: since words are sequences of sounds, the similarity of rhyme words is a similarity of sequences. This means we can make use of the typical methods for automated and computer-assisted sequence comparison in historical linguistics, which have been developed during the past twenty years (see the overview in List 2014), when trying to analyze rhyming across different languages and traditions.

Conclusion

When writing this post, I realized that I still feel like I am swimming in an ocean of ignorance when it comes to rhyming and rhyming practices, and how to compare them in a way that takes linguistic aspects into account. I hope that I can make up for this in the follow-up post, where I will introduce my first solutions for a consistent annotation of poetry. By then, I also hope it will become clearer why I give so much importance to the notion of imperfect rhymes, and the emphasis on the linearity of rhyming.

References

Brian Ó Cuív (1966) The phonetic basis of classical modern irish rhyme. Ériu 20: 94-103.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Nathan W. Hill and Christopher J. Foster (2019) Towards a standardized annotation of rhyme judgments in Chinese historical phonology (and beyond). Journal of Language Relationship 17.1: 26-43.

Peust, Carsten (2014) Parametric variation of end rhyme across languages. In: Grossmann et al. Egyptian-Coptic Linguistics in Typological Perspective. Berlin: Mouton de Gruyter, pp. 341-385.

de Saussure, Ferdinand (1916) Cours de linguistique générale. Lausanne:Payot.

Wagner, M. and McCurdy, K. (2010) Poetic rhyme reflects cross-linguistic differences in information structure. Cognition 117.2: 166-175.

Zwicky, Arnold (1976) Well, This rock and roll has got to stop. Junior’s head is hard as a rock. In: Papers from the Twelfth Regional Meeting of the Chicago Linguistic Society 676-697.

From words to deeds?


If you want to annoy a linguist, then there are three easy ways to do so: ask them how many languages they speak; ask them for their opinion regarding the German spelling reform; or ask them whether it is true that the Eskimo language has 50 words for snow. What those three questions have in common is that they all touch upon some big issues in linguistics, which are so big that they give us a headache when being reminded of them.

For the first question, asking about a linguist's linguistic talent touches upon the conviction of quite a few linguists that in order to practice linguistics, one does not need to study many languages. One language is usually enough; and even if that language is only English, this may also be sufficient (at least according to some fanatics who practice syntax). To put it in different words: knowing only one language does not prevent a linguist from making claims about the evolution of whole language families. Knowing how to describe a language, or how to compare several languages, does not necessarily require anyone to be able to speak them. After all, mathematicians also pride themselves on not being able to calculate.

The second question, regarding the German spelling reform, marks the last time when German linguists failed royally in proving the importance of their studies to the broader public. The problem was that the German spelling reform, the first after some 100 years of linguistic peace, was mostly done without any linguistic input. Those who commented on it were, instead, novelists, poets and journalists, usually a bit older in age, who felt that the reform was proposed mainly in order to annoy them personally. At the same time, and this was maybe no coincidence, more and more institutes for comparative linguistics disappeared from German universities. The reason was again that the field had not succeeded in explaining its importance to the public. However, historical language comparison can, indeed, be important when discussing the reform of a writing system that is being used by millions of people, specifically also because the investigation of historically evolving linguistic systems is one of the specialties of historical-comparative linguistics. This was completely ignored by then.

The last question concerns the almost ancient debate about the hypothesis commonly known attributed to Edward Sapir (1884-1939) and Benjamin Lee Whorf (1897-1941). This says, in its strong form (Whorf 1950), that speaking influences thinking to such an extent that we might, for example, develop a different kind of Relativity Theory in physics if we started to practice our science in languages different from English, French, and German. Given that Eskimo languages are said to have some 50 different words for snow (as people keep repeating), it should be clear enough that those speaking an Eskimo language must think completely differently from those who start to forget what snow is after all.

The latter concept leads to an interesting use of networks, which I will discuss here.

Words versus deeds

The hypothesis by Sapir and Whorf annoys many linguists (including myself), because it has been long since disproved, at least in its strong, naive form. It was disproved by linguistic data, not by arguments; and the data were the data used by Whorf in order to prove his point in a first instance. However, although there is little evidence for the hypothesis in its strong form, people keep repeating it, especially in non-linguistic circles, where it is often instrumentalized.

Whether we can find evidence for a weak form of the hypothesis — which would say that we can find some influence of speaking on thinking — is another question; which is, however, difficult to answer. It may well be possible that our thoughts are channeled to some degree by the material we use in order to express them. When distinguishing color shades, for example, such as light blue and dark blue, by distinct words, such as goluboj and sin'ij in Russian or celeste and azul in Spanish, it may be that we develop different thoughts when somebody talks about blue cheese, which is called dark blue cheese in Spanish (queso azul).

But this does not mean that somebody who speaks English would never know that there is some difference between light and dark blue, just because the language does not primarily make the distinction between the two color tones. It is possible that the stricter distinction in Russian and Spanish triggers an increased attention among speakers, but we do not know how large the underlying effect is in the end, and how many people would be affected by it.

Particular languages are thus neither a template nor a mirror of human thinking — they do not necessarily channel our thoughts, and may only provide small hints as to how we perceive things around us. For example, if a language expresses different concepts, such as "arm" and "hand" with the same word, this may be a hint that "arm" and "hand" are not that different from each other, or that they belong together functionally in some sense, which is why we may perceive them as a unit. This is the case in Russian, where we find only one expression ruka for both concepts. In daily conversations, this works pretty well, and there are rarely any situations where Russian speakers would not understand each other due to ambiguities, since most of the time the context in which people speak disambiguates all they want to express well enough.

Colexification network with the central concept "MIND" and the geographical distribution of languages colexifying "MIND" and "BRAIN"

These colexifications, as we now call the phenomenon (François2008), occur frequently in the languages of the world. This is due to the polysemy of many of the words we use, since no single word denotes only one concept alone, but often denote several similar concepts at the same time. On the other hand, we encounter identical word forms in the same language which express completely different things, resulting from coincidental processes by which originally different pronunciations came to sound alike (called convergence, in biology). Those colexifications that are not coincidental but result from polysemy are the most interesting ones for linguists, not least because the words are related by network graphs not trees (as shown above). When assembled in large enough numbers, across a sufficiently large sample of languages, they may allow us some interesting insights into human cognition.

The procedure to mine these insights from cross-linguistic data has already been discussed in a previous blog, from 2018. The main idea is to collect colexifications for as many concepts and languages and possible, in order to construct a colexification network, in which each concept is represented by a node, and weighted links between the nodes represent how often each colexification between the linked concepts occurs; that is, they represent how often we find a language that expresses the two linked concepts with the same word.

Having proposed a first update of our Database of Cross-Linguistic Colexifications (CLICS) back in 2018, we have now been able to further increase the data. With this third installment of the database, we could double the number of language varieties, from 1,200 to 2,400. In addition, we could enhance the workflows that we use to aggregate data from different sources, in a rigorously reproducible way (Rzymski et al. 2020).

Current work

Even more interesting than these data, however, is a study initiated by colleagues from psychology from the University of North Carolina, which was recently published, after more than two years of intensive collaboration (Jackson et al. 2019). In this study, the colexifications for emotion concepts, such as "love", "pity", "surprise", and "fear", were assembled and the resulting networks were statistically compared across different language families. The surprising result was that the structures of the networks differed quite considerably from each other (an effect that we could not find for color concepts derived from the same data). Some language families, for example, tend to colexify "surprise" and "fear (fright)" (see our subgraph for "surprised"), while others colexified "love" and "pity" (see the subgraph for "pity").

Not all aspects of the network structures were different. An additional analysis involving informants showed that especially the criterion of valency (that is, if something is perceived as negative or positive) played an important role for the structure of the networks; and similar effects could be found for the degree of arousal.

These results show that the way in which we express emotion concepts in our languages is, on the one hand, strongly influenced by cultural factors, while on the other hand there are some cognitive aspects that seem to be reflected similarly across all languages.

What we cannot conclude from the results, however, is, that those, who speak languages in which "pity" and "love" are represented by the same word, will not know the difference between the two emotions. Here again, it is important to emphasize, what I mentioned above with respect to color terms: if a particular distinction is not present in a given language, this it does not mean that the speakers do not know the difference.

It may be tempting to dig out the old hypothesis of Sapir and Whorf in the context of the study on emotions; but the results do not, by any means, provide evidence that our thinking is directly shaped and restricted by the languages we speak. Many factors influence how we think. Language is one aspect among many others. Instead of focusing too much on the question as to which languages we speak, we may want to focus on how we speak the language in which we want to express our thoughts.

References

François, Alexandre (2008) Semantic maps and the typology of colexification: intertwining polysemous networks across languages. In: Vanhove, Martine (ed.): From polysemy to semantic change. Amsterdam: Benjamins, pp. 163-215.

Joshua Conrad Jackson, Joseph Watts, Teague R. Henry, Johann-Mattis List, Peter J. Mucha, Robert Forkel, Simon J. Greenhill and Kristen Lindquist (2019) Emotion semantics show both cultural variation and universal structure. Science 366.6472: 1517-1522. DOI: 10.1126/science.aaw8160

Rzymski, Christoph, Tiago Tresoldi, Simon Greenhill, Mei-Shin Wu, Nathanael E. Schweikhard, Maria Koptjevskaja-Tamm, Volker Gast, Timotheus A. Bodt, Abbie Hantgan, Gereon A. Kaiping, Sophie Chang, Yunfan Lai, Natalia Morozova, Heini Arjava, Nataliia Hübler, Ezequiel Koile, Steve Pepper, Mariann Proos, Briana Van Epps, Ingrid Blanco, Carolin Hundt, Sergei Monakhov, Kristina Pianykh, Sallona Ramesh, Russell D. Gray, Robert Forkel and Johann-Mattis List (2020): The Database of Cross-Linguistic Colexifications, reproducible analysis of cross- linguistic polysemies. Scientific Data 7.13: 1-12. DOI: 10.1038/s41597-019-0341-x

Benjamin Lee Whorf (1950) An American Indian Model of the Universe. International Journal of American Linguistics 16.2: 67-72.

Automatic morpheme segmentation (Open problems in computational diversity linguistics 1)


The first task on my list of 10 open problems in computational diversity linguistics deals with morphemes, that is, the minimal meaning-bearing parts in a language. A morpheme can be a word, but it does not have to be a word, since words may consist of more than one morpheme, and ­— depending on the language in question — may do so almost by default.

Examples of morphemes in English include clear-cut cases of compounding, where two words are joined to form a new word. Often, this is not even readily reflected in spelling, and, as a result, speakers may at times think that a word like "primary school" is not a single word, although it is easy to determine from its semantics that the word is indeed pointing to one uniform concept. Other examples include grammatical markers, such as the ending -s for most English plurals, or to mark the third person singular of verbs. When confronted with a word form like walks, linguists will analyze this word as consisting of two morphemes, illustrating it by adding a dash as a boundary marker: walk-s.

The problem

The task of automatic morpheme segmentation is thus a pretty straightforward one: given a list of words, potentially along with additional information, such as their meaning, or their frequency in the given language, try to identify all morpheme boundaries, and mark this by adding dash symbols where a boundary has been identified.

One may ask why automatic identification of morphemes should be a problem —  and some people commenting on my presentation of the 10 open problems last month did ask this. The problem is not unrecognized in the field of Natural Language Processing, and solutions have been discussed from the 1950s onwards (Harris 1955, Benden 2005, Bordag 2008, Hammarström 2006, see also the overview by Goldsmith 2017).

Roughly speaking, all approaches build on statistics about n-grams, i.e., recurring symbol sequences of arbitrary length. Assuming that n-grams representing meaning-building units should be distributed more frequently across the lexicon of a language, they assemble these statistics from the data, trying to infer the ones which "matter". With Morfessor (Creutz and Lagus 2005, there is also a popular family of algorithms available in form of a very stable and easy-to-use Python library (Virpioja et al. 2013). Applying and testing methods for automatic morpheme segmentation is thus very straightforward nowadays.

The issue with all of these approaches and ideas is that they require a very large amount of data for training, while our actual datasets are small and sparse, by nature. As a result, all currently available algorithms fail graciously when it comes to determining the morphemes in datasets of less of 1,000 words.

Interestingly, even when having been trained on large datasets, the algorithms still commit surprising errors, as can be easily seen when testing the online demo of the Morfessor software for German (https://asr.aalto.fi/morfessordemo/). When testing words like auftürmen "pile up", for example, the algorithm yields the segmentation auf-türme-n, which is probably understandable from the fact that the word Türme "towers" is quite frequent in the German lexicon, thus confusing the algorithm; but for a German speaker, who knows that verbs end in -en in their infinitive, it is clear that the auftürmen can only be segmented as auf-türm-en.

If I understand the information on the website correctly, the Morfessor algorithm offered online was trained with more than 1 million different word forms in German. Given that in our linguistic approaches we can usually dispose of 1,000 words, if not less, per language, it is clear that the algorithms won't provide help in finding the morphemes in our data.

To illustrate this, I ran a small test on the Morfessor software, using two datasets for training, one big dataset with about 50000 words from Baayen et al. (1995), and one smaller dataset of about 600 words which I used as a cognate detection benchmark when writing my dissertation (List 2014). I then used these two datasets to train the Morfessor software and then applied the trained models to segment a list of 10 German words (see the GitHub.Gist here.

The results for the two models (small data and big data) as well as the segmentations proposed by the online application (online) are given in the table below (with my own judgments on morphemes given in the column word).

NumberWordSmall dataBig dataOnline
1handhandhandhand
2hand-schuhhand-sch-uhhand-schuhhand-schuh
3hantelh-a-n-t-elhant-elhan-tel
4hungerh-u-n-g-erhungerhunger
5lauf-enl-a-u-f-enlaufenlauf-en
6geh-engehengehengehen
7lieg-enl-i-e-g-enliegenliegen
8schlaf-ensch-lafenschlafenschlaf-en
9kind-er-arztkind-er-a-r-z-tkind-er-arztkinder-arzt
10grund-schuleg-rund-sch-u-l-egrund-schulegrundschule

What can be seen clearly from the table, where all forms deviating from my analysis are marked in red font, is that none of the models makes a convincing job in segmenting my ten test words.  More importantly, however, we can clearly see that the algorithm's problems increase drastically when dealing with small training data. Since the segmentations proposed in the Small data column are clearly the worst, splitting words in a seemingly random fashion into letters.

What is interesting in this context is that trained linguists would rarely fail at this task, even when all they were given is the small data list for training. That they do not fail is shown by the numerous studies where linguistic fieldworkers have investigated so far under-investigated languages, and quickly figured out how the morphology works.

Why is it so difficult to find morpheme boundaries?

What makes the detection of morpheme boundaries so difficult, also for humans, is that they are inherently ambiguous. A final -s can mark the plural in German, especially on borrowings, as in Job-s, but it can likewise mark a short variant of es "it", where the vowel is deleted, as in ist's "it's", and in many other cases, it can just mark nothing, but instead be part of a larger morpheme, like Haus "house". Whether or not a certain substring of sounds in a language can function as a morpheme depends on the meaning of the word, not on the substring itself. We can — once more — see one of the great differences between sequences in biology and sequences in linguistics here: linguistic sequences derive their "function" (ie. their meaning) from the context in which they are used, not from their structure alone. 

If speakers are no longer able to clearly understand the morphological structure of a given word, they may even start to change it, in order to make it more "transparent" in its denotation. Examples for this are the numerous cases of folk etymology, where speakers re-interpret the morphemes in a word, with English ham-burger as a prominent example, since the word originally seems to derive from the city Hamburg, which has nothing to do with ham. 

How do humans find morphemes?
 
The reasons why human linguists can relatively easy find morphemes in sparse data, while machines cannot, is still not entirely clear to me (ie. humans are good at pattern recognition and machines are not). However, I do have some basic ideas about why humans largely outperform machines when it comes to morpheme segmentation; and I think that future approaches that try to take these ideas into account might drastically improve the performance of automatic morpheme segmentation methods.

As a first point, given the importance of meaning in order to determine morphemic structure, it seems almost absurd to me to try to identify morphemes in a given language corpus based on a pure analysis of the sequences, without taking their meaning into account.  If we are confronted with two words like Spanish hermano "brother" and hermana "sister", it is clear — if we know what they mean — that the -o vs. -a most likely denotes a distinction of gender. While the machines compare potential similarities inside the words independent of semantics, humans will always start from those pairs where they think that they could expect to find interesting alternations. As long as the meanings are supplied, a human linguist — even when not familiar with a given language — can easily propose a more or less convincing segmentation of a list of only 500 words.

A second point that is disregarded in current automatic approaches is the fact that morphological structures vary drastically among languages. In Chinese and many South-East Asian languages, for example, it is almost a rule that every syllable represents one morpheme (with minimal exceptions being attested and discussed in the literature). Since syllables are again easy to find in these languages, since words can often only end in a specific number of sounds, an algorithm to detect words in those languages would not need any n-gram statistics, but just a theory on syllable structures. Instead of global strategies, we may rather have to use for local strategies of morpheme segmentation, in which we identify different types of languages for which a given algorithm seems suitable.

This brings us to a third point. A peculiarity of linguistic sequences in spoken languages is that they are built by specific phonotactic rules that govern their overall structure. Whether or not a language tolerates more than three consonants in the beginning of a word depends on its phonotactics, its set of rules by which the inventory of sounds is combined to form morphemes and words. Phonotactics itself can also give hints on morpheme boundaries, since they may prohibit combinations of sounds within morphemes which can occur when morphemes are joined to form words. German Ur-instinkt "basic instinct", for example, is pronounced with a glottal stop after the Ur-, which can only occur in the beginning of German words and morphemes, thus marking the word clearly as a compound (otherwise the word could be parsed as Urin-stinkt "urine smells".

A fourth point that is also generally disregarded in current approaches to automatic morpheme segmentation is that of cross-linguistic evidence. In many cases, the speakers of a given language may themselves no longer be aware of the original morphological segmentation of some of their words, while the comparison with closely related languages can still reveal it. If we have a potentially multi-morphemic word in one language, for example, and only one of the two potential morphemes reflected as a normal word in the other language, this is clear evidence that the potentially multi-morphemic word does, indeed, consist of multiple morphemes.

Suggestions

Linguists regularly use multiple types of evidence when trying to understand the morphological composition of the words in a given language. If we want to advance the field of automatic morpheme segmentation, it seems to me indispensable that we give up the idea of detecting the morphology of a language just by looking at the distribution of letters across word forms. Instead, we should make use of semantic, phonotactic, and comparative information. We should further give up the idea of designing universal morpheme segmentation algorithms, but rather study which approach works best on which morphological type. How these aspects can be combined in a unified framework, however, is still not entirely clear to me; and this is also the reason why I list automatic morpheme segmentation as the first of my ten open problems in computational diversity linguistics.

Even more important than the strategies for the solutions of the problem, however, is that we start to work on extensive datasets for testing and training of new algorithms that seek to identify morpheme boundaries on sparse data. As of now, no such datasets exist. Approaches like Morfessor were designed to identify morpheme boundaries in written languages, they barely work with phonetic transcriptions.  But if we had the datasets for testing and training available, be it only some 20 or 40 languages from different language families, manually annotated by experts, segmented both with respect to the phonetics and to the morphemes, this would allow us to investigate both existing and new approaches much more profoundly, and I expect it could give a real boost to our discipline and greatly help us to develop advanced solutions for the problem.

References

Baayen, R. H. and Piepenbrock, R. and Gulikers, L. (eds.) (1995) The CELEX Lexical Database. Version 2. Philadelphia.

Benden, Christoph (2005) Automated detection of morphemes using distributional measurements. In: Claus Weihs and Wolfgang Gaul (eds.): Classification -- the Ubiquitous Challenge. Berlin and Heidelberg:Springer. pp 490-497.

Bordag, Stefan (2008) Unsupervised and knowledge-free morpheme segmentation and analysis. In: Carol Peters, Valentin Jijkoun, Thomas Mandl, Henning Müller, Douglas W. Oard, Anselmo Peñas, Vivien Petras and Diana Santos (eds.): Advances in Multilingual and Multimodal Information Retrieval. Berlin and Heidelberg:Springer, pp 881-891.

Creutz, M. and Lagus, K. (2005) Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Technical Report. Helsinki University of Technology.

Goldsmith, John A. and Lee, Jackson L. and Xanthos, Aris (2017) Computational learning of morphology. Annual Review of Linguistics 3.1: 85-106.

Hammarström, Harald (2006) A Naive Theory of Affixation and an Algorithm for Extraction. In: Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology at HLT-NAACL 2006 pp. 79-88.

Harris, Zellig S. (1955) From phoneme to morpheme. Language 31.2: 190-222.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf:Düsseldorf University Press.

Virpioja, Sami, Smit, Peter, Grönroos, Stig-Arne and Kurimo, Mikko (2013) Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Helsinki:Aalto University.

Dante and the tree model


I was preparing a blog post on phylogenetic methods for the study of the Divine Comedy, by Dante Alighieri (1265-1321), and it occurred to me that a note on Dante's contribution to the tree model might also be worthwhile. This medieval poet cannot, of course, be described as the father of the Stammbaum, but he should probably be listed among the many sources for the development of the model, and of the linguistic theories that it supported in the 19th century.

The study of Dante's works became almost an international mania with the rise of Romanticism in the 18-19th centuries, and scholars are not strangers to his more obscure works. One of these works is an abandoned linguistic essay entitled De vulgari eloquentia ("On the eloquence in the vernacular language", circa 1305). In short, it is an unfinished manual on composition, a "poetics", with an introductory chapter discussing the appropriate language for poetry. The essay is written in Latin, but from the first paragraphs the author declares that the same language is not suitable for literature, as it is not a living language. Latin is then reserved for scientific and philosophical matters, and the author "ventures in a quest" for a good literary vernacular.


The first paragraphs are full of medieval opinions on language, such as the confusion arising from the Tower of Babel, and a discussion about how a linguistic ability would be superfluous for demons and angels alike. Towards the end the author starts to favor an artificial vernacular language, concluding that no living language (among the 14 dialects of the Italian peninsula) would be good enough. This latter idea was not followed when he wrote the Divine Comedy (which was written in Tuscan, Dante's native dialect), and this probably explains why the essay was abandoned just when the composition of that poem was begun.

However, between the biblical linguistics and the poetic formalism, Dante explores linguistic matters with an almost modern (and sometimes surprising) mindset. For example, he discusses how birds don't talk but simply repeat air movements, he discusses how grammar (i.e. Latin and Greek) is a codification, he provides a detailed, while subjective, map of the Italian vernaculars of the 12th century, and, what matters for us here, explains that not all linguistic differences are due to the "vengeful confusion" arising from the Tower of Babel. Being human constructions, he says, languages are unstable and, as such, change, as proved by many similarities that can't be random and don't really add much confusion (i.e. their differences are too feeble to be a consequence of the punishment of an almighty god). Our problem, he continues, is that changes are gradual and subtle, and as such we don't perceive them; but they do exist, as someone who returns to a city after many years can confirm, or as can be recorded when moving from city to city.

The (genealogical) tree model is implicit but undeniable in the eighth chapter of the first book, when the author uses words such as "root", "planted", and "branches". Here, I also report the original word in Latin, along with a translation adapted from Botterill (2006):
The confusion of languages [after the Tower of Babel] leads me [...] to the opinion that it was then that human beings were first scattered throughout the whole world, into every temperate zone and habitable region, right to its furthest corners. And since the principal root [radix] from which the human race has grown was planted [plantata] in the East, and from there our growth has spread, through many branches [palmites] and in all directions, finally reaching the furthest limits of the West [...]. [...] these people brought with them a tripartite language. Of those who brought it, some found their way to southern Europe and some to northern; and a third group, whom we now call Greeks, settled partly in Europe and partly in Asia. Later, from this tripartite language (which had been received in that vengeful confusion), different vernaculars developed, as I shall show later. For in that whole area that extends from the mouth of the Danube (or the Meotide marshes) to the westernmost shores of England, and which is defined by the boundaries of the Italians and the French, and by the [Atlantic] ocean, only one language prevailed, although later it was split up into many vernaculars by the Slavs, the Hungarians, the Teutons, the Saxons, the English, and several other nations. Only one sign of their common origin remains in almost all of them, namely that nearly all the nations listed above, when they answer in the affirmative, say [see the map above, from Elisabeth Burr] Starting from the furthest point reached by this vernacular (that is, from the boundary of the Hungarians towards the east), another occupied all the rest of what, from there onwards, is called Europe; and it stretches even beyond that. All the rest of Europe that was not dominated by these two vernaculars was held by a third, although nowadays this itself seems to be divided in three: for some now say oc, some oïl, and some , when they answer in the affirmative; and these are the Hispanic, the French, and the Italians. Yet the sign that the vernaculars of these three peoples derive from one and the same language is plainly apparent: for they can be seen to use the same words to signify many things, such as 'God', 'heaven', 'love', 'sea,' 'earth', 'is', 'lives', 'dies', 'loves', and almost all others. Of these peoples, those who say oc live in the western part of southern Europe, beginning from the boundaries of the Genoese. Those who say , however, live to the east of those boundaries, all the way to that outcrop of Italy from which the gulf of the Adriatic begins, and in Sicily. But those who say oïl live somewhat to the north of these others, for to the east they have the Germans, on the west and north they are hemmed in by the English sea and by the mountains of Aragon, and to the south they are enclosed by the people of Provence and the slopes of the Apennines.
The De vulgari eloquentia has routinely been printed alongside the Divine Comedy, and was studied, to give some examples, by Thomas Warton in his History of the English Poetry (London, 1775), by Johann Gottfried Eichhorn in his Allgemeine Geschichte der Cultur und Litteratur des neueren Europa (Göttingen, 1796), and by August Pott (a student of Franz Bopp) in his Indogermanischer sprachstamm" (1840). The essay was copied in Germany even before the introduction of the printing press; and a German translation, Über die Volkssprache (K. L. Kannegießer, 1845), was published in Leipzig when August Schleicher was already active in linguistic studies.

By this time, it seems, the work was almost a commonplace topic of discussion — when defending his model for the Italian language, and complaining about people who proposed a 12th century language for a 19th century nation state, around 1830, Alessandro Manzoni jokingly reminded that it was "one of those books which nobody actually read, but everybody discusses".

This is one more little note to our narrative on the evolution of the tree model.

References

Alighieri, D. De Vulgari Eloquentia. Edited and translated by Steven Botterill. Cambridge University Press, 2006.

Elisabeth Burr. Klassifizierung der romanischen Sprachen.

Phylogenetics versus historical linguistics


Google Trends looks at recent trends in web searches, and it has been used to study patterns in web activity for many concepts. This is similar to The Ngram Viewer in Google Books (see the post Ngrams and phylogenetics). Google Trends aggregates the number of web searches that have been performed for any given search term (or terms), and it can display the results as a time graph, for any given geographical region. The Trends searches are somewhat restrictive, but they may show us something about the period 2004-2016 (inclusive).

So, I thought that it might be interesting to look at a few expressions of relevance to readers of this blog. The Trends graphs show changes in the relative proportion of searches for the given term (vertically) through time (horizontally). The vertical axis is scaled so that 100 is simply the time with the most popularity as a fraction of the total number of searches (ie. the scale shows the proportion of searches, with the maximum always shown as 100, no matter how many searches there were).


As you can see, the term "phylogenetics" has maintained its popularity over "historical linguistics". However, it has decreased in popularity through time much more than has "historical linguistics". Nevertheless, both decreases are very small compared to that for the term "bioinformatics", as discussed in the blog post on Bioinformaticians look at bioinformatics.

It is not necessarily clear to me why many technical terms have decreased in Google searches through time, although there are several possibilities. First, it could be Google itself. The Trends numbers represent search volume for a keyword relative to the total search volume on Google. So, actual search numbers for the technical terms could be increasing while as a fraction of total search volume of the internet they are decreasing, if total Google search volume is increasing. 

Alternatively, Business Insider has noted that "search is facing a huge challenge ... consumers are increasingly shifting [from desktop] to mobile. On mobile, consumers say they just don't search as much as they used to because they have apps that cater to their specific needs. They might still perform searches within those apps, but they're not doing as many searches on traditional search engines". Furthermore, "people are discovering content through social media. The top eight social networks drove more than 30% of traffic to sites in 2014".

The extra raggedness in search popularity in the first couple of years of the graph probably reflects inadequacies in the Google Trends dataset in the early years (as discussed by Wikipedia). The same is true for the next graph, as well.


The "phylogenetic tree" searches have been more popular than "evolutionary tree", just as was true for the Google Books usage discussed in the post Ngrams and phylogenetics. However, the "phylogenetic tree" searches show a distinctly bimodal pattern every year. This presumably reflects teaching semesters — few people search for technical terms out of term time!

Unfortunately, it is not possible to look at the term "phylogenenetic network", because Google Trends tells me that there is "Not enough search volume to show results". How rude!

Once more on artificial intelligence and machine learning


In an earlier blog post, I expressed my scepticism regarding the scientific value of non-transparent machine learning approaches, which only provide a result but no transparent explanation of how they arrive at their conclusion. I am aware that I run the risk of giving the impression of abusing this blog for my own agenda, against artificial intelligence and machine learning approaches in the historical sciences, by bringing the problem up again. However, a recent post in Nature News (Castelvecchi 2016) further substantiates my original scepticism, providing some interesting new perspectives on the scientific and the practical consequences, so I could not resist mentioning it in my post for this month.

Deep learning approaches in research on artificial intelligence and machine learning go back to the 1950s, and have now become so successful that they are starting to play an increasingly important role in our daily lives, be it that they are used to recommend to us yet another book that somebody has bought along with the book we just want to buy, or that they allow us to take a little nap while driving fancy electronic cars and saving carbon footprints for our next round-the-world trip. The same holds, of course, also for science, and in particular for biology, where neural networks have been used for tasks like homolog detection (Bengio et al. 1990) or protein classification (Leslie et al. 2004). This is true even more for linguistics, where a complete subfield, usually called natural language processing, has emerged (see Hladka and Holub 2015 for an overview), in which algorithms are trained for various tasks related to language, ranging from word segmentation in Chinese texts (Cai and Zhao 2016) to the general task of morpheme detection, which seeks to find the smallest meaningful units in human languages (King 2016).

In the post by Castelvecchi, I found two aspects that triggered my interest. Firstly, the author emphasizes that answers that can be easily and often accurately produced by machine learning approaches do not automatically provide real insights, quoting Vincenco Innocente, a physicist at CERN, saying:
As a scientist ... I am not satisfied with just distinguishing cats from dogs. A scientist wants to be able to say: "the difference is such and such." (Vincenco Innocente, quoted by Castelvecchi 2016: 22)
This expresses precisely (and much more transparently) what I tried to emphasize in the former blog post, namely, that science is primarily concerned with the questions why? and how?, and only peripherally with the question what?

The other interesting aspect is that these apparently powerful approaches can, in fact, be easily betrayed. Given that they are trained on certain data, and that it is usually not known to the trainers what aspects of the training data effectively trigger a given classification, one can in turn use algorithms to train data that will betray an application, forcing it to give false responses. Castelvecchi mentions an experiment by Mahendran and Vedaldi (2015) which illustrates how "a network might see wiggly lines and classify them as a starfish, or mistake black-and-yellow stripes for a school bus" (Castelvecchi 2016: 23).

Putting aside the obvious consequences that arise from abusing the neural networks that are used in our daily lives, this problem is surely not unknown to us as human beings. We can likewise be easily betrayed by our expectations, be it in daily life or in science. This, finally, brings us back to networks and trees, as we all know how difficult it is at times to see the forest behind the tree that our software gives us, or the tree inside the forest of incompletely sorted lineages.

References
  • Bengio, Y., S. Bengio, Y. Pouliot, and P. Agin (1990): A neural network to detect homologies in proteins. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems 2. Morgan-Kaufmann, pp. 423-430.
  • Cai, D. and H. Zhao (2016) Neural word segmentation learning for Chinese. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 409-420.
  • Castelvecchi, D. (2016): Can we open the blackbox of AI. Nature 538: 20-23.
  • Hladka, B. and M. Holub (2015 A gentle introduction to machine learning for natural language processing: how to start in 16 practical steps. Lang. Linguist. Compass 9.2: 55-76.
  • King, D. (2016) Evaluating sequence alignment for learning inflectional morphology. In: Proceedings of the 14th Annual SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 49–53.
  • Leslie, C., E. Eskin, A. Cohen, J. Weston, and W. Noble (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20.4: 467-476.
  • Mahendran, A. and A. Vedaldi (2015) Understanding deep image representations by inverting them. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp. 5188-5196.

Capturing phylogenetic algorithms for linguistics


A little over a week ago I was at a workshop "Capturing phylogenetic algorithms for linguistics" at the Lorentz Centre in Leiden (NL). This is, as some of you will recall, the venue that hosted two earlier workshops on phylogenetic networks in 2012 and 2014.

I had been invited to participate and to give a talk and I chose to discuss the possible relevance of phylogenetic networks (as opposed to phylogenetic trees) for linguistics. (My talk is here). This turned out to be a good choice because, although phylogenetic trees are now a firmly established part of contemporary linguistics, networks are much less prominent. Data-display networks (which visualize incongruence in a data-set, but do not model the genealogical processs that gave rise to it) have found their way into some linguistic publications, and a number of the presentations earlier in the week showed various flavours of split networks. However, the idea of constructing "evolutionary" phylogenetic networks - e.g. modeling linguistic analogues of horizontal gene transfer - has not yet gained much traction in the field. In many senses this is not surprising, since tools for constructing evolutionary phylogenetic networks in biology are not yet widely used, either. As in biology, much of the reticence concerning these tools stems from uncertainty about whether models for reticulate evolution are sufficiently mature to be used 'out of the box'.

As far as this blog is concerned the relevant word in linguistics is 'borrowing'. My lay-man interpretation of this is that it denotes the process whereby words or terms are transferred horizontally from one language to another. (Mattis, feel free to correct me...) There were many discussions of how this proces can confound the inference of concept and language trees, but other than it being a problem there was not a lot a said about how to deal with it methodologically (or model it). One of the issues, I think, is that linguists are nervous about the interface between micro and macro levels of evolution and at what scale of (language) evolution horizontal events could and should be modelled. To cite a biological analogue: if you study populations at the most microscopic level evolution is usually reticulate (because of e.g. meiotic recombination) but at the macro level large parts of mammalian evolution are uncontroversially tree-like. In this sense whether reticulate events are modeled depends on the event itself and the scale of the phylogenetic model concerned.

Are there analogues of population-genetic phenomena in linguistics, and are they foundations for phenomena observed at the macro level? Is there a risk of over-stating the parallels with biology? One participant told me that, while she felt that there was definitely scope for incrorporating analogies of species and gene trees within linguistics - and many of the participants immediately recognized these concepts - comparisons quickly break down at more micro levels of evolution.

I'm not the right person to comment on this of course, or to answer these questions, but in any case it's clear that linguistics has plenty of scope for continuing the horizontal/vertical discussions that have already been with us for many years in biology...

Last, but not least: it was a very enjoyable workshop and I'm grateful to the organizers for inviting me!

Capturing phylogenetic algorithms for linguistics


A little over a week ago I was at a workshop "Capturing phylogenetic algorithms for linguistics" at the Lorentz Centre in Leiden (NL). This is, as some of you will recall, the venue that hosted two earlier workshops on phylogenetic networks in 2012 and 2014.

I had been invited to participate and to give a talk and I chose to discuss the possible relevance of phylogenetic networks (as opposed to phylogenetic trees) for linguistics. (My talk is here). This turned out to be a good choice because, although phylogenetic trees are now a firmly established part of contemporary linguistics, networks are much less prominent. Data-display networks (which visualize incongruence in a data-set, but do not model the genealogical processs that gave rise to it) have found their way into some linguistic publications, and a number of the presentations earlier in the week showed various flavours of split networks. However, the idea of constructing "evolutionary" phylogenetic networks - e.g. modeling linguistic analogues of horizontal gene transfer - has not yet gained much traction in the field. In many senses this is not surprising, since tools for constructing evolutionary phylogenetic networks in biology are not yet widely used, either. As in biology, much of the reticence concerning these tools stems from uncertainty about whether models for reticulate evolution are sufficiently mature to be used 'out of the box'.

As far as this blog is concerned the relevant word in linguistics is 'borrowing'. My lay-man interpretation of this is that it denotes the process whereby words or terms are transferred horizontally from one language to another. (Mattis, feel free to correct me...) There were many discussions of how this proces can confound the inference of concept and language trees, but other than it being a problem there was not a lot a said about how to deal with it methodologically (or model it). One of the issues, I think, is that linguists are nervous about the interface between micro and macro levels of evolution and at what scale of (language) evolution horizontal events could and should be modelled. To cite a biological analogue: if you study populations at the most microscopic level evolution is usually reticulate (because of e.g. meiotic recombination) but at the macro level large parts of mammalian evolution are uncontroversially tree-like. In this sense whether reticulate events are modeled depends on the event itself and the scale of the phylogenetic model concerned.

Are there analogues of population-genetic phenomena in linguistics, and are they foundations for phenomena observed at the macro level? Is there a risk of over-stating the parallels with biology? One participant told me that, while she felt that there was definitely scope for incrorporating analogies of species and gene trees within linguistics - and many of the participants immediately recognized these concepts - comparisons quickly break down at more micro levels of evolution.

I'm not the right person to comment on this of course, or to answer these questions, but in any case it's clear that linguistics has plenty of scope for continuing the horizontal/vertical discussions that have already been with us for many years in biology...

Last, but not least: it was a very enjoyable workshop and I'm grateful to the organizers for inviting me!

The complexity of lexical change


Most computational approaches to historical linguistics, be it those producing networks or those producing trees, make use of lexical data. There are several reasons for this preference. Lexical data is much easier to handle than abstract grammatical data. Many linguists also think that lexical data is more representative of language evolution in general, and thus offers a much better starting point for inferences. Whether one likes the preference for lexical data or not, it seems to be worthwhile in this context to reflect a bit more about the nature of lexical data and the complexities of lexical change. This may help to get a clearer picture of the differences between language history and biological evolution.

What Makes a Word?

In a very simple language model, the lexicon of a language can be seen as a bag of words. A word, furthermore, is traditionally defined by two aspects: its form and its meaning. Thus, the French word arbre can be defined by its written form arbre or its phonetic form [ɑʁbʁə], and its meaning "tree". This is reflected in the famous sign model of Ferdinand de Saussure (Saussure 1916), which I have reproduced in [A] in the graphic below. In order to emphasize the importance of the two aspects, linguists often say that form and meaning of a word are like two sides of the same coin (see [B] in the graphic below). But we should not forget that a word is only a word if it belongs to a certain language! From the perspective of the German or the English language, for example, the sound chain [ɑʁbʁə] is just meaningless. So, instead of two major aspects of a word, we may better talk of three major aspects: form, meaning, and language. As a result, our bilateral sign model becomes a trilateral one, as I have tried to illustrate in [C] in the graphic below.


What is Lexical Change?

If there was no lexical change, the lexicon of languages would remain stable during all times. Words might change their forms by means of regular sound change, but there would always be an unbroken tradition of identical patterns of denotation. Since this is not the case, the lexicon of all languages is constantly changing. Words are lost, when the speakers cease to use them, or new words enter the lexicon when new concepts arise, be it that they are borrowed from other languages, or created from native material via different morphological processes. Such processes of word loss and word gain are quite frequent and can sometimes even be observed directly by the speakers of a language when they compare their own speech with the speech of an elder or a younger generation.

An even more important process of lexical change, especially in quantitative historical linguistics, is the process of lexical replacement. Lexical replacement refers to the process by which a given word A which is commonly used to express a certain meaning x ceases to express this meaning, while at the same time another word B, which was formerly used to express a meaning y, is now used to express the meaning x. The notion of lexical replacement is thus nothing else than a shift in the perspective on semantic change (as one major dimension of lexical change, see below). While semantic change is usually described from a semasiological perspective, i.e. from the perspective of the form, lexical replacement describes semantic change from an onomasiological perspective, i.e. from the perspective of the meaning.

Three Dimensions of Lexical Change

Gévaudan (2007)distinguishes three dimensions of lexical change: the morphological dimension, the semantic dimension, and the stratic dimension. The morphological dimension points to changes in the outer form of the words which are not due to regular sound change. As an example of this type of change, consider English birth and its ancestral form Proto-Germanic *ga-burdi "birth" — while the meaning of the word did not change (or at least only slightly), the English word apparently lost the prefix ga-. This prefix is still present in the German Geburt "birth", but it was lost without leaving a trace in English.

The loss of prefixes is not the only way in which words can change during language evolution. We also find that prefixes or suffixes are added, as, for example, in French soleil "sun", which goes back to Latin soliculus "small sun, sunny" which is itself a derivation of Latin sol "sun". The semantic dimension is illustrated by changes like the one from Proto-Germanic *sælig "happy" to English silly.

The stratic dimension refers to changes involving the exchange of words betweenlanguages, that is, processes of borrowing, in which a word is transferred from one stratum of a language to another. An example for this type of change is English mountain which was borrowed from Old French montaigne "mountain".

Note that these three dimensions of lexical change correspond directly to the three major aspects constituting a linguistic sign (or a word) that I mentioned above: The morphological dimension changes the form of a word, the semantic dimension changes its meaning, and the stratic dimension its language. Thus, the three dimensions of lexical change, as proposed by Gévaudan (2007), find their direct reflection in the major dimensions according to which words can vary.


During language evolution, lexical change processes interact in all three dimensions, and yield complex patterns which may be very hard to uncover for historical linguists. As an example of this complexity, consider the development of Proto-Indo-European *bʰreu̯Hg̑-* "to use", as depicted in the graphic below, which was originally designed by Hans Geisler (Heinrich-Heine University, Düsseldorf), who kindly allowed me to reproduce it here. In the graphic, changes in the stratic dimension are illustrated with the help of dotted arcs (the legend labels this as "borrowed from"), and changes in the morphological dimension are indicated by double arcs (labelled as "derived from"). The semantic dimension is not specifically labelled as such, but one can easily detect it by comparing the meanings of the words.


Modeling Lexical Change

If we look at different historical relations from the perspective of the three dimensions of lexical change, it becomes obvious that the terminology we use in linguistics is rather fuzzy. I mentioned this in an earlier post, where I pointed to the different shades of cognacy, which were never really settled in a satisfying way in historical linguistics. If we look at this again from the perspective of the three dimensions, it is much easier to become clear about the origin of these different historical relations between words.

If we investigate the different uses of the term "cognacy", for example, it becomes obvious that the differences result from controling for one or more of the three dimensions of lexical change. The traditional Indo-Europeanist notion of cognacy, for example, controls the stratic dimension by requiring stratic continuity (no borrowing), but at the same time it is indifferent regarding the other two dimensions. Cognacy à la Swadesh (especially Swadesh 1955), as we know it from the popular computational approaches which model lexical change as a process of cognate loss and gain, is indifferent regarding morphological continuity, but controls the semantic and the stratic dimensions by only considering words that have the same meaning and have not been borrowed (at least in theory).

In the table below, I have attempted to illustrate in which way the different terms, including the biological terms of homology, orthology, paralogy, and xenology, cover processes by controling each for one or more of the three dimensions of lexical change (with "+" indicating that continuity is required, "-" indicating that change is required, and "+/-" indicating indifference.) Contrasting the different dimensions of lexical change with the terminology used to refer to different relations between words shows not only the arbitrariness of the traditional linguistic terminology (why do we only cover two out of 3 * 3 * 3 = 27 different possible types? why do we only control by requiring continuity, not change? etc.), but also the fundamental difference between biological and linguistic terminology.


Concluding Remarks

So far, all computational methods that have been proposed for historical linguistics are based on the strict Swadesh type of wordlist encoding, which in the end controls for the semantic and stratic dimensions of lexical change and is indifferent regarding morphology. Such an encoding is per seinconsistent, since there is no reason to assume that morphological change would be less frequent or less indicative of language history than any of the other types.

The reason why linguists tend to control for meaning when creating their datasets is mostly due to problems of sampling: it is much easier to draw a set of words from a couple of languages by starting from a given set of meanings. However, it may be useful to relax this criterion, since the restricted sets of only about 200 meanings on average necessarily hide vivid and interesting processes of lexical change.

The reasons why linguists control for borrowing are only historical, and in many cases also not feasible, since our evidence for borrowing may be limited, especially in cases where the majority of speakers is bilingual (which is more often the rule than the exception in the languages of the world). It seems much more fruitful to revive our network thinking in linguistics and to invest into the development of high quality datasets with a less arbitrary exclusion of certain dimensions of lexical change, and transparent computational methods which do not exclusively stick to the tree model.

References

  • Gévaudan, P. (2007) Typologie des lexikalischen Wandels [Typology of lexical change]. Tübingen: Stauffenburg.
  • Swadesh, M. (1955)  Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics. Vol. 21(2), pp. 121- 137.
  • Saussure, F. de (1916) Cours de linguistique générale [Course on general linguistics]. Lausanne: Payot.

Productive and unproductive analogies between biology and linguistics


Genotypes or phenotypes?

In a blogpost from 2013, David investigated some of the popular analogies between anthropology (including linguistics) and biology. He rejected those analogies that compare the genotype with anthropological entities (like the common "words = genes" analogy). Instead, he proposed to draw the analogy between anthropological entities and the phenotype. I generally agree that we should be very careful about the analogies we draw between different disciplines, and I share the scepticism regarding those naive approaches in which genes are compared with words or sounds are compared with nucleotide bases. I am, however, sceptical whether the alternative analogy between phenotypes and anthropological entities offers a general solution for the study of language evolution.

Productive and unproductive analogies

My scepticism results from a general uncertainty about the transfer of models and methodologies among scientific disciplines. I am deeply convinced that such a transfer is useful and that it can be fruitful, but we seem to lack a proper understanding of how to carry out such a transfer. Apart from this general uncertainty as to how to do it properly, I think that for linguistics the analogy between phenotypes and linguistic entities is too broad to be successfully applied.

Instead of drawing general analogies between biology and linguistics, it would be more useful to carry out a fine-grained analysis of productive analogies between the two disciplines. By productive, I mean that the analogies should lead to an interdisciplinary transfer of models and methods that increases the insights about the entities in the discipline that imports them. If this is not the case for a given analogy, this does not mean that the analogy is wrong or false, but rather that it is simply unproductive, since an analogy is just a similarity between entities from different domains, and what we define as being "similar" crucially depends on our perspective. With enough fantasy, we can draw analogies between all kinds of objects, and we never really know the degree to which we construct rather than detect, as I have tried to illustrate in the graphic below.

Constructed or detected similarities?

Local productive analogies: alignment analyses

A productive analogy does not necessarily have to be global, offering a full-fledged account of shared similarities, as in the analogies which compare, for example, languages with organisms (Schleicher 1848) or languages with species (Mufwene 2001), but also the analogy between phenotypes and anthropological entities proposed by David. It is likewise possible to find very useful local analogies, which only hold to a certain extent, but offer enough insights to get started.

Consider, for example, the problem of sequence alignment in biology and linguistics. It is clear, that both biologists and linguists carry out alignment analyses of some of the entities they are dealing with in their disciplines. We use alignment analyses in biology and linguistics, since both disciplines have to deal with entities that are best modeled as sequences, be it sequences of DNA, RNA, or amino acids in biology, or sequences of sounds in linguistics. In both cases, we are dealing with entities in which a limited numer of symbols is linearily ordered, and an alignment analysis is a very intuitive and fruitful way to show which of the symbols in two different sequences correspond.

In this very general point, the analogy between words as sequences of sounds and genes as sequences of nucleic acids holds, and it seems straightforward to think of transferring models and methods between the disciplines (in this case from biology to linguistics, since automatic sequence alignment has a longer tradition in biology).

In the details, however, we will detect differences between biological and linguistic sequences, with the main differences lying in the alphabets (the collections of symbols) from which our sequences are drawn (discussed in more detail in List 2014: 61-75):
  • Biological alphabets are universal, that is, they are basically the same for all living creatures, while the alphabets of languages are specific for each and every language or dialect.
  • Biolological alphabets are limited and small regarding the number of symbols, while linguistic alphabets are widely varying and can be very large in size.
  • Biological alphabets are stable over time, with sequences changing by the replacement of symbols with other symbols drawn from the same pool of symbols, while linguistic alphabets are mutable: not only can they acquire new sounds or lose existing ones, but also the sounds themselves can change.

How similar are words and genes in the end?

What are the consequences of these differences in the word-gene analogy? Can we still profit from the long tradition of automatic alignment methods when dealing with phonetic alignment (the alignment of sound sequences, like words or morphemes) in linguistics? Yes, we can! But within limits!

Linguists can profit from the general frameworks for sequence alignment developed in biology, but we need to make sure that we adapt them according to our linguistic needs. For alignment methods, this means, for example, that we can use the traditional frameworks of dynamic programming for pairwise alignment, which were developed back in the seventies (Needleman and Wunsch 1971, Smith and Waterman 1981). We can also use some of the frameworks for multiple sequence alignment, which were developed a bit later, starting from the end of the eighties, be it progressive (Feng and Doolittle 1987, Thompson et al. 1994, Notredame et al. 1998), iterative (Barton and Sternberg 1987, Edgar 2004), or probabilistic (Do et al. 2004). But we can only import the overall frameworks, not their details.

All algorithms for phonetic alignment that are supposed to be applicable to a wide range of data (and not serve as a mere proof of concept that handles but a limited range of test datasets) need to address the specific characteristics of sound sequences. Apart from the differences in alphabet size and the mutable character of sound systems mentioned above, these differences also include the important role that context plays in sound change (List 2014: 26-33), the problem of secondary sequence structures (List 2012), the problem of metathesis(List 2012: 51f), but also the problem of unalignable parts resulting from cases of partial and oblique homology in language evolution (see my recent blog post on this issue).

Concluding remarks

Drawing analogies between the research objects of different disciplines is not a bad idea, and it can be very inspiring, as multiple cases in the history of science show. When transferring models and methods from one discipline to another, however, we need to make sure that the analogies we use are productive, adding value to our research and understanding. We should never expect that analogies hold in all details. Instead we need to be aware about their specific limits, and we need to be willing to adapt those models and methods we transfer to the needs of the target discipline. Only then can we make sure that the analogies we use are really productive in the end.

References

  • Barton, G. J. and M. J. E. Sternberg (1987). “A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons”. J. Mol. Biol. 198.2, 327 –337. 
  • Do, C. B., M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou (2005). “ProbCons. Probabilistic consistency-based multiple sequence alignment”. Genome Res. 15, 330–340.
  • Edgar, R. C. (2004). “MUSCLE. Multiple sequence alignment with high accuracy and high throughput”. Nucleic Acids Res. 32.5, 1792–1797.
  • Feng, D. F. and R. F. Doolittle (1987). “Progressive sequence alignment as a prerequisite to correct phylogenetic trees”. J. Mol. Evol. 25.4, 351–360.
  • List, J.-M. (2014). Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.  
  • List, J.-M. (2012a). "Improving phonetic alignment by handling secondary sequence structures". In: Hinrichs, E. and Jäger, G.: Computational approaches to the study of dialectal and typological variation. Working papers submitted for the workshop organized as part of the ESSLLI 2012. 
  • List, J.-M. (2012b). “Multiple sequence alignment in historical linguistics. A sound class based approach”. In: Proceedings of ConSOLE XIX. “The 19th Conference of the Student Organization of Linguistics in Europe” (Groningen, 01/05–01/08/2011). Ed. by E. Boone, K. Linke, and M. Schulpen, 241–260.
  • Mufwene, S. S. (2001): The ecology of language evolution. Cambridge: Cambridge University Press.
  • Needleman, S. B. and C. D. Wunsch (1970). “A gene method applicable to the search for similarities in the amino acid sequence of two proteins”. J. Mol. Biol. 48, 443– 453.
  • Notredame, C., L. Holm, and D. G. Higgins (1998). “COFFEE. An objective function for multiple sequence alignment”. Bioinformatics 14.5, 407–422.
  • Schleicher, A. (1848). Zur vergleichenden Sprachengeschichte [On comparative language history]. Bonn: König.
  • Smith, T. F. and M. S. Waterman (1981). “Identification of common molecular subsequences”. J. Mol. Biol. 1, 195–197.
  • Thompson, J. D., D. G. Higgins, and T. J. Gibson (1994). “CLUSTAL W. Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”. Nucleic Acids Res. 22.22, 4673–4680.