From rhymes to networks (A new blog series in six steps)


Whenever one feels stuck in solving a particular problem, it is useful to split this problem into parts, in order to identify exactly where the problems are. The problem that is vexing me at the moment is how to construct a network of rhymes from a set of annotated poems, either by one and the same author, or by many authors who wrote during the same epoch in a certain country using a certain language.

For me, a rhyme network is a network in which words (or parts of words) occur as nodes, and weighted links between the nodes indicate how often the linked words have been found to rhyme in a given corpus

An example

As an example, the following figure illustrates this idea for the case of two Chinese poems, where the rhyme words represented by Chinese characters are linked to form a network (taken from List 2016).


Figure 1: Constructing a network of rhymes in Chinese poetry (List 2016)

One may think that it is silly to make a network from rhymes. However, experiments on Chinese rhyme networks (of which I have reported in the past) have proven to be quite interesting, specifically because they almost always show one large connected component. I find this fascinating, since I would have expected that we would see multiple connected components, representing very distinct rhymes.

It is obvious that some writers don't have a good feeling for rhymes and fail royally when they try to do it — this happens across all languages and cultures in which rhyming plays a role. However, it was much less obvious to me that rhyming can be seen to form at least some kind of a continuum, as you can see from the rhyme networks that we have constructed from Chinese poetry (again) in the past (taken from List et al. 2017).


Figure 2: A complete rhyme network of poems in the Book of Odes (ca. 1000 BC, List et al. 2017)

The current problem

My problem now is that I do not know how to do the same for rhyme collections in other languages. During recent months, I have thought a lot about the problem of constructing rhyme networks for languages such as English or German. However, I always came to a point where I feel stuck, where I realized that I actually did not know at all how to deal with this.

I thought, first, that I could write one blog post listing the problems; but the more I thought about it, I realized that there were so many problems that I could barely do it in one blogpost. So, I decided then that I could just do another series of blog posts (after the nice experience from the series on open problems in computational historical linguistics I posted last year), but this time devoted solely to the question of how one can get from rhymes into networks.

So for the next six months, I will discuss the four major issues that keep me from presenting German or English rhyme networks here and now. I hope that at the end of this discussion I may even have solved the problem, so that I will then be able to present a first rhyme network of Goethe, Shakespeare, or Bob Dylan. (I would not do Eminem, as the rhymes are quite complex, and tedious to annotate).

Summary of the series

Before we can start to think about the modeling of rhyme patterns in rhymed verse, we need to think about the problem in general, and discuss how rhyming shows up in different languages. So, I will start the series with the problem of rhyming in general, by discussing how languages rhyme, where these practices differ, and what we can learn from these differences. Having looked into this, we can think about ways of annotating rhymes in texts in order to acquire a first corpus of examples. So, the following post will deal with the problems that we encounter when trying to annotate the rhyme words that we identify in poetry collections.

If one knows how to annotate something, one will sooner or later get impatient, and long for faster ways to do these boring tasks. Since this also holds for the manual annotation of rhyme collections (which we need for our rhyme networks), it is obvious to think about automated ways of finding rhymes in corpora — that is, to think about the inference of rhyme patterns, which can also be done semi-automatically, of course. So the major problems related to automated rhyme detection will be discussed in a separate post.

Once this is worked out, and one has a reasonably large corpus of rhyme patterns, one wants to analyze it — and the way I want to analyze annotated rhyme corpora is with the help of network models. But, as I mentioned before, I realized that I was stuck when I started to think about rhyme networks of German and English (which are relatively easy languages, one should think). So, it will be important to discuss clearly what seems to be the best way to constructrhyme networks as a first step of analysis. This will therefore be dealt with in a separate blogpost. In a final post, I then plan to tackle the second analysis step, by discussing very briefly what one can do with rhyme networks.

All in all, this makes for six posts (including this one); so we will be busy for the next six months, thinking about rhymes and poetry, which is probably not the worst thing one can do. I hope, but I cannot promise at this point, that this gives me enough time to stick to my ambitious annotation goals, and then present you with a real rhyme network of some poetry collection, other than the Chinese ones I already published in the past.

References

List, Johann-Mattis, Pathmanathan, Jananan Sylvestre, Hill, Nathan W., Bapteste, Eric, Lopez, Philippe (2017) Vowel purity and rhyme evidence in Old Chinese reconstruction. Lingua Sinica 3.1: 1-17.

List, Johann-Mattis (2016) Using network models to analyze Old Chinese rhyme data. Bulletin of Chinese Linguistics 9.2: 218-241.

Open problems in computational diversity linguistics: Conclusion and Outlook


One year has now passed since I discussed the idea with David to devote a whole year of 12 blopgosts to the topic of "Open problems in computational diversity linguistics". It is time to look back at this year, and the topics that have been discussed.

Quantitative view

The following table lists the pageviews (or clicks) for each blogpost (with all caveats as to what this actually entails), from January to November.

ProblemMonthTitleClicksComments
0JanuaryIntroduction5354
1FebruaryAutomatic morpheme detection7180
2MarchAutomatic borrowing detection4221
3AprilAutomatic sound law induction5222
4MayAutomatic phonological reconstruction5170
5JuneSimulation of lexical change2690
6JulySimulation of sound change4230
7AugustStatistical proof of language relatedness 3831
8September Typology of semantic change3722
9OctoberTypology of sound change2503
10NovemberTypology of semantic promiscuity2172

The first thing to note is that people might have gotten tired of the problems, since the last two blogs were not very well-received in terms of readers (or not yet, anyway). One should, however, not forget that the number of clicks received by the system are cumulative, so if a blog is older, it may have received more readers just because it has been online for a longer time.

What seems, however, to be interesting is the rather high number of readers for the February post; and it seems that this is related to the topic, rather than the content. Morpheme detection is considered to be a very interesting problem by many practitioners of Natural Language Processing (NLP), and the field of NLP has generally many more followers than the field of historical linguistics.

Reader comments and discussions

For a few of the posts, I received interesting comments, and I  replied to all of them, where I found that a reply was in order. A few of them are worth emphasizing here.

As a first comment in March, Guillaume Jacques replied in form of a blog post of his own, where he proposed a very explicit method for the detection of borrowings, which assumes that data are compared where an ancestral language is a available in written sources (see here for the post). Since it will still take some time to prepare the data in the manner proposed by Guillaume, I have not had time to test this method for myself, but it is a very nice example for a new method for borrowing detection, which addresses one specific data type and has so far not been tested.

Thomas Pellard provided a very useful comment on my April post, emphasizing that automatic reconstruction based on regular expressions (as I had proposed it, more or less, as a riddle that should be solved), requires a "very precise chronology (order) of the sound changes", as well as "a perfect knowledge of all the sound changes having occurred". He concluded that "regular expression-based approach may thus be rather suited for the final stage of a reconstruction rather than for exploratory purposes". What is remarkable about this comment is that it partly contradicts (at least in my opinion) the classical doctrine of historical language comparison, since we often assume that linguists apply their "sound laws" perfectly well, being able to explain the history of a given set of languages in full detail. The sparsity of the available literature, and the problems that even small experiments encounter, shows that the idea of completely regular sound change that can be laid out in form of transducers has always remained an idea, but was never really practiced. It seems that it is time to leave the realm of theory and do more practical research on sound change, as suggested by Thomas.

In response to my post on problem number 7 (August), the proof of language relatedness, Guillaume Jacques wrote that: "although most historical linguists see inflectional morphology as the most convincing evidence for language relatedness, it is very difficult to conceive a statistical test that could be applied to morphological paradigms in any systematic way cross-linguistically". I think he is completely right with this point.

J. Pystynen made a very good point with respect to my post on the typology of semantic change (September), mentioning that semantic change may, similar to sound change, also underlie dynamics resulting from the fact that the lexicon of a given language at a given time is a system whose parts are determined by their relation to each other.

David Marjanović criticized my use (in October) of the Indo-European laryngeals as an example to make clear that the abstractionalist-realist problem in the debate about sound change has an impact on what scholars actively reconstruct, and that they are often content to not further specify concrete sound values as long as they can be sure that there are distinctive values for a given phenomenon. His main point was that — in his opinion — the reconstruction of sound values for the Indo-European laryngeal is much clearer than I presented it in my post. I think that Marjanović was misunderstanding the point I wanted to make; and I also think that he is not right regarding the surety with which we can determine sound values for the laryngeal sounds.

As a last and very long comment from November, Alex(andre) François (I assume that it was him, but he only left his first name) provided excellent feedback on the last problem, which I had labelled the problem of establishing a typology of "semantic promiscuity". Alex argues that I overemphasized the role of semantics in the discussion, and that the phenomenon I described might better be labelled "lexical yield of roots". I think that he's right with this criticism, but I am not sure whether the term "lexical yield" is better than the notion of promiscuity. Given that we are searching for a counterpart of the mostly form-based term "productivity", which furthermore focuses on grammatical affixes, the term "promiscuity" focuses on the success of certain form-concept pairs at being recycled during the process of word formation. Alex is right that we are in fact talking about the root here, as a linguistic concept that is — unfortunately — not very strictly defined in linguistics. For the time being, I would propose either the term "root promiscuity" or "lexical promiscuity", but avoid the term "yield", since it sounds too static to me.

Advances on particular problems

Although the problems that I posted are personal, and I am keen to try tackling them in at least some way in the future, I have not yet managed to advance on any of them in particular.

I have experimented with new approaches to borrowing detection, which are not yet in a state where they could be published, but it helped myself to re-think the whole matter in detail. Parts of my ideas shared in this blog post also appeared, in a deeper discussion, in an article that was published this year (List 2019).

I played with the problem of morpheme detection, but none of the different approaches was really convincing enough so far. However, I am still convinced that we can do better than "meaning-less" NLP approaches (which try to infer morphology from dictionaries alone, ignoring any semantic information).

A peripheral thought on automated phonological reconstruction, focusing on the question of the evaluation of a set of automated reconstructions and a set of human-annotated gold standard data, has now been published (List 2019b) as a comment to a target study by Jäger (2019). While my proposal can solve cases where two reconstruction systems differ only by their segment-wise phonological information, I had to conclude my comment by admitting that there are cases where two sets of words in different languages are equivalent in their structure, but not identical. Formally, that means that structurally identical sets of segmented strings in linguistics can be converted from one set to the other with help of simple replacement rules, while structurally equivalent (I am still unsure, if the two terms are well chosen) sets of segmented strings may require additional context rules.

Although I tried to advance on most of the problems mentioned throughout the year, and I carried out quite a few experiments, most of the things that I tested were not conclusive. Before I discuss them in detail, I should make sure they actually work, or provide a larger study that emphasizes and explains why they do notwork. At this stage, however, any sharing of information on the different experiments I ran would be premature, leading to confusion rather than to clarification.

Strategies for problem solving

Those of you who have followed my treatment of all the problems over the year will see that I tend to be very careful in delegating problem solutions to classical machine learning approaches. I do this because I am convinced that most of the problems that I mentioned and discussed can, in fact, be handled in a very concrete manner. When dealing with problems that one thinks can ultimately be solved by an algorithm, one should not start by developing a machine learning algorithm, but rather search for the algorithm that really solves the problem.

Nobody would develop a machine learning approach to replace an abacus, although this may in fact be possible. In the same way, I believe that the practice of historical linguistics has sufficiently shown that most of the problems can be solved with help of concrete methods, with the exception, perhaps, of phylogenetic reconstruction (see, for example, my graph-based solution for the sound correspondence pattern detection problem, presented in List 2019c). For this reason, I prefer to work on concrete solutions, avoiding probabilistic approaches or black-box methods, such as neural networks.

A language problem

Retrospect and outlook

In retrospect, I enjoyed the series a lot. It has the advantage of being easier to plan, as I knew in advance what I had to write about. It was, however, also tedious at times, since I knew I could not just talk about a seemingly simpler topic in my monthly post, but had to develop the problem and share all of my thoughts on it. In some situations, I had the impression that I failed, since I realized that there was not enough time to really think everything through. Here, the comments of colleagues were quite helpful.

Content-wise, the idea of looking at our field through the lens of unsolved problems turned out to be very useful. For quite a few of the problems, I have initial ideas (as I tried to indicate each time); and maybe there will be time in the next years to test them in concrete, and to potentially even cross off the one or the other problem from the big list.

Writing a series instead of a collection of unrelated posts turned out to have definite advantages. With my monthly goal of writing at least one contribution for the Genealogical World of Phylogenetic Networks, I never had the problem of thinking too hard of something that might be interesting for a broader readership. While this happened in the past, blog series have the disadvantage of not allowing for flexibility, when something interesting comes up, especially if one sticks to one post per month and reserves this post for the series.

In the next year, I am still considering to write another series, but maybe this time, I will handle it less strictly, allowing some room for surprise, since this is as well one of the major advantages of writing scientific blogs: one is never really be bound to follow beaten tracks.

But for now, I am happy that the year is over, since 2019 has been very busy for me in terms of work. Since this is the final post for the year, I would like to take the chance to thank all who read the posts, and specifically also all those who commented on them. But my greatest thanks go to David for being there, as always, reading my texts, correcting my errors in writing, and giving active feedback in the form of interesting and inspiring comments.

References

Jäger, Gerhard (2019) Computational historical linguistics. Theoretical Linguistics 45.3-4: 151-182.

List, Johann-Mattis (2019a) Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Language and Linguistics Compass 13.e12355: 1-16.

List, Johann-Mattis (2019b) Beyond Edit Distances: Comparing linguistic reconstruction systems. Theoretical Linguistics 45.3-4: 1-10.

List, Johann-Mattis (2019c) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Typology of sound change (Open problems in computational diversity linguistics 9)


We are getting closer to the end of my list of open problems in computational diversity linguistics. After this post, there is only one left, for November, followed by an outlook and a wrap-up in December.

In last month's post, devoted to the Typology of semantic change, I discussed the general aspects of a typology in linguistics, or — to be more precise — how I think that linguists use the term. One of the necessary conditions for a typology to be meaningful is that the phenomenon under questions shows enough similarities across the languages of the world, so that patterns or tendencies can be identified regardless of the historical relations between human languages.

Sound change in this context refers to a very peculiar phenomenon observed in the change of spoken languages, by which certain sounds in the inventory of a given language change their pronunciation over time. This often occurs across all of the words in which these sounds recur, or across only those sounds which appear to occur in specific phonetic contexts.

As I have discussed this phenomenon in quite a few past blog posts, I will not discuss it any more here, but I will rather simply refer to the specific task, that this problem entails:
Assuming (if needed) a given time frame, in which the change occurs, establish a general typology that informs about the universal tendencies by which sounds occurring in specific phonetic environments are subject to change.
Note that my view of "phonetic environment" in this context includes an environment that would capture all possible contexts. When confronted with a sound change that seems to affect a sound in all phonetic contexts, in which the sound occurs in the same way, linguists often speak of "unconditioned sound change", as they do not find any apparent condition for this change to happen. For a formal treatment, however, this is unsatisfying, since the lack of a phonetic environment is also a specific condition of sound change.

Why it is hard to establish a typology of sound change

As is also true for semantic change, discussed as Problem 8 last month, there are three major reasons why it is hard to establish a typology of sound change. As a first problem, we find, again, the issue of acquiring the data needed to establish the typology. As a second problem, it is also not clear how to handle the data appropriately in order to allow us to study sound change across different language families and different times. As a third problem, it is also very difficult to interpret sound change data when trying to identify cross-linguistic tendencies.

Problem 1

The problem of acquiring data about sound change processes in sufficient size is very similar to the problem of semantic change: most of what we know about sound change has been inferred by comparing languages, and we do not know how confident we can be with respect to those inferences. While semantic change is considered to be notoriously difficult to handle (Fox 1995: 111), scholars generally have more confidence in sound change and the power of linguistic reconstruction. The question remains, however, as to how confident we can really be, which divides the field into the so-called "realists" and the so-called "abstractionalists" (see Lass 2017 for a recent discussion of the debate).

As a typical representative of abstractionalism in linguistic reconstruction, consider the famous linguist Ferdinand de Saussure, who emphasized that the real sound values which scholars reconstructed for proposed ancient words in unattested languages like, for example, Indo-European, could as well be simply replaced by numbers or other characters, serving as identifiers (Saussure 1916: 303). The fundamental idea here, when reconstructing a word for a given proto-language, is that a reconstruction does not need to inform us about the likely pronunciation of a word, but rather about the structure of the word in contrast to other words.

This aspect of historical linguistics is often difficult to discuss with colleagues from other disciplines, since it seems to be very peculiar, but it is very important in order to understand the basic methodology. The general idea of structure versus substance is that, once we accept that the words in a languages are built by drawing letters from an alphabet, the letters themselves do not have a substantial value, but have only a value in contrast to other letters. This means that a sequence, such as "ABBA" can be seen as being structurally identical with "CDDC", or "OTTO". The similarity should be obvious: we have the same letter in the beginning and the end of each word, and the same letter being repeated in the middle of each word (see List 2014: 58f for a closer discussion of this type of similarity).

Since sequence similarity is usually not discussed in pure structural terms, the abstract view of correspondences, as it is maintained by many historical linguists, is often difficult to discuss across disciplines. The reason why linguists tend to maintain it is that languages tend to change not only their words by mutating individual sounds, but that whole sound systems change, and new sounds can be gained during language evolution, or lost (see my blogpost from March 2018 for a closer elaboration of the problem of sound change).

It is important to emphasize, however, that despite prominent abstractionalists such as Ferdinand de Saussure (1857-1913), and in part also Antoine Meillet (1866-1936), the majority of linguists think more realistically about their reconstructions. The reason is that the composition of words based on sounds in the spoken languages of the world usually follows specific rules, so-called phonotactic rules. These may vary to quite some degree among languages, but are also restricted by some natural laws of pronunciability. Thus, although languages may show impressively long chains of one consonant following another, there is a certain limit to the number of consonants that can follow each other without a vowel. Sound change is thus believed to originate roughly in either production (speakers want to pronounce things in a simpler, more convenient way) or perception (listeners misunderstand words and store erroneous variants, see Ohala 1989 for details). Therefore, a reconstruction of a given sound system based on the comparison of multiple languages gains power from a realistic interpretation of sound values.

The problem with the abstractionalist-realist debate, however, is that linguists usually conduct some kind of a mixture between the two extremes. That means that they may reconstruct very concrete sound values for certain words, where they have very good evidence, but at the same time, they may come up with abstract values that serve as place holders in lack of better evidence. The most famous example are the Indo-European "laryngeals", whose existence is beyond doubt for most historical linguistics, but whose sound values cannot be reconstructed with high reliability. As a result, linguists tend to spell them with subscript numbers as *h₁, *h₂, and *h₃. Any attempt to assemble data about sound change processes in the languages of the world needs to find a way to cope with the different degrees of evidence we find in linguistic analyses.

Problem 2

This leads us directly to our second problem in handling sound change data appropriately in order to study sound change processes. Given that many linguists propose changes in the typical A > B / C (A becomes B in context C) notation, a possible way of thinking about establishing a first database of sound changes would consist of typing these changes from the literature and making a catalog out of it. Apart from the interpretation of the data in abstractionalist-realist terms, however, such a way of collecting the data would have a couple of serious shortcomings.

First, it would mean that the analysis of the linguist who proposed the sound change is taken as final, although we often find many debates about the specific triggers of sound change, and it is not clear whether there would be alternative sound change rulesthat could apply just as well (see Problem 3on the task of automatic sound law induction for details). Second, as linguists tend to report only what changes, while disregarding what does notchange, we would face the same problem as in the traditional study of semantic change: the database would suffer from a sampling bias, as we could not learn anything about the stability of sounds. Third, since sound change depends not only on production and perception, but also on the system of the language in which sounds are produced, listing sounds deprived of examples in real words would most likely make it impossible to take these systemic aspects of sound change into account.

Problem 3

This last point now leads us to the third general difficulty, the question of how to interpret sound change data, assuming that one has had the chance to acquire enough of it from a reasonably large sample of spoken languages. If we look at the general patterns of sound change observed for the languages of the world, we can distinguish two basic conditions of sound change, phonetic conditions and systemic conditions. Phonetic conditions can be further subdivided into articulatory (= production) and acoustic(= perception) conditions. When trying to explain why certain sound changes can be observed more frequently across different languages of the world, many linguists tend to invoke phonetic factors. If the sound p, for example, turns into an f, this is not necessarily surprising given the strong similarity of the sounds.

But similarity can be measured in two ways: one can compare the similarity with respect to the production of a sound by a speaker, and with respect to the perception of the sound by a listener. While production of sounds is traditionally seen as the more important factor contributing to sound change (Hock 1991: 11), there are clear examples for sound change due to misperception and re-interpretation by the listeners (Ohala 1989: 182). Some authors go as far as to claim that production-driven changes reflect regular internal language change (which happens gradually during acquisition, or (depending on the theory) also in later stages (Bybee 2002), while perception-based changes rather reflect change happening in second language acquisition and language contact (Mowrey and Pagliuca 1995: 48).

While the interaction of production and perception has been discussed in some detail in the linguistic literature, the influence of systemic factors has so far been only rarely regarded. What I mean by this factor is the idea that certain changes in language evolution may be explained exclusively as resulting from systemic constellations. As a straightforward example, consider the difference in design space for the production of consonants, vowels, and tones. In order to maintain pronunciability and comprehensiblity, it is useful for the sound system of a given language to fill in those spots in the design space that are maximally different from each other. The larger the design space and the smaller the inventory, the easier it is to guarantee its functionality. Since design spaces for vowels and tones are much smaller than for consonants, however, these sub-systems are more easily disturbed, which could be used to explain the presence of chain shifts of vowels, or flip- flop in tone systems (Wang 1967: 102). Systemic considerations play an increasingly important role in evolutionary theory, and, as shown in List et al. (2016), also be used as explanations for phenomena as strange as the phenomenon of Sapir's drift (Sapir 1921).

However, the crucial question, when trying to establish a typology of sound change, is how these different effects could be measured. I think it is obvious that collections of individual sound changes proposed in the literature are not enough. But what data would be sufficient or needed to address the problem is not entirely clear to me either.

Traditional approaches

As the first traditional approach to the typology of sound change, one should mention the intuition inside the heads of the numerous historical linguists who study particular language families. Scholars trained in historical linguistics usually start to develop some kind of intuition about likely and unlikely tendencies in sound change, and in most parts they also agree on this. The problem with this intuition, however, is that it is not explicit, and it seems even that it was never the intention of the majority of historical linguists to make their knowledge explicit. The reasons for this reluctance with respect to formalization and transparency are two-fold. First, given that every individual has invested quite some time in order to grow their intuition, it is possible that the idea of having a resource that distributes this intuition in a rigorously data-driven and explicit manner yields the typical feeling of envy in quite a few people who may then think: «I had to invest so much time in order to learn all this by heart. Why should young scholars now get all this knowledge for free?» Second, given the problems outlined in the previous section, many scholars also strongly believe that it is impossible to formalize the problem of sound change tendencies.

The by far largest traditional study of the typology of sound change is Kümmel's (2008) book Konsonantenwandel (Consonant Change), in which the author surveys sound change processes discussed in the literature on Indo-European and Semitic languages. As the title of the book suggests, it concentrates on the change of consonants, which are (probably due to the larger design space) also the class of sounds that shows stronger cross-linguistic tendencies. The book is based on a thorough inspection of the literature on consonant change in Indo-European and Semitic linguistics. The procedure by which this collection was carried out can be seen as the gold standard, which any future attempt of enlarging the given collection should be carried out.

What is specifically important, and also very difficult to achieve, is the harmonization of the evidence, which is nicely reflected in Kümmel's introduction, where he mentions that one of the main problems was to determine what the scholars actually meant with respect to phonetics and phonology, when describing certain sound changes (Kümmel 2008: 35). The major drawback of the collection is that it is not (yet) available in digital form. Given the systematicity with which the data was collected, it should be generally possible to turn the collection into a database; and it is beyond doubt that this collection could offer interesting insights into certain tendencies of sound change.

Another collection of sound changes collected from the literature is the mysterious Index Diachronica, a collection of sound changes collected from various language families by a person who wishes to remain anonymous. Up to now, this collection even has a Searchable Index that allows scholars to click on a given sound and to see in which languages this sound is involved in some kind of sound change. What is a pity about the resource is that it is difficult to use, given that one does not really know where it actually comes from, and how the information was extracted from the sources. If the anonymous author would only decide to put it (albeit anonymously, or under a pseudonym) on a public preprint server, such as, for example, Humanities Commons, this would be excellent, as it would allow those who are interested in pursuing the idea of collecting sound changes from the literature an excellent starting point to check the sources, and to further digitize the resource.

Right now, this resource seems to be mostly used by conlangers, ie., people who create artificial languages as a hobby (or profession). Conlangers are often refreshingly pragmatic, and may come up with very interesting and creative ideas about how to address certain data problems in linguistics, which "normal" linguists would refuse to do. There is a certain tendency in our field to ignore certain questions, either because scholars think it would be too tedious to collect the data to address that problem, or they consider it impossible to be done "correctly" from the start.

As a last and fascinating example, I have to mention the study by Yang and Xu (2019), in which the authors review studies of concrete examples of tone change in South-East Asian languages, trying to identify cross-linguistic tendencies. Before I read this study, I was not aware that tone change had at all been studied concretely, since most linguists consider the evidence for any kind of tendency far too shaky, and reconstruct tone exclusively as an abstract entity. The survey by Yang and Xu, however, shows clearly that there seem to be at least some tendencies, and that they can be identified by invoking a careful degree of abstraction when comparing tone change across different languages.

For the detailed reasons outlined in the previous paragraph, I do not think that a collection of sound change examples from the literature addresses the problem of establishing a typology of sound change. Specifically, the fact that sound change collections usually do not provide any tangible examples or frequencies of a given sound change within the language where it occurred, but also the fact that they do not offer any tendencies of sounds to resist change, is a major drawback, and a major loss of evidence during data collection. However, I consider these efforts as valuable and important contributions to our field. Given that they allow us to learn a lot about some very general and well-confirmed tendencies of sound change, they are also an invaluable source of inspiration when it comes to working on alternative approaches.

Computational approaches

To my knowledge, there are no real computational approaches to the study of sound change so far. What one should mention, however, are initial attempts to measure certain aspects of sound change automatically. Thus, Brown et al. (2013) measure sound correspondences across the world's languages, based on a collection of 40-item wordlists for a very large sample of languages. The limitations of this study can be found in the restricted alphabet being used (all languages are represented by a reduced transcription system of some 40 letters, called the ASJP code. While the code originally allowed representing more that just 40 sounds, since the graphemes can be combined, the collection was carried out inconsistently for different languages, which has now led to the situation that the majority of computational approaches treat each letter as a single sound, or consider only the first element of complex grapheme combinations.

While sound change is a directional process, sound correspondences reflect the correspondence of sounds in different languages as a result of sound change, and it is not trivial to extract directional information from sound correspondence data alone. Thus, while the study of Brown et al. is a very interesting contribution, also providing a very straightforward methodology, it does not address the actual problem of sound change.

The study also has other limitations. First, the approach only measures those cases where sounds differ in two languages, and thus we have the same problem that we cannot tell how likely it is that two identical sounds correspond. Second, the study ignores phonetic environment (or context), which is an important factor in sound change tendencies (some sound changes, for example, tend to occur only in word endings, etc.). Third, the study considers only sound correspondences across language pairs, while it is clear that one can often find stronger evidence for sound correspondences when looking at multiple languages (List 2019).

Initial ideas for improvement

What we need in order to address the problem of establishing a true typology of sound change processes, are, in my opinion:
  1. a standardized transcription system for the representation of sounds across linguistic resources,
  2. increased amounts of readily coded data that adhere to the standard transcription system and list cognate sets of ancestral and descendant languages,
  3. good, dated phylogenies that allow to measure how often sound changes appear in a certain time frame,
  4. methods to infer the sound change rules (Problem 3), and
  5. improved methods for ancestral state reconstruction that would allow us to identify sound change processes not only for the root and the descendant nodes, but also for intermediate stages.
It is possible that even these five points are not enough yet, as I am still trying to think about how one should best address the problem. But what I can say for sure is that one needs to address the problem step by step, starting with the issue of standardization — and that the only way to account for the problems mentioned above is to collect the pure empirical evidence on sound change, not the summarized results discussed in the literature. Thus, instead of saying that some source quotes that in German, the t became a ts at some point, I want to see a dataset that provides this in the form of concrete examples that are large enough to show the regularity of the findings and ideally also list the exceptions.

The advantage of this procedure is that the collection is independent of the typical errors that usually occur when data are collected from the literature (usually also by employing armies of students who do the "dirty" work for the scientists). It would also be independent of individual scholars' interpretations. Furthermore, it would be exhaustive — that is, one could measure not only the frequency of a given change, but also the regularity, the conditioning context, or the systemic properties

The disadvantage is, of course, the need to acquire standardized data in a large-enough size for a critical number of languages and language families. But, then again, if there were no challenges involved in this endeavor, I would not present it as an open problem of computational diversity linguistics.

Outlook

With the newly published database of Cross-Linguist Transcription Systems (CLTS, Anderson et al. 2018), the first step towards a rigorous standardization of transcription systems has already been made. With our efforts towards a standardization of wordlists that can also be applied in the form of a retro-standardization to existing data (Forkel et al. 2018), we have proposed a further step of how lexical data can be collected efficiently for a large sample of the worlds' spoken languages (see also List et al. 2018). Work on automated cognate detection and workflows for computer-assisted language comparison has also drastically increased the efficiency of historical language comparison.

So, we are advancing towards a larger collection of high-quality and historically compared datasets; and it is quite possible that we will, in a couple of years from now, arrive at a point where the typology of sound change is no longer a dream by me and many colleagues, but something that may actually be feasible to extract from cross-linguistic data that has been historically annotated. But until then, many issues still remain unsolved; and in order to address these, it would be useful to work towards pilot studies, in order to see how well the ideas for improvement, outlined above, can actually be implemented.

References

Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (2018) A Cross-Linguistic Database of Phonetic Transcription Systems. Yearbook of the Poznań Linguistic Meeting 4.1: 21-53.

Brown, Cecil H. and Holman, Eric W. and Wichmann, Søren (2013) Sound correspondences in the worldś languages. Language 89.1: 4-29.

Bybee, Joan L. (2002) Word frequency and context of use in the lexical diffusion of phonetically conditioned sound change. Language Variation and Change 14: 261-290.

Forkel, Robert and List, Johann-Mattis and Greenhill, Simon J. and Rzymski, Christoph and Bank, Sebastian and Cysouw, Michael and Hammarström, Harald and Haspelmath, Martin and Kaiping, Gereon A. and Gray, Russell D. (2018) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5.180205: 1-10.

Fox, Anthony (1995) Linguistic Reconstruction. An Introduction to Theory and Method. Oxford: Oxford University Press.

Hock, Hans Henrich (1991) Principles of Historical Linguistics. Berlin: Mouton de Gruyter.

Kümmel, Martin Joachim (2008): Konsonantenwandel [Consonant change]. Wiesbaden:Reichert.
Lass, Roger (2017): Reality in a soft science: the metaphonology of historical reconstruction. Papers in Historical Phonology 2.1: 152-163.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Pathmanathan, Jananan Sylvestre and Lopez, Philippe and Bapteste, Eric (2016) Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biology Direct 11.39: 1-17.

List, Johann-Mattis and Greenhill, Simon J. and Anderson, Cormac and Mayer, Thomas and Tresoldi, Tiago and Forkel, Robert (2018) CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2: 277-306.

List, Johann-Mattis (2019): Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Mowrey, Richard and Pagliuca, William (1995) The reductive character of articulatory evolution. Rivista di Linguistica 7: 37–124.

Ohala, J. J. (1989) Sound change is drawn from a pool of synchronic variation. In: Breivik, L. E. and Jahr, E. H. (eds.) Language Change: Contributions to the Study of its Causes. Berlin: Mouton de Gruyter., pp.173-198.

Sapir, Edward (1921[1953]) Language. An Introduction to the Study of Speech.

de Saussure, Ferdinand (1916) Cours de linguistique générale. Lausanne: Payot.

William S-Y. Wang (1967) Phonological features of tone. International Journal of American Linguistics 33.2: 93-105.

Yang, Cathryn and Xu, Yi (2019) A review of tone change studies in East and Southeast Asia. Diachronica 36.3: 417-459.

Typology of semantic change (Open problems in computational diversity linguistics 8)


With this month's problem we are leaving the realm of modeling, which has been the basic aspect underlying the last three problems, discussed in June, July, and August, and enter the realm of typology, or general linguistics. The last three problems that I will discuss, in this and two follow-up posts, deal with the basic problem of making use or collecting data that allows us to establish typologies, that is, to identify cross-linguistic tendencies for specific phenomena, such as semantic change (this post), sound change (October), or semantic promiscuity (November).

Cross-linguistic tendencies are here understood as tendencies that occur across all languages independently of their specific phylogenetic affiliation, the place where they are spoken, or the time when they are spoken. Obviously, the uniformitarian requirement of independence of place and time is an idealization. As we know well, the capacity for language itself developed, potentially gradually, with the evolution of modern humans, and as a result, it does not make sense to assume that the tendencies of semantic change or sound change were the same through time. This has, in fact, been shown in recent research that illustrated that there may be a certain relationship between our diet and the speech sounds that we speak in our languages (Blasi et al. 2019).

Nevertheless, in the same way in which we simplify models in physics, as long as they yield good approximations of the phenomena we want to study, we can also assume a certain uniformity for language change. To guarantee this, we may have to restrict the time frame of language development that we want to discuss (eg. the last 2,000 years), or the aspects of language we want to investigate (eg. a certain selection of concepts that we know must have been expressed 5,000 years ago).

For the specific case of a semantic change, the problem of establishing a typology of the phenomenon can thus be stated as follows:
Assuming a certain pre-selection of concepts that we assume were readily expressed in a given time frame, establish a general typology that informs about the universal tendencies by which a word expressing one concept changes its meaning, to later express another concept in the same language.
In theory, we can further relax the conditions of universality and add the restrictions on time and place later, after having aggregated the data. Maybe this would even be the best idea for a practical investigation; but given that the time frames in which we have attested data for semantic changes are rather limited, I do not believe that it would make much of a change.

Why it is hard to establish a typology of semantic change

There are three reasons why it is hard to establish a typology of semantic change. First, there is the problem of acquiring the data needed to establish the typology. Second, there is the problem of handling the data efficiently. Third, there is the problem of interpreting the data in order to identify cross-linguistic, universal tendencies.

The problem of data acquisition results from the fact that we lack data on observed processes of semantic change. Since there are only a few languages with a continuous tradition of written records spanning 500 years or more, we will never be able to derive any universal tendencies from those languages alone, even if it may be a good starting point to start from languages like Latin and its Romance descendants, as has been shown by Blank (1997).

Accepting the fact that processes attested only for Romance languages are never enough to fill the huge semantic space covered by the world's languages, the only alternative would be using inferred processes of semantic change — that is, processes that have been reconstructed and proposed in the literature. While it is straightforward to show that the meanings of cognate words in different languages can vary quite drastically, it is much more difficult to infer the direction underlying the change. Handling the direction, however, is important for any typology of semantic change, since the data from observed changes suggests that there are specific directional tendencies. Thus, when confronted with cognates such as selig "holy" in German and silly in English, it is much less obvious whether the change happened from "holy" to "silly" or from "silly" to "holy", or even from an unknown ancient concept to both "holy" and "silly".

As a result, we can conclude that any collection of data on semantic change needs to make crystal-clear upon which types of evidence the inference of semantic change processes is based. Citing only the literature on different language families is definitely not enough. Because of the second problem, this also applies to the handling of data on semantic shifts. Here, we face the general problem of elicitation of meanings. Elicitation refers to the process in fieldwork where scholars use a questionnaire to ask their informants how certain meanings are expressed. The problem here is that linguists have never tried to standardize which meanings they actually elicit. What they use, instead, are elicitation glosses, which they think are common enough to allow linguists to understand to what meaning they refer. As a result, it is extremely difficult to search in field work notes, and even in wordlists or dictionaries, for specific meanings, since every linguist is using their own style, often without further explanations.

Our Concepticon project (List et al. 2019, https://concepticon.clld.org) can be seen as a first attempt to handle elicitation glosses consistently. What we do is to link those elicitation glosses that we find in questionnaires, dictionaries, and fieldwork notes to so-called concept sets, which reflect a given concept that is given a unique identifier and a short definition. It would go too far to dive deeper into the problem of concept handling. Interested readers can have a look at a previous blog post I wrote on the topic (List 2018). In any case, any typology on semantic change will need to find a way to address the problem of handling elicitation glosses in the literature, in the one or the other way.

As a last problem, when having assembled data that show semantic change processes across a sufficiently large sample of languages and concepts, there is the problem of analyzing the data themselves. While it seems obvious to identify cross-linguistic tendencies by looking for examples that occur in different language families and different parts of the world, it is not always easy to distinguish between the four major reasons for similarities among languages, namely: (1) coincidence, (2) universal tendencies, (3) inheritance, and (4) contact (List 2019). The only way to avoid being forced to make use of potentially unreliable statistics, to squeeze out the juice of small datasets, is to work on a sufficiently large coverage of data from as many language families and locations as possible. But given that there are no automated ways to infer directed semantic change processes across linguistic datasets, it is unlikely that a collection of data acquired from the literature alone will reach the critical mass needed for such an endeavor.

Traditional approaches

Apart from the above-mentioned work by Blank (1997), which is, unfortunately, rarely mentioned in the literature (potentially because it is written in German), there is an often-cited paper by Wilkinson (1996), and preliminary work on directionality (Urban 2012). However, the attempt that addresses the problem most closely is the Database of Semantic Shifts (Zalizniak et al. 2012), which has, according to the most recent information on the website, was established in 2002 and has been  continuously updated since then.

The basic idea, as far as I understand the principle of the database, is to collect semantic shifts attested in the literature, and to note the type of evidence, as well as the direction, where it is known. The resource is unique, nobody else has tried to establish a collection of semantic shifts attested in the literature, and it is therefore incredibly valuable. It shows, however, also, what problems we face when trying to establish a typology of semantic shifts.

Apart from the typical technical problems found in many projects shared on the web (missing download access to all data underlying the website, missing deposit of versions on public repositories, missing versioning), the greatest problem of the project is that no apparent attempt was undertaken to standardize the elicitation glosses. This became specifically obvious when we tried to link an older version of the database, which is now no longer available, to our Concepticon project. In the end, I selected some 870 concepts from the database, which were supported by more datapoints, but had to ignore more than 1500 remaining elicitation glosses, since it was not possible to infer in reasonable time what the underlying concepts denote, not to speak of obvious cases where the same concept was denoted by slightly different elicitation glosses. As far as I can tell, this has not changed much with the most recent update of the database, which was published some time earlier this year.

Apart from the afore-mentioned problems of missing standardization of elicitation glosses, the database does not seem to annotate which type of evidence has been used to establish a given semantic shift. An even more important problem, which is typical of almost all attempts to establish databases of change in the field of diversity linguistics, is that the database only shows what has changed, while nothing can be found on what has stayed the same. A true typology of change, however, must show what has not changed along with showing what has changed. As a result, any attempt to pick proposed changes from the literature alone will fail to offer a true typology, a collection of universal tendencies

To be fair: the Database of Semantic Shifts is by no means claiming to do this. What it offers is a collection of semantic change phenomena discussed in the linguistic literature. This itself is an extremely valuable, and extremely tedious, enterprise. While I wish that the authors open their data, versionize it, standardize the elicitation glosses, and also host it on stable public archives, to avoid what happened in the past (that people quote versions of the data which no longer exist), and to open the data for quantitative analyses, I deeply appreciate the attempt to approach the problem of semantic change from an empirical, data-driven perspective. To address the problem of establishing a typology of semantic shift, however, I think that we need to start thinking beyond collecting what has been stated in the literature.

Computational approaches

As a first computational approach that comes in some way close to a typology of semantic shifts, there is the Database of Cross-Linguistic Colexifications (List et al. 2018), which was originally launched in 2014, and received a major update in 2018 (see List et al. 2018b for details). This CLICS database, which I have mentioned several times in the past, does not show diachronic data, ie. data on semantic change phenomena, but lists automatically detectable polysemies and homophonies (also called colexifications), instead.

While the approach taken by the Database of Semantic shifts is bottom-up in some sense, as the authors start from the literature and add those concept that are discussed there, CLICS is top-down, as it starts from a list of concepts (reflected as standardized Concepticon concept sets) and then checks which languages express more than one concept by one and the same word form.

The advantages of top-down approaches are: that much more data can be processed, and that one can easily derive a balanced sample in which the same concepts iare compared for as many languages as possible. The disadvantage is that such a database will ignore certain concepts a priori, if they do not occur in the data.

Since CLICS lists synchronic patterns without further interpreting them, the database is potentially interesting for those who want to work on semantic change, but it does not help solve the problem of establishing a typology of semantic change itself. In order to achieve this, one would have to go through all attested polysemies in the database and investigate them, searching for potential hints on directions.

A potential way to infer directions for semantic shifts is presented by Dellert (2016), who applies causal inference techniques on polysemy networks to address this task. The problem, as far as I understand the techniques, is that the currently available polysemy databases barely offer enough information needed for these kinds of analyses. Furthermore, it would also be important to see how well the method actually performs in comparison to what we think we already know about the major patterns of semantic change.

Initial ideas for improvement

There does not seem to be a practical way to address our problem by means of computational solutions alone. What we need, instead, is a computer-assisted strategy that starts from the base of  a thorough investigation of the criteria that scholars use to infer directions of semantic change from linguistic data. Once these criteria are settled, more or less, one would need to think of ways to operationalize them, in order to allow scholars to work with concrete etymological data, ideally comprising standardized word-lists for different language families, and to annotate them as closely as possible.

Ideally, scholars would propose larger etymological datasets in which they reconstruct whole language families, proposing semantic reconstructions for proto-forms. These would already contain the proposed directions of semantic change, and they would also automatically show where change does not happen. Since we currently lack automated workflows that fully account for this level of detail, one could start by applying methods for cognate detection across semantic semantic slots (cross-semantic cognate detection), which would yield valuable data on semantic change processes, without providing directions, and then adding the directional information based on the principles that scholars use in their reconstruction methodology.

Outlook

Given the recent advances in detection of sound correspondence patterns, sequence comparison, and etymological annotation in the field of computational historical linguistics, it seems perfectly feasible to work on detailed etymological datasets of the languages of the world, in which all information required to derive a typology of semantic change is transparently available. The problem is, however, that it would still take a lot of time to actually analyze and annotate these data, and to find enough scholars who would agree to carry out linguistic reconstruction in a similar way, using transparent tools rather than convenient shortcuts.

References

Blank, Andreas (1997) Prinzipien des lexikalischen Bedeutungswandels am Beispiel der romanischen Sprachen. Tübingen:Niemeyer.

Blasi, Damián E. and Steven Moran and Scott R. Moisik and Paul Widmer and Dan Dediu and Balthasar Bickel (2019) Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363.1192: 1-10.

List, Johann-Mattis and Simon Greenhill and Cormac Anderson and Thomas Mayer and Tiago Tresoldi and Robert Forkel (2018: CLICS: Database of Cross-Linguistic Colexifications. Version 2.0. Max Planck Institute for the Science of Human History. Jena: http://clics.clld.org/.

Johann Mattis List and Simon Greenhill and Christoph Rzymski and Nathanael Schweikhard and Robert Forkel (2019) Concepticon. A resource for the linking of concept lists (Version 2.1.0). Max Planck Institute for the Science of Human History. Jena: https://concepticon.clld.org/.

Dellert, Johannes and Buch, Armin (2016) Using computational criteria to extract large Swadesh Lists for lexicostatistics. In: Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics.

List, Johann-Mattis and Greenhill, Simon J. and Anderson, Cormac and Mayer, Thomas and Tresoldi, Tiago and Forkel, Robert (2018) CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2: 277-306.

List, Johann-Mattis (2018) Towards a history of concept list compilation in historical linguistics. History and Philosophy of the Language Sciences 5.10: 1-14.

List, Johann-Mattis (2019) Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Language and Linguistics Compass 13.e12355: 1-16.

Urban, Matthias (2011) Asymmetries in overt marking and directionality in semantic change. Journal of Historical Linguistics 1.1: 3-47.

Wilkins, David P. (1996) Natural tendencies of semantic change and the search for cognates. In: Durie, Mark (ed.) The Comparative Method Reviewed: Regularity and Irregularity in Language Change. New York: Oxford University Press, pp. 264-304.

Zalizniak, Anna A. and Bulakh, M. and Ganenkov, Dimitrij and Gruntov, Ilya and Maisak, Timur and Russo, Maxim (2012) The catalogue of semantic shifts as a database for lexical semantic typology. Linguistics 50.3: 633-669.

Statistical proof of language relatedness (Open problems in computational diversity linguistics 7)


The more I advance with the problems I want to present during this year, the more I have to admit to myself, sometimes, that the problem I planned to present is so difficult that I find it even hard to simply present the state-of-the-art. The problem of this month, problem number 7 in my list, is such an example — proving that two or more languages are "genetically related", as historical linguists (incorrectly) tend to say, is not only hard, it is also extremely difficult even to summarize the topic properly.

Typically, colleagues start with the famous but also not very helpful quote of Sir William Jones, who delivered a report to the British Indian Company, thereby mentioning that there might be a deeper relationship between Sanskrit and some European languages (like Greek and Latin). The article, titled The third anniversary discourse, delivered 2 February, 1786, by the president(published in 1798) has by now been quoted so many times that it is better to avoid quoting it another time (but you will find the full quote with references in my reference library.

In contrast to later scholars like Jacob Grimm and Rasmus Rask, however, Jones does not prove anything, he just states an opinion. The reason why scholars like to quote him, is that he seems to talk about probability, since he mentions the impossibility that the resemblances between the languages he observed could have arisen by chance. Since a great deal of the discussion about language relationship centers around the question how chance could be controlled for, it is a welcome quote from the olden times to be used when writing a paper on statistics or quantitative methods. But this does not necessarily mean that Jones really knew what he was writing about, as one can read in detail in the very interesting book by Campbell and Poser (2008), which deals at length with the supposedly overrated role that William Jones played in the early history of historical linguistics.

Macro Families

Returning to the topic at hand. The regularity of sound change and the possibility to prove language relationship in some cases was an unexpected detection of some linguists during the early 19th century, but what many linguists have been dreaming about since is to expand their methods to such a degree that even deeper relationships could be proven. While the evidence for the relationship of the core Indo-European languages was more or less convincing by itself (as rightfully pointed out by Nichols 1996), scholars have proposed many suggestions of relationship, many of which are no longer followed by the communis opinio. Among these long-range proposals for deep phylogenetic relations are theories that further unite fully established language families, proposing large macro-families — such as Nostratic (uniting Semitic, Indo-European, and many more, depending on the respective version), Altaic (uniting Turkic, Mongolic, Tungusic, Japanese, and Korean, etc.), or Dene-Caucasian (uniting Sino-Tibetan, North Caucasian, and Na-Dene), which span incredibly large areas on earth.

Given that it the majority of scholars mistrust these new and risky proposals, and that even scholars who work in the field of long-range comparison often disagree with each other, it is not surprising that at least some linguists became interested in the question of how long-range relationship could be proven in the end. One of the first attempts in this regard was presented by Aharon Dolgopolsky, a convinced Nostratic linguist, who presented a first, very interesting, heuristic procedure to determine deep cognates and deep language relationships, by breaking sounds down to more abstract classes, in order to address the problem that words often do no longer look similar due to sound change (Dolgopolsky 1964).

Why it is hard to prove language relationship

Dolgopolsky did not use any statistics to prove his approach, but he emphasized the probabilistic aspect of his endeavor, and derived his "consonant classes" or "sound classes" as well as his very short list of stable conceptsfrom the empirical investigation of a large corpus. The core of his approach, to fix a list of semantic items, presumably "stable" (i.e. slowly changing with respect to semantic shift), and to reduce the complexity of phonetic transcriptions to a core meta-alphabet, has been the basis of many follow-up studies that follow an explicitly quantitative (or statistic) approach.

As of now, most scholars, be they classical or computational, agree that the first stage of historical language comparison consists of the proof that the languages one wants to investigate are, indeed, historically related to each other (for the underlying workflow of historical language comparison, see Ross and Durie). In a blogpost published much earlier (Monogenesis, polygenesis, and militant agnosticismI have already pointed to this problem, as it is quite different from biology, where independent evolution of life is usually not assumed by scholars, while linguistic research can never really exclude it.

While proving language relationship of closely related languages is often a complete no-brainer, it becomes especially then hard, when exceeding some critical time depth. Where this time depth lies is not clear by now, but based on our observations regarding the paste in which languages replace existing words with new ones, borrow words, or loose and build grammatical structures, it is clear that it is theoretically possible that a language group could have lost all hints on its ancestry after 5,000 to 10,000 years. Luckily, what is theoretically possible for one language, does not necessarily happen with all languages in a given sample, and as a result, we find still enough signal for ancestral languages in quite a few language families of the world, that allows us to draw conclusions that go back about 10,000 years in the most cases, if not even deeper in some cases.

Traditional insights into the proof of language relationships

The difficulty of the task is probably obvious without further explanation — the more material a language acquires from its neighbors, and the more it loses or modifies the material it inherited from its ancestors, the more difficult it is for the experts to find the evidence that convinces their colleagues about the phylogenetic affiliation of such a language. While regular sound changes can easily convince people of phylogenetic relationship, the evidence that scholars propose for deeper linguistic groupings is rarely large enough to establish correspondences.

As a result, scholars often resort to other types of evidence, such as certain grammatical peculiarities, certain similarities in the pronunciation of certain words, or external findings (e.g.,from archaeology). As Handel (2008) points out, for example, a good indicator of a Sino-Tibetan language is that its words for five, I, and fish start with similar initial sounds and contain a similar vowel (compare Chinese , , and , going back to MC readings ŋjuX. ŋaX, and ŋjo). While these arguments are often intuitively very convincing (and may also be statistically convincing, as Nichols 1996 argues), this kind of evidence, as mentioned by Handel, is extremely difficult to detect, since the commonalities can be found in so many different regions of a human language system.

While linguists also use sound correspondences to prove and establish relationship, there are no convincing cases known to me in which sound correspondences were employed to prove relationships beyond a certain time depth. One can compare this endeavor to some degree with the work of police commissars who have to find a murderer, and can do so easily if the person responsible left DNA at the spot, while they have to spend many nights in pubs, drinking cheap beer and smoking bad cigarettes, in order to wait for the spark of inspiration that delivers the ultimate proof not based on DNA.

Computational and statistical approaches

Up to now, no computational methods are available to find signals of the kind presented by Handel for Sino-Tibetan, i.e, a general-purpose heuristic to search for what Nichols (1996) calls individual-identifying evidence. So,computational and statistical methods have so far been based on very schematic approaches, which are almost exclusively based on wordlists. A wordlist can hereby be thought of as a simple table with a certain number of concepts (arm, hand, stone, cinema) in the first column, and translation equivalents for these concepts being listed for several different languages in the following columns (see List 2014: 22-24). This format can of course be enhanced (Forkel et al. 2018), but it represents the standard way in which many historical linguists still prepare and curate their data.

What scholars now try to do is to see if they can find some kind of signal in the data that they think would be unlikely to be detected by chance. In general, there are two ways that scholars have explored so far. In the approach proposed by Ringe (1992), the signalsthat are tested for in the wordlists are sound correspondences, and we can therefore call theses approaches correspondence-based approaches to prove language relationship. In the approach of Baxter and Manaster Ramer (2000), which follows the original idea of Dolgopolsky, the data are converted to sound classes first, and cognacy is assumed for words with identical sound classes. Sound-class-based approaches again try to illustrate that the matches that can be identified are unlikely to be due to chance.

Both approaches have been discussed in quite a range of different papers, and scholars have also tried to propose improvements to the methods. Ringe's correspondence-based approach showed that it can become difficult to prove the relationship of languages formally, although we have very good reasons to assume it based on our standard methods. Baxter and Manaster Ramer (2000) presented a more optimistic case study, in which they argue that their sound-class-based approach would allow them to argue in favor of the relationship of Hindi and English, even if the two languages are separated by at least 10,000 or even more years.

A general problem of Ringe's approach was that he tried to use combinatorics to arrive at his statistical evaluation. This is similar to the way in which Henikoff and Henikoff (1992) developed their BLOSUM matrices for biology, by assuming that the only factor that handles the combination of amino acids in biological sequences is their frequency. Ringe tried to estimate the likelihood of finding matches of word-initial consonants in his data by using a combinatorial approach based on the assumption of simple sound frequencies in the word lists he investigated. The general problem with linguistic sequences, however, is that they are not randomly arranged. Instead, every language has its own system of phonotactic rules, a rather simple grammar that restricts certain letter combinations and favors others. All spoken languages have these systems, and some vary greatly with respect to their phonotactics. As a result, due to the inherent structure of sequences, a bag of symbols approach, as used by Ringe, can have unwanted side effects and invoke misleading estimates regarding the probability of certain matches.

To avoid this problem, Kessler (2001) proposed the use of permutation tests, by which the random distribution, against which the attested distribution is compared, is generated via the shuffling of the lists. Instead of comparing translations for "apple" in one language with translations for "apple" in another language, one compares now translations for pear with translations for "apple", hoping that this — if done often enough — better approximates the random distribution (i.e. the situation in which one compares several known unrelated languages with similar phoneme inventories).

Permutation is also the standard in all sound-correspondence-based approaches. In a recent paper, Kassian et al. (2015) used these approaches (first proposed by Turchin et al. 2010) to argue for the relationship of Indo-European and Uralic languages by comparing reconstructed word lists for Proto-Indo-European and Proto-Uralic. As can be seen from the discussion of these findings involving multiple authors, people are still not automatically convinced by a significance test, and scholars have criticized: their choice of test concepts (they used the classical 110-item list by Yakhontov and Starostin), their choice of reconstruction system (they did not use the mysterious laryngeals in their comparison), and the possibility that the findings were due to other factors (early borrowing).

While there have been some more attempts to improve the correspondence-based and the sound-class-based approaches (e.g., Kessler 2007, Kilani 2015, Mortarino 2009), it is unlikely that they will lead to the consolidation of contested proposals on macro families any time soon. Apart from the general problems of many of the current tests, there seem to be too many unknowns that prevent the community to accept findings, no matter "how" significant they appear. As can be nicely seen from the reaction to the paper by Kassian et al. 2015, a significant test will first raise the typical questions regarding the quality of the data and the initial judgments (which may also at times be biased). Even if all scholars would agree in this case, however, i.e. if one could not criticize anything in the initial test setting, there would still be the possibility to say that the findings reflect early language contact instead of phylogenetic relatedness.

Initial ideas for improvement

What I find unsatisfying about most existing tests is that they do not make exhaustive use of alignment methods. The sound-class-based approach is a shortcut for alignments, but it reduces words to two consonant classes only, and requires an extensive analysis of the words to compare only the root morpheme. It therefore also opens the possibility to bias the results (even if scholars may not intend that directly). While correspondence-based tests are much more elegant in general, they avoid alignments completely, and just pick the first letter in every word. The problem seems to be that — even when using permutations to generate the random distribution — nobody really knows how one should score the significance of sound correspondences in aligned words. I have to admit that I do not know it either. Although the tools for automated sequence comparison that my colleagues and I have been developing in the past (List 2014, List et al. 2018) seem like the best starting point to improve the correspondence-based approach, it is not clear how the test should be performed in the end.

Additionally, I assume also that expanded, fully fledged, tests will ultimately show what I reported back in my dissertation — if we work on limited wordlists, with only 200 items per language, the test will drastically lose its power when certain time depths have been reached. While we can easily prove the relationship of English and German, even with only 100 words, we have a hard time doing the same thing for English and Albanian (see List 2014: 200-203). But expanding the wordlists bears another risk for comparison (as pointed out to me by George Starostin): the more words we add, the more likely it is that they have been borrowed. Thus, we face a general dilemma in historical linguistics: that we are forced to deal with sparse data, since languages tend to lose their historical signal rather quickly.

Outlook

While there is no doubt that it would be attractive to have a test that would immediately tell one whether languages are related or not, I am becoming more and more skeptical about whether this test would actually help us, specifically when concentrating on pairwise tests alone. The challenge of this problem is not just to design a test that makes sense and does not overly simplify. The challenge is to propagate the test in such a way that it convinces our colleagues that it really works. This, however, is a challenge that is greater than any of the other open problems I have discussed so far in this year.

References

Baxter, William H. and Manaster Ramer, Alexis (2000) Beyond lumping and splitting: Probabilistic issues in historical linguistics. In: Renfrew, Colin and McMahon, April and Trask, Larry (eds.) Time Depth in Historical Linguistics. Cambridge:McDonald Institute for Archaeological Research, pp. 167-188.

Campbell, Lyle and Poser, William John (2008) Language Classification: History and Method. Cambridge:Cambridge University Press.

Dolgopolsky, Aron B. (1964) Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija [A probabilistic hypothesis concering the oldest relationships among the language families of Northern Eurasia]. Voprosy Jazykoznanija 2: 53-63.

Forkel, Robert and List, Johann-Mattis and Greenhill, Simon J. and Rzymski, Christoph and Bank, Sebastian and Cysouw, Michael and Hammarström, Harald and Haspelmath, Martin and Kaiping, Gereon A. and Gray, Russell D. (2018) Cross-linguistic data formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5: 1-10.

Handel, Zev (2008) What is Sino-Tibetan? Snapshot of a field and a language family in flux. Language and Linguistics Compass 2: 422-441.

Henikoff, Steven and Henikoff, Jorja G. (1992) Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences 89: 10915-10919.

Jones, William (1798) The third anniversary discourse, delivered 2 February, 1786, by the president. On the Hindus. Asiatick Researches 1: 415-43.

Kassian, Alexei and Zhivlov, Mikhail and Starostin, George S. (2015) Proto-Indo-European-Uralic comparison from the probabilistic point of view. The Journal of Indo-European Studies 43: 301-347.

Kessler, Brett (2001) The Significance of Word Lists. Statistical Tests for Investigating Historical Connections Between Languages. Stanford: CSLI Publications.

Kessler, Brett (2007) Word similarity metrics and multilateral comparison. In: Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology, pp. 6-14.

Kilani, Marwan (2015): Calculating false cognates: An extension of the Baxter & Manaster-Ramer solution and its application to the case of Pre-Greek. Diachronica 32: 331-364.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Walworth, Mary and Greenhill, Simon J. and Tresoldi, Tiago and Forkel, Robert (2018) Sequence comparison in computational historical linguistics. Journal of Language Evolution 3: 130–144.

Mortarino, Cinzia (2009) An improved statistical test for historical linguistics. Statistical Methods and Applications 18: 193-204.

Nichols, Johanna (1996) The comparative method as heuristic. In: Durie, Mark (ed.) The Comparative Method Reviewed. New York:Oxford University Press, pp. 39-71.

Ringe, Donald A. (1992) On calculating the factor of chance in language comparison. Transactions of the American Philosophical Society 82: 1-110.

Ross, Malcolm D. (1996) Contact-induced change and the comparative method. Cases from Papua New Guinea. In: Durie, Mark (ed.) The Comparative Method Reviewed. New York: Oxford University Press, pp. 180-217.

Turchin, Peter and Peiros, Ilja and Gell-Mann, Murray (2010) Analyzing genetic connections between languages by matching consonant classes. Journal of Language Relationship 3: 117-126.

Statistical proof of language relatedness (Open problems in computational diversity linguistics 7)


The more I advance with the problems I want to present during this year, the more I have to admit to myself, sometimes, that the problem I planned to present is so difficult that I find it even hard to simply present the state-of-the-art. The problem of this month, problem number 7 in my list, is such an example — proving that two or more languages are "genetically related", as historical linguists (incorrectly) tend to say, is not only hard, it is also extremely difficult even to summarize the topic properly.

Typically, colleagues start with the famous but also not very helpful quote of Sir William Jones, who delivered a report to the British Indian Company, thereby mentioning that there might be a deeper relationship between Sanskrit and some European languages (like Greek and Latin). The article, titled The third anniversary discourse, delivered 2 February, 1786, by the president(published in 1798) has by now been quoted so many times that it is better to avoid quoting it another time (but you will find the full quote with references in my reference library.

In contrast to later scholars like Jacob Grimm and Rasmus Rask, however, Jones does not prove anything, he just states an opinion. The reason why scholars like to quote him, is that he seems to talk about probability, since he mentions the impossibility that the resemblances between the languages he observed could have arisen by chance. Since a great deal of the discussion about language relationship centers around the question how chance could be controlled for, it is a welcome quote from the olden times to be used when writing a paper on statistics or quantitative methods. But this does not necessarily mean that Jones really knew what he was writing about, as one can read in detail in the very interesting book by Campbell and Poser (2008), which deals at length with the supposedly overrated role that William Jones played in the early history of historical linguistics.

Macro Families

Returning to the topic at hand. The regularity of sound change and the possibility to prove language relationship in some cases was an unexpected detection of some linguists during the early 19th century, but what many linguists have been dreaming about since is to expand their methods to such a degree that even deeper relationships could be proven. While the evidence for the relationship of the core Indo-European languages was more or less convincing by itself (as rightfully pointed out by Nichols 1996), scholars have proposed many suggestions of relationship, many of which are no longer followed by the communis opinio. Among these long-range proposals for deep phylogenetic relations are theories that further unite fully established language families, proposing large macro-families — such as Nostratic (uniting Semitic, Indo-European, and many more, depending on the respective version), Altaic (uniting Turkic, Mongolic, Tungusic, Japanese, and Korean, etc.), or Dene-Caucasian (uniting Sino-Tibetan, North Caucasian, and Na-Dene), which span incredibly large areas on earth.

Given that it the majority of scholars mistrust these new and risky proposals, and that even scholars who work in the field of long-range comparison often disagree with each other, it is not surprising that at least some linguists became interested in the question of how long-range relationship could be proven in the end. One of the first attempts in this regard was presented by Aharon Dolgopolsky, a convinced Nostratic linguist, who presented a first, very interesting, heuristic procedure to determine deep cognates and deep language relationships, by breaking sounds down to more abstract classes, in order to address the problem that words often do no longer look similar due to sound change (Dolgopolsky 1964).

Why it is hard to prove language relationship

Dolgopolsky did not use any statistics to prove his approach, but he emphasized the probabilistic aspect of his endeavor, and derived his "consonant classes" or "sound classes" as well as his very short list of stable conceptsfrom the empirical investigation of a large corpus. The core of his approach, to fix a list of semantic items, presumably "stable" (i.e. slowly changing with respect to semantic shift), and to reduce the complexity of phonetic transcriptions to a core meta-alphabet, has been the basis of many follow-up studies that follow an explicitly quantitative (or statistic) approach.

As of now, most scholars, be they classical or computational, agree that the first stage of historical language comparison consists of the proof that the languages one wants to investigate are, indeed, historically related to each other (for the underlying workflow of historical language comparison, see Ross and Durie). In a blogpost published much earlier (Monogenesis, polygenesis, and militant agnosticismI have already pointed to this problem, as it is quite different from biology, where independent evolution of life is usually not assumed by scholars, while linguistic research can never really exclude it.

While proving language relationship of closely related languages is often a complete no-brainer, it becomes especially then hard, when exceeding some critical time depth. Where this time depth lies is not clear by now, but based on our observations regarding the paste in which languages replace existing words with new ones, borrow words, or loose and build grammatical structures, it is clear that it is theoretically possible that a language group could have lost all hints on its ancestry after 5,000 to 10,000 years. Luckily, what is theoretically possible for one language, does not necessarily happen with all languages in a given sample, and as a result, we find still enough signal for ancestral languages in quite a few language families of the world, that allows us to draw conclusions that go back about 10,000 years in the most cases, if not even deeper in some cases.

Traditional insights into the proof of language relationships

The difficulty of the task is probably obvious without further explanation — the more material a language acquires from its neighbors, and the more it loses or modifies the material it inherited from its ancestors, the more difficult it is for the experts to find the evidence that convinces their colleagues about the phylogenetic affiliation of such a language. While regular sound changes can easily convince people of phylogenetic relationship, the evidence that scholars propose for deeper linguistic groupings is rarely large enough to establish correspondences.

As a result, scholars often resort to other types of evidence, such as certain grammatical peculiarities, certain similarities in the pronunciation of certain words, or external findings (e.g.,from archaeology). As Handel (2008) points out, for example, a good indicator of a Sino-Tibetan language is that its words for five, I, and fish start with similar initial sounds and contain a similar vowel (compare Chinese , , and , going back to MC readings ŋjuX. ŋaX, and ŋjo). While these arguments are often intuitively very convincing (and may also be statistically convincing, as Nichols 1996 argues), this kind of evidence, as mentioned by Handel, is extremely difficult to detect, since the commonalities can be found in so many different regions of a human language system.

While linguists also use sound correspondences to prove and establish relationship, there are no convincing cases known to me in which sound correspondences were employed to prove relationships beyond a certain time depth. One can compare this endeavor to some degree with the work of police commissars who have to find a murderer, and can do so easily if the person responsible left DNA at the spot, while they have to spend many nights in pubs, drinking cheap beer and smoking bad cigarettes, in order to wait for the spark of inspiration that delivers the ultimate proof not based on DNA.

Computational and statistical approaches

Up to now, no computational methods are available to find signals of the kind presented by Handel for Sino-Tibetan, i.e, a general-purpose heuristic to search for what Nichols (1996) calls individual-identifying evidence. So,computational and statistical methods have so far been based on very schematic approaches, which are almost exclusively based on wordlists. A wordlist can hereby be thought of as a simple table with a certain number of concepts (arm, hand, stone, cinema) in the first column, and translation equivalents for these concepts being listed for several different languages in the following columns (see List 2014: 22-24). This format can of course be enhanced (Forkel et al. 2018), but it represents the standard way in which many historical linguists still prepare and curate their data.

What scholars now try to do is to see if they can find some kind of signal in the data that they think would be unlikely to be detected by chance. In general, there are two ways that scholars have explored so far. In the approach proposed by Ringe (1992), the signalsthat are tested for in the wordlists are sound correspondences, and we can therefore call theses approaches correspondence-based approaches to prove language relationship. In the approach of Baxter and Manaster Ramer (2000), which follows the original idea of Dolgopolsky, the data are converted to sound classes first, and cognacy is assumed for words with identical sound classes. Sound-class-based approaches again try to illustrate that the matches that can be identified are unlikely to be due to chance.

Both approaches have been discussed in quite a range of different papers, and scholars have also tried to propose improvements to the methods. Ringe's correspondence-based approach showed that it can become difficult to prove the relationship of languages formally, although we have very good reasons to assume it based on our standard methods. Baxter and Manaster Ramer (2000) presented a more optimistic case study, in which they argue that their sound-class-based approach would allow them to argue in favor of the relationship of Hindi and English, even if the two languages are separated by at least 10,000 or even more years.

A general problem of Ringe's approach was that he tried to use combinatorics to arrive at his statistical evaluation. This is similar to the way in which Henikoff and Henikoff (1992) developed their BLOSUM matrices for biology, by assuming that the only factor that handles the combination of amino acids in biological sequences is their frequency. Ringe tried to estimate the likelihood of finding matches of word-initial consonants in his data by using a combinatorial approach based on the assumption of simple sound frequencies in the word lists he investigated. The general problem with linguistic sequences, however, is that they are not randomly arranged. Instead, every language has its own system of phonotactic rules, a rather simple grammar that restricts certain letter combinations and favors others. All spoken languages have these systems, and some vary greatly with respect to their phonotactics. As a result, due to the inherent structure of sequences, a bag of symbols approach, as used by Ringe, can have unwanted side effects and invoke misleading estimates regarding the probability of certain matches.

To avoid this problem, Kessler (2001) proposed the use of permutation tests, by which the random distribution, against which the attested distribution is compared, is generated via the shuffling of the lists. Instead of comparing translations for "apple" in one language with translations for "apple" in another language, one compares now translations for pear with translations for "apple", hoping that this — if done often enough — better approximates the random distribution (i.e. the situation in which one compares several known unrelated languages with similar phoneme inventories).

Permutation is also the standard in all sound-correspondence-based approaches. In a recent paper, Kassian et al. (2015) used these approaches (first proposed by Turchin et al. 2010) to argue for the relationship of Indo-European and Uralic languages by comparing reconstructed word lists for Proto-Indo-European and Proto-Uralic. As can be seen from the discussion of these findings involving multiple authors, people are still not automatically convinced by a significance test, and scholars have criticized: their choice of test concepts (they used the classical 110-item list by Yakhontov and Starostin), their choice of reconstruction system (they did not use the mysterious laryngeals in their comparison), and the possibility that the findings were due to other factors (early borrowing).

While there have been some more attempts to improve the correspondence-based and the sound-class-based approaches (e.g., Kessler 2007, Kilani 2015, Mortarino 2009), it is unlikely that they will lead to the consolidation of contested proposals on macro families any time soon. Apart from the general problems of many of the current tests, there seem to be too many unknowns that prevent the community to accept findings, no matter "how" significant they appear. As can be nicely seen from the reaction to the paper by Kassian et al. 2015, a significant test will first raise the typical questions regarding the quality of the data and the initial judgments (which may also at times be biased). Even if all scholars would agree in this case, however, i.e. if one could not criticize anything in the initial test setting, there would still be the possibility to say that the findings reflect early language contact instead of phylogenetic relatedness.

Initial ideas for improvement

What I find unsatisfying about most existing tests is that they do not make exhaustive use of alignment methods. The sound-class-based approach is a shortcut for alignments, but it reduces words to two consonant classes only, and requires an extensive analysis of the words to compare only the root morpheme. It therefore also opens the possibility to bias the results (even if scholars may not intend that directly). While correspondence-based tests are much more elegant in general, they avoid alignments completely, and just pick the first letter in every word. The problem seems to be that — even when using permutations to generate the random distribution — nobody really knows how one should score the significance of sound correspondences in aligned words. I have to admit that I do not know it either. Although the tools for automated sequence comparison that my colleagues and I have been developing in the past (List 2014, List et al. 2018) seem like the best starting point to improve the correspondence-based approach, it is not clear how the test should be performed in the end.

Additionally, I assume also that expanded, fully fledged, tests will ultimately show what I reported back in my dissertation — if we work on limited wordlists, with only 200 items per language, the test will drastically lose its power when certain time depths have been reached. While we can easily prove the relationship of English and German, even with only 100 words, we have a hard time doing the same thing for English and Albanian (see List 2014: 200-203). But expanding the wordlists bears another risk for comparison (as pointed out to me by George Starostin): the more words we add, the more likely it is that they have been borrowed. Thus, we face a general dilemma in historical linguistics: that we are forced to deal with sparse data, since languages tend to lose their historical signal rather quickly.

Outlook

While there is no doubt that it would be attractive to have a test that would immediately tell one whether languages are related or not, I am becoming more and more skeptical about whether this test would actually help us, specifically when concentrating on pairwise tests alone. The challenge of this problem is not just to design a test that makes sense and does not overly simplify. The challenge is to propagate the test in such a way that it convinces our colleagues that it really works. This, however, is a challenge that is greater than any of the other open problems I have discussed so far in this year.

References

Baxter, William H. and Manaster Ramer, Alexis (2000) Beyond lumping and splitting: Probabilistic issues in historical linguistics. In: Renfrew, Colin and McMahon, April and Trask, Larry (eds.) Time Depth in Historical Linguistics. Cambridge:McDonald Institute for Archaeological Research, pp. 167-188.

Campbell, Lyle and Poser, William John (2008) Language Classification: History and Method. Cambridge:Cambridge University Press.

Dolgopolsky, Aron B. (1964) Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija [A probabilistic hypothesis concering the oldest relationships among the language families of Northern Eurasia]. Voprosy Jazykoznanija 2: 53-63.

Forkel, Robert and List, Johann-Mattis and Greenhill, Simon J. and Rzymski, Christoph and Bank, Sebastian and Cysouw, Michael and Hammarström, Harald and Haspelmath, Martin and Kaiping, Gereon A. and Gray, Russell D. (2018) Cross-linguistic data formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5: 1-10.

Handel, Zev (2008) What is Sino-Tibetan? Snapshot of a field and a language family in flux. Language and Linguistics Compass 2: 422-441.

Henikoff, Steven and Henikoff, Jorja G. (1992) Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences 89: 10915-10919.

Jones, William (1798) The third anniversary discourse, delivered 2 February, 1786, by the president. On the Hindus. Asiatick Researches 1: 415-43.

Kassian, Alexei and Zhivlov, Mikhail and Starostin, George S. (2015) Proto-Indo-European-Uralic comparison from the probabilistic point of view. The Journal of Indo-European Studies 43: 301-347.

Kessler, Brett (2001) The Significance of Word Lists. Statistical Tests for Investigating Historical Connections Between Languages. Stanford: CSLI Publications.

Kessler, Brett (2007) Word similarity metrics and multilateral comparison. In: Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology, pp. 6-14.

Kilani, Marwan (2015): Calculating false cognates: An extension of the Baxter & Manaster-Ramer solution and its application to the case of Pre-Greek. Diachronica 32: 331-364.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Walworth, Mary and Greenhill, Simon J. and Tresoldi, Tiago and Forkel, Robert (2018) Sequence comparison in computational historical linguistics. Journal of Language Evolution 3: 130–144.

Mortarino, Cinzia (2009) An improved statistical test for historical linguistics. Statistical Methods and Applications 18: 193-204.

Nichols, Johanna (1996) The comparative method as heuristic. In: Durie, Mark (ed.) The Comparative Method Reviewed. New York:Oxford University Press, pp. 39-71.

Ringe, Donald A. (1992) On calculating the factor of chance in language comparison. Transactions of the American Philosophical Society 82: 1-110.

Ross, Malcolm D. (1996) Contact-induced change and the comparative method. Cases from Papua New Guinea. In: Durie, Mark (ed.) The Comparative Method Reviewed. New York: Oxford University Press, pp. 180-217.

Turchin, Peter and Peiros, Ilja and Gell-Mann, Murray (2010) Analyzing genetic connections between languages by matching consonant classes. Journal of Language Relationship 3: 117-126.

Simulation of sound change (Open problems in computational diversity linguistics 6)


The sixth problem in my list of open problems in computational diversity linguistics is devoted to the problem of simulating sound change. When formulating the problem, it is difficult to see what is actually meant, as there are two possibilities for a concrete simulation: (i) one could think of a sound system of a given language and then model how, through time, the sounds change into other sounds; or (ii) one could think of a bunch of words in the lexicon of a given language, and then simulate how these words are changed through time, based on different kinds of sound change rules. I have in mind the latter scenario.

Why simulating sound change is hard

The problem of simulating sound change is hard for four reasons. First of all, the problem is similar to the problem of sound law induction, since we have to find a simple and straightforward way to handle phonetic context (remember that sound change may often only apply to sounds that occur in a certain environment of other sounds). This is already difficult enough, but it could be handled with help of what I called multi-tiered sequence representations (List and Chacon 2015). However, there are four further problems that one would need to overcome (or at least be aware of) when trying to successfully simulate sound change.

The first of these extra problems is that of morphological change and analogy, which usually goes along with "normal" sound change, following what Anttila (1976) calls Sturtevant's paradox — namely, that regular sound change produces irregularity in language systems, while irregular analogy produces regularity in language systems. In historical linguistics, analogy serves as a cover-term for various processes in which words or word parts are rendered more similar to other words than they had been before. Classical examples are children's "regular" plurals of nouns like mouse (eg. mouses instead of mice) or "regular" past tense forms of verbs like catch (e.g., catched instead of caught). In all these cases, perceived irregularities in the grammatical system, which often go back to ancient sound change processes, are regularized on an ad-hoc basis.

One could (maybe one should), of course, start with a model that deliberately ignores processes of morphological change and analogical leveling, when drafting a first system for sound change simulation. However, one needs to be aware that it is difficult to separate morphological change from sound change, as our methods for inference require that we identify both of them properly.

The second extra problem is the question of the mechanism of sound change, where competing theories exist. Some scholars emphasize that sound change is entirely regular, spreading over the whole lexicon (or changing one key in the typewriter), while others claim that sound change may slowly spread from word to word and at times not reach all words in a given lexicon. If one wants to profit from simulation studies, one would ideally allow for a testing of both systems; but it seems difficult to model the idea of lexical diffusion (Wang 1969), given that it should depend on external parameters, like frequency of word use, which are also not very well understood.

The last problem is that of the actual tendencies of sound change, which are also by no means well understood by linguists. Initial work on sound change has been carried out (Kümmel 2008). However, the major work of finding a way to compare the major tendencies of sound change processes across a large sample of the world's languages (ie. the typology of sound change, which I plan to discuss separately in a later post), has not been carried out so far. The reason why we are missing this typology is that we lack clear-cut machine-readable accounts of annotated, aligned data. Here, scholars would provide their proto-forms for the reconstructed languages along with their proposed sound laws in a system that can in fact be tested and run (to allow to estimate also the exceptions or where those systems fail).

But having an account of the tendencies of sound change opens a fourth important problem apart from the lack of data that we could use to draw a first typology of sound change processes: since sound change tendencies are not only initiated by the general properties of speech sounds, but also by the linguistic systems in which these speech sounds are employed. While scholars occasionally mention this, there have been no real attempts to separate the two aspects in a concrete reconstruction of a particular language. The typology of sound change tendencies could thus not simply stop at listing tendencies resulting from the properties of speech sounds, but would also have to find a way to model diverging tendencies because of systemic constraints.

Traditional insights into the process of sound change

When discussing sound change, we need to distinguish mechanisms, types, and patterns. Mechanisms refer to how the process "proceeds", the types refer to the concrete manifestations of the process (like a certain, concrete change), and patterns reflect the systematic perspective of changes (i.e. their impact on the sound system of a given language, see List 2014).

Figure 1: Lexical diffusion

The question regarding the mechanism is important, since it refers to the dispute over whether sound change is happening simultaneously for the whole lexicon of a given language — that is, whether it reflects a change in the inventory of sounds, or whether it jumps from word to word, as the defenders of lexical diffusion propose, whom I mentioned above (see also Chen 1972). While nobody would probably nowadays deny that sound change can proceed as a regular process (Labov 1981), it is less clear as to which degree the idea of lexical diffusion can be confirmed. Technically, the theory is dangerous, since it allows a high degree of freedom in the analysis, which can have a deleterious impact on the inference of cognates (Hill 2016). But this does not mean, of course, that the process itself does not exist. In these two figures, I have tried to contrast the different perspectives on the phenomena.

Figure 2: Regular sound change

To gain a deeper understanding of the mechanisms of sound change, it seems indispensable to work more on models trying to explain how it is actuatedafter all. While most linguists agree that synchronic variation in our daily speech is what enables sound change in the first place, it is not entirely clear how certain new variants are fixed in a society. Interesting theories in this context have been proposed by Ohala (1989) who proposes distinct scenarios in which sound change can be initiated both by the speaker or the listener, which would in theory also yield predictable tendencies with respect to the typology of sound change.

The insights into the types and patterns of sound change are, as mentioned above, much more rudimentary, although one can say that most historical linguists have a rather good intuition with respect to what is possible and what is less likely to happen.

Computational approaches

We can find quite a few published papers devoted to the simulation of certain aspects of sound change, but so far, we do not (at least to my current knowledge) find any comprehensive account that would try to feed some 1,000 words to a computer and see how this "language'' develops — which sound laws can be observed to occur, and how they change the shape of the given language. What we find, instead, are a couple of very interesting accounts that try to deal with certain aspects of sound change.

Winter and Wedel for example test agent-based exemplar models, in order to see how systems maintain contrast despite variation in the realization (Hamann 2014: 259f gives a short overview of other recent articles). Au (2008) presents simulation studies that aim to test to which degree lexical diffusion and "regular" sound change interact in language evolution. Dediu and Moisik (2019) investigate, with the help of different models, to which degree vocal tract anatomy of speakers may have an impact on the actuation of sound change. Stevens et al. (2019) present an agent-based simulation to investigate the change of /s/ to /ʃ/ in.

This summary of literature is very eclectic, especially because I have only just started to read more about the different proposals out there. What is important for the problem of sound change simulation is that, to my knowledge, there is no approach yet ready to run the full simulation of a given lexicon for a given language, as stated above. Instead, the studies reported so far have a much more fine-grained focus, specifically concentrating on the dynamics of speaker interaction.

Initial ideas for improvement

I do not have concrete ideas for improvement, since the problem's solution depends on quite a few other problems that would need to be solved first. But to address the idea of simulating sound change, albeit only in a very simplifying account, I think it will be important to work harder on our inferences, by making transparent what so far is only implicitly stored in the heads of the many historical linguists in form of what they call their intuition.

During the past 200 years, after linguists started to apply the mysterious comparative method that they had used successfully to reconstruct Indo-European on other language families, the amount of data and number of reconstructions for the world's languages has been drastically increasing. Many different language families have now been intensively studied, and the results have been presented in etymological dictionaries, numerous books and articles on particular questions, and at times even in databases.

Unfortunately, however, we rarely find attempts of scholars to actually provide their findings in a form that would allow to check the correctness of their predictions automatically. I am thinking in very simple terms here — a scholar who proposes a reconstruction for a given language family should deliver not only the proto-forms with the reflexes in the daughter languages, but also a detailed test of how the proposed sound law by which the proto-forms change into the daughter languages produce the reflexes.

While it is clear that this could not be easily implemented in the past, it is in fact possible now, as we can see from a couple of studies where scholars have tried to compute sound change (Hartmann 2003, Pyysalo 2017, see also Sims-Williams 2018 for an overview on more literature). Although these attempts are unsatisfying, given that they do not account for cross-linguistic comparability of data (eg. they use orthographies rather than unified transcriptions, as proposed by Anderson et al. 2018), they illustrate that it should in principle be possible to use transducers and similar technologies to formally check how well the data can be explained under a certain set of assumptions.

Without cross-linguistic accounts of the diversity of sound change processes (ie. a first solution to the problem of establishing a first typology of sound change), attempts to simulate sound change will remain difficult. The only way to address this problem is to require a more rigorous coding of data (both human- and machine-readable), and an increased openness of scholars who work on the reconstruction of interesting language families, to help make their data cross-linguistically comparable.

Sign languages

When drafting this post, I promised to Guido and Justin to grasp the opportunity when talking about sound change to say a few words about the peculiarities of sound change in contrast to other types of language change. The idea was, that this would help us to somehow contribute to the mini-series on sign languages, which Guido and Justin have been initiated this month (see post number one, two, and three).

I do not think that I have completely succeeded in doing so, as what I have discussed today with respect to sound change does not really point out what makes it peculiar (if it is). But to provide a brief attempt, before I finish this post, I think that it is important to emphasize that the whole debate about regularity of sound change is, in fact, not necessarily about regularity per se, but rather about the question of where the change occurs. As the words in spoken languages are composed of a fixed number of sounds, any change to this system will have an impact on the language as a whole. Synchronic variation of the pronunciation of these sounds offers the possibility of change (for example during language acquisition); and once the pronunciation shifts in this way, all words that are affected will shift along, similar to a typewriter in which you change a key.

As far as I understand, for the time being it is not clear whether a counterpart of this process exists in sign language evolution, but if one wanted to search for such a process, one should, in my opinion, do so by investigating to what degree the signs can be considered as being composed of something similar to phonemes in historical linguistics. In my opinion, the existence of phonemes as minimal meaning-discriminating units in all human languages, including spoken and signed ones, is far from being proven. But if it should turn out that signed languages also recruit meaning-discriminating units from a limited pool of possibilities, there might be the chance of uncovering phenomena similar to regular sound change.

References
    Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (2018) A cross-linguistic database of phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting 4.1: 21-53.

    Anttila, Raimo (1976) The acceptance of sound change by linguistic structure. In: Fisiak, Jacek (ed.) Recent Developments in Historical Phonology. The Hague, Paris, New York: de Gruyter, pp. 43-56.

    Au, Ching-Pong (2008) Acquisition and Evolution of Phonological Systems. Academia Sinica: Taipei.

    Chen, Matthew (1972) The time dimension. Contribution toward a theory of sound change. Foundations of Language 8.4. 457-498.

    Dan Dediu and Scott Moisik (2019) Pushes and pulls from below: Anatomical variation, articulation and sound change. Glossa 4.1: 1-33.

    Hamann, Silke (2014) Phonological changes. In: Bowern, Claire (ed.) Routledge Handbook of Historical Linguistics. Routledge, pp. 249-263.

    Hartmann, Lee (2003) Phono. Software for modeling regular historical sound change. In: Actas VIII Simposio Internacional de Comunicación Social. Southern Illinois University, pp. 606-609.

    Hill, Nathan (2016): A refutation of Song’s (2014) explanation of the ‘stop coda problem’ in Old Chinese. International Journal of Chinese Linguistic 2.2. 270-281.

    Kümmel, Martin Joachim (2008) Konsonantenwandel [Consonant change]. Wiesbaden: Reichert.

    Labov, William (1981) Resolving the Neogrammarian Controversy. Language 57.2: 267-308.

    List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

    List, Johann-Mattis and Chacon, Thiago (2015) Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context. Paper presented at the workshop "Historical Phonology and Phonological Theory [organized as part of the 48th annual meeting of the SLE]" (2015/09/04, Leiden, Societas Linguistica Europaea).

    Ohala, J. J. (1989) Sound change is drawn from a pool of synchronic variation. In: Breivik, L. E. and Jahr, E. H. (eds.) Language Change: Contributions to the Study of its Causes. Berlin: Mouton de Gruyter, pp. 173-198.

    Pyysalo, Jouna (2017) Proto-Indo-European Lexicon: The generative etymological dictionary of Indo-European languages. In: Proceedings of the 21st Nordic Conference of Computational Linguistics, pp. 259-262.

    Sims-Williams, Patrick (2018) Mechanising historical phonology. Transactions of the Philological Society 116.3: 555-573.

    Stevens, Mary and Harrington, Jonathan and Schiel, Florian (2019) Associating the origin and spread of sound change using agent-based modelling applied to /s/- retraction in English. Glossa 4.1: 1-30.

    Wang, William Shi-Yuan (1969) Competing changes as a cause of residue. Language 45.1: 9-25.

    Winter, Bodo and Wedel, Andrew (2016) The co-evolution of speech and the lexicon: Interaction of functional pressures, redundancy, and category variation. Topics in Cognitive Science 8:  503-513.

    Simulation of lexical change (Open problems in computational diversity linguistics 5)


    The fifth problem in my list of open problems in computational diversity linguistics is devoted to the problem of simulating lexical change. In a broad sense, lexical change refers to the way in which the lexicon of a human language evolves over time. In a narrower sense, we would reduce it to the major processes that constitute the changes that affect the words of human languages.

    Following Gevaudán (2007: 15-17), we can distinguish three different dimensions along which words can change, namely:
    • the semantic dimension — a given word can change its meaning
    • the morphological dimension —new words are formed from old words by combining existing words or deriving new words with help of affixes, and
    • the stratic dimension — languages may acquire words from their neighbors and thus contain strata of contact.
    If we take these three dimension as the basis of any linguistically meaningful system that simulates lexical change (and I would strongly argue that we should), the task of simulating lexical change can thus be worded as follows:
    Create a model of lexical change that simulates how the lexicon of a given language changes over time. This model may be simplifying, but it should account for change along the major dimensions of lexical change, including morphological change, semantic change, and lexical borrowing.
    Note that the focus on three dimensions along which a word can change deliberately excludes sound change (which I will treat as a separate problem in an upcoming blogpost). Excluding sound change is justified by the fact that, in the majority of cases, the process proceeds independently from semantic change, morphological change, and borrowing, while the latter three process often interact.

    There are, of course, cases where sound change may trigger the other three processes — for example, in cases where sound change leads to homophonous words in a language that express contrary meanings, which is usually resolved by using another word form for one of the concepts. An example for this process can be found in Chinese, where shǒu (in modern pronunciation) came to mean both "head" and "hand" (spelled as 首 and 手). Nowadays, shǒu remains only in expressions like shǒudū 首都 "capital", while tóu 头 is the regular word for "head".

    Since the number of these processes where we have sufficient evidence to infer that sound change triggered other changes is rather small, we will do better to ignore it when trying to design initial models of lexical change. Later models could, of course, combine sound change with lexical change in an overarching framework, but given how the modeling of lexical change is already complex just with the three dimensions alone, it seems useful to put it aside for the moment and treat it as a separate problem.

    Why simulating lexical change is hard

    For historical linguists, it is obvious why it is hard to simulate lexical change in a computational model. The reason is that all three major processes of lexical change, semantic change, morphological change, and lexical borrowing, are already hard to model and understand themselves.

    Morphological change is not only difficult to understand as a process, it is even difficult to infer; and it is for this reason, that we find morphological segmentation as the first examplein my list of open problems. The same holds for lexical borrowing, which I discussed as the second examplein my list of open problems. The problem of common pathways of semantic change will be discussed in a later post, devoted to the general typology of semantic change processes.

    If each of the individual processes that constitute lexical change is itself either hard to model or to infer, it is no wonder that the simulation is also hard.

    Traditional insights into the process of lexical change

    Important work on lexical change goes back at least to the 1950s, when Morris Swadesh (1909-1967) proposed his theory of lexicostatistics and glottochronology (Swadesh 1952, 1955, Lees 1953). What was important in this context was not the idea that one could compute the divergence time of languages, but the data model which Swadesh introduced. This data model is represented by a word-list in which a particular list of concepts is translated into a particular range of languages. While former work on semantic change had been mostly onomasiological — ie. form-based, taking the word as the basic unit and asking how it would change its meaning over time — the new model used concepts as a comparandum, investigating how word forms replaced each other in expressing specific contexts over time. This onomasiological or concept-based perspective has the great advantage of drastically facilitating the sampling of language data from different languages.

    When comparing only specific word forms for cognacy, it is difficult to learn something about the dynamics of lexical change through time, since it is never clear how to sample those words that one wants to investigate more closely in a given study. With Swadesh's data model, the sampling process is reduced to the selection of concepts, regardless of whether one knows how many concepts one can find in a given sample of languages. Swadesh was by no means the first to propose this perspective, but he was the one who promulgated it.

    Swadesh's data model does not directly measure lexical change, but instead measures the results of lexical change, given that its results surface in the distribution of cognate sets across lexicostatistical word-lists. While historical linguists mostly focused on sound change processes before, often ignoring morphological and semantic change, the lexicostatistical data model moved semantic change, lexical borrowing, and (to a lesser degree also) morphological change into the spotlight of linguistic endeavors. As an example, consider the following quote from Lees (1953), discussing the investigation of change in vocabulary under the label of morpheme decay:
    The reasons for morpheme decay, ie. for change in vocabulary, have been classified by many authors; they include such processes as word tabu, phonemic confusion of etymologically distinct items close in meaning, change in material culture with loss of obsolete terms, rise of witty terms or slang, adoption of prestige forms from a superstratum language, and various gradual semantic shifts such as specialization, generalization, and pejoration. [Lees 1953: 114]
    In addition to lexicostatistics and the discussions that arose especially from it (including those that criticized the method harshly), I consider the aforementioned model of three dimensions of lexical change by Gevaudán (2007) to be very useful in this context, since it constitutes one of the few attempts to approach the question of lexical change in a formal (or formalizable) way.

    Computational approaches

    Among the most frequently used models in the historical linguistics literature are those in which lexical change is modeled as a process of cognate gain and cognate loss. Modeling lexical change as a process of word gain and word loss, or root gain and root loss, is in fact straightforward. We well know that languages may cease to use certain words during their evolution, either because the things the words denote no longer exist (think of the word walkman and then try to project the future of the word ipad), or because a specific word form is no longer being used to denote a concept and therefore drops out of the language at some point (think of thorp which meant something like "village", as a comparison with German Dorf "village" shows, but now exists only as a suffix in place names).

    Since the gain-loss (or birth-death) model finds a direct counterpart in evolutionary biology, where genome evolution is often modeled as a process involving gain and loss of gene families (Cohen et al. 2008), it is also very easy to apply it to linguistics. The major work on the stochastic description of different gain-loss models has already been done, and we can find very stable software to helps us employ gain-loss models to reconstruct phylogenetic trees (Ronquist and Huelsenbeck 2003).

    It is therefore not surprising that gain-loss models are very popular in computational approaches to historical linguistics. Starting from pioneering work by Gray and Jordan (2000) and Gray and Atkinson (2003), they have now been used on many language families, including Austronesian (Gray et al. 2007), Australian languages (Bowern and Atkinson 2012), and most recently also Sino-Tibetan (Sagart et al. 2019). Although scholars (including myself) have expressed skepticism about their usefulness (List 2016), the gain-loss model can be seen as reflecting the quasi-standard of phylogenetic reconstruction in contemporary quantitative historical linguistics.

    Despite their popularity for phylogenetic reconstructions, gain-loss models have been used only sporadically in simulation studies. The only attempts that I know of so far are one study by Greenhill et al. (2009), where the authors used the TraitLab software (Nicholls 2013) to simulate language change along with horizontal transfer events, and a study by Murawaki (2015), in which (if I understand the study correctly) a gain-loss model is used to model language contact.

    Another approach is reflected in the more "classical" work on lexicostatistics, where lexical change is modeled as a process of lexical replacement within previously selected concept slots. I will call this model the concept-slot model. In this model (and potential variants of it), a language is not a bag of words whose contents changes over time, but is more like a chest of drawers, in which each drawer represents a specific concept and the content of a drawer represents the words that can be used to express that given concept. In such a model, lexical change proceeds as a replacement process: a word within a given concept drawer is replaced by another word.

    This model represents the classical way in which Morris Swadesh used to view the evolution of a given language. It is still present in the work of scholars working in the original framework of lexicostatistics (Starostin 2000), but it is used almost exclusively within distance-based frameworks, since a character-based account of the model would require a potentially large number of character states, which usually exceeds the number of character states allowed in the classical software packages for phylogenetic reconstruction.

    Similar to the gain-loss model, there have not been many attempts to test the characteristics of this model in simulation studies. The only one known to me is a posthumously published letter from Sergei Starostin (1953-2005) to Murray Gell-Mann (Starostin 2007), in which he describes an attempt to account for his theory that a word's replacement rage increases with the word's age (Starostin 2000) in a computer simulation.

    Problems with current models of lexical change

    Neither the gain-loss model nor the concept-slot model seem to be misleading when it comes to describe the process of lexical change. However, they both obviously ignore specific and crucial aspects of lexical change that (according to the task stated above) any ambitious simulation of lexical change should try to account for. The gain-loss model, for example, deliberately ignores semantic change and morphological change. It can account for borrowings, which can be easily included in a simulation by allowing contemporary languages to exchange words with each other, but it cannot tell us (since it ignores the meaning of word forms) how the meaning of words changes over time, or how word forms change their shape due to morphological change.

    The concept-slot model can, in theory, account for semantic change, but only as far as the concept-slots allow: the number of concepts in this model is fixed and one usually does not assume that it would change. Furthermore, while borrowing can be included in this model, the model does not handle morphological change processes.

    In phylogenetic approaches, both models also have clear disadvantages. The main problem of the gain-loss model is the sampling procedure. Since one cannot sample all words of a language, scholars usually derive the cognate sets they use to reconstruct phylogenies from cognate-coded lexicostatistical word-lists. As I have tried to show earlier, in List (2016), this sampling procedure can lead to problems when homology is defined in a loose way. The problem of the concept-slot model is that it cannot be easily applied in phylogenetic inference based on likelihood models (like Maximum likelihood or Bayesian inference), since the only straightforward way to handle them would be multi-state models, which are generally difficult to handle.

    Initial ideas for improvement

    For the moment, I have no direct idea of how to model morphological change, and more research will be needed before we will be able to handle this in models of lexical change. The problem of the gain-loss and the concept-slot models to account for semantic change, however, can be overcome by turning to bipartite graph models of lexical change (see Newman 2010: 32f for details on bipartite graphs). In such a model, the lexicon of a human language is represented by a bipartite graph consisting of concepts as one type of node and word forms (or forms) as another type of node. The association strength of a given word node and a given concept node (or its "reference potential", see List 2014: 21f), ie. the likelihood of a word being used by a speaker to denote a given concept, can be modeled with help of weighted edges. This model naturally accounts for synonymy (if a meaning can be expressed by multiple words) and polysemy (if a word can express multiple meanings). Lexical change in such a model would consist of the re-arrangement of the weights in the network. Word loss and word gain would occur if a new word node is introduced into the network or an existing node gets dissociated from all of the concepts.


    Sankoff's (1996) bipartite model of the lexicon of human languages

    We can find this idea of bipartite modeling of a language's lexicon in the early linguistic work of Sankoff (1969: 28-53), as reflected in the figure above, taken from his dissertation (Figure 5, p. 36). Similarly, Smith (2004) used bipartite form-concept networks (which he describes as a matrix) in order to test the mechanisms by which these vocabularies are transmitted from the perspective of different theories on cultural evolution.

    As I have never actively tried to review the large amount of literature devoted to simulation studies in historical linguistics, biology, and cultural evolution, it is quite possible that this blogpost lacks reference to important studies devoted to the problem. Despite this possibility, we can clearly say that we are lacking simulation studies in historical linguistics. I am furthermore convinced that the problem of handling lexical change in simulation studies is a difficult one, and that we may well have to wait to acquire more knowledge of the key processes involving lexical change in order to address it sufficiently in the future.

    While I understand the popularity of gain-loss models in recent work on phylogenetic reconstruction in historical linguistics, I hope that it might be possible to develop more realistic models in the future. It is well possible that such studies will confirm the superiority of gain-loss models over alternative approaches. But instead of assuming this in an axiomatic way, as we seem to be doing it for the time being, I would rather see some proof for this in simulation studies, or in studies where the data fed to the gain-loss algorithms is sampled differently.

    References

    Bowern, Claire and Atkinson, Quentin D. (2012) Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88: 817-845.

    Cohen, Ofir and Rubinstein, Nimrod D. and Stern, Adi and Gophna, Uri and Pupko, Tal (2008) A likelihood framework to analyse phyletic patterns. Philosophical Transactions of the Royal Society B 363: 3903-3911.

    Gévaudan, Paul (2007) Typologie des lexikalischen Wandels. Bedeutungswandel, Wortbildung und Entlehnung am Beispiel der romanischen Sprachen. Tübingen:Stauffenburg.

    Gray, Russell D. and Jordan, Fiona M. (2000) Language trees support the express-train sequences of Austronesian expansion. Nature 405: 1052-1055.

    Gray, Russell D. and Atkinson, Quentin D. (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426: 435-439.

    Gray, Russell D. and Greenhill, Simon J. and Ross, Malcolm D. (2007) The pleasures and perils of Darwinzing culture (with phylogenies). Biological Theory 2: 360-375.

    Greenhill, S. J. and Currie, T. E. and Gray, R. D. (2009) Does horizontal transmission invalidate cultural phylogenies? Proceedings of the Royal Society of London, Series B 276: 2299-2306.

    Lees, Robert B. (1953) The basis of glottochronology. Language 29: 113-127.

    List, Johann-Mattis (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1: 119-136.

    Murawaki, Yugo (2015) Spatial structure of evolutionary models of dialects in Contact. PLoS One 10: e0134335.

    Newman, M. E. J. (2010) Networks: An Introduction. Oxford: Oxford University Press.

    Nicholls, Geoff K and Ryder, Robin J and Welch, David (2013) TraitLab: A MatLab package for fitting and simulating binary tree-like data.

    Ronquist, Frederik and Huelsenbeck, J. P. (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572–1574.

    Sagart, Laurent, Jacques, Guillaume, Lai, Yunfan, Ryder, Robin, Thouzeau, Valentin, Greenhill, Simon J., List, Johann-Mattis (2019) Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America 116: 10317–10322. DOI: 10.1073/pnas.1817972116

    Sankoff, David (1969) Historical Linguistics as Stochastic Process. McGill University: Montreal.

    Smith, Kenny (2004) The evolution of vocabulary. Journal of Theoretical Biology 228: 127-142.

    Starostin, Sergej Anatolévič (2000) Comparative-historical linguistics and lexicostatistics. In: Renfrew, Colin, McMahon, April, Trask, Larry (eds.): Time Depth in Historical Linguistics: 1. Cambridge:McDonald Institute for Archaeological Research, pp. 223-265.

    Starostin, Sergej A. (2007) Computer-based simulation of the glottochronological process (Letter to M. Gell-Mann). In: : S. A. Starostin: Trudy po yazykoznaniyu [S. A. Starostin: Works in Linguistics]. LRC Publishing House, pp. 854-861.

    Swadesh, Morris (1952) Lexico-statistic dating of prehistoric ethnic contacts. With special reference to North American Indians and Eskimos. Proceedings of the American Philosophical Society 96: 452-463.

    Swadesh, Morris (1955) Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21.2: 121-137.

    Automatic phonological reconstruction (Open problems in computational diversity linguistics 4)


    The fourth problem in my list of open problems in computational diversity linguistics is devoted to the problem of linguistic reconstruction, or, more specifically, to the problem of phonological reconstruction, which can be characterized as follows:
    Given a set of cognate morphemes across a set of related languages, try to infer the hypothetical pronunciation of each morpheme in the proto-language.
    This task needs to be distinguished from the broader task of linguistic reconstruction, which would usually include also the reconstruction of full lexemes, i.e. lexical reconstruction — as opposed to single morphemes or "roots" in an unknown ancestral language. In some cases, linguistic reconstruction is even used as a cover term for all reconstruction methods in historical linguistics, including such diverse approaches as phylogenetic reconstruction (finding the phylogeny of a language family), semantic reconstruction (finding the meaning of a reconstructed morpheme or root), or the task of demonstrating that languages are genetically related (see, e.g., the chapters in Fox 1995)

    Phonological and lexical reconstruction

    In order to understand the specific difference between phonological and lexical reconstruction, and why making this distinction is so important, consider the list of words meaning "yesterday" in five Burmish languages (taken from Hill and List 2017: 51).

    Figure 1: Cognate words in Burmish languages (taken from Hill and List 2017)

    Four of these languages express the word "yesterday" with the help of more than one morpheme, indicated by using different colors in the table's phonetic transcriptions, which at the same time ­ also indicate which words we consider to be homologous in this sample. Four of the languages have one morpheme which (as we confirmed from the detailed language data) means "day" independently. This morpheme is given the label 2 in the last column of the table. From this, we can see that the motivation by which the word for "yesterday" is composed in these languages is similar to the one we observe in English, where we also find the word day being a part of the word yester-day.

    If we want to know how the word "yesterday" was expressed in the ancestor of the Burmish languages, we could make an abstract estimation based on the lexical material we have at hand. We might assume that it was also a compound word, given the importance of compounding in all living Burmish languages. We could further hypothesize that one part of the ancient compound would have been the original word for "day". We could even make a guess and say the word was in structure similar to Bola and Lashi (although it is difficult to find a justification for doing this). In all cases, we would propose a lexical reconstruction for the word for "yesterday" in Proto-Burmish. We would make an assumption with respect to what one could call the denotation structure or the motivation structure, as we called it in Hill and List (2017: 67). This assumption would not need to provide an actual pronunciation of the word, it could be proposed entirely independently.

    If we want to reconstruct the pronunciation of the ancient word for "yesterday" as well, we have to compare the corresponding sounds, and build a phonological reconstruction for each of the morphemes separately. As a matter of fact, scholars working on South-East Asian languages rarely propose a full lexical reconstruction as part of their reconstruction systems (for a rare exception, see Mann 1998). Instead, they pick the homologous morphemes from their word comparisons, assign some rough meaning to them (this step would be called semantic reconstruction), and then propose an ancient pronunciation based on the correspondence patterns they observe.

    When listing phonological reconstruction as one of my ten problems, I am deliberately distinguishing this task from the tasks of lexical reconstruction or semantic reconstruction, since they can (and probably should) be carried out independently. Furthermore, by describing pronunciation of the morphemes as "hypothetical pronunciations" in the ancestral language, I want not only to emphasize that all reconstruction is hypothetical, but also to point to the fact that it is very possible that some of the morphemes for which one proposes a proto-form may not even have existed in the proto-language. They could have evolved only later as innovations on certain branches in the history of the languages. For the task of phonological reconstruction, however, this would not matter, since the question of whether a morpheme existed in the most recent common ancestor becomes relevant only if one tries to reconstruct the lexicon of a given proto-language. But phonological reconstruction seeks to reconstruct its phonology, i.e. the sound inventory of the proto-language, and the rules by which these sounds could be combined to form morphemes (phonotactics).

    Why phonological reconstruction is hard

    That phonological reconstruction is hard should not be surprising. What the task entails is to find the most probable pronunciation for a bunch of morphemes in a language for which no written records exist. Imagine you want to find the DNA of LUCA as a biologist, not even in its folded form, with all of the pieces in place, but just a couple of chunks, in order to get a better picture of how this LUCA might have looked like. But while we can employ some weak version of uniformitarianism when trying to reconstruct at least some genes of our LUCA (we would still assume that it was using some kind of DNA, drawn from the typical alphabet of DNA letters), we face the specific problem in linguistics that we cannot even be sure about the letters.

    Only recently, Blasi et al. (2019) argued that sounds like f and v may have evolved later than the other sounds we can find in the languages of the world, driven by post-Neolithic changes in the bite configuration, which seem to depend on what we eat. As a rule, and independent of these findings, linguists do not tend to reconstruct an f for the proto-language in those cases where they find it corresponding to a p, since we know that in almost all known cases a p can evolve into an f, but an f almost never becomes a p again. This can lead to the strange situation where some linguists reconstruct a p for a given proto-language even though all descendants show an f, which is, of course, an exaggeration of the principle (see Guillaume Jacques' discussion on this problem).

    But the very idea, that we may have good reasons to reconstruct something in our ancestral language that has been lost in all descendant languages, is something completely normal for linguists. In 1879, for example Ferdinand de Saussure (Saussure 1879) used internal and comparative evidence to propose the existence of what he called coefficients sonantiques in Proto-Indo-European. His proposal included the prediction that — if ever a languages was found that retained these elements — these new sounds would surface as segmental elements, as distinctive sounds, in certain cognate sets, where all known Indo-European languages had already lost the contrast.

    These sounds are nowadays known as laryngeals (*h1, *h2, *h3, see Meier-Brügger 2002), and when Hittite was identified as an Indo-European language (Hrozný 1915), one of the two sounds predicted by Saussure could indeed be identified. I have discussed before on this blog the problem of unattested character states in historical linguistics, so there is no need to go into further detail. What I want to emphasize is that this aspect of linguistic reconstruction in general, and phonological reconstruction specifically, is one of the many points that makes the task really hard, since any algorithm to reconstruct the phonological system of some proto-language would have to find a way to formalize the complicated arguments by which linguists infer that there are traces of something that is no longer there.

    There are many more things that I could mention, if I wanted to identify the difficulty of phonological reconstruction in its entirety. What I find most difficult to deal with is that the methodology is insufficiently formalized. Linguists have their success stories, which helped them to predict certain aspects of a given proto-language that could later be confirmed, and it is due to these success stories that we are confident that it can, in principle, be done. But the methodological literature is sparse, and the rare cases where scholars have tried to formalize it are rarely discussed when it comes to evaluating concrete proposals (as an example for an attempt of formalizing, see Hoenigswald 1960). Before this post becomes too long, I will therefore conclude bu noting that scholars usually have a pretty good idea of how they should perform their phonological reconstructions, but that this knowledge of how one should reconstruct a proto-language is usually not seen as something that could be formalized completely.

    Traditional strategies for phonological reconstruction

    Given the lack of methodological literature on phonological reconstruction, it is not easy to describe how it should be done in an ideal scenario. What seems to me to be the most promising approach is to start from correspondence patterns. A correspondence pattern is an abstraction from individual alignment sites distributed over cognate sets drawn from related languages. As I have tried to show in a paper published earlier this year (List 2019), a correspondence pattern summarizes individual alignment sites in an abstract form, where missing data are imputed. I will avoid going into the details here but, as a shortcut, we can say that each correspondence pattern should, in theory, only correspond to one proto-sound in the language, although the same proto-sound may correspond to more than one correspondence pattern. As an example, consider the following table, showing three (fictive) patterns that would all be reconstructed by a *p.

     Proto-Form  L₁  L₂  L₃
     *p  p  p f
     *p p p p
     *p b p p

    To justify that the same proto-sound is reconstructed by a *p in all three patterns, linguists invoke the rule of context, by looking at the real words from which the pattern was derived. An example is shown in the next table.


     Proto-Form
    L₁L₂L₃
    *p i a ŋ p i a ŋ  p i u ŋ  f a n
    *p a t p a t p a t p a t
    *a p a ŋ a b a ŋ  a p a ŋ  a p a n

    What you should be able to see from the table is that we can find in all three patterns a conditioning factor that allows us to assume that the deviation from the original *p is secondary. In language L₃, the factor can be found in the palatal environment (followed by the front vowel *i) that we find in the ancestral language. We would assume that this environment triggered the change from *p to f in this language. In the case of the change from *p to b in L₁, the triggering environment is that the p occurs inter-vocalically.

    To summarize: what linguists usually do in order to reconstruct proto-forms for ancestral languages that are not attested in written sources, is to investigate the correspondence patterns, and to try to find some neat explanation of how they could have evolved, given a set of proto-forms along with triggering contexts that explain individual changes in individual descendant languages.

    Computational strategies for phonological reconstruction

    Not many attempts have been made so far to automate the task of reconstruction. The most prominent proposal in this direction has been made by Bouchard-Côté et al. (2013). Their strategy radically differs from the strategy outlined above, since they do not make use of correspondence patterns, but instead use a stochastic transducer and known cognate words in the descendant languages, along with a known phylogenetic tree that they traverse, inferring the most likely changes that could explain the observed distribution of cognate sets.

    So far, this method has been tested only on Austronesian languages and their subgroups, where it performed particularly well (with error rates between 0.25 and 0.12, using edit distance as the evaluation measure). Since it is not available as a software package that can be conveniently used and tested on other language families, it is difficult to tell how well it would perform when being presented with more challenging test cases.

    In a forthcoming paper, Gerhard Jäger illustrates how classical methods for ancestral state reconstruction applied to aligned cognate sets could be used for the same task (Jäger forthcoming). While Jäger's method is more in line with "linguistic thinking", in so far as he uses alignments, and applies ancestral state reconstructions to each column of the alignments, it does not make use of correspondence patterns, which would be the general way by which linguists would proceed. This may also explain the performance, which shows an error rate of 0.48 (also using edit distance for evaluation) — although this is also due to the fact that the method was tested on Romance languages and compared with Latin, which is believed to be older than the ancestor of all Romance languages.

    Problems with computational strategies for phonological reconstruction

    Both the method of Bouchard-Côté et al. and the approach of Jäger suffer from the problem of not being able to detect unobserved sounds in the data. Jäger side-steps this problem in theory, by using a shortened alphabet of only 40 characters, proposed by the ASJP project, which encoded more than half of the world's languages in this form. Bouchard-Côté's test data, Proto-Austronesian (and its subgroups), are fairly simple in this regard. It would therefore be interesting to see what would happen if the methods are tested with full phonetic (or phonological) representations of more challenging language families (for example, the Chinese dialects). While Jäger's approach assumes the independence of all alignment sites, Bouchard-Côté's stochastic transducers handle context on the level of bigrams (if I read their description properly). However, while bigrams can be seen as an improvement over ignoring conditioning context, they are not the way in which context is typically handled by linguists. As I tried to explain briefly in last month's post, context in historical linguistics calls for a handling of abstract contexts, for example, by treating sequences as layered entities, similar to music scores.

    Apart from the handling of context and unobserved characters, the evaluation measure used in both approaches seems also problematic. Both approaches used the edit distance (Levenshtein 1965), which is equivalent to the Hamming distance (Hamming 1950) applied to aligned sequences. Given the problem of unobserved characters and the abstract nature of linguistic reconstruction systems, however, any measure that evaluates the surface similarity of sequences is essentially wrong.

    To illustrate this point, consider the reconstruction of the Indo-European word for sheep by Kortlandt (2007), who gives *ʕʷ e u i s, as compared to Lühr (2008), who gives *h₂ ó w i s. The normalized edit distance between both systems is the Hamming distance of their (trivial) alignment: in three of five cases they differ, which makes up to an unnormalized edit distance of three, and a normalized edit distance of 0.6. While this is pretty high, their systems are mostly compatible, since Korthland reconstructs *ʕʷ in most cases where Lühr writes *h₂. Therefore, the distance should be much lower; in fact, it should be zero, since both authors agree on the structure of the form they reconstruct in comparison with the structure of other words they reconstruct for Proto-Indo-European.

    Since scholars do not necessarily select phonetic values in their reconstructions that derive directly from the descendant languages, and moreover they may differ often regarding the details of the phonetic values they propose, a valid evaluation of different reconstruction systems (including automatically derived ones) needs to compare the structure of the systems, not their substance (see List 2014: 48-50 for a discussion of structural and substantial differences between sequences).

    Currently, there is (to my knowledge) no accepted solution for the comparison of structural differences among aligned sequences. Finding an adequate evaluation measure to compare reconstruction systems can therefore be seen as a sub-problem of the bigger problem of phonological reconstruction. To illustrate why it is so important to compare the structural information and not the pure substance, consider the three cases in which Jäger's reconstruction gives a v as opposed to a w in Latin (data here): while evaluating by the edit distance yields a score of 0.48, this score will drop to 0.47 when replacing the v instances with a w. Jäger's system is doing something right, but the edit distance cannot capture the fact that the system is deviating systematically from Latin, not randomly.

    Initial ideas for improvement

    There are many things that we can easily improve when working on automatic methods for phonological reconstruction.

    As a first point, we should work on enhanced measures of evaluation, going beyond the edit distance as our main evaluation measure. In fact, this can be easily done. With B-Cubed scores (Amigó et al. 2009), we already have a straightforward measure to compare whether two reconstruction systems are structurally identical or similar. In order to apply these scores, the automatic reconstructions have to be aligned with the gold standard. If they are identical, although the symbols may differ, then the scores will indicate this. The problem of comparing reconstruction systems is, of course, more difficult, as we can face cases where systems are not structurally identical (i.e. you can directly replace any symbol a in system A by any symbol a' in system B to produce B from A and vice versa), but they would be a start.

    Furthermore, given that we lack test cases, we might want to work on semi-automatic instead of fully automatic methods, in the meantime. Given that we have a first method to infer sound correspondence patterns from aligned data (List 2019), we can infer all patterns and have linguists annotate each pattern by providing the proto-sound they think would fit best — we are testing this at the moment. Having created enough datasets in this form, we could then think of discussing concrete algorithms that would derive proto-forms from correspondence patterns, and use the semi-automatically created and manually corrected data as gold standard.

    Last but not least, one straightforward way by which it is possible to formally create unknown sounds from known data, is to represent sound as vectors of phonological features instead of bare symbols (e.g. representing p as voiceless bilabial plosive and b as voiced labial plosive). If we then compare alignment sites or correspondence patterns for the feature vectors, we could check to what degree standard algorithms for ancestral state reconstructions propose unattested sounds similar to the ones proposed by experts. In order to do this, we would need to encode our data in transparent transcription systems. This is not the case for most current datasets, but with the Cross-Linguistic Transcription Systems initiative we already have a first attempt to provide features for the majority of sounds that we find in the languages of the world (Anderson et al. forthcoming).

    Outlook

    It is difficult to tell how hard the problem of phonological reconstruction is in the end. Semi-automatic solutions are already feasible now, and we are currently testing them on different (smaller) groups of phylogenetically related languages. One crucial step in the future is to code up enough data to allow for a rigorous testing of the few automatic solutions that have been proposed so far. We are working on that as well. But how to propose an evaluation system that rigorously tests not only to what degree a given reconstruction is identical with a given gold standard, but also structurally equivalent, remains one of the crucial open problems in this regard.

    References
      Amigó, Enrique and Gonzalo, Julio and Artiles, Javier and Verdejo, Felisa (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12.4: 461-486.

      Anderson, Cormac, Tresoldi, Tiago, Chacon, Thiago Costa, Fehn, Anne-Maria, Walworth, Mary, Forkel, Robert and List, Johann-Mattis (forthcoming) A cross-linguistic Database of Phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting, pp. 1-27.

      Blasi, Damián E. , Steven Moran, Scott R. Moisik, Paul Widmer, Dan Dediu and Balthasar Bickel (2019) Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363.1192: 1-10.

      Bouchard-Côté, Alexandre and Hall, David and Griffiths, Thomas L. and Klein, Dan (2013) Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 110.11: 4224–4229.

      Fox, Anthony (1995) Linguistic Reconstruction: An Introduction to Theory and Method. Oxford: Oxford University Press.

      Hamming, Richard W. (1950) Error detection and error detection codes. Bell System Technical Journal 29.2: 147–160.

      Hill, Nathan W. and List, Johann-Mattis (2017) Challenges of annotation and analysis in computer-assisted language comparison: a case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1: 47–76.

      Hoenigswald, Henry M. (1960) Phonetic similarity in internal reconstruction. Language 36.2: 191-192.

      Hrozný, Bedřich (1915) Die Lösung des hethitischen Problems [The solution of the Hittite problem]. Mitteilungen der Deutschen Orient-Gesellschaft 56: 17–50.

      Jäger, Gerhard (forthcoming) Computational historical linguistics. Theoretical Linguistics.

      Kortlandt, Frederik (2007) For Bernard Comrie.

      Levenshtein, V. I. (1965) Dvoičnye kody s ispravleniem vypadenij, vstavok i zameščenij simvolov [Binary codes with correction of deletions, insertions and replacements]. Doklady Akademij Nauk SSSR 163.4: 845-848.

      List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

      List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

      Lühr, Rosemarie (2008) Von Berthold Delbrück bis Ferdinand Sommer: Die Herausbildung der Indogermanistik in Jena. Vortrag im Rahmen einer Ringvorlesung zur Geschichte der Altertumswissenschaften (09.01.2008, FSU-Jena).

      Mann, Noel Walter (1998) A Phonological Reconstruction of Proto Northern Burmic. The University of Texas: Arlington.

      Meier-Brügger, Michael (2002) Indogermanische Sprachwissenschaft. Berlin and New York: de Gruyter.

      Saussure, Ferdinand de (1879) Mémoire sur le Système Primitif des Voyelles dans les Langues Indo- Européennes. Leipzig: Teubner.

      Automatic sound law induction (Open problems in computational diversity linguistics 3)


      The third problem in my list of ten open problems in computational diversity linguistics is a problem that has (to my knowledge) not even been considered as a true problem in computational historical linguistics, so far. Until now, it has been discussed by colleagues only indirectly. This problem, which I call the automatic induction of sound laws, can be described as follows:
      Starting from a list of words in a proto-language and their reflexes in a descendant language, try to find the rules by which the ancestral language is converted into the descendant language.
      Note that by "rules", in this context, I mean the classical notation that phonologists and historical linguists use in order to convert a source sound in a target sound in a specific environment (see Hall 2000: 73-75). If we consider the following ancestral and descendant words from a fictive language, we can easily find the laws by which the input should be converted into an output — namely, an a should be changed to an e, an e should be changed to an i, and a k changes to s if followed by an i but not if followed by an a.

      InputOutput
      papapepe
      mamameme
      kakakeke
      kekesisi

      Short excursus on linguistic notation of sound laws

      Based on the general idea of sound change (or sound laws in classical historical linguistics) as some kind of a function by which a source sound is taken as input and turned into a target sound as output, linguists use a specific notation system for sound laws. In the simplest form of the classical sound law notation, this process is described in the form s > t, where s is the source sound and t is the target sound. Since sound change often relies the on specific conditions of the surrounding context — i.e. it makes a difference if some sound occurs in the beginning or the end of a word — context is added as a condition separated by a /, with an underscore _ referring to the sound in its original phonetic environment. Thus, the phenomenon of voiced stops becoming unvoiced at the end of words in German (e.g. d becoming t), can be written as d > t / _$, where $ denotes the end of a word.

      One can see how close this notation comes to regular expressions and according to many scholars, the rules by which languages change with respect to their sound systems do not exceed the complexity of regular grammars. Nevertheless, sound change notation does differ in the scope and the rules for annotation. One notable difference is the possibility to explain how full classes of sounds change in a specific environment. The German rule of devoicing, for example, generally affects all voiced stops in the end of a word. As a result, one could also annotat it as G > K / _$, where G would denote the sounds [b, d, g] and K their counterparts [p, t, k]. Although we could easily write a single rule for each of the three phenomena here, the rule by which the sounds are grouped into two classes of voiced sounds and their unvoiced counterparts is linguistically more interesting, since it reminds us that the change by which word-final consonants loose the feature of voice is a systemic change, and not a phenomenon applying to some random selection of sounds in a given language.

      The problem of this systemic annotation, however, is that the grouping of sounds into classes that change in a similar form is often language-specific. As a result, scholars have to propose new groupings whenever they deal with another language. Since neither the notation of sound values nor the symbols used to group sounds into classes are standardized, it is extremely difficult to compare different proposals made in the literature. As a result, any attempt to solve the problem of automatic sound law induction in historical linguistics would at the same time have to make strict proposals for a standardization of sound law notations used in our field. Standardization can thus be seen as one of the first major obstacles of solving this problem, with the problem of accounting for systemic aspects of sound change as the second one.

      Beyond regular expressions

      Even if we put the problem of inconsistent annotation and systemic changes to one side, the analogy with regular expressions cannot properly handle all aspects of sound change. When looking at the change from Middle Chinese to Mandarin Chinese, for example, we find a complex pattern, by which originally voiced sounds, like [b, d, g, dz] (among others), were either devoiced, becoming [p, t, k, ts], or devoiced and aspirated, becoming [pʰ, tʰ, kʰ, tsʰ]. While it is not uncommon that one sound can change into two variants, depending on the context in which it occurs, the Mandarin sound change in this case is interesting because the context is not a neighboring sound, but is instead the Middle Chinese tone for the syllable in question — syllables with a flat tone (called píng tone in classical terminology) are nowadays voiceless and aspirated, and syllables with one of the three remaining Middle Chinese tones (called shǎng, , and ) are nowadays plain voiceless (see List 2019: 157 for examples).

      Since tone is a feature that applies to whole syllables, and not to single sound segments, we are dealing with so-called supra-segmental features here. As the meaning of the term supra-segmental indicates, the features in question cannot be represented as a sequence of sound, but need to be thought of as an additional layer, similar to other supra-segmental features in language, including stress, or juncture (indicating word or morpheme boundaries).

      In contrast to sequences as we meet them in mathematics and informatics, linguistic sound sequences do not consist solely of letters drawn from an alphabet that is lined up in some unique order. They are instead often composed of multiple layers, which are in part hierarchically ordered. Words, morphemes, and phrases in linguistics are thus multi-layered constructs, which cannot be represented by one sequence alone, but could be more fruitfully thought of as the same as a partitura in music — the score of a piece of orchestra music, in which every voice of the orchestra is given its own sequence of sounds, and all different sequences are aligned with each other to form a whole.

      img
      The multi-layered character of sound sequences can be seen as similar to a partitura in musical notation.

      This multi-layered character of sound sequences in spoken languages comprises a third complication for the task of automatic sound law induction. Finding the individual laws that trigger the change of one stage of a language to a later stage, cannot (always) be trivially reduced to the task of finding the finite state transducer that translates a set of input strings to a corresponding set of output strings. Since our input word forms in the proto-language are not simple strings, but rather an alignment of the different layers of a word form, a method to induce sound laws needs to be able to handle the multi-layered character of linguistic sequences.

      Background for computational approaches to sound law induction

      To my knowledge, the question of how to induce sound laws from data on proto- and descendant languages has barely been addressed. What comes closest to the problem are attempts to model sound change from known ancestral languages, such as Latin, to daughter languages, such as Spanish. This is reflected, for example, in the PHONO program (Hartmann 2003), where one can insert data for a proto-language along with a set of sound change rules (provided in a similar form to that mentioned above), which need to be given in a specific order, and are then checked to see whether they correctly predict the descendant forms.

      For teaching purposes, I adapted a JavaScript version of a similar system, called the Sound Change Applier²(http://www.zompist.com/sca2.html) by Mark Rosenfelder from 2012, in which students could try to turn Old High German into modern German, by assigning simple rules as they are traditionally used to describe sound change processes in the linguistic literature. This adaptation (which can be found at http://dighl.github.io/sound_change/SoundChanger.html) compares the attested output with the output generated by a given set of rules, and provides some assessment of the general accuracy of the proposed set of rules. For example, when feeding the system the simple rule an > en /_#, which turns all final instances of -an into -en, 54 out of 517 Old High German words will yield the expected output in modern Standard German.

      The problem with these endeavors is, of course, the handling of exceptions, along with the comparison of different proposals. Since we can think of an infinite number of rules by which we could successfully turn a certain amount of Old High German strings into Standard German strings, we would need to ask ourselves how we could evaluate different proposals. That some kind of parsimony should play a role here is obvious. However, it is by no means clear (at least to me) how to evaluate the complexity of two systems, since the complexity would not only be reflected in the number of rules, but also in the initial grouping of sounds to classes, which is commonly used to account for systemic aspects of sound change. A system accounting for the problem of sound law induction would try to automate the task of finding the set of rules. The fact that it is difficult even to compare two or more proposals based on human assessment further illustrates why I think that the problem is not trivial.

      Another class of approaches is that of word prediction experiments, such as the one by Ciobanu and Dinu (2018) (but see also Bodt and List 2019), in which training data consisting of the source and the target language are used to create a model, which is then successively applied to new data, in order to test how well this model predicts target words from the source words. Since the model itself is not reported in these experiments, but only used in the form of a black box to predict new words, the task cannot be considered to be the same as the task for sound law induction — which I propose as one of my ten challenges for computational historical linguistics — given that we are interested in a method that explicitly returns the model, in order to allow linguists to inspect it.

      Problems with the current solutions to sound law induction

      Given that no real solutions exist to the problem up to now, it seems somewhat useless to point to the problems of current solutions. What I want to mention in this context, however, are the problems of the solutions presented for word prediction experiments, be they fed by manual data on sound changes (Hartmann 2003), or based on inference procedures (Ciobanu and Dinu 2018, Dekker 2018). Manual solutions like PHONO suffer from the fact that they are tedious to apply, given that linguists have to present all sound changes in their data in an ordered fashion, with the program converting them step by step, always turning the whole input sequence into an intermediate output sequence — the word prediction approaches thus suffer from limitations in feature design.

      The method by Ciobanu and Dinu (2018), for example, is based on orthographic data alone, using the Needleman-Wunsch algorithm for sequence alignment (Needleman and Wunsch 1970); and the approach by Dekker (2018) only allows for the use for the limited alphabet of 40 symbols proposed by the ASJP project (Holman et al. 2008). In addition to the limited representation of linguistic sound sequences, be it by resorting to abstract orthography or to abstract reduced phonetic alphabets, none of the methods can handle those kinds of contexts which result from the multi-layered character of speech. Since we know well that these aspects are vital for certain phenomena of sound change, the methods exclude from the beginning an aspect that traditional historical linguists, who might be interested in an automatic solution to the sound law induction problem, would put at the top of their wish-list of what the algorithm should be able to handle.

      Why is automatic sound law induction difficult?

      The handling of supra-segmental contexts, mentioned above, is in my opinion also the reason why sound law induction is so difficult, not only for machines, but also for humans. I have so far mentioned three major problems as to why I think sound law induction is difficult. First, we face problems in defining the task properly in historical linguistics, due to a significant lack in standardization. This makes it difficult to decide on the exact output of a method for sound law induction. Second, we have problems in handling the systemic aspect of sound change properly. This does not apply only to automatic approaches, but also to the evaluation of different proposals for the same data proposed by humans. Third, the multi-layered character of speech requires an enhanced modeling of linguistic sequences, which cannot be modeled as mono-dimensional strings alone, but should rather be seen as alignments of different strings representing different layers (tonal layer, stress layer, sound layer, etc.).

      How humans detect sound laws

      There are only a few examples in the literature where scholars have tried to provide detailed lists of sound changes from proto- to descendant language (Baxter 1992, Newman 1999). Most examples of individual sound laws proposed in the literature are rarely even tested exhaustively on the data. As a result, it is difficult to assess what humans usually do in order to detect sound laws. What is clear is that historical linguists who have been working a lot on linguistic reconstruction tend to acquire a very good intuition that helps them to quickly check sound laws applied to word forms in their head, and to convert the output forms. This ability is developed in a learning-by-doing fashion, with no specific techniques ever being discussed in the classroom, which reflects the general tendency in historical linguistics to trust that students will learn how to become a good linguist from examples, sooner or later (Schwink 1994: 29). For this reason, it is difficult to take inspiration from current practice in historical linguistics, in order to develop computer-assisted approaches to solve this task.

      Potential solutions to the problem

      What can we do in order to address the problem of sound law induction in automatic frameworks in the future?

      As a first step, we would have to standardize the notation system that we use to represent sound changes. This would need to come along with a standardized phonetic transcription system. Scholars often think that phonetic transcription is standardized in linguistics, specifically due to the use of the International Phonetic Alphabet. As our investigations into the actual application of the IPA have shown, however, the IPA cannot be seen as a standard, but rather as a set of recommendations that are often only loosely followed by linguists. First attempts to standardize phonetic transcription systems for the purpose of cross-linguistic applications have, however, been made, and will hopefully gain more acceptance in the future (Anderson et al. forthcoming, https://clts.clld.org).

      As a second step, we should invest more time in investigating the systemic aspects of language change cross-linguistically. What I consider important in this context is the notion of distinctive features by which linguists try to group sounds into classes. Since feature systems proposed by linguists differ greatly, with some debate as to whether features are innate and the same for all languages, or instead language-specific (see Mielke 2008 for an overview on the problem), a first step would again consist of making the data comparable, rather than trying to decide in favour of one of the numerous proposals in the literature.

      As a third step, we need to work on ways to account for the multi-layered aspect of sound sequences. Here, a first proposal, labelled "multi-tiered sequence representation", has already been made by myself (List and Chacon 2015), based on an idea that I had already used for the phonetic alignment algorithm proposed in my dissertation (List 2014), which itself goes back to the handling of hydrophilic sequences in ClustalW (Thompson et al. 1994). The idea is to define a sound sequence as a sequence of vectors, with each vector (called tier) representing one distinct aspect of the original word. As this representation allows for an extremely flexible modeling of context — which would just consist of an arbitrary number of vector dimensions that could account for aspects such as tone, stress, preceding or following sounds — this representation would allow us to treat words as sequences of sounds while at the same time accounting for their multi-layered structure. Although there remain many unsolved aspects on how to exploit this specific model for phonetic sequences to induce sound laws from ancestor-descendant data, I consider this to be a first step in the direction of a solution to the problem.

      Multi-tiered sequence representation for a fictive word in Middle Chinese.

      Outlook

      Although it is not necessarily recognized by the field as a real problem of historical linguistics, I consider the problem of automatic sound law induction as a very important problem for our field. If we could infer sound laws from a set of proposed proto-forms and a set of descendant forms, then we could use them to test the quality of the proto-forms themselves, by inspecting the sound laws proposed by a given system. We could also compare sound laws across different language families to see whether we find cross-linguistic tendencies.

      Having inferred enough cross-linguistic data on sound laws represented in unified models for sound law notation, we could also use the rules to search for cognate words that have so far been ignored. There is a lot to do, however, until we reach this point. Starting to think about automatic, and also manual, induction of sound laws as a specific task in computational historical linguistics can be seen as a first step in this direction.

      References
        Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (forthcoming) A Cross-Linguistic Database of Phonetic Transcription Systems. Yearbook of the Poznań Linguistic Meeting, pp 1-27.

        Baxter, William H. (1992) A handbook of Old Chinese Phonology. Berlin: de Gruyter.

        Bodt, Timotheus A. and List, Johann-Mattis (2019) Testing the predictive strength of the comparative method: An ongoing experiment on unattested words in Western Kho-Bwa langauges. 1-22. [Preprint, under review, not peer-reviewed]

        Ciobanu, Alina Maria and Dinu, Liviu P. (2018) Simulating language evolution: A tool for historical linguistics. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp 68-72.

        Dekker, Peter (2018) Reconstructing Language Ancestry by Performing Word Prediction with Neural Networks. University of Amsterdam: Amsterdam.

        Hall, T. Alan (2000) Phonologie: Eine Einführung. Berlin and New York: de Gruyter.

        Hartmann, Lee (2003) Phono. Software for modeling regular historical sound change. In: Actas VIII Simposio Internacional de Comunicación Social. Southern Illinois University, pp 606-609.

        Holman, Eric W. and Wichmann, Søren and Brown, Cecil H. and Velupillai, Viveka and Müller, André and Bakker, Dik (2008) Explorations in automated lexicostatistics. Folia Linguistica 20.3: 116-121.

        List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

        List, Johann-Mattis and Chacon, Thiago (2015) Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context. Paper, presented at the workshop Historical Phonology and Phonological Theory [organized as part of the 48th annual meeting of the SLE] (2015/09/04, Leiden, Societas Linguistica Europaea).

        List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

        Mielke, Jeff (2008) The Emergence of Distinctive Features. Oxford: Oxford University Press.

        Needleman, Saul B. and Wunsch, Christan D. (1970) A gene method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.

        Newman, John and Raman, Anand V. (1999) Chinese Historical Phonology: Compendium of Beijing and Cantonese Pronunciations of Characters and their Derivations from Middle Chinese. München: LINCOM Europa.

        Schwink, Frederick (1994) Linguistic Typology, Universality and the Realism of Reconstruction. Washington: Institute for the Study of Man.

        Thompson, J. D. and Higgins, D. G. and Gibson, T. J. (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673–4680.