Open problems in computational diversity linguistics: Conclusion and Outlook


One year has now passed since I discussed the idea with David to devote a whole year of 12 blopgosts to the topic of "Open problems in computational diversity linguistics". It is time to look back at this year, and the topics that have been discussed.

Quantitative view

The following table lists the pageviews (or clicks) for each blogpost (with all caveats as to what this actually entails), from January to November.

ProblemMonthTitleClicksComments
0JanuaryIntroduction5354
1FebruaryAutomatic morpheme detection7180
2MarchAutomatic borrowing detection4221
3AprilAutomatic sound law induction5222
4MayAutomatic phonological reconstruction5170
5JuneSimulation of lexical change2690
6JulySimulation of sound change4230
7AugustStatistical proof of language relatedness 3831
8September Typology of semantic change3722
9OctoberTypology of sound change2503
10NovemberTypology of semantic promiscuity2172

The first thing to note is that people might have gotten tired of the problems, since the last two blogs were not very well-received in terms of readers (or not yet, anyway). One should, however, not forget that the number of clicks received by the system are cumulative, so if a blog is older, it may have received more readers just because it has been online for a longer time.

What seems, however, to be interesting is the rather high number of readers for the February post; and it seems that this is related to the topic, rather than the content. Morpheme detection is considered to be a very interesting problem by many practitioners of Natural Language Processing (NLP), and the field of NLP has generally many more followers than the field of historical linguistics.

Reader comments and discussions

For a few of the posts, I received interesting comments, and I  replied to all of them, where I found that a reply was in order. A few of them are worth emphasizing here.

As a first comment in March, Guillaume Jacques replied in form of a blog post of his own, where he proposed a very explicit method for the detection of borrowings, which assumes that data are compared where an ancestral language is a available in written sources (see here for the post). Since it will still take some time to prepare the data in the manner proposed by Guillaume, I have not had time to test this method for myself, but it is a very nice example for a new method for borrowing detection, which addresses one specific data type and has so far not been tested.

Thomas Pellard provided a very useful comment on my April post, emphasizing that automatic reconstruction based on regular expressions (as I had proposed it, more or less, as a riddle that should be solved), requires a "very precise chronology (order) of the sound changes", as well as "a perfect knowledge of all the sound changes having occurred". He concluded that "regular expression-based approach may thus be rather suited for the final stage of a reconstruction rather than for exploratory purposes". What is remarkable about this comment is that it partly contradicts (at least in my opinion) the classical doctrine of historical language comparison, since we often assume that linguists apply their "sound laws" perfectly well, being able to explain the history of a given set of languages in full detail. The sparsity of the available literature, and the problems that even small experiments encounter, shows that the idea of completely regular sound change that can be laid out in form of transducers has always remained an idea, but was never really practiced. It seems that it is time to leave the realm of theory and do more practical research on sound change, as suggested by Thomas.

In response to my post on problem number 7 (August), the proof of language relatedness, Guillaume Jacques wrote that: "although most historical linguists see inflectional morphology as the most convincing evidence for language relatedness, it is very difficult to conceive a statistical test that could be applied to morphological paradigms in any systematic way cross-linguistically". I think he is completely right with this point.

J. Pystynen made a very good point with respect to my post on the typology of semantic change (September), mentioning that semantic change may, similar to sound change, also underlie dynamics resulting from the fact that the lexicon of a given language at a given time is a system whose parts are determined by their relation to each other.

David Marjanović criticized my use (in October) of the Indo-European laryngeals as an example to make clear that the abstractionalist-realist problem in the debate about sound change has an impact on what scholars actively reconstruct, and that they are often content to not further specify concrete sound values as long as they can be sure that there are distinctive values for a given phenomenon. His main point was that — in his opinion — the reconstruction of sound values for the Indo-European laryngeal is much clearer than I presented it in my post. I think that Marjanović was misunderstanding the point I wanted to make; and I also think that he is not right regarding the surety with which we can determine sound values for the laryngeal sounds.

As a last and very long comment from November, Alex(andre) François (I assume that it was him, but he only left his first name) provided excellent feedback on the last problem, which I had labelled the problem of establishing a typology of "semantic promiscuity". Alex argues that I overemphasized the role of semantics in the discussion, and that the phenomenon I described might better be labelled "lexical yield of roots". I think that he's right with this criticism, but I am not sure whether the term "lexical yield" is better than the notion of promiscuity. Given that we are searching for a counterpart of the mostly form-based term "productivity", which furthermore focuses on grammatical affixes, the term "promiscuity" focuses on the success of certain form-concept pairs at being recycled during the process of word formation. Alex is right that we are in fact talking about the root here, as a linguistic concept that is — unfortunately — not very strictly defined in linguistics. For the time being, I would propose either the term "root promiscuity" or "lexical promiscuity", but avoid the term "yield", since it sounds too static to me.

Advances on particular problems

Although the problems that I posted are personal, and I am keen to try tackling them in at least some way in the future, I have not yet managed to advance on any of them in particular.

I have experimented with new approaches to borrowing detection, which are not yet in a state where they could be published, but it helped myself to re-think the whole matter in detail. Parts of my ideas shared in this blog post also appeared, in a deeper discussion, in an article that was published this year (List 2019).

I played with the problem of morpheme detection, but none of the different approaches was really convincing enough so far. However, I am still convinced that we can do better than "meaning-less" NLP approaches (which try to infer morphology from dictionaries alone, ignoring any semantic information).

A peripheral thought on automated phonological reconstruction, focusing on the question of the evaluation of a set of automated reconstructions and a set of human-annotated gold standard data, has now been published (List 2019b) as a comment to a target study by Jäger (2019). While my proposal can solve cases where two reconstruction systems differ only by their segment-wise phonological information, I had to conclude my comment by admitting that there are cases where two sets of words in different languages are equivalent in their structure, but not identical. Formally, that means that structurally identical sets of segmented strings in linguistics can be converted from one set to the other with help of simple replacement rules, while structurally equivalent (I am still unsure, if the two terms are well chosen) sets of segmented strings may require additional context rules.

Although I tried to advance on most of the problems mentioned throughout the year, and I carried out quite a few experiments, most of the things that I tested were not conclusive. Before I discuss them in detail, I should make sure they actually work, or provide a larger study that emphasizes and explains why they do notwork. At this stage, however, any sharing of information on the different experiments I ran would be premature, leading to confusion rather than to clarification.

Strategies for problem solving

Those of you who have followed my treatment of all the problems over the year will see that I tend to be very careful in delegating problem solutions to classical machine learning approaches. I do this because I am convinced that most of the problems that I mentioned and discussed can, in fact, be handled in a very concrete manner. When dealing with problems that one thinks can ultimately be solved by an algorithm, one should not start by developing a machine learning algorithm, but rather search for the algorithm that really solves the problem.

Nobody would develop a machine learning approach to replace an abacus, although this may in fact be possible. In the same way, I believe that the practice of historical linguistics has sufficiently shown that most of the problems can be solved with help of concrete methods, with the exception, perhaps, of phylogenetic reconstruction (see, for example, my graph-based solution for the sound correspondence pattern detection problem, presented in List 2019c). For this reason, I prefer to work on concrete solutions, avoiding probabilistic approaches or black-box methods, such as neural networks.

A language problem

Retrospect and outlook

In retrospect, I enjoyed the series a lot. It has the advantage of being easier to plan, as I knew in advance what I had to write about. It was, however, also tedious at times, since I knew I could not just talk about a seemingly simpler topic in my monthly post, but had to develop the problem and share all of my thoughts on it. In some situations, I had the impression that I failed, since I realized that there was not enough time to really think everything through. Here, the comments of colleagues were quite helpful.

Content-wise, the idea of looking at our field through the lens of unsolved problems turned out to be very useful. For quite a few of the problems, I have initial ideas (as I tried to indicate each time); and maybe there will be time in the next years to test them in concrete, and to potentially even cross off the one or the other problem from the big list.

Writing a series instead of a collection of unrelated posts turned out to have definite advantages. With my monthly goal of writing at least one contribution for the Genealogical World of Phylogenetic Networks, I never had the problem of thinking too hard of something that might be interesting for a broader readership. While this happened in the past, blog series have the disadvantage of not allowing for flexibility, when something interesting comes up, especially if one sticks to one post per month and reserves this post for the series.

In the next year, I am still considering to write another series, but maybe this time, I will handle it less strictly, allowing some room for surprise, since this is as well one of the major advantages of writing scientific blogs: one is never really be bound to follow beaten tracks.

But for now, I am happy that the year is over, since 2019 has been very busy for me in terms of work. Since this is the final post for the year, I would like to take the chance to thank all who read the posts, and specifically also all those who commented on them. But my greatest thanks go to David for being there, as always, reading my texts, correcting my errors in writing, and giving active feedback in the form of interesting and inspiring comments.

References

Jäger, Gerhard (2019) Computational historical linguistics. Theoretical Linguistics 45.3-4: 151-182.

List, Johann-Mattis (2019a) Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Language and Linguistics Compass 13.e12355: 1-16.

List, Johann-Mattis (2019b) Beyond Edit Distances: Comparing linguistic reconstruction systems. Theoretical Linguistics 45.3-4: 1-10.

List, Johann-Mattis (2019c) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

A new playground for networks and exploratory data analysis


[This is a post by Guido with some help from David]

There tend to be two types of studies of inheritance and evolution. First, there is evolution of organisms, either of the phenotype (morphology, anatomy, cell ultrastructure, etc) or genotype (chromosome, nucleotides). The latter involves direct inheritance, but it is often treated as including all molecules, although it is the nucleotides (and chromosomes) that get inherited, not amino acids, for example.

Second, there are studies of the evolution of behaviour, which has focused mainly on humans, of course, but can include all species. For humans, this includes socio-cultural phenomena, particularly language (written as well as spoken), but also including cultural advancements such as social organization, tool use, agriculture, etc., which are inherited indirectly, by learning.

However, we rarely see studies that are multi-disciplinary in the sense of combining both physical and behavioural evolution. It is therefore very interesting to note the just-published preprint by:
Fernando Racimo, Martin Sikora, Hannes Schroeder, Carles Lalueza-Fox. 2019. Beyond broad strokes: sociocultural insights from the study of ancient genomes. arXiv.
These authors provide a review about the extent to which the analysis of ancient human genomes has provided new insights into socio-cultural evolution. This provides a platform for interesting future cross-disciplinary research.

The authors comment:
In this review, we summarize recent studies showcasing these types of insights, focusing on the methods used to infer sociocultural aspects of human behaviour. This work often involves working across disciplines that have, until recently, evolved in separation. We argue that multidisciplinary dialogue is crucial for a more integrated and richer reconstruction of human history, as it can yield extraordinary insights about past societies, reproductive behaviours and even lifestyle habits that would not have been possible to obtain otherwise.
Since multi-disciplinary dialogue is a focal point here at the Genealogical World of Phylogenetic Networks. Since our blog embraces non-biological data, we have done a little brainstorming, to put forward some ideas based on Racimo et al.'s comments. The four figures contain some extra discussion, with some visual representations of the ideas.

Why it's important to correlate genetic, linguistic and socio-cultural data. The doodle shows a simple free expansion model of a founder population with three genotypes (yellow, green, blue), a shared language (L) and two major cultural innovations (white stars). Because of drift and stochastic intra-population processes (size represent the size of the actively reproducing populace) the first expansion (light gray arrows) lead to 'tribes' that show already some variation. The smaller ones close to the founder population spoke still the same language, the ones further away used variants (dialects) of L (L', still close to L, L'', more distinct). Because of bootlenecks, geographic distance and differing levels of inbreeding (the smaller a population, the farther away from the source, the more likely are changes in genotype frequency), each population has a different genotype composition. The second expansion (mid-gray arrows) mixing two sources leads to a grandchild that evolved a new language M and lost the blue genotype. Because the cultural innovations are beneficial, we find them in the entire group. In extreme cases of genetic sorting and linguistic evolution, such shared cultural innovations may be the only evidence clearly linking all these populations.

Social-cultural character matrices

Correlating different sets of data and (cross-)exploring the signal in these data can be facilitated by creating suitable character matrices. In phylogenetics, we primarily use characters that underlie (ideally) neutral evolution, such as nucleotide sequences and their transcripts, amino-acid sequences. When using matrices scoring morphological traits, we relax the requirement of neutral evolution, but we are still scoring traits that are the product of biological evolution. However, we don't need to stop there, phylo-linguistics is an active field, even though languages involve different evolutionary constraints and processes than we meet in biology. Data-wise there are nonetheless many analogies, and phylogenetic methods seem to work fine.

So, why not also score socio-cultural traits in a character matrix? For instance, we can characterize cultures and populations by basic features including: the presence of agriculture, which crops were cultivated, which animals were domesticated, which technological advances were available, whether it was a stone-age, bronze-age, iron-age culture, etc. Linguistically, we could also develop matrices of local populations, with regional accents or dialects, etc.

Creating such a matrix should, of course, be informed by available objective information. As in the case of morphological matrices or non-biological matrices in general, we should not be concerned about character independence. We don't need to infer a phylogenetic tree from these matrices, as their purpose is just to sum up all available characteristics of a socio-cultural group.

Second phase: stabilization of differentiation pattern. While the close-by tribes are still in contact with the mother population, the most distant lost contact. As consequence the gene pools of the L/L'-speaking communities will become more similar, and new innovations acquired by the founder population (black star) are readily propagated within its cultural sphere. Re-migration from the larger M-speaking tribe to the struggling L''-speakers (small population with high inbreeding levels) lead to the extinction of the blue genotype in the latter and increased 'borrowing' of M-words and concepts.

Distance calculations

Pairwise distance matrices are most versatile for comparing data across different data sets.

First, any character matrix can be quickly transformed into a distance matrix, and the right distance transformation can handle any sort of data: qualitative, categorical data as well as quantitative, continuous data.

Second, the signal in any distance matrix can be quickly visualized using Neighbor-nets. This blog has a long list of posts showing Neighbor-nets based on all sorts of sociological data that don't follow any strict pattern of evolution, and are heavily biased by socio-cultural constraints (eg. bikability, breast sizes, German politics, gun legislation, happiness, professional poker, spare-time activities). We have even included celestial bodies.

Third, distance matrices can be tested for correlation as-is, without any prior inference, using simple statistics, such as the Pearson correlation coefficient. To give just one example from our own research: in Göker and Grimm (BMC Evol. Biol. 2008), the latter was used for testing the performance of character and distance transformations for cloned ITS data covering substantial intra-genomic diversity, by correlating the resulting individual-based distances with species-level morphological data matrices. (The internal transcribed spacers are multi-copy, nuclear-encoded, non-coding gene regions; in the simplest case each individual has two sets of copies, arrays, one inherited from the father, the other from the mothers, which may differ between but also within the individual.)

In the context of Racimo et al.'s paper, one could construct a genetic, a socio-cultural, a linguistic and a geographical matrix, determine the pairwise distances between what in phylogenetics are called OTUs (the operational taxonomic units), and test how well these data (or parts of it) correlate. The OTUs would be local human groups sharing the same culture (and, if known) language.

Alternatively, one can just map the scored socio-cultural traits onto trees based on genetic data or linguistics.

A new culture with its own language (Λ), genotype (red) and innovations (ruby-red pentagon) migrates close to the settling area of the L-people. Because of raids, genotypes and innovations from the the L-people get incorporated into the the Λ-culture.

How to get the same set of OTUs

The Göker & Grimm paper mentioned above tested several options for character and distance transformations, because we faced a similar problem to what researchers will face when trying to correlate socio-cultural data with genetic profiles of our ancestors: a different set of leaves (the OTUs). We were interested in phylogenetic relationships between individuals using data representing the genetic heterogeneity within these individuals.

Genetic studies of human (ancient or modern) DNA use data based from individuals, but socio-cultural and linguistic data can only be compiled at a (much) higher level: societies, or other groups of many individuals. In addition, these groups may also span a larger time frame. Since humans love to migrate, we are even more of a genetic mess than were the ITS data that we studied.

One potential alternative is to use the host-associate analysis framework of Göker & Grimm. Instead of using the individual genetic profiles (the associate data), one sums them across a socio-cultural unit (serving as host). The simplest method is to create a consensus of the data (in Göker & Grimm, we tested strict and modal consensuses). This produces sequences with a lot of ambiguity codes — genetic diversity within the population will be presented by intra-unit sequence polymorphism (IUSP). Standard distance and parsimony implementation do not deal with ambiguities, but the Maximum likelihood, as implemented in RAxML, does to some degree. A gapstop is the recoding of ambiguities as discrete states for phylogenetic analysis (tree and network inference) as done by Potts et al. (Syst. Biol. 2014 [PDF]) for 2ISPs ('twisps'), intra-individual site polymorphism. It can't hurt to try out whether this works for IUSPs, too.

Since humans (tribes, local groups) often differ in the frequency of certain genotypes, it would be straightforward to use these frequencies directly when putting up a host matrix. Instead of, for example, nucleotides or their ambiguity codes, the matrix would have the frequency of the different haplotypes. We can't infer trees from such a matrix (we need categorical data), but we can still calculate the distance matrix and infer a Neighbor-net.

The 'phylogenetic Bray-Curtis' (distance) transformation introduced in Göker & Grimm (2008) also keeps the information about within-host diversity when determining inter-host distances (see Reticulation at its best ...)


Transformations for genetic data from smaller to larger, more-inclusive units are implemented in the software package POFAD by Joli et al. (Methods in Ecology & Evolution, 2015. Their paper also provides a comparison of different methods, including the ones tested in Göker & Grimm (2008, also implemented in the tiny executables g2cef and pbc, compiled for any platform).

The process of assimilation. The Λ-people subdued the L-culture with the consequence that all innovations are shared in their influence sphere. Having a much smaller total population size, the language of the invaders is largely lost but the new common language L* still includes some Λ-elements (in a phylogenetic tree analysis, L* would be part of the L/M clade, using networks, L* would share edges with Λ in contrast to L and M). The L''/M-speaking remote population is re-integrated. The invaders' genotype (red) becomes part of the L-people's gene pool. Re-migration (forced or not) introduces L-genotypes into the original Λ-population. Only by comparing all available data, ideally covering more than one time period, we can deduce that the M-speakers represent an early isolated subpopulation of the L-people that was not affected by the Λ-invasion. With only the genetic data at hand, one may identify the M-speakers as one source and the Λ-tribe as another source for the L*-people, and infer that all L/M and Λ-tribes share a common origin (since the yellow genotype is found in both the M- and the original Λ-population).

Conclusion

It therefore seems to us that there is enormous potential for multi-disciplinary work, that truly combine organismal and socio-cultural evolution. We have provided a few practical suggestions here about how this might be done. We encourage you all to have try some of these ideas, to see where it leads us all.

Typology of sound change (Open problems in computational diversity linguistics 9)


We are getting closer to the end of my list of open problems in computational diversity linguistics. After this post, there is only one left, for November, followed by an outlook and a wrap-up in December.

In last month's post, devoted to the Typology of semantic change, I discussed the general aspects of a typology in linguistics, or — to be more precise — how I think that linguists use the term. One of the necessary conditions for a typology to be meaningful is that the phenomenon under questions shows enough similarities across the languages of the world, so that patterns or tendencies can be identified regardless of the historical relations between human languages.

Sound change in this context refers to a very peculiar phenomenon observed in the change of spoken languages, by which certain sounds in the inventory of a given language change their pronunciation over time. This often occurs across all of the words in which these sounds recur, or across only those sounds which appear to occur in specific phonetic contexts.

As I have discussed this phenomenon in quite a few past blog posts, I will not discuss it any more here, but I will rather simply refer to the specific task, that this problem entails:
Assuming (if needed) a given time frame, in which the change occurs, establish a general typology that informs about the universal tendencies by which sounds occurring in specific phonetic environments are subject to change.
Note that my view of "phonetic environment" in this context includes an environment that would capture all possible contexts. When confronted with a sound change that seems to affect a sound in all phonetic contexts, in which the sound occurs in the same way, linguists often speak of "unconditioned sound change", as they do not find any apparent condition for this change to happen. For a formal treatment, however, this is unsatisfying, since the lack of a phonetic environment is also a specific condition of sound change.

Why it is hard to establish a typology of sound change

As is also true for semantic change, discussed as Problem 8 last month, there are three major reasons why it is hard to establish a typology of sound change. As a first problem, we find, again, the issue of acquiring the data needed to establish the typology. As a second problem, it is also not clear how to handle the data appropriately in order to allow us to study sound change across different language families and different times. As a third problem, it is also very difficult to interpret sound change data when trying to identify cross-linguistic tendencies.

Problem 1

The problem of acquiring data about sound change processes in sufficient size is very similar to the problem of semantic change: most of what we know about sound change has been inferred by comparing languages, and we do not know how confident we can be with respect to those inferences. While semantic change is considered to be notoriously difficult to handle (Fox 1995: 111), scholars generally have more confidence in sound change and the power of linguistic reconstruction. The question remains, however, as to how confident we can really be, which divides the field into the so-called "realists" and the so-called "abstractionalists" (see Lass 2017 for a recent discussion of the debate).

As a typical representative of abstractionalism in linguistic reconstruction, consider the famous linguist Ferdinand de Saussure, who emphasized that the real sound values which scholars reconstructed for proposed ancient words in unattested languages like, for example, Indo-European, could as well be simply replaced by numbers or other characters, serving as identifiers (Saussure 1916: 303). The fundamental idea here, when reconstructing a word for a given proto-language, is that a reconstruction does not need to inform us about the likely pronunciation of a word, but rather about the structure of the word in contrast to other words.

This aspect of historical linguistics is often difficult to discuss with colleagues from other disciplines, since it seems to be very peculiar, but it is very important in order to understand the basic methodology. The general idea of structure versus substance is that, once we accept that the words in a languages are built by drawing letters from an alphabet, the letters themselves do not have a substantial value, but have only a value in contrast to other letters. This means that a sequence, such as "ABBA" can be seen as being structurally identical with "CDDC", or "OTTO". The similarity should be obvious: we have the same letter in the beginning and the end of each word, and the same letter being repeated in the middle of each word (see List 2014: 58f for a closer discussion of this type of similarity).

Since sequence similarity is usually not discussed in pure structural terms, the abstract view of correspondences, as it is maintained by many historical linguists, is often difficult to discuss across disciplines. The reason why linguists tend to maintain it is that languages tend to change not only their words by mutating individual sounds, but that whole sound systems change, and new sounds can be gained during language evolution, or lost (see my blogpost from March 2018 for a closer elaboration of the problem of sound change).

It is important to emphasize, however, that despite prominent abstractionalists such as Ferdinand de Saussure (1857-1913), and in part also Antoine Meillet (1866-1936), the majority of linguists think more realistically about their reconstructions. The reason is that the composition of words based on sounds in the spoken languages of the world usually follows specific rules, so-called phonotactic rules. These may vary to quite some degree among languages, but are also restricted by some natural laws of pronunciability. Thus, although languages may show impressively long chains of one consonant following another, there is a certain limit to the number of consonants that can follow each other without a vowel. Sound change is thus believed to originate roughly in either production (speakers want to pronounce things in a simpler, more convenient way) or perception (listeners misunderstand words and store erroneous variants, see Ohala 1989 for details). Therefore, a reconstruction of a given sound system based on the comparison of multiple languages gains power from a realistic interpretation of sound values.

The problem with the abstractionalist-realist debate, however, is that linguists usually conduct some kind of a mixture between the two extremes. That means that they may reconstruct very concrete sound values for certain words, where they have very good evidence, but at the same time, they may come up with abstract values that serve as place holders in lack of better evidence. The most famous example are the Indo-European "laryngeals", whose existence is beyond doubt for most historical linguistics, but whose sound values cannot be reconstructed with high reliability. As a result, linguists tend to spell them with subscript numbers as *h₁, *h₂, and *h₃. Any attempt to assemble data about sound change processes in the languages of the world needs to find a way to cope with the different degrees of evidence we find in linguistic analyses.

Problem 2

This leads us directly to our second problem in handling sound change data appropriately in order to study sound change processes. Given that many linguists propose changes in the typical A > B / C (A becomes B in context C) notation, a possible way of thinking about establishing a first database of sound changes would consist of typing these changes from the literature and making a catalog out of it. Apart from the interpretation of the data in abstractionalist-realist terms, however, such a way of collecting the data would have a couple of serious shortcomings.

First, it would mean that the analysis of the linguist who proposed the sound change is taken as final, although we often find many debates about the specific triggers of sound change, and it is not clear whether there would be alternative sound change rulesthat could apply just as well (see Problem 3on the task of automatic sound law induction for details). Second, as linguists tend to report only what changes, while disregarding what does notchange, we would face the same problem as in the traditional study of semantic change: the database would suffer from a sampling bias, as we could not learn anything about the stability of sounds. Third, since sound change depends not only on production and perception, but also on the system of the language in which sounds are produced, listing sounds deprived of examples in real words would most likely make it impossible to take these systemic aspects of sound change into account.

Problem 3

This last point now leads us to the third general difficulty, the question of how to interpret sound change data, assuming that one has had the chance to acquire enough of it from a reasonably large sample of spoken languages. If we look at the general patterns of sound change observed for the languages of the world, we can distinguish two basic conditions of sound change, phonetic conditions and systemic conditions. Phonetic conditions can be further subdivided into articulatory (= production) and acoustic(= perception) conditions. When trying to explain why certain sound changes can be observed more frequently across different languages of the world, many linguists tend to invoke phonetic factors. If the sound p, for example, turns into an f, this is not necessarily surprising given the strong similarity of the sounds.

But similarity can be measured in two ways: one can compare the similarity with respect to the production of a sound by a speaker, and with respect to the perception of the sound by a listener. While production of sounds is traditionally seen as the more important factor contributing to sound change (Hock 1991: 11), there are clear examples for sound change due to misperception and re-interpretation by the listeners (Ohala 1989: 182). Some authors go as far as to claim that production-driven changes reflect regular internal language change (which happens gradually during acquisition, or (depending on the theory) also in later stages (Bybee 2002), while perception-based changes rather reflect change happening in second language acquisition and language contact (Mowrey and Pagliuca 1995: 48).

While the interaction of production and perception has been discussed in some detail in the linguistic literature, the influence of systemic factors has so far been only rarely regarded. What I mean by this factor is the idea that certain changes in language evolution may be explained exclusively as resulting from systemic constellations. As a straightforward example, consider the difference in design space for the production of consonants, vowels, and tones. In order to maintain pronunciability and comprehensiblity, it is useful for the sound system of a given language to fill in those spots in the design space that are maximally different from each other. The larger the design space and the smaller the inventory, the easier it is to guarantee its functionality. Since design spaces for vowels and tones are much smaller than for consonants, however, these sub-systems are more easily disturbed, which could be used to explain the presence of chain shifts of vowels, or flip- flop in tone systems (Wang 1967: 102). Systemic considerations play an increasingly important role in evolutionary theory, and, as shown in List et al. (2016), also be used as explanations for phenomena as strange as the phenomenon of Sapir's drift (Sapir 1921).

However, the crucial question, when trying to establish a typology of sound change, is how these different effects could be measured. I think it is obvious that collections of individual sound changes proposed in the literature are not enough. But what data would be sufficient or needed to address the problem is not entirely clear to me either.

Traditional approaches

As the first traditional approach to the typology of sound change, one should mention the intuition inside the heads of the numerous historical linguists who study particular language families. Scholars trained in historical linguistics usually start to develop some kind of intuition about likely and unlikely tendencies in sound change, and in most parts they also agree on this. The problem with this intuition, however, is that it is not explicit, and it seems even that it was never the intention of the majority of historical linguists to make their knowledge explicit. The reasons for this reluctance with respect to formalization and transparency are two-fold. First, given that every individual has invested quite some time in order to grow their intuition, it is possible that the idea of having a resource that distributes this intuition in a rigorously data-driven and explicit manner yields the typical feeling of envy in quite a few people who may then think: «I had to invest so much time in order to learn all this by heart. Why should young scholars now get all this knowledge for free?» Second, given the problems outlined in the previous section, many scholars also strongly believe that it is impossible to formalize the problem of sound change tendencies.

The by far largest traditional study of the typology of sound change is Kümmel's (2008) book Konsonantenwandel (Consonant Change), in which the author surveys sound change processes discussed in the literature on Indo-European and Semitic languages. As the title of the book suggests, it concentrates on the change of consonants, which are (probably due to the larger design space) also the class of sounds that shows stronger cross-linguistic tendencies. The book is based on a thorough inspection of the literature on consonant change in Indo-European and Semitic linguistics. The procedure by which this collection was carried out can be seen as the gold standard, which any future attempt of enlarging the given collection should be carried out.

What is specifically important, and also very difficult to achieve, is the harmonization of the evidence, which is nicely reflected in Kümmel's introduction, where he mentions that one of the main problems was to determine what the scholars actually meant with respect to phonetics and phonology, when describing certain sound changes (Kümmel 2008: 35). The major drawback of the collection is that it is not (yet) available in digital form. Given the systematicity with which the data was collected, it should be generally possible to turn the collection into a database; and it is beyond doubt that this collection could offer interesting insights into certain tendencies of sound change.

Another collection of sound changes collected from the literature is the mysterious Index Diachronica, a collection of sound changes collected from various language families by a person who wishes to remain anonymous. Up to now, this collection even has a Searchable Index that allows scholars to click on a given sound and to see in which languages this sound is involved in some kind of sound change. What is a pity about the resource is that it is difficult to use, given that one does not really know where it actually comes from, and how the information was extracted from the sources. If the anonymous author would only decide to put it (albeit anonymously, or under a pseudonym) on a public preprint server, such as, for example, Humanities Commons, this would be excellent, as it would allow those who are interested in pursuing the idea of collecting sound changes from the literature an excellent starting point to check the sources, and to further digitize the resource.

Right now, this resource seems to be mostly used by conlangers, ie., people who create artificial languages as a hobby (or profession). Conlangers are often refreshingly pragmatic, and may come up with very interesting and creative ideas about how to address certain data problems in linguistics, which "normal" linguists would refuse to do. There is a certain tendency in our field to ignore certain questions, either because scholars think it would be too tedious to collect the data to address that problem, or they consider it impossible to be done "correctly" from the start.

As a last and fascinating example, I have to mention the study by Yang and Xu (2019), in which the authors review studies of concrete examples of tone change in South-East Asian languages, trying to identify cross-linguistic tendencies. Before I read this study, I was not aware that tone change had at all been studied concretely, since most linguists consider the evidence for any kind of tendency far too shaky, and reconstruct tone exclusively as an abstract entity. The survey by Yang and Xu, however, shows clearly that there seem to be at least some tendencies, and that they can be identified by invoking a careful degree of abstraction when comparing tone change across different languages.

For the detailed reasons outlined in the previous paragraph, I do not think that a collection of sound change examples from the literature addresses the problem of establishing a typology of sound change. Specifically, the fact that sound change collections usually do not provide any tangible examples or frequencies of a given sound change within the language where it occurred, but also the fact that they do not offer any tendencies of sounds to resist change, is a major drawback, and a major loss of evidence during data collection. However, I consider these efforts as valuable and important contributions to our field. Given that they allow us to learn a lot about some very general and well-confirmed tendencies of sound change, they are also an invaluable source of inspiration when it comes to working on alternative approaches.

Computational approaches

To my knowledge, there are no real computational approaches to the study of sound change so far. What one should mention, however, are initial attempts to measure certain aspects of sound change automatically. Thus, Brown et al. (2013) measure sound correspondences across the world's languages, based on a collection of 40-item wordlists for a very large sample of languages. The limitations of this study can be found in the restricted alphabet being used (all languages are represented by a reduced transcription system of some 40 letters, called the ASJP code. While the code originally allowed representing more that just 40 sounds, since the graphemes can be combined, the collection was carried out inconsistently for different languages, which has now led to the situation that the majority of computational approaches treat each letter as a single sound, or consider only the first element of complex grapheme combinations.

While sound change is a directional process, sound correspondences reflect the correspondence of sounds in different languages as a result of sound change, and it is not trivial to extract directional information from sound correspondence data alone. Thus, while the study of Brown et al. is a very interesting contribution, also providing a very straightforward methodology, it does not address the actual problem of sound change.

The study also has other limitations. First, the approach only measures those cases where sounds differ in two languages, and thus we have the same problem that we cannot tell how likely it is that two identical sounds correspond. Second, the study ignores phonetic environment (or context), which is an important factor in sound change tendencies (some sound changes, for example, tend to occur only in word endings, etc.). Third, the study considers only sound correspondences across language pairs, while it is clear that one can often find stronger evidence for sound correspondences when looking at multiple languages (List 2019).

Initial ideas for improvement

What we need in order to address the problem of establishing a true typology of sound change processes, are, in my opinion:
  1. a standardized transcription system for the representation of sounds across linguistic resources,
  2. increased amounts of readily coded data that adhere to the standard transcription system and list cognate sets of ancestral and descendant languages,
  3. good, dated phylogenies that allow to measure how often sound changes appear in a certain time frame,
  4. methods to infer the sound change rules (Problem 3), and
  5. improved methods for ancestral state reconstruction that would allow us to identify sound change processes not only for the root and the descendant nodes, but also for intermediate stages.
It is possible that even these five points are not enough yet, as I am still trying to think about how one should best address the problem. But what I can say for sure is that one needs to address the problem step by step, starting with the issue of standardization — and that the only way to account for the problems mentioned above is to collect the pure empirical evidence on sound change, not the summarized results discussed in the literature. Thus, instead of saying that some source quotes that in German, the t became a ts at some point, I want to see a dataset that provides this in the form of concrete examples that are large enough to show the regularity of the findings and ideally also list the exceptions.

The advantage of this procedure is that the collection is independent of the typical errors that usually occur when data are collected from the literature (usually also by employing armies of students who do the "dirty" work for the scientists). It would also be independent of individual scholars' interpretations. Furthermore, it would be exhaustive — that is, one could measure not only the frequency of a given change, but also the regularity, the conditioning context, or the systemic properties

The disadvantage is, of course, the need to acquire standardized data in a large-enough size for a critical number of languages and language families. But, then again, if there were no challenges involved in this endeavor, I would not present it as an open problem of computational diversity linguistics.

Outlook

With the newly published database of Cross-Linguist Transcription Systems (CLTS, Anderson et al. 2018), the first step towards a rigorous standardization of transcription systems has already been made. With our efforts towards a standardization of wordlists that can also be applied in the form of a retro-standardization to existing data (Forkel et al. 2018), we have proposed a further step of how lexical data can be collected efficiently for a large sample of the worlds' spoken languages (see also List et al. 2018). Work on automated cognate detection and workflows for computer-assisted language comparison has also drastically increased the efficiency of historical language comparison.

So, we are advancing towards a larger collection of high-quality and historically compared datasets; and it is quite possible that we will, in a couple of years from now, arrive at a point where the typology of sound change is no longer a dream by me and many colleagues, but something that may actually be feasible to extract from cross-linguistic data that has been historically annotated. But until then, many issues still remain unsolved; and in order to address these, it would be useful to work towards pilot studies, in order to see how well the ideas for improvement, outlined above, can actually be implemented.

References

Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (2018) A Cross-Linguistic Database of Phonetic Transcription Systems. Yearbook of the Poznań Linguistic Meeting 4.1: 21-53.

Brown, Cecil H. and Holman, Eric W. and Wichmann, Søren (2013) Sound correspondences in the worldś languages. Language 89.1: 4-29.

Bybee, Joan L. (2002) Word frequency and context of use in the lexical diffusion of phonetically conditioned sound change. Language Variation and Change 14: 261-290.

Forkel, Robert and List, Johann-Mattis and Greenhill, Simon J. and Rzymski, Christoph and Bank, Sebastian and Cysouw, Michael and Hammarström, Harald and Haspelmath, Martin and Kaiping, Gereon A. and Gray, Russell D. (2018) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5.180205: 1-10.

Fox, Anthony (1995) Linguistic Reconstruction. An Introduction to Theory and Method. Oxford: Oxford University Press.

Hock, Hans Henrich (1991) Principles of Historical Linguistics. Berlin: Mouton de Gruyter.

Kümmel, Martin Joachim (2008): Konsonantenwandel [Consonant change]. Wiesbaden:Reichert.
Lass, Roger (2017): Reality in a soft science: the metaphonology of historical reconstruction. Papers in Historical Phonology 2.1: 152-163.

List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis and Pathmanathan, Jananan Sylvestre and Lopez, Philippe and Bapteste, Eric (2016) Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biology Direct 11.39: 1-17.

List, Johann-Mattis and Greenhill, Simon J. and Anderson, Cormac and Mayer, Thomas and Tresoldi, Tiago and Forkel, Robert (2018) CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2: 277-306.

List, Johann-Mattis (2019): Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

Mowrey, Richard and Pagliuca, William (1995) The reductive character of articulatory evolution. Rivista di Linguistica 7: 37–124.

Ohala, J. J. (1989) Sound change is drawn from a pool of synchronic variation. In: Breivik, L. E. and Jahr, E. H. (eds.) Language Change: Contributions to the Study of its Causes. Berlin: Mouton de Gruyter., pp.173-198.

Sapir, Edward (1921[1953]) Language. An Introduction to the Study of Speech.

de Saussure, Ferdinand (1916) Cours de linguistique générale. Lausanne: Payot.

William S-Y. Wang (1967) Phonological features of tone. International Journal of American Linguistics 33.2: 93-105.

Yang, Cathryn and Xu, Yi (2019) A review of tone change studies in East and Southeast Asia. Diachronica 36.3: 417-459.

Typology of semantic change (Open problems in computational diversity linguistics 8)


With this month's problem we are leaving the realm of modeling, which has been the basic aspect underlying the last three problems, discussed in June, July, and August, and enter the realm of typology, or general linguistics. The last three problems that I will discuss, in this and two follow-up posts, deal with the basic problem of making use or collecting data that allows us to establish typologies, that is, to identify cross-linguistic tendencies for specific phenomena, such as semantic change (this post), sound change (October), or semantic promiscuity (November).

Cross-linguistic tendencies are here understood as tendencies that occur across all languages independently of their specific phylogenetic affiliation, the place where they are spoken, or the time when they are spoken. Obviously, the uniformitarian requirement of independence of place and time is an idealization. As we know well, the capacity for language itself developed, potentially gradually, with the evolution of modern humans, and as a result, it does not make sense to assume that the tendencies of semantic change or sound change were the same through time. This has, in fact, been shown in recent research that illustrated that there may be a certain relationship between our diet and the speech sounds that we speak in our languages (Blasi et al. 2019).

Nevertheless, in the same way in which we simplify models in physics, as long as they yield good approximations of the phenomena we want to study, we can also assume a certain uniformity for language change. To guarantee this, we may have to restrict the time frame of language development that we want to discuss (eg. the last 2,000 years), or the aspects of language we want to investigate (eg. a certain selection of concepts that we know must have been expressed 5,000 years ago).

For the specific case of a semantic change, the problem of establishing a typology of the phenomenon can thus be stated as follows:
Assuming a certain pre-selection of concepts that we assume were readily expressed in a given time frame, establish a general typology that informs about the universal tendencies by which a word expressing one concept changes its meaning, to later express another concept in the same language.
In theory, we can further relax the conditions of universality and add the restrictions on time and place later, after having aggregated the data. Maybe this would even be the best idea for a practical investigation; but given that the time frames in which we have attested data for semantic changes are rather limited, I do not believe that it would make much of a change.

Why it is hard to establish a typology of semantic change

There are three reasons why it is hard to establish a typology of semantic change. First, there is the problem of acquiring the data needed to establish the typology. Second, there is the problem of handling the data efficiently. Third, there is the problem of interpreting the data in order to identify cross-linguistic, universal tendencies.

The problem of data acquisition results from the fact that we lack data on observed processes of semantic change. Since there are only a few languages with a continuous tradition of written records spanning 500 years or more, we will never be able to derive any universal tendencies from those languages alone, even if it may be a good starting point to start from languages like Latin and its Romance descendants, as has been shown by Blank (1997).

Accepting the fact that processes attested only for Romance languages are never enough to fill the huge semantic space covered by the world's languages, the only alternative would be using inferred processes of semantic change — that is, processes that have been reconstructed and proposed in the literature. While it is straightforward to show that the meanings of cognate words in different languages can vary quite drastically, it is much more difficult to infer the direction underlying the change. Handling the direction, however, is important for any typology of semantic change, since the data from observed changes suggests that there are specific directional tendencies. Thus, when confronted with cognates such as selig "holy" in German and silly in English, it is much less obvious whether the change happened from "holy" to "silly" or from "silly" to "holy", or even from an unknown ancient concept to both "holy" and "silly".

As a result, we can conclude that any collection of data on semantic change needs to make crystal-clear upon which types of evidence the inference of semantic change processes is based. Citing only the literature on different language families is definitely not enough. Because of the second problem, this also applies to the handling of data on semantic shifts. Here, we face the general problem of elicitation of meanings. Elicitation refers to the process in fieldwork where scholars use a questionnaire to ask their informants how certain meanings are expressed. The problem here is that linguists have never tried to standardize which meanings they actually elicit. What they use, instead, are elicitation glosses, which they think are common enough to allow linguists to understand to what meaning they refer. As a result, it is extremely difficult to search in field work notes, and even in wordlists or dictionaries, for specific meanings, since every linguist is using their own style, often without further explanations.

Our Concepticon project (List et al. 2019, https://concepticon.clld.org) can be seen as a first attempt to handle elicitation glosses consistently. What we do is to link those elicitation glosses that we find in questionnaires, dictionaries, and fieldwork notes to so-called concept sets, which reflect a given concept that is given a unique identifier and a short definition. It would go too far to dive deeper into the problem of concept handling. Interested readers can have a look at a previous blog post I wrote on the topic (List 2018). In any case, any typology on semantic change will need to find a way to address the problem of handling elicitation glosses in the literature, in the one or the other way.

As a last problem, when having assembled data that show semantic change processes across a sufficiently large sample of languages and concepts, there is the problem of analyzing the data themselves. While it seems obvious to identify cross-linguistic tendencies by looking for examples that occur in different language families and different parts of the world, it is not always easy to distinguish between the four major reasons for similarities among languages, namely: (1) coincidence, (2) universal tendencies, (3) inheritance, and (4) contact (List 2019). The only way to avoid being forced to make use of potentially unreliable statistics, to squeeze out the juice of small datasets, is to work on a sufficiently large coverage of data from as many language families and locations as possible. But given that there are no automated ways to infer directed semantic change processes across linguistic datasets, it is unlikely that a collection of data acquired from the literature alone will reach the critical mass needed for such an endeavor.

Traditional approaches

Apart from the above-mentioned work by Blank (1997), which is, unfortunately, rarely mentioned in the literature (potentially because it is written in German), there is an often-cited paper by Wilkinson (1996), and preliminary work on directionality (Urban 2012). However, the attempt that addresses the problem most closely is the Database of Semantic Shifts (Zalizniak et al. 2012), which has, according to the most recent information on the website, was established in 2002 and has been  continuously updated since then.

The basic idea, as far as I understand the principle of the database, is to collect semantic shifts attested in the literature, and to note the type of evidence, as well as the direction, where it is known. The resource is unique, nobody else has tried to establish a collection of semantic shifts attested in the literature, and it is therefore incredibly valuable. It shows, however, also, what problems we face when trying to establish a typology of semantic shifts.

Apart from the typical technical problems found in many projects shared on the web (missing download access to all data underlying the website, missing deposit of versions on public repositories, missing versioning), the greatest problem of the project is that no apparent attempt was undertaken to standardize the elicitation glosses. This became specifically obvious when we tried to link an older version of the database, which is now no longer available, to our Concepticon project. In the end, I selected some 870 concepts from the database, which were supported by more datapoints, but had to ignore more than 1500 remaining elicitation glosses, since it was not possible to infer in reasonable time what the underlying concepts denote, not to speak of obvious cases where the same concept was denoted by slightly different elicitation glosses. As far as I can tell, this has not changed much with the most recent update of the database, which was published some time earlier this year.

Apart from the afore-mentioned problems of missing standardization of elicitation glosses, the database does not seem to annotate which type of evidence has been used to establish a given semantic shift. An even more important problem, which is typical of almost all attempts to establish databases of change in the field of diversity linguistics, is that the database only shows what has changed, while nothing can be found on what has stayed the same. A true typology of change, however, must show what has not changed along with showing what has changed. As a result, any attempt to pick proposed changes from the literature alone will fail to offer a true typology, a collection of universal tendencies

To be fair: the Database of Semantic Shifts is by no means claiming to do this. What it offers is a collection of semantic change phenomena discussed in the linguistic literature. This itself is an extremely valuable, and extremely tedious, enterprise. While I wish that the authors open their data, versionize it, standardize the elicitation glosses, and also host it on stable public archives, to avoid what happened in the past (that people quote versions of the data which no longer exist), and to open the data for quantitative analyses, I deeply appreciate the attempt to approach the problem of semantic change from an empirical, data-driven perspective. To address the problem of establishing a typology of semantic shift, however, I think that we need to start thinking beyond collecting what has been stated in the literature.

Computational approaches

As a first computational approach that comes in some way close to a typology of semantic shifts, there is the Database of Cross-Linguistic Colexifications (List et al. 2018), which was originally launched in 2014, and received a major update in 2018 (see List et al. 2018b for details). This CLICS database, which I have mentioned several times in the past, does not show diachronic data, ie. data on semantic change phenomena, but lists automatically detectable polysemies and homophonies (also called colexifications), instead.

While the approach taken by the Database of Semantic shifts is bottom-up in some sense, as the authors start from the literature and add those concept that are discussed there, CLICS is top-down, as it starts from a list of concepts (reflected as standardized Concepticon concept sets) and then checks which languages express more than one concept by one and the same word form.

The advantages of top-down approaches are: that much more data can be processed, and that one can easily derive a balanced sample in which the same concepts iare compared for as many languages as possible. The disadvantage is that such a database will ignore certain concepts a priori, if they do not occur in the data.

Since CLICS lists synchronic patterns without further interpreting them, the database is potentially interesting for those who want to work on semantic change, but it does not help solve the problem of establishing a typology of semantic change itself. In order to achieve this, one would have to go through all attested polysemies in the database and investigate them, searching for potential hints on directions.

A potential way to infer directions for semantic shifts is presented by Dellert (2016), who applies causal inference techniques on polysemy networks to address this task. The problem, as far as I understand the techniques, is that the currently available polysemy databases barely offer enough information needed for these kinds of analyses. Furthermore, it would also be important to see how well the method actually performs in comparison to what we think we already know about the major patterns of semantic change.

Initial ideas for improvement

There does not seem to be a practical way to address our problem by means of computational solutions alone. What we need, instead, is a computer-assisted strategy that starts from the base of  a thorough investigation of the criteria that scholars use to infer directions of semantic change from linguistic data. Once these criteria are settled, more or less, one would need to think of ways to operationalize them, in order to allow scholars to work with concrete etymological data, ideally comprising standardized word-lists for different language families, and to annotate them as closely as possible.

Ideally, scholars would propose larger etymological datasets in which they reconstruct whole language families, proposing semantic reconstructions for proto-forms. These would already contain the proposed directions of semantic change, and they would also automatically show where change does not happen. Since we currently lack automated workflows that fully account for this level of detail, one could start by applying methods for cognate detection across semantic semantic slots (cross-semantic cognate detection), which would yield valuable data on semantic change processes, without providing directions, and then adding the directional information based on the principles that scholars use in their reconstruction methodology.

Outlook

Given the recent advances in detection of sound correspondence patterns, sequence comparison, and etymological annotation in the field of computational historical linguistics, it seems perfectly feasible to work on detailed etymological datasets of the languages of the world, in which all information required to derive a typology of semantic change is transparently available. The problem is, however, that it would still take a lot of time to actually analyze and annotate these data, and to find enough scholars who would agree to carry out linguistic reconstruction in a similar way, using transparent tools rather than convenient shortcuts.

References

Blank, Andreas (1997) Prinzipien des lexikalischen Bedeutungswandels am Beispiel der romanischen Sprachen. Tübingen:Niemeyer.

Blasi, Damián E. and Steven Moran and Scott R. Moisik and Paul Widmer and Dan Dediu and Balthasar Bickel (2019) Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363.1192: 1-10.

List, Johann-Mattis and Simon Greenhill and Cormac Anderson and Thomas Mayer and Tiago Tresoldi and Robert Forkel (2018: CLICS: Database of Cross-Linguistic Colexifications. Version 2.0. Max Planck Institute for the Science of Human History. Jena: http://clics.clld.org/.

Johann Mattis List and Simon Greenhill and Christoph Rzymski and Nathanael Schweikhard and Robert Forkel (2019) Concepticon. A resource for the linking of concept lists (Version 2.1.0). Max Planck Institute for the Science of Human History. Jena: https://concepticon.clld.org/.

Dellert, Johannes and Buch, Armin (2016) Using computational criteria to extract large Swadesh Lists for lexicostatistics. In: Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics.

List, Johann-Mattis and Greenhill, Simon J. and Anderson, Cormac and Mayer, Thomas and Tresoldi, Tiago and Forkel, Robert (2018) CLICS². An improved database of cross-linguistic colexifications assembling lexical data with help of cross-linguistic data formats. Linguistic Typology 22.2: 277-306.

List, Johann-Mattis (2018) Towards a history of concept list compilation in historical linguistics. History and Philosophy of the Language Sciences 5.10: 1-14.

List, Johann-Mattis (2019) Automated methods for the investigation of language contact situations, with a focus on lexical borrowing. Language and Linguistics Compass 13.e12355: 1-16.

Urban, Matthias (2011) Asymmetries in overt marking and directionality in semantic change. Journal of Historical Linguistics 1.1: 3-47.

Wilkins, David P. (1996) Natural tendencies of semantic change and the search for cognates. In: Durie, Mark (ed.) The Comparative Method Reviewed: Regularity and Irregularity in Language Change. New York: Oxford University Press, pp. 264-304.

Zalizniak, Anna A. and Bulakh, M. and Ganenkov, Dimitrij and Gruntov, Ilya and Maisak, Timur and Russo, Maxim (2012) The catalogue of semantic shifts as a database for lexical semantic typology. Linguistics 50.3: 633-669.

Simulation of sound change (Open problems in computational diversity linguistics 6)


The sixth problem in my list of open problems in computational diversity linguistics is devoted to the problem of simulating sound change. When formulating the problem, it is difficult to see what is actually meant, as there are two possibilities for a concrete simulation: (i) one could think of a sound system of a given language and then model how, through time, the sounds change into other sounds; or (ii) one could think of a bunch of words in the lexicon of a given language, and then simulate how these words are changed through time, based on different kinds of sound change rules. I have in mind the latter scenario.

Why simulating sound change is hard

The problem of simulating sound change is hard for four reasons. First of all, the problem is similar to the problem of sound law induction, since we have to find a simple and straightforward way to handle phonetic context (remember that sound change may often only apply to sounds that occur in a certain environment of other sounds). This is already difficult enough, but it could be handled with help of what I called multi-tiered sequence representations (List and Chacon 2015). However, there are four further problems that one would need to overcome (or at least be aware of) when trying to successfully simulate sound change.

The first of these extra problems is that of morphological change and analogy, which usually goes along with "normal" sound change, following what Anttila (1976) calls Sturtevant's paradox — namely, that regular sound change produces irregularity in language systems, while irregular analogy produces regularity in language systems. In historical linguistics, analogy serves as a cover-term for various processes in which words or word parts are rendered more similar to other words than they had been before. Classical examples are children's "regular" plurals of nouns like mouse (eg. mouses instead of mice) or "regular" past tense forms of verbs like catch (e.g., catched instead of caught). In all these cases, perceived irregularities in the grammatical system, which often go back to ancient sound change processes, are regularized on an ad-hoc basis.

One could (maybe one should), of course, start with a model that deliberately ignores processes of morphological change and analogical leveling, when drafting a first system for sound change simulation. However, one needs to be aware that it is difficult to separate morphological change from sound change, as our methods for inference require that we identify both of them properly.

The second extra problem is the question of the mechanism of sound change, where competing theories exist. Some scholars emphasize that sound change is entirely regular, spreading over the whole lexicon (or changing one key in the typewriter), while others claim that sound change may slowly spread from word to word and at times not reach all words in a given lexicon. If one wants to profit from simulation studies, one would ideally allow for a testing of both systems; but it seems difficult to model the idea of lexical diffusion (Wang 1969), given that it should depend on external parameters, like frequency of word use, which are also not very well understood.

The last problem is that of the actual tendencies of sound change, which are also by no means well understood by linguists. Initial work on sound change has been carried out (Kümmel 2008). However, the major work of finding a way to compare the major tendencies of sound change processes across a large sample of the world's languages (ie. the typology of sound change, which I plan to discuss separately in a later post), has not been carried out so far. The reason why we are missing this typology is that we lack clear-cut machine-readable accounts of annotated, aligned data. Here, scholars would provide their proto-forms for the reconstructed languages along with their proposed sound laws in a system that can in fact be tested and run (to allow to estimate also the exceptions or where those systems fail).

But having an account of the tendencies of sound change opens a fourth important problem apart from the lack of data that we could use to draw a first typology of sound change processes: since sound change tendencies are not only initiated by the general properties of speech sounds, but also by the linguistic systems in which these speech sounds are employed. While scholars occasionally mention this, there have been no real attempts to separate the two aspects in a concrete reconstruction of a particular language. The typology of sound change tendencies could thus not simply stop at listing tendencies resulting from the properties of speech sounds, but would also have to find a way to model diverging tendencies because of systemic constraints.

Traditional insights into the process of sound change

When discussing sound change, we need to distinguish mechanisms, types, and patterns. Mechanisms refer to how the process "proceeds", the types refer to the concrete manifestations of the process (like a certain, concrete change), and patterns reflect the systematic perspective of changes (i.e. their impact on the sound system of a given language, see List 2014).

Figure 1: Lexical diffusion

The question regarding the mechanism is important, since it refers to the dispute over whether sound change is happening simultaneously for the whole lexicon of a given language — that is, whether it reflects a change in the inventory of sounds, or whether it jumps from word to word, as the defenders of lexical diffusion propose, whom I mentioned above (see also Chen 1972). While nobody would probably nowadays deny that sound change can proceed as a regular process (Labov 1981), it is less clear as to which degree the idea of lexical diffusion can be confirmed. Technically, the theory is dangerous, since it allows a high degree of freedom in the analysis, which can have a deleterious impact on the inference of cognates (Hill 2016). But this does not mean, of course, that the process itself does not exist. In these two figures, I have tried to contrast the different perspectives on the phenomena.

Figure 2: Regular sound change

To gain a deeper understanding of the mechanisms of sound change, it seems indispensable to work more on models trying to explain how it is actuatedafter all. While most linguists agree that synchronic variation in our daily speech is what enables sound change in the first place, it is not entirely clear how certain new variants are fixed in a society. Interesting theories in this context have been proposed by Ohala (1989) who proposes distinct scenarios in which sound change can be initiated both by the speaker or the listener, which would in theory also yield predictable tendencies with respect to the typology of sound change.

The insights into the types and patterns of sound change are, as mentioned above, much more rudimentary, although one can say that most historical linguists have a rather good intuition with respect to what is possible and what is less likely to happen.

Computational approaches

We can find quite a few published papers devoted to the simulation of certain aspects of sound change, but so far, we do not (at least to my current knowledge) find any comprehensive account that would try to feed some 1,000 words to a computer and see how this "language'' develops — which sound laws can be observed to occur, and how they change the shape of the given language. What we find, instead, are a couple of very interesting accounts that try to deal with certain aspects of sound change.

Winter and Wedel for example test agent-based exemplar models, in order to see how systems maintain contrast despite variation in the realization (Hamann 2014: 259f gives a short overview of other recent articles). Au (2008) presents simulation studies that aim to test to which degree lexical diffusion and "regular" sound change interact in language evolution. Dediu and Moisik (2019) investigate, with the help of different models, to which degree vocal tract anatomy of speakers may have an impact on the actuation of sound change. Stevens et al. (2019) present an agent-based simulation to investigate the change of /s/ to /ʃ/ in.

This summary of literature is very eclectic, especially because I have only just started to read more about the different proposals out there. What is important for the problem of sound change simulation is that, to my knowledge, there is no approach yet ready to run the full simulation of a given lexicon for a given language, as stated above. Instead, the studies reported so far have a much more fine-grained focus, specifically concentrating on the dynamics of speaker interaction.

Initial ideas for improvement

I do not have concrete ideas for improvement, since the problem's solution depends on quite a few other problems that would need to be solved first. But to address the idea of simulating sound change, albeit only in a very simplifying account, I think it will be important to work harder on our inferences, by making transparent what so far is only implicitly stored in the heads of the many historical linguists in form of what they call their intuition.

During the past 200 years, after linguists started to apply the mysterious comparative method that they had used successfully to reconstruct Indo-European on other language families, the amount of data and number of reconstructions for the world's languages has been drastically increasing. Many different language families have now been intensively studied, and the results have been presented in etymological dictionaries, numerous books and articles on particular questions, and at times even in databases.

Unfortunately, however, we rarely find attempts of scholars to actually provide their findings in a form that would allow to check the correctness of their predictions automatically. I am thinking in very simple terms here — a scholar who proposes a reconstruction for a given language family should deliver not only the proto-forms with the reflexes in the daughter languages, but also a detailed test of how the proposed sound law by which the proto-forms change into the daughter languages produce the reflexes.

While it is clear that this could not be easily implemented in the past, it is in fact possible now, as we can see from a couple of studies where scholars have tried to compute sound change (Hartmann 2003, Pyysalo 2017, see also Sims-Williams 2018 for an overview on more literature). Although these attempts are unsatisfying, given that they do not account for cross-linguistic comparability of data (eg. they use orthographies rather than unified transcriptions, as proposed by Anderson et al. 2018), they illustrate that it should in principle be possible to use transducers and similar technologies to formally check how well the data can be explained under a certain set of assumptions.

Without cross-linguistic accounts of the diversity of sound change processes (ie. a first solution to the problem of establishing a first typology of sound change), attempts to simulate sound change will remain difficult. The only way to address this problem is to require a more rigorous coding of data (both human- and machine-readable), and an increased openness of scholars who work on the reconstruction of interesting language families, to help make their data cross-linguistically comparable.

Sign languages

When drafting this post, I promised to Guido and Justin to grasp the opportunity when talking about sound change to say a few words about the peculiarities of sound change in contrast to other types of language change. The idea was, that this would help us to somehow contribute to the mini-series on sign languages, which Guido and Justin have been initiated this month (see post number one, two, and three).

I do not think that I have completely succeeded in doing so, as what I have discussed today with respect to sound change does not really point out what makes it peculiar (if it is). But to provide a brief attempt, before I finish this post, I think that it is important to emphasize that the whole debate about regularity of sound change is, in fact, not necessarily about regularity per se, but rather about the question of where the change occurs. As the words in spoken languages are composed of a fixed number of sounds, any change to this system will have an impact on the language as a whole. Synchronic variation of the pronunciation of these sounds offers the possibility of change (for example during language acquisition); and once the pronunciation shifts in this way, all words that are affected will shift along, similar to a typewriter in which you change a key.

As far as I understand, for the time being it is not clear whether a counterpart of this process exists in sign language evolution, but if one wanted to search for such a process, one should, in my opinion, do so by investigating to what degree the signs can be considered as being composed of something similar to phonemes in historical linguistics. In my opinion, the existence of phonemes as minimal meaning-discriminating units in all human languages, including spoken and signed ones, is far from being proven. But if it should turn out that signed languages also recruit meaning-discriminating units from a limited pool of possibilities, there might be the chance of uncovering phenomena similar to regular sound change.

References
    Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (2018) A cross-linguistic database of phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting 4.1: 21-53.

    Anttila, Raimo (1976) The acceptance of sound change by linguistic structure. In: Fisiak, Jacek (ed.) Recent Developments in Historical Phonology. The Hague, Paris, New York: de Gruyter, pp. 43-56.

    Au, Ching-Pong (2008) Acquisition and Evolution of Phonological Systems. Academia Sinica: Taipei.

    Chen, Matthew (1972) The time dimension. Contribution toward a theory of sound change. Foundations of Language 8.4. 457-498.

    Dan Dediu and Scott Moisik (2019) Pushes and pulls from below: Anatomical variation, articulation and sound change. Glossa 4.1: 1-33.

    Hamann, Silke (2014) Phonological changes. In: Bowern, Claire (ed.) Routledge Handbook of Historical Linguistics. Routledge, pp. 249-263.

    Hartmann, Lee (2003) Phono. Software for modeling regular historical sound change. In: Actas VIII Simposio Internacional de Comunicación Social. Southern Illinois University, pp. 606-609.

    Hill, Nathan (2016): A refutation of Song’s (2014) explanation of the ‘stop coda problem’ in Old Chinese. International Journal of Chinese Linguistic 2.2. 270-281.

    Kümmel, Martin Joachim (2008) Konsonantenwandel [Consonant change]. Wiesbaden: Reichert.

    Labov, William (1981) Resolving the Neogrammarian Controversy. Language 57.2: 267-308.

    List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

    List, Johann-Mattis and Chacon, Thiago (2015) Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context. Paper presented at the workshop "Historical Phonology and Phonological Theory [organized as part of the 48th annual meeting of the SLE]" (2015/09/04, Leiden, Societas Linguistica Europaea).

    Ohala, J. J. (1989) Sound change is drawn from a pool of synchronic variation. In: Breivik, L. E. and Jahr, E. H. (eds.) Language Change: Contributions to the Study of its Causes. Berlin: Mouton de Gruyter, pp. 173-198.

    Pyysalo, Jouna (2017) Proto-Indo-European Lexicon: The generative etymological dictionary of Indo-European languages. In: Proceedings of the 21st Nordic Conference of Computational Linguistics, pp. 259-262.

    Sims-Williams, Patrick (2018) Mechanising historical phonology. Transactions of the Philological Society 116.3: 555-573.

    Stevens, Mary and Harrington, Jonathan and Schiel, Florian (2019) Associating the origin and spread of sound change using agent-based modelling applied to /s/- retraction in English. Glossa 4.1: 1-30.

    Wang, William Shi-Yuan (1969) Competing changes as a cause of residue. Language 45.1: 9-25.

    Winter, Bodo and Wedel, Andrew (2016) The co-evolution of speech and the lexicon: Interaction of functional pressures, redundancy, and category variation. Topics in Cognitive Science 8:  503-513.

    Stacking networks based on sign language manual alphabets


    This post is the first of a mini-series on sign language manual alphabets. While the evolution of spoken languages has been studied intensively using phylogenetic methods, sign languages have not, as yet.

    In this post we will first introduce our readers to a set of stacked networks, and how it assists in establishing ancestor-descendant relationships in a pretty straightforward (but not trivial) case: the evolution of manual alphabets in sign languages. In the next post, I will demonstrate the use of networks for character mapping and putting forward hypothesis about ancestor-descendant relationships.

    In 2004, Spencer et al. (Two papers you may want to read...) showed that Neighbor-nets outperform tree inferences when it comes to explicit ancestor-descendant relationships. The data set they used was quite particular: copies of written text. Here, scribes copy a text, and then other scribes, some of them ignorant of the language of the text they are copying, copy the copies. In the paper, the sequence of copies was recorded (the 'true tree'), and then the various texts were transferred into phylogenetic matrices, in order to infer trees and networks, and then this result was compared to the 'true tree'. The best fit of the data to the truth was the Neighbor-net.

    This is a compelling conclusion, because, as a planar network and in contrast to median networks, Neighbor-nets don't explicitly place taxa in ancestor-descendant relationships. However, we have shown for many cases here at the Genealogical World of Phylogenetic Networks how ancestors are often placed with respect to their descendants: they are often closer to the center of the graph, or the root when known, and thus they bridge the center or sister lineages and their descendants. We can thus see why Neighbor-nets might be useful in practice.

    In this context, the evolution of sign language manual alphabets, ie. the hand-shapes used to represent letters of a written alphabet, should be relatively easy to reconstruct. Once an alphabet is established in a sign language school / community, the ancestor, it will be passed on to other "generations" within the community and other schools / communities, the descendants. However, this is not necessarily a dichotomous process, as depicted in the first figure.

    A scheme depicting how manual alphabets may evolve and disperse.

    There are a few complications here: for example, hand-shapes may change in course of being used (the hand-shape evolves); contact may lead to exchange or appropriation of hand-shapes (called "borrowing" in linguistics); and, in some cases, entire alphabets will need to be adapted to a particular use. The latter case occurs when changing from one script (Latin, say) to another (Cyrillic or Arabic) — the first formal school for the deaf was established in Paris, for example. As a teacher, I need to decide: Do I take a hand-shape from the morphologically similar letter, or the phonetically similar one? As a scientist, I need to assess the homologies among such hand-shapes without inflicting systematic bias.

    Standardization will wipe out local customs and replace them with a multinational standard. For instance, Country 2 in the scheme above, drops its original B-type manual alphabet (red) for an A-type (blue); and in Country 7 both traditions are fused. Over time, originally distinct sign languages may converge due to geographic proximity, or even just feasibility.

    The evolution of spoken languages has been studied intensively using phylogenetic methods, and in particular networks are much more commonly found in the linguistic literature than in the biological one. For sign languages we have made a first step in a recently published pre-print:
    Justin M. Power, Guido W. Grimm, and Johann-Mattis List (2019) Evolutionary dynamics in the dispersal of sign languages. Humanities Commons. http://dx.doi.org/10.17613/0smt-j414
    What excites me about our study is that it combines historical manual alphabets (going back to 1593), which are potential ancestors, with a set of modern-day alphabets, which are their likely descendants. The data set is thus an evolutionary paleontologist's dream (and, possibly, a cladist's nightmare, if we expect a simple tree-like set of relationships rather than a network). As a scientist, I simple love to boldly go where no-one has gone before.

    The next figure shows the all-inclusive network from our paper, but focusing on the age of the manual alphabets.

    For more linguistic details see the pre-print.
    * Historical version(s) of these lineages are not included in our data set

    Obviously, there has been quite a lot of evolutionary changes, as well as standardization, going on, although some parts, like the Swedish SL (sign language), have stuck to its unique original. Historical and contemporary Spanish / Catalan are still most similar to the oldest manual alphabets that Justin dug out for our study. On the other hand, the contemporary Norwegian SL is placed far apart from his historical counterparts, and lacks any obvious affinity. Austrian, Danish, and German look back on a long and diverse history, the green "Austrian-origin Group", but the contemporaries have been homogenized by standardization (note the closeness to the International Sign manual alphabet). If we use an analogy with common biological and biogeographical processes (such as range expansion, competition, extinction, etc), then the Austrian-origin Group only survived in a remote island population, where we still find a sort of living fossil, the Icelandic SL.

    In contrast to biological data, the old, putatively ancestral, manual alphabets are not closer to the graph's center, or the oldest manual alphabets in our data set. The reason for this seems to lie in the data itself and how manual alphabets evolve, and this will be the topic of the next post(s).

    Still, we can isolate some evolutionary pathways, especially when we make time-wise taxon-filtered networks and stack them (see this introduction to stacking and this application using Osmundaceae, a data set including an even larger ratio of fossil taxa to modern taxa).

    Fig. 4 from Power et al. Coloring same as above: pink – Spanish; turquoise – French-origin; green – Austrian-origin; orange – Polish; red – Russian; light blue – Swedish Group. The English-origin and Afghan-Jordanian groups are not included, since not represented by historical manual alphabets in our data set

    Each of the three networks includes manual alphabets from a certain time period, starting with pre-1840 at the bottom, historical 19th-/20th-century manual alphabets in the middle, and post-1950 manual alphabets in the top network. The dotted links between the networks connect manual alphabets that are included in two of the networks.

    Even from these graphs alone, we can say a lot about how ancestors (original manual alphabets in a country) relate to descendants (later and contemporary manual alphabets) and their evolutionary pathways. Here are some examples.

    Shortly after the time when the first schools for the deaf were established in continental Europe (late 18th, early 19th centuries), manual alphabets showed quite a diversity, and were very different from their potential Spanish sources, such as Yebra 1593 and Bonet 1620, with the French and Austrian teachers and communities going different ways. The oldest Cyrillic alphabet, Russian 1835, is more closely related to (ancient) Austrian than it is to (ancient) French.

    The Swedish manual alphabet of 1866 is a fresh invention. Some hand-shapes may have been borrowed from one or another alphabet in use on the continent, but, as we will see in the next post of the series, includes genuinely new forms.

    The French tradition was dispersed into the new World (American SL appears to be a direct derivation from the French, while the Brazilian SL is an adaptation) but remained a relatively homogeneous group. On the other hand, the Austrian-origin languages diversified, in particular within the Danish influence zone. Politically, the Danish king ceded Norway to Sweden in the Treaty of Kiel 1814 (note the distance between Norwegian and Danish languages in the late 19th century), while Iceland was a Danish dependency until 1918, when the Danish-Icelandic Act of Union was signed. Furthermore, the German manual alphabets subsequently diverged from the Austrian source.

    The Polish manual alphabet, originally an adaptation of the Austrian-Danish manual alphabets (see the graph in the middle), became closer to the Russian group, with the Latvian sign language taking up an intermediate position. The Cyrillic alphabets evolved further away, too (top graph).

    In the following post(s) of this miniseries, we will explain what we learned from simple character mapping on the time-taxon-filtered networks, and how to score manual alphabets in the first place.


    Follow-up posts in this miniseries

    Automatic phonological reconstruction (Open problems in computational diversity linguistics 4)


    The fourth problem in my list of open problems in computational diversity linguistics is devoted to the problem of linguistic reconstruction, or, more specifically, to the problem of phonological reconstruction, which can be characterized as follows:
    Given a set of cognate morphemes across a set of related languages, try to infer the hypothetical pronunciation of each morpheme in the proto-language.
    This task needs to be distinguished from the broader task of linguistic reconstruction, which would usually include also the reconstruction of full lexemes, i.e. lexical reconstruction — as opposed to single morphemes or "roots" in an unknown ancestral language. In some cases, linguistic reconstruction is even used as a cover term for all reconstruction methods in historical linguistics, including such diverse approaches as phylogenetic reconstruction (finding the phylogeny of a language family), semantic reconstruction (finding the meaning of a reconstructed morpheme or root), or the task of demonstrating that languages are genetically related (see, e.g., the chapters in Fox 1995)

    Phonological and lexical reconstruction

    In order to understand the specific difference between phonological and lexical reconstruction, and why making this distinction is so important, consider the list of words meaning "yesterday" in five Burmish languages (taken from Hill and List 2017: 51).

    Figure 1: Cognate words in Burmish languages (taken from Hill and List 2017)

    Four of these languages express the word "yesterday" with the help of more than one morpheme, indicated by using different colors in the table's phonetic transcriptions, which at the same time ­ also indicate which words we consider to be homologous in this sample. Four of the languages have one morpheme which (as we confirmed from the detailed language data) means "day" independently. This morpheme is given the label 2 in the last column of the table. From this, we can see that the motivation by which the word for "yesterday" is composed in these languages is similar to the one we observe in English, where we also find the word day being a part of the word yester-day.

    If we want to know how the word "yesterday" was expressed in the ancestor of the Burmish languages, we could make an abstract estimation based on the lexical material we have at hand. We might assume that it was also a compound word, given the importance of compounding in all living Burmish languages. We could further hypothesize that one part of the ancient compound would have been the original word for "day". We could even make a guess and say the word was in structure similar to Bola and Lashi (although it is difficult to find a justification for doing this). In all cases, we would propose a lexical reconstruction for the word for "yesterday" in Proto-Burmish. We would make an assumption with respect to what one could call the denotation structure or the motivation structure, as we called it in Hill and List (2017: 67). This assumption would not need to provide an actual pronunciation of the word, it could be proposed entirely independently.

    If we want to reconstruct the pronunciation of the ancient word for "yesterday" as well, we have to compare the corresponding sounds, and build a phonological reconstruction for each of the morphemes separately. As a matter of fact, scholars working on South-East Asian languages rarely propose a full lexical reconstruction as part of their reconstruction systems (for a rare exception, see Mann 1998). Instead, they pick the homologous morphemes from their word comparisons, assign some rough meaning to them (this step would be called semantic reconstruction), and then propose an ancient pronunciation based on the correspondence patterns they observe.

    When listing phonological reconstruction as one of my ten problems, I am deliberately distinguishing this task from the tasks of lexical reconstruction or semantic reconstruction, since they can (and probably should) be carried out independently. Furthermore, by describing pronunciation of the morphemes as "hypothetical pronunciations" in the ancestral language, I want not only to emphasize that all reconstruction is hypothetical, but also to point to the fact that it is very possible that some of the morphemes for which one proposes a proto-form may not even have existed in the proto-language. They could have evolved only later as innovations on certain branches in the history of the languages. For the task of phonological reconstruction, however, this would not matter, since the question of whether a morpheme existed in the most recent common ancestor becomes relevant only if one tries to reconstruct the lexicon of a given proto-language. But phonological reconstruction seeks to reconstruct its phonology, i.e. the sound inventory of the proto-language, and the rules by which these sounds could be combined to form morphemes (phonotactics).

    Why phonological reconstruction is hard

    That phonological reconstruction is hard should not be surprising. What the task entails is to find the most probable pronunciation for a bunch of morphemes in a language for which no written records exist. Imagine you want to find the DNA of LUCA as a biologist, not even in its folded form, with all of the pieces in place, but just a couple of chunks, in order to get a better picture of how this LUCA might have looked like. But while we can employ some weak version of uniformitarianism when trying to reconstruct at least some genes of our LUCA (we would still assume that it was using some kind of DNA, drawn from the typical alphabet of DNA letters), we face the specific problem in linguistics that we cannot even be sure about the letters.

    Only recently, Blasi et al. (2019) argued that sounds like f and v may have evolved later than the other sounds we can find in the languages of the world, driven by post-Neolithic changes in the bite configuration, which seem to depend on what we eat. As a rule, and independent of these findings, linguists do not tend to reconstruct an f for the proto-language in those cases where they find it corresponding to a p, since we know that in almost all known cases a p can evolve into an f, but an f almost never becomes a p again. This can lead to the strange situation where some linguists reconstruct a p for a given proto-language even though all descendants show an f, which is, of course, an exaggeration of the principle (see Guillaume Jacques' discussion on this problem).

    But the very idea, that we may have good reasons to reconstruct something in our ancestral language that has been lost in all descendant languages, is something completely normal for linguists. In 1879, for example Ferdinand de Saussure (Saussure 1879) used internal and comparative evidence to propose the existence of what he called coefficients sonantiques in Proto-Indo-European. His proposal included the prediction that — if ever a languages was found that retained these elements — these new sounds would surface as segmental elements, as distinctive sounds, in certain cognate sets, where all known Indo-European languages had already lost the contrast.

    These sounds are nowadays known as laryngeals (*h1, *h2, *h3, see Meier-Brügger 2002), and when Hittite was identified as an Indo-European language (Hrozný 1915), one of the two sounds predicted by Saussure could indeed be identified. I have discussed before on this blog the problem of unattested character states in historical linguistics, so there is no need to go into further detail. What I want to emphasize is that this aspect of linguistic reconstruction in general, and phonological reconstruction specifically, is one of the many points that makes the task really hard, since any algorithm to reconstruct the phonological system of some proto-language would have to find a way to formalize the complicated arguments by which linguists infer that there are traces of something that is no longer there.

    There are many more things that I could mention, if I wanted to identify the difficulty of phonological reconstruction in its entirety. What I find most difficult to deal with is that the methodology is insufficiently formalized. Linguists have their success stories, which helped them to predict certain aspects of a given proto-language that could later be confirmed, and it is due to these success stories that we are confident that it can, in principle, be done. But the methodological literature is sparse, and the rare cases where scholars have tried to formalize it are rarely discussed when it comes to evaluating concrete proposals (as an example for an attempt of formalizing, see Hoenigswald 1960). Before this post becomes too long, I will therefore conclude bu noting that scholars usually have a pretty good idea of how they should perform their phonological reconstructions, but that this knowledge of how one should reconstruct a proto-language is usually not seen as something that could be formalized completely.

    Traditional strategies for phonological reconstruction

    Given the lack of methodological literature on phonological reconstruction, it is not easy to describe how it should be done in an ideal scenario. What seems to me to be the most promising approach is to start from correspondence patterns. A correspondence pattern is an abstraction from individual alignment sites distributed over cognate sets drawn from related languages. As I have tried to show in a paper published earlier this year (List 2019), a correspondence pattern summarizes individual alignment sites in an abstract form, where missing data are imputed. I will avoid going into the details here but, as a shortcut, we can say that each correspondence pattern should, in theory, only correspond to one proto-sound in the language, although the same proto-sound may correspond to more than one correspondence pattern. As an example, consider the following table, showing three (fictive) patterns that would all be reconstructed by a *p.

     Proto-Form  L₁  L₂  L₃
     *p  p  p f
     *p p p p
     *p b p p

    To justify that the same proto-sound is reconstructed by a *p in all three patterns, linguists invoke the rule of context, by looking at the real words from which the pattern was derived. An example is shown in the next table.


     Proto-Form
    L₁L₂L₃
    *p i a ŋ p i a ŋ  p i u ŋ  f a n
    *p a t p a t p a t p a t
    *a p a ŋ a b a ŋ  a p a ŋ  a p a n

    What you should be able to see from the table is that we can find in all three patterns a conditioning factor that allows us to assume that the deviation from the original *p is secondary. In language L₃, the factor can be found in the palatal environment (followed by the front vowel *i) that we find in the ancestral language. We would assume that this environment triggered the change from *p to f in this language. In the case of the change from *p to b in L₁, the triggering environment is that the p occurs inter-vocalically.

    To summarize: what linguists usually do in order to reconstruct proto-forms for ancestral languages that are not attested in written sources, is to investigate the correspondence patterns, and to try to find some neat explanation of how they could have evolved, given a set of proto-forms along with triggering contexts that explain individual changes in individual descendant languages.

    Computational strategies for phonological reconstruction

    Not many attempts have been made so far to automate the task of reconstruction. The most prominent proposal in this direction has been made by Bouchard-Côté et al. (2013). Their strategy radically differs from the strategy outlined above, since they do not make use of correspondence patterns, but instead use a stochastic transducer and known cognate words in the descendant languages, along with a known phylogenetic tree that they traverse, inferring the most likely changes that could explain the observed distribution of cognate sets.

    So far, this method has been tested only on Austronesian languages and their subgroups, where it performed particularly well (with error rates between 0.25 and 0.12, using edit distance as the evaluation measure). Since it is not available as a software package that can be conveniently used and tested on other language families, it is difficult to tell how well it would perform when being presented with more challenging test cases.

    In a forthcoming paper, Gerhard Jäger illustrates how classical methods for ancestral state reconstruction applied to aligned cognate sets could be used for the same task (Jäger forthcoming). While Jäger's method is more in line with "linguistic thinking", in so far as he uses alignments, and applies ancestral state reconstructions to each column of the alignments, it does not make use of correspondence patterns, which would be the general way by which linguists would proceed. This may also explain the performance, which shows an error rate of 0.48 (also using edit distance for evaluation) — although this is also due to the fact that the method was tested on Romance languages and compared with Latin, which is believed to be older than the ancestor of all Romance languages.

    Problems with computational strategies for phonological reconstruction

    Both the method of Bouchard-Côté et al. and the approach of Jäger suffer from the problem of not being able to detect unobserved sounds in the data. Jäger side-steps this problem in theory, by using a shortened alphabet of only 40 characters, proposed by the ASJP project, which encoded more than half of the world's languages in this form. Bouchard-Côté's test data, Proto-Austronesian (and its subgroups), are fairly simple in this regard. It would therefore be interesting to see what would happen if the methods are tested with full phonetic (or phonological) representations of more challenging language families (for example, the Chinese dialects). While Jäger's approach assumes the independence of all alignment sites, Bouchard-Côté's stochastic transducers handle context on the level of bigrams (if I read their description properly). However, while bigrams can be seen as an improvement over ignoring conditioning context, they are not the way in which context is typically handled by linguists. As I tried to explain briefly in last month's post, context in historical linguistics calls for a handling of abstract contexts, for example, by treating sequences as layered entities, similar to music scores.

    Apart from the handling of context and unobserved characters, the evaluation measure used in both approaches seems also problematic. Both approaches used the edit distance (Levenshtein 1965), which is equivalent to the Hamming distance (Hamming 1950) applied to aligned sequences. Given the problem of unobserved characters and the abstract nature of linguistic reconstruction systems, however, any measure that evaluates the surface similarity of sequences is essentially wrong.

    To illustrate this point, consider the reconstruction of the Indo-European word for sheep by Kortlandt (2007), who gives *ʕʷ e u i s, as compared to Lühr (2008), who gives *h₂ ó w i s. The normalized edit distance between both systems is the Hamming distance of their (trivial) alignment: in three of five cases they differ, which makes up to an unnormalized edit distance of three, and a normalized edit distance of 0.6. While this is pretty high, their systems are mostly compatible, since Korthland reconstructs *ʕʷ in most cases where Lühr writes *h₂. Therefore, the distance should be much lower; in fact, it should be zero, since both authors agree on the structure of the form they reconstruct in comparison with the structure of other words they reconstruct for Proto-Indo-European.

    Since scholars do not necessarily select phonetic values in their reconstructions that derive directly from the descendant languages, and moreover they may differ often regarding the details of the phonetic values they propose, a valid evaluation of different reconstruction systems (including automatically derived ones) needs to compare the structure of the systems, not their substance (see List 2014: 48-50 for a discussion of structural and substantial differences between sequences).

    Currently, there is (to my knowledge) no accepted solution for the comparison of structural differences among aligned sequences. Finding an adequate evaluation measure to compare reconstruction systems can therefore be seen as a sub-problem of the bigger problem of phonological reconstruction. To illustrate why it is so important to compare the structural information and not the pure substance, consider the three cases in which Jäger's reconstruction gives a v as opposed to a w in Latin (data here): while evaluating by the edit distance yields a score of 0.48, this score will drop to 0.47 when replacing the v instances with a w. Jäger's system is doing something right, but the edit distance cannot capture the fact that the system is deviating systematically from Latin, not randomly.

    Initial ideas for improvement

    There are many things that we can easily improve when working on automatic methods for phonological reconstruction.

    As a first point, we should work on enhanced measures of evaluation, going beyond the edit distance as our main evaluation measure. In fact, this can be easily done. With B-Cubed scores (Amigó et al. 2009), we already have a straightforward measure to compare whether two reconstruction systems are structurally identical or similar. In order to apply these scores, the automatic reconstructions have to be aligned with the gold standard. If they are identical, although the symbols may differ, then the scores will indicate this. The problem of comparing reconstruction systems is, of course, more difficult, as we can face cases where systems are not structurally identical (i.e. you can directly replace any symbol a in system A by any symbol a' in system B to produce B from A and vice versa), but they would be a start.

    Furthermore, given that we lack test cases, we might want to work on semi-automatic instead of fully automatic methods, in the meantime. Given that we have a first method to infer sound correspondence patterns from aligned data (List 2019), we can infer all patterns and have linguists annotate each pattern by providing the proto-sound they think would fit best — we are testing this at the moment. Having created enough datasets in this form, we could then think of discussing concrete algorithms that would derive proto-forms from correspondence patterns, and use the semi-automatically created and manually corrected data as gold standard.

    Last but not least, one straightforward way by which it is possible to formally create unknown sounds from known data, is to represent sound as vectors of phonological features instead of bare symbols (e.g. representing p as voiceless bilabial plosive and b as voiced labial plosive). If we then compare alignment sites or correspondence patterns for the feature vectors, we could check to what degree standard algorithms for ancestral state reconstructions propose unattested sounds similar to the ones proposed by experts. In order to do this, we would need to encode our data in transparent transcription systems. This is not the case for most current datasets, but with the Cross-Linguistic Transcription Systems initiative we already have a first attempt to provide features for the majority of sounds that we find in the languages of the world (Anderson et al. forthcoming).

    Outlook

    It is difficult to tell how hard the problem of phonological reconstruction is in the end. Semi-automatic solutions are already feasible now, and we are currently testing them on different (smaller) groups of phylogenetically related languages. One crucial step in the future is to code up enough data to allow for a rigorous testing of the few automatic solutions that have been proposed so far. We are working on that as well. But how to propose an evaluation system that rigorously tests not only to what degree a given reconstruction is identical with a given gold standard, but also structurally equivalent, remains one of the crucial open problems in this regard.

    References
      Amigó, Enrique and Gonzalo, Julio and Artiles, Javier and Verdejo, Felisa (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12.4: 461-486.

      Anderson, Cormac, Tresoldi, Tiago, Chacon, Thiago Costa, Fehn, Anne-Maria, Walworth, Mary, Forkel, Robert and List, Johann-Mattis (forthcoming) A cross-linguistic Database of Phonetic transcription systems. Yearbook of the Poznań Linguistic Meeting, pp. 1-27.

      Blasi, Damián E. , Steven Moran, Scott R. Moisik, Paul Widmer, Dan Dediu and Balthasar Bickel (2019) Human sound systems are shaped by post-Neolithic changes in bite configuration. Science 363.1192: 1-10.

      Bouchard-Côté, Alexandre and Hall, David and Griffiths, Thomas L. and Klein, Dan (2013) Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 110.11: 4224–4229.

      Fox, Anthony (1995) Linguistic Reconstruction: An Introduction to Theory and Method. Oxford: Oxford University Press.

      Hamming, Richard W. (1950) Error detection and error detection codes. Bell System Technical Journal 29.2: 147–160.

      Hill, Nathan W. and List, Johann-Mattis (2017) Challenges of annotation and analysis in computer-assisted language comparison: a case study on Burmish languages. Yearbook of the Poznań Linguistic Meeting 3.1: 47–76.

      Hoenigswald, Henry M. (1960) Phonetic similarity in internal reconstruction. Language 36.2: 191-192.

      Hrozný, Bedřich (1915) Die Lösung des hethitischen Problems [The solution of the Hittite problem]. Mitteilungen der Deutschen Orient-Gesellschaft 56: 17–50.

      Jäger, Gerhard (forthcoming) Computational historical linguistics. Theoretical Linguistics.

      Kortlandt, Frederik (2007) For Bernard Comrie.

      Levenshtein, V. I. (1965) Dvoičnye kody s ispravleniem vypadenij, vstavok i zameščenij simvolov [Binary codes with correction of deletions, insertions and replacements]. Doklady Akademij Nauk SSSR 163.4: 845-848.

      List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

      List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

      Lühr, Rosemarie (2008) Von Berthold Delbrück bis Ferdinand Sommer: Die Herausbildung der Indogermanistik in Jena. Vortrag im Rahmen einer Ringvorlesung zur Geschichte der Altertumswissenschaften (09.01.2008, FSU-Jena).

      Mann, Noel Walter (1998) A Phonological Reconstruction of Proto Northern Burmic. The University of Texas: Arlington.

      Meier-Brügger, Michael (2002) Indogermanische Sprachwissenschaft. Berlin and New York: de Gruyter.

      Saussure, Ferdinand de (1879) Mémoire sur le Système Primitif des Voyelles dans les Langues Indo- Européennes. Leipzig: Teubner.

      Automatic sound law induction (Open problems in computational diversity linguistics 3)


      The third problem in my list of ten open problems in computational diversity linguistics is a problem that has (to my knowledge) not even been considered as a true problem in computational historical linguistics, so far. Until now, it has been discussed by colleagues only indirectly. This problem, which I call the automatic induction of sound laws, can be described as follows:
      Starting from a list of words in a proto-language and their reflexes in a descendant language, try to find the rules by which the ancestral language is converted into the descendant language.
      Note that by "rules", in this context, I mean the classical notation that phonologists and historical linguists use in order to convert a source sound in a target sound in a specific environment (see Hall 2000: 73-75). If we consider the following ancestral and descendant words from a fictive language, we can easily find the laws by which the input should be converted into an output — namely, an a should be changed to an e, an e should be changed to an i, and a k changes to s if followed by an i but not if followed by an a.

      InputOutput
      papapepe
      mamameme
      kakakeke
      kekesisi

      Short excursus on linguistic notation of sound laws

      Based on the general idea of sound change (or sound laws in classical historical linguistics) as some kind of a function by which a source sound is taken as input and turned into a target sound as output, linguists use a specific notation system for sound laws. In the simplest form of the classical sound law notation, this process is described in the form s > t, where s is the source sound and t is the target sound. Since sound change often relies the on specific conditions of the surrounding context — i.e. it makes a difference if some sound occurs in the beginning or the end of a word — context is added as a condition separated by a /, with an underscore _ referring to the sound in its original phonetic environment. Thus, the phenomenon of voiced stops becoming unvoiced at the end of words in German (e.g. d becoming t), can be written as d > t / _$, where $ denotes the end of a word.

      One can see how close this notation comes to regular expressions and according to many scholars, the rules by which languages change with respect to their sound systems do not exceed the complexity of regular grammars. Nevertheless, sound change notation does differ in the scope and the rules for annotation. One notable difference is the possibility to explain how full classes of sounds change in a specific environment. The German rule of devoicing, for example, generally affects all voiced stops in the end of a word. As a result, one could also annotat it as G > K / _$, where G would denote the sounds [b, d, g] and K their counterparts [p, t, k]. Although we could easily write a single rule for each of the three phenomena here, the rule by which the sounds are grouped into two classes of voiced sounds and their unvoiced counterparts is linguistically more interesting, since it reminds us that the change by which word-final consonants loose the feature of voice is a systemic change, and not a phenomenon applying to some random selection of sounds in a given language.

      The problem of this systemic annotation, however, is that the grouping of sounds into classes that change in a similar form is often language-specific. As a result, scholars have to propose new groupings whenever they deal with another language. Since neither the notation of sound values nor the symbols used to group sounds into classes are standardized, it is extremely difficult to compare different proposals made in the literature. As a result, any attempt to solve the problem of automatic sound law induction in historical linguistics would at the same time have to make strict proposals for a standardization of sound law notations used in our field. Standardization can thus be seen as one of the first major obstacles of solving this problem, with the problem of accounting for systemic aspects of sound change as the second one.

      Beyond regular expressions

      Even if we put the problem of inconsistent annotation and systemic changes to one side, the analogy with regular expressions cannot properly handle all aspects of sound change. When looking at the change from Middle Chinese to Mandarin Chinese, for example, we find a complex pattern, by which originally voiced sounds, like [b, d, g, dz] (among others), were either devoiced, becoming [p, t, k, ts], or devoiced and aspirated, becoming [pʰ, tʰ, kʰ, tsʰ]. While it is not uncommon that one sound can change into two variants, depending on the context in which it occurs, the Mandarin sound change in this case is interesting because the context is not a neighboring sound, but is instead the Middle Chinese tone for the syllable in question — syllables with a flat tone (called píng tone in classical terminology) are nowadays voiceless and aspirated, and syllables with one of the three remaining Middle Chinese tones (called shǎng, , and ) are nowadays plain voiceless (see List 2019: 157 for examples).

      Since tone is a feature that applies to whole syllables, and not to single sound segments, we are dealing with so-called supra-segmental features here. As the meaning of the term supra-segmental indicates, the features in question cannot be represented as a sequence of sound, but need to be thought of as an additional layer, similar to other supra-segmental features in language, including stress, or juncture (indicating word or morpheme boundaries).

      In contrast to sequences as we meet them in mathematics and informatics, linguistic sound sequences do not consist solely of letters drawn from an alphabet that is lined up in some unique order. They are instead often composed of multiple layers, which are in part hierarchically ordered. Words, morphemes, and phrases in linguistics are thus multi-layered constructs, which cannot be represented by one sequence alone, but could be more fruitfully thought of as the same as a partitura in music — the score of a piece of orchestra music, in which every voice of the orchestra is given its own sequence of sounds, and all different sequences are aligned with each other to form a whole.

      img
      The multi-layered character of sound sequences can be seen as similar to a partitura in musical notation.

      This multi-layered character of sound sequences in spoken languages comprises a third complication for the task of automatic sound law induction. Finding the individual laws that trigger the change of one stage of a language to a later stage, cannot (always) be trivially reduced to the task of finding the finite state transducer that translates a set of input strings to a corresponding set of output strings. Since our input word forms in the proto-language are not simple strings, but rather an alignment of the different layers of a word form, a method to induce sound laws needs to be able to handle the multi-layered character of linguistic sequences.

      Background for computational approaches to sound law induction

      To my knowledge, the question of how to induce sound laws from data on proto- and descendant languages has barely been addressed. What comes closest to the problem are attempts to model sound change from known ancestral languages, such as Latin, to daughter languages, such as Spanish. This is reflected, for example, in the PHONO program (Hartmann 2003), where one can insert data for a proto-language along with a set of sound change rules (provided in a similar form to that mentioned above), which need to be given in a specific order, and are then checked to see whether they correctly predict the descendant forms.

      For teaching purposes, I adapted a JavaScript version of a similar system, called the Sound Change Applier²(http://www.zompist.com/sca2.html) by Mark Rosenfelder from 2012, in which students could try to turn Old High German into modern German, by assigning simple rules as they are traditionally used to describe sound change processes in the linguistic literature. This adaptation (which can be found at http://dighl.github.io/sound_change/SoundChanger.html) compares the attested output with the output generated by a given set of rules, and provides some assessment of the general accuracy of the proposed set of rules. For example, when feeding the system the simple rule an > en /_#, which turns all final instances of -an into -en, 54 out of 517 Old High German words will yield the expected output in modern Standard German.

      The problem with these endeavors is, of course, the handling of exceptions, along with the comparison of different proposals. Since we can think of an infinite number of rules by which we could successfully turn a certain amount of Old High German strings into Standard German strings, we would need to ask ourselves how we could evaluate different proposals. That some kind of parsimony should play a role here is obvious. However, it is by no means clear (at least to me) how to evaluate the complexity of two systems, since the complexity would not only be reflected in the number of rules, but also in the initial grouping of sounds to classes, which is commonly used to account for systemic aspects of sound change. A system accounting for the problem of sound law induction would try to automate the task of finding the set of rules. The fact that it is difficult even to compare two or more proposals based on human assessment further illustrates why I think that the problem is not trivial.

      Another class of approaches is that of word prediction experiments, such as the one by Ciobanu and Dinu (2018) (but see also Bodt and List 2019), in which training data consisting of the source and the target language are used to create a model, which is then successively applied to new data, in order to test how well this model predicts target words from the source words. Since the model itself is not reported in these experiments, but only used in the form of a black box to predict new words, the task cannot be considered to be the same as the task for sound law induction — which I propose as one of my ten challenges for computational historical linguistics — given that we are interested in a method that explicitly returns the model, in order to allow linguists to inspect it.

      Problems with the current solutions to sound law induction

      Given that no real solutions exist to the problem up to now, it seems somewhat useless to point to the problems of current solutions. What I want to mention in this context, however, are the problems of the solutions presented for word prediction experiments, be they fed by manual data on sound changes (Hartmann 2003), or based on inference procedures (Ciobanu and Dinu 2018, Dekker 2018). Manual solutions like PHONO suffer from the fact that they are tedious to apply, given that linguists have to present all sound changes in their data in an ordered fashion, with the program converting them step by step, always turning the whole input sequence into an intermediate output sequence — the word prediction approaches thus suffer from limitations in feature design.

      The method by Ciobanu and Dinu (2018), for example, is based on orthographic data alone, using the Needleman-Wunsch algorithm for sequence alignment (Needleman and Wunsch 1970); and the approach by Dekker (2018) only allows for the use for the limited alphabet of 40 symbols proposed by the ASJP project (Holman et al. 2008). In addition to the limited representation of linguistic sound sequences, be it by resorting to abstract orthography or to abstract reduced phonetic alphabets, none of the methods can handle those kinds of contexts which result from the multi-layered character of speech. Since we know well that these aspects are vital for certain phenomena of sound change, the methods exclude from the beginning an aspect that traditional historical linguists, who might be interested in an automatic solution to the sound law induction problem, would put at the top of their wish-list of what the algorithm should be able to handle.

      Why is automatic sound law induction difficult?

      The handling of supra-segmental contexts, mentioned above, is in my opinion also the reason why sound law induction is so difficult, not only for machines, but also for humans. I have so far mentioned three major problems as to why I think sound law induction is difficult. First, we face problems in defining the task properly in historical linguistics, due to a significant lack in standardization. This makes it difficult to decide on the exact output of a method for sound law induction. Second, we have problems in handling the systemic aspect of sound change properly. This does not apply only to automatic approaches, but also to the evaluation of different proposals for the same data proposed by humans. Third, the multi-layered character of speech requires an enhanced modeling of linguistic sequences, which cannot be modeled as mono-dimensional strings alone, but should rather be seen as alignments of different strings representing different layers (tonal layer, stress layer, sound layer, etc.).

      How humans detect sound laws

      There are only a few examples in the literature where scholars have tried to provide detailed lists of sound changes from proto- to descendant language (Baxter 1992, Newman 1999). Most examples of individual sound laws proposed in the literature are rarely even tested exhaustively on the data. As a result, it is difficult to assess what humans usually do in order to detect sound laws. What is clear is that historical linguists who have been working a lot on linguistic reconstruction tend to acquire a very good intuition that helps them to quickly check sound laws applied to word forms in their head, and to convert the output forms. This ability is developed in a learning-by-doing fashion, with no specific techniques ever being discussed in the classroom, which reflects the general tendency in historical linguistics to trust that students will learn how to become a good linguist from examples, sooner or later (Schwink 1994: 29). For this reason, it is difficult to take inspiration from current practice in historical linguistics, in order to develop computer-assisted approaches to solve this task.

      Potential solutions to the problem

      What can we do in order to address the problem of sound law induction in automatic frameworks in the future?

      As a first step, we would have to standardize the notation system that we use to represent sound changes. This would need to come along with a standardized phonetic transcription system. Scholars often think that phonetic transcription is standardized in linguistics, specifically due to the use of the International Phonetic Alphabet. As our investigations into the actual application of the IPA have shown, however, the IPA cannot be seen as a standard, but rather as a set of recommendations that are often only loosely followed by linguists. First attempts to standardize phonetic transcription systems for the purpose of cross-linguistic applications have, however, been made, and will hopefully gain more acceptance in the future (Anderson et al. forthcoming, https://clts.clld.org).

      As a second step, we should invest more time in investigating the systemic aspects of language change cross-linguistically. What I consider important in this context is the notion of distinctive features by which linguists try to group sounds into classes. Since feature systems proposed by linguists differ greatly, with some debate as to whether features are innate and the same for all languages, or instead language-specific (see Mielke 2008 for an overview on the problem), a first step would again consist of making the data comparable, rather than trying to decide in favour of one of the numerous proposals in the literature.

      As a third step, we need to work on ways to account for the multi-layered aspect of sound sequences. Here, a first proposal, labelled "multi-tiered sequence representation", has already been made by myself (List and Chacon 2015), based on an idea that I had already used for the phonetic alignment algorithm proposed in my dissertation (List 2014), which itself goes back to the handling of hydrophilic sequences in ClustalW (Thompson et al. 1994). The idea is to define a sound sequence as a sequence of vectors, with each vector (called tier) representing one distinct aspect of the original word. As this representation allows for an extremely flexible modeling of context — which would just consist of an arbitrary number of vector dimensions that could account for aspects such as tone, stress, preceding or following sounds — this representation would allow us to treat words as sequences of sounds while at the same time accounting for their multi-layered structure. Although there remain many unsolved aspects on how to exploit this specific model for phonetic sequences to induce sound laws from ancestor-descendant data, I consider this to be a first step in the direction of a solution to the problem.

      Multi-tiered sequence representation for a fictive word in Middle Chinese.

      Outlook

      Although it is not necessarily recognized by the field as a real problem of historical linguistics, I consider the problem of automatic sound law induction as a very important problem for our field. If we could infer sound laws from a set of proposed proto-forms and a set of descendant forms, then we could use them to test the quality of the proto-forms themselves, by inspecting the sound laws proposed by a given system. We could also compare sound laws across different language families to see whether we find cross-linguistic tendencies.

      Having inferred enough cross-linguistic data on sound laws represented in unified models for sound law notation, we could also use the rules to search for cognate words that have so far been ignored. There is a lot to do, however, until we reach this point. Starting to think about automatic, and also manual, induction of sound laws as a specific task in computational historical linguistics can be seen as a first step in this direction.

      References
        Anderson, Cormac and Tresoldi, Tiago and Chacon, Thiago Costa and Fehn, Anne-Maria and Walworth, Mary and Forkel, Robert and List, Johann-Mattis (forthcoming) A Cross-Linguistic Database of Phonetic Transcription Systems. Yearbook of the Poznań Linguistic Meeting, pp 1-27.

        Baxter, William H. (1992) A handbook of Old Chinese Phonology. Berlin: de Gruyter.

        Bodt, Timotheus A. and List, Johann-Mattis (2019) Testing the predictive strength of the comparative method: An ongoing experiment on unattested words in Western Kho-Bwa langauges. 1-22. [Preprint, under review, not peer-reviewed]

        Ciobanu, Alina Maria and Dinu, Liviu P. (2018) Simulating language evolution: A tool for historical linguistics. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp 68-72.

        Dekker, Peter (2018) Reconstructing Language Ancestry by Performing Word Prediction with Neural Networks. University of Amsterdam: Amsterdam.

        Hall, T. Alan (2000) Phonologie: Eine Einführung. Berlin and New York: de Gruyter.

        Hartmann, Lee (2003) Phono. Software for modeling regular historical sound change. In: Actas VIII Simposio Internacional de Comunicación Social. Southern Illinois University, pp 606-609.

        Holman, Eric W. and Wichmann, Søren and Brown, Cecil H. and Velupillai, Viveka and Müller, André and Bakker, Dik (2008) Explorations in automated lexicostatistics. Folia Linguistica 20.3: 116-121.

        List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.

        List, Johann-Mattis and Chacon, Thiago (2015) Towards a cross-linguistic database for historical phonology? A proposal for a machine-readable modeling of phonetic context. Paper, presented at the workshop Historical Phonology and Phonological Theory [organized as part of the 48th annual meeting of the SLE] (2015/09/04, Leiden, Societas Linguistica Europaea).

        List, Johann-Mattis (2019) Automatic inference of sound correspondence patterns across multiple languages. Computational Linguistics 1.45: 137-161.

        Mielke, Jeff (2008) The Emergence of Distinctive Features. Oxford: Oxford University Press.

        Needleman, Saul B. and Wunsch, Christan D. (1970) A gene method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.

        Newman, John and Raman, Anand V. (1999) Chinese Historical Phonology: Compendium of Beijing and Cantonese Pronunciations of Characters and their Derivations from Middle Chinese. München: LINCOM Europa.

        Schwink, Frederick (1994) Linguistic Typology, Universality and the Realism of Reconstruction. Washington: Institute for the Study of Man.

        Thompson, J. D. and Higgins, D. G. and Gibson, T. J. (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673–4680.

        Automatic morpheme segmentation (Open problems in computational diversity linguistics 1)


        The first task on my list of 10 open problems in computational diversity linguistics deals with morphemes, that is, the minimal meaning-bearing parts in a language. A morpheme can be a word, but it does not have to be a word, since words may consist of more than one morpheme, and ­— depending on the language in question — may do so almost by default.

        Examples of morphemes in English include clear-cut cases of compounding, where two words are joined to form a new word. Often, this is not even readily reflected in spelling, and, as a result, speakers may at times think that a word like "primary school" is not a single word, although it is easy to determine from its semantics that the word is indeed pointing to one uniform concept. Other examples include grammatical markers, such as the ending -s for most English plurals, or to mark the third person singular of verbs. When confronted with a word form like walks, linguists will analyze this word as consisting of two morphemes, illustrating it by adding a dash as a boundary marker: walk-s.

        The problem

        The task of automatic morpheme segmentation is thus a pretty straightforward one: given a list of words, potentially along with additional information, such as their meaning, or their frequency in the given language, try to identify all morpheme boundaries, and mark this by adding dash symbols where a boundary has been identified.

        One may ask why automatic identification of morphemes should be a problem —  and some people commenting on my presentation of the 10 open problems last month did ask this. The problem is not unrecognized in the field of Natural Language Processing, and solutions have been discussed from the 1950s onwards (Harris 1955, Benden 2005, Bordag 2008, Hammarström 2006, see also the overview by Goldsmith 2017).

        Roughly speaking, all approaches build on statistics about n-grams, i.e., recurring symbol sequences of arbitrary length. Assuming that n-grams representing meaning-building units should be distributed more frequently across the lexicon of a language, they assemble these statistics from the data, trying to infer the ones which "matter". With Morfessor (Creutz and Lagus 2005, there is also a popular family of algorithms available in form of a very stable and easy-to-use Python library (Virpioja et al. 2013). Applying and testing methods for automatic morpheme segmentation is thus very straightforward nowadays.

        The issue with all of these approaches and ideas is that they require a very large amount of data for training, while our actual datasets are small and sparse, by nature. As a result, all currently available algorithms fail graciously when it comes to determining the morphemes in datasets of less of 1,000 words.

        Interestingly, even when having been trained on large datasets, the algorithms still commit surprising errors, as can be easily seen when testing the online demo of the Morfessor software for German (https://asr.aalto.fi/morfessordemo/). When testing words like auftürmen "pile up", for example, the algorithm yields the segmentation auf-türme-n, which is probably understandable from the fact that the word Türme "towers" is quite frequent in the German lexicon, thus confusing the algorithm; but for a German speaker, who knows that verbs end in -en in their infinitive, it is clear that the auftürmen can only be segmented as auf-türm-en.

        If I understand the information on the website correctly, the Morfessor algorithm offered online was trained with more than 1 million different word forms in German. Given that in our linguistic approaches we can usually dispose of 1,000 words, if not less, per language, it is clear that the algorithms won't provide help in finding the morphemes in our data.

        To illustrate this, I ran a small test on the Morfessor software, using two datasets for training, one big dataset with about 50000 words from Baayen et al. (1995), and one smaller dataset of about 600 words which I used as a cognate detection benchmark when writing my dissertation (List 2014). I then used these two datasets to train the Morfessor software and then applied the trained models to segment a list of 10 German words (see the GitHub.Gist here.

        The results for the two models (small data and big data) as well as the segmentations proposed by the online application (online) are given in the table below (with my own judgments on morphemes given in the column word).

        NumberWordSmall dataBig dataOnline
        1handhandhandhand
        2hand-schuhhand-sch-uhhand-schuhhand-schuh
        3hantelh-a-n-t-elhant-elhan-tel
        4hungerh-u-n-g-erhungerhunger
        5lauf-enl-a-u-f-enlaufenlauf-en
        6geh-engehengehengehen
        7lieg-enl-i-e-g-enliegenliegen
        8schlaf-ensch-lafenschlafenschlaf-en
        9kind-er-arztkind-er-a-r-z-tkind-er-arztkinder-arzt
        10grund-schuleg-rund-sch-u-l-egrund-schulegrundschule

        What can be seen clearly from the table, where all forms deviating from my analysis are marked in red font, is that none of the models makes a convincing job in segmenting my ten test words.  More importantly, however, we can clearly see that the algorithm's problems increase drastically when dealing with small training data. Since the segmentations proposed in the Small data column are clearly the worst, splitting words in a seemingly random fashion into letters.

        What is interesting in this context is that trained linguists would rarely fail at this task, even when all they were given is the small data list for training. That they do not fail is shown by the numerous studies where linguistic fieldworkers have investigated so far under-investigated languages, and quickly figured out how the morphology works.

        Why is it so difficult to find morpheme boundaries?

        What makes the detection of morpheme boundaries so difficult, also for humans, is that they are inherently ambiguous. A final -s can mark the plural in German, especially on borrowings, as in Job-s, but it can likewise mark a short variant of es "it", where the vowel is deleted, as in ist's "it's", and in many other cases, it can just mark nothing, but instead be part of a larger morpheme, like Haus "house". Whether or not a certain substring of sounds in a language can function as a morpheme depends on the meaning of the word, not on the substring itself. We can — once more — see one of the great differences between sequences in biology and sequences in linguistics here: linguistic sequences derive their "function" (ie. their meaning) from the context in which they are used, not from their structure alone. 

        If speakers are no longer able to clearly understand the morphological structure of a given word, they may even start to change it, in order to make it more "transparent" in its denotation. Examples for this are the numerous cases of folk etymology, where speakers re-interpret the morphemes in a word, with English ham-burger as a prominent example, since the word originally seems to derive from the city Hamburg, which has nothing to do with ham. 

        How do humans find morphemes?
         
        The reasons why human linguists can relatively easy find morphemes in sparse data, while machines cannot, is still not entirely clear to me (ie. humans are good at pattern recognition and machines are not). However, I do have some basic ideas about why humans largely outperform machines when it comes to morpheme segmentation; and I think that future approaches that try to take these ideas into account might drastically improve the performance of automatic morpheme segmentation methods.

        As a first point, given the importance of meaning in order to determine morphemic structure, it seems almost absurd to me to try to identify morphemes in a given language corpus based on a pure analysis of the sequences, without taking their meaning into account.  If we are confronted with two words like Spanish hermano "brother" and hermana "sister", it is clear — if we know what they mean — that the -o vs. -a most likely denotes a distinction of gender. While the machines compare potential similarities inside the words independent of semantics, humans will always start from those pairs where they think that they could expect to find interesting alternations. As long as the meanings are supplied, a human linguist — even when not familiar with a given language — can easily propose a more or less convincing segmentation of a list of only 500 words.

        A second point that is disregarded in current automatic approaches is the fact that morphological structures vary drastically among languages. In Chinese and many South-East Asian languages, for example, it is almost a rule that every syllable represents one morpheme (with minimal exceptions being attested and discussed in the literature). Since syllables are again easy to find in these languages, since words can often only end in a specific number of sounds, an algorithm to detect words in those languages would not need any n-gram statistics, but just a theory on syllable structures. Instead of global strategies, we may rather have to use for local strategies of morpheme segmentation, in which we identify different types of languages for which a given algorithm seems suitable.

        This brings us to a third point. A peculiarity of linguistic sequences in spoken languages is that they are built by specific phonotactic rules that govern their overall structure. Whether or not a language tolerates more than three consonants in the beginning of a word depends on its phonotactics, its set of rules by which the inventory of sounds is combined to form morphemes and words. Phonotactics itself can also give hints on morpheme boundaries, since they may prohibit combinations of sounds within morphemes which can occur when morphemes are joined to form words. German Ur-instinkt "basic instinct", for example, is pronounced with a glottal stop after the Ur-, which can only occur in the beginning of German words and morphemes, thus marking the word clearly as a compound (otherwise the word could be parsed as Urin-stinkt "urine smells".

        A fourth point that is also generally disregarded in current approaches to automatic morpheme segmentation is that of cross-linguistic evidence. In many cases, the speakers of a given language may themselves no longer be aware of the original morphological segmentation of some of their words, while the comparison with closely related languages can still reveal it. If we have a potentially multi-morphemic word in one language, for example, and only one of the two potential morphemes reflected as a normal word in the other language, this is clear evidence that the potentially multi-morphemic word does, indeed, consist of multiple morphemes.

        Suggestions

        Linguists regularly use multiple types of evidence when trying to understand the morphological composition of the words in a given language. If we want to advance the field of automatic morpheme segmentation, it seems to me indispensable that we give up the idea of detecting the morphology of a language just by looking at the distribution of letters across word forms. Instead, we should make use of semantic, phonotactic, and comparative information. We should further give up the idea of designing universal morpheme segmentation algorithms, but rather study which approach works best on which morphological type. How these aspects can be combined in a unified framework, however, is still not entirely clear to me; and this is also the reason why I list automatic morpheme segmentation as the first of my ten open problems in computational diversity linguistics.

        Even more important than the strategies for the solutions of the problem, however, is that we start to work on extensive datasets for testing and training of new algorithms that seek to identify morpheme boundaries on sparse data. As of now, no such datasets exist. Approaches like Morfessor were designed to identify morpheme boundaries in written languages, they barely work with phonetic transcriptions.  But if we had the datasets for testing and training available, be it only some 20 or 40 languages from different language families, manually annotated by experts, segmented both with respect to the phonetics and to the morphemes, this would allow us to investigate both existing and new approaches much more profoundly, and I expect it could give a real boost to our discipline and greatly help us to develop advanced solutions for the problem.

        References

        Baayen, R. H. and Piepenbrock, R. and Gulikers, L. (eds.) (1995) The CELEX Lexical Database. Version 2. Philadelphia.

        Benden, Christoph (2005) Automated detection of morphemes using distributional measurements. In: Claus Weihs and Wolfgang Gaul (eds.): Classification -- the Ubiquitous Challenge. Berlin and Heidelberg:Springer. pp 490-497.

        Bordag, Stefan (2008) Unsupervised and knowledge-free morpheme segmentation and analysis. In: Carol Peters, Valentin Jijkoun, Thomas Mandl, Henning Müller, Douglas W. Oard, Anselmo Peñas, Vivien Petras and Diana Santos (eds.): Advances in Multilingual and Multimodal Information Retrieval. Berlin and Heidelberg:Springer, pp 881-891.

        Creutz, M. and Lagus, K. (2005) Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Technical Report. Helsinki University of Technology.

        Goldsmith, John A. and Lee, Jackson L. and Xanthos, Aris (2017) Computational learning of morphology. Annual Review of Linguistics 3.1: 85-106.

        Hammarström, Harald (2006) A Naive Theory of Affixation and an Algorithm for Extraction. In: Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology at HLT-NAACL 2006 pp. 79-88.

        Harris, Zellig S. (1955) From phoneme to morpheme. Language 31.2: 190-222.

        List, Johann-Mattis (2014) Sequence Comparison in Historical Linguistics. Düsseldorf:Düsseldorf University Press.

        Virpioja, Sami, Smit, Peter, Grönroos, Stig-Arne and Kurimo, Mikko (2013) Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Helsinki:Aalto University.

        Future challenges for computational diversity linguistics


        At the end of each year, many people start to think of the things they want to do during the next year. While not being very extreme in this perspective, I tend to do the same thing at times; and last year, it happened that I started — inspired by a discussion with students I had in Buenos Aires — thinking of the biggest challenges that I see for the field of computational diversity linguistics (i.e. historical and typological language comparison carried out in a formal or quantitative way). I thus sat down before my holidays started, and made a short list of tasks that are challenging, but of which I think can still be tackled in the nearer or further future.


        The idea to make such a list of questions is not new to mathematicians, who have their well-known Hilbert Problems, proposed by David Hilbert in 1900. In linguistics, I first heard about them from Russell Gray, who himself was introduced to this by a talk of the linguist Martin Hilpert, who gave a talk on challenging questions for linguistics in 2014 (online available here), called "Challenges for 21st century linguistics". Russell Gray since then has emphasized the importance to propose "Hilbert" questions for the fields of linguistic and cultural evolution, and has also presented his own big challenges in the past.

        As somebody who considers himself to be a methodologist, I'm not going to frame questions as "big" or challenging as Russell Gray or Martin Hilpert did. Instead, the problems I would like to see tackled are pure computational challenges, that I think can be solved by algorithms or workflows. This does not mean, of course, that these problems are not challenging in the big sense, and it also does not automatically mean that they can be solved in the near future. But given that my own work, and that of colleagues in the field of computational and computer-assisted language comparison, progresses steadily, at times even at an impressive paste, I have some trust that these problems will indeed be solvable within the next 5-10 years.

        The problems I came up with are listed below:
        1. automatic morpheme segmentation
        2. automatic sound law induction
        3. automatic borrowing detection
        4. automatic phonological reconstruction
        5. simulating lexical change
        6. simulating sound change
        7. statistical proof of language relatedness
        8. typology of semantic change
        9. typology of semantic promiscuity
        10. typology of sound change.
        You can see that the way I worded the problems divides them into four major categories. The first four problems point to questions of inference, such as the inference of morpheme boundaries in a mono-lingual wordlist (# 1), the inference of laws by which sounds are changed from a parent to a daughter language (# 2), the inference of borrowings in multilingual datasets (# 3), and the inference of so far unattested proto-forms (# 4). The fifth and the sixth problems deal with simulation, and I distinguish the simulation of lexical change (# 5) and the simulation of sound change (# 6) as two separate tasks, although they could of course be combined later. The seventh problem is a bit different from the others, as it deals with the question of genealogical relationship among languages, and how we can test  it statistically (see Baxter and Manaster Ramer 2000 for an overview).

        The last three problems deal with general patterns that can, or could be, observed for change in semantics and phonology. Semantic change (# 8) shows highly interesting cross-linguistic tendencies that are not yet fully understood (see Wilkins 1996 for an early discussion). Furthermore (# 9), words are often re-used across the lexicon of a given language, and it is an open question whether striking preferences for building many new words from just a few basic words denoting "promiscuitive" concepts (like "fall", "stand", see Geisler forthcoming and a recent blogpost by Schweikhard 2018 for an overview). Sound change (# 10) also follows cross-linguistic regularities, but the nature of these similarities are still not very well understood (see Kümmel 2008 for a pilot study on the topic).

        Discussing each task would be way too long for a single post, given that I have reflected about these problems a lot during the last years, and may at times even have some ideas on how the problems could be tackled in concrete.

        So, based on my idea of making plans for 2019, I decided that I would try to discuss each of these ten problems in greater detail in separate blog posts throughout 2019. This post thus serves merely to introduce the problems. Over the next ten months, I will try to devote some time to discuss each problem in a blog post devoted to each of the topics; and then I will discuss all of problems again at the end of this year.

        I do not yet know how far this will go, and whether I will have the discipline to write up a post on each topic within the coming months, especially since it may also be possible that I end up discarding problems from my list. However, I feel that this could turn into a nice road map for my research in 2019. If I have to devote at least half a day each month over the next year to think about problems in computational historical and typological language comparison, it might not only help myself but also some colleagues to come up with a solution to some of the problems.

        References

        Baxter, William H. and Manaster Ramer, Alexis (2000) Beyond lumping and splitting: probabilistic issues in historical linguistics. In: Renfrew, Colin and McMahon, April and Trask, Larry (eds.) Time Depth in Historical Linguistics. Cambridge: McDonald Institute for Archaeological Research, 167-188.

        Geisler, Hans (2018): Sind unsere Wörter von Sinnen? Überlegungen zu den sensomotorischen Grundlagen der Begriffsbildung. In: Kazzazi, Kerstin and Luttermann, Karin and Wahl, Sabine and Fritz, Thomas A. (eds.) Worte über Wörter. Festschrift zu Ehren von Elke Ronneberger-Sibold. Tübingen:Stauffenburg, 131-142.

        Nathanael E. Schweikhard (2018) Semantic promiscuity as a factor of productivity in word formation. Computer-Assisted Language Comparison in Practice 1.11.19.

        Wilkins, David P. (1996) Natural tendencies of semantic change and the search for cognates. In: Durie, Mark (ed.) The comparative method reviewed. Regularity and irregularity in language change. New York:Oxford University Press, 264-304.