Structural data in historical linguistics

The majority of historical linguists compare words to reconstruct the history of different languages. However, in phylogenetic studies focusing on cognate sets reflecting shared homologs across the languages under investigation, there exists another data type that people have been trying to explore in the past. The nature of this data type is difficult to understand for non-linguists, given that it has a very abstract nature. In the past, it has led to a considerable amount of confusion both among linguists and among non-linguists who tried to use this data for quick (and often also dirty) phylogenetic approaches. For this reason, I figured it would be useful to introduce this type of data in more detail.

This data type can be called "structural". To enable interested readers to experiment with the data themselves, this blogpost comes along with two example datasets that we converted into a computer-readable format (with much help from David), since the original papers only offered the data as PDF files. In future blogposts, we will try to illustrate how the data can, and should, be explored with network methods. In this first blogpost, I will try to explain the basic structure of the data.

Structural data in historical linguistics and language typology

In order to illustrate the type of data we are dealing with here, let's have a look at a typical dataset, compiled by the famous linguist Jerry Norman to illustrate differences between Chinese dialects (Norman 2003). The table below shows a part of the data provided by Norman.

1The third person pronoun is tā, or cognate to it+---
4Velars palatalize before high-front vowels++--
7The qu-tone lacks a register distinction+-+-
12The word for "stand" is zhàn or cognate to it+---

In this example, the data is based on a questionnaire that provides specific questions; and for each of the languages in the sample, the dataset answers the question with either + or -. Many of these datasets are binary in their nature, but this is not a necessary condition, and questionnaires can also query categorical variables, such as, for example, the major type of word order might have three categories (subject-object-verb, subject-verb-object or other).

We can also see is that the questions can be very diverse. While we often use more or less standardized concept lists for lexical research (such as fixed lists of basic concepts, List et al. 2016), this kind of dataset is much less standardized, due to the nature of the questionnaire: asking for the translation of a concept is more or less straightforward, and the number of possible concepts that are useful for historical research is quite constrained. Asking a question about the structure of a language, however, be it phonological, lexical, based on attested sound changes, or on syntax, provides an incredible number of different possibilities. As a result, it seems that it is close to impossible to standardize these questions across different datasets.

Although scholars often call the data based on these questionnaires "grammatical" (since many questions are directed towards grammatical features, such as word order, presence or absence of articles, etc.), most datasets show a structure in which questions of phonology, lexicon, and grammar are mixed. For this reason, it is misleading to talk of "grammatical datasets", but instead the term "structural data" seems more adequate, since this is what the datasets were originally designed for: to investigate differences in the structure of different languages, as reflected in the most famous World Atlas of Language Structures(Dryer and Haspelmath 2013,

Too much freedom is a restriction

In addition to mixed features that can be observed without knowing the history of the languages under investigation, many datasets (including the one by Norman we saw above) also use explicit "historical" (diachronic in linguistic terminology) questions in their questionnaires. In his paper describing the dataset, Norman defends this practice, as he argues that the goal of his study is to establish an historical classification of the Chinese dialects. With this goal in mind, it seems defensible to make use of historical knowledge and to include observed phenomena of language change in general, and sound change in specific, when compiling a structural dataset for group of related language varieties.

The problem of the extremely diverse nature of questionnaire items in structural datasets, however, makes their interpretation extremely difficult. This becomes especially evident when using the data in combination with computational methods for phylogenetic reconstruction. This is problematic for two major reasons.
  1. Since questions are by nature less restricted regarding their content, scholars can easily pick and choose the features in such a way that they confirm the theory they want them to confirm rather than testing it objectively. Since scholars can select suitable features from a virtually unlimited array of possibilities, it is extremely difficult to guarantee the objectivity of a given feature collection. 
  2. If features are mixed, phylogenetic methods that work on explicit statistical models (like gain and loss of character states, etc.) may often be inadequate to model the evolution of the characters, especially if the characters are historical. While a feature like "the language has an article" may be interpreted as a gain-loss process (at some point, the language has no article, then it gains the article, then it looses it, etc.), features showing the results of processes, like "the words that originally started in [k] followed by a front vowel are now pronounced as []", cannot be interpreted as a process, since the feature itself describes a process.
For these reasons, all phylogenetic studies that make use of structural data, in contrast to purely lexical datastes, should be taken with great care, not only because they tend to yield unreliable results, but more importantly because they are extremely difficult to compare across different language families, given that they have way too much freedom when compiling them. Feature collections provided in structural datasets are an interesting resource for diversity linguistics, but they should not be used to make primary claims about external language history or subgrouping.

Two structural datasets for Chinese dialects

Before I start to bore the already small circle of readers interested in these topics, it seems better to stop discussing the usefulness of structural data at this point, and to introduce the two datasets that were promised at the beginning of the post.

Both datasets target Chinese dialect classification, the former being proposed by Norman (2003), and the latter reflecting a new data collection that was recently used by Szeto et al. (2018) to propose a North-South-split of dialects of Mandarin Chinese with help of a Neighbor-Net analysis (Bryant and Moulton 2004). Both datasets have been uploaded to Zenodo, and can be found in the newly established community collection cldf-datasets. The main idea of this collection is to collect various structural datasets that have been published in the literature in the past, and allow those people interested in the data, be it for replication studies or to thest alternative approaches, easy access to the data in various formats.

The basic format is based on the format specifications laid out by the CLDF initiative (Forkel et al. 2018), which provides a software API, format specifications, and examples for best practice for both structural and lexical datasets in historical linguistics and language typology. The collection is curated on GitHub (cldf-datasets), and datasets are converted to CLDF (with all languages being linked to the Glottolog database,, Hammarström et al. 2018) and also to Nexus format. The dataset is versionized, it may be updated in the future, and interested readers can study the code used to generate the specific data format from the raw files, as well as the Nexus files, to learn how to submit their own datasets to our initiative.

Final remarks on publishing structural datasets online

By providing only two initial datasets for an enterprise whose general usefulness is highly questionable, readers might ask themselves why we are going through the pain of making data created by other people accessible through the web.

The truth is that the situation in historical linguistics and language typology has for a very long time been very unsatisfactory. Most of the research based on data did not supply the data with the paper, and often authors directly refuse to share the data when asked after publication (see also the post on Sharing supplementary data). In other cases, access to the data is exacerbated by providing data only in PDF format in tables inside the paper (or even worse: long tables in the supplement of a paper), which force scholars wishing to check a given analysis themselves to reverse-engineer the data from the PDF.

Many colleagues think that it is time to change this, and we can only change it by offering standard ways to share our data. The CLDF along with the Nexus file, in which the two Chinese datasets are now published in this open repository collection, may hopefully serve as a starting point for larger collaboration among typologists and historical linguistics. Ideally, all people who publish papers that make use of structural datasets, would — similar to the practice in biology where scholars submit data to GenBank (Benson et al. 2013) — submit their data in CLDF format and Nexus, so that their colleagues can easily build on their results, and test them for potential errors.


Benson D., M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and E. Sayers (2013) GenBank. Nucleic Acids Res. 41.Database issue: 36-42.

Bryant D. and V. Moulton (2004) Neighbor-Net. An agglomerative method for the construction of phylogenetic networks. Molecular Biology and Evolution 21.2: 255-265.
Campbell, L. and W. Poser (2008): Language classification: History and method. Cambridge University Press: Cambridge.

Cathcard C., G. Carling, F. Larson, R. Johansson, and E. Round (2018) Areal pressure in grammatical evolution. An Indo-European case study. Diachronica 35.1: 1-34.

Dryer M. and Haspelmath, M. (2013) WALS Online. Max Planck Institute for Evolutionary Anthropology: Leipzig.

Forkel R., J.-M. List, S. Greenhill, C. Rzymski, S. Bank, M. Cysouw, H. Hammarström, M. Haspelmath, G. Kaiping, and R. Gray (forthcoming) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data.

Hammarström H., R. Forkel, and M. Haspelmath (2018) Glottolog. Version 3.3. Max Planck Institute for Evolutionary Anthropology: Leipzig.

List J.-M., M. Cysouw, and R. Forkel (2016) Concepticon. A resource for the linking of concept lists. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp 2393-2400.

Norman J. (2003) The Chinese dialects. Phonology. In: Thurgood, G. and R. LaPolla (eds.) The Sino-Tibetan languages. Routledge: London and New York, pp 72-83.

Pritchard J., M. Stephens, and P. Donnelly (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.

Szeto P., U. Ansaldo, and S. Matthews (2018) Typological variation across Mandarin dialects: An areal perspective with a quantitative approach. Linguistic Typology 22.2: 233-275.

Zhang M., W. Pan, S. Yan, and L. Jin (2018) Phonemic evidence reveals interwoven evolution of Chinese dialects. bioarxiv.

Weekend reads: The study that never existed; turmoil at Cochrane; a plagiarist is appointed professor

Before we present this week’s Weekend Reads, a question: Do you enjoy our weekly roundup? If so, we could really use your help. Would you consider a tax-deductible donation to support Weekend Reads, and our daily work? Thanks in advance. The week at Retraction Watch featured a lot of news about Brian Wansink — six new … Continue reading Weekend reads: The study that never existed; turmoil at Cochrane; a plagiarist is appointed professor

Improving newborn sickle cell screening in Africa: ‘We can affect change there just like we did in the US’

Improving newborn sickle cell screening in Africa: ‘We can affect change there just like we did in the US’ |

by Kim Krisberg

In the US, nearly all children born with sickle cell disease survive into adulthood. Across the globe in sub-Saharan Africa, more than half of babies born with the genetic condition don’t survive until their fifth birthdays.

A major reason for the stark disparity is the region’s lack of newborn screening capacity, which allows for early detection and medical intervention. Here in the US, state public health laboratories automatically test babies for a number of genetic and metabolic disorders, including sickle cell disease, as part of their universal newborn screening programs. In sub-Saharan Africa, however, diagnostic and treatment capacity is severely limited, despite the region being home to more than 75% of the disease’s global burden.

Researchers estimate that about 240,000 babies are born with sickle cell disease in sub-Saharan Africa every year, with studies estimating that at least half of such children die before age five (though research finds the under-five mortality rate related to sickle cell disease in the region could be as high as 90%). Globally, the number of people with sickle cell disease is expected to grow by 30% by 2050. Early detection and diagnosis is critical to pushing that child mortality rate down, but to date, no country in sub-Saharan Africa has been able to establish universal newborn screening for any disease, including sickle cell disease.

Sickle cell disease is an inherited red blood cell disorder in which abnormally shaped red blood cells block the adequate flow of blood and oxygen throughout the body. The disease causes a number of adverse and debilitating effects, including anemia, chronic pain, delayed growth, vision problems and more frequent infections. The disease is manageable with access to relatively easy, low-cost interventions, such as folic acid supplementation, vaccines and antibiotics, pain treatment, dietary changes and high fluid intake.

“This is the same disease we screen for here in the US and we know that if we’re able to detect it early enough and provide the right treatment — prophylaxis penicillin and folic acid — it increases their chances of having a normal life enormously,” says Jelili Ojodu, MPH, director of newborn screening and genetics at APHL. “Sickle cell disease doesn’t have to be a death sentence, as it is now in these countries.”

This summer, the Sickle Cell Disease Coalition — APHL is a member of its steering committee — released a new public service announcement directing viewers to a library of global resources on sickle cell disease screening sites and treatment centers in African regions. Also unveiled was an eight-minute documentary from the American Society of Hematology on sickle cell disease newborn screening efforts now underway in Ghana and how families impacted by sickle cell disease can access appropriate care.

For more than a decade, APHL has been working with providers and health officials in sub-Saharan Africa to institute newborn screening for sickle cell disease, providing technical assistance and guidance on testing methodologies, facilitating relationships with laboratory vendors and in some cases, providing hands-on training in validating lab instruments. The goal, Ojodu said, is to help countries take the first steps in the slow scale-up toward universal newborn screening and foster small pilot projects that expand the evidence base and justification for further investment. For example, in Ghana, where sickle cell disease is endemic, APHL partnered with the Centers for Disease Control and Prevention and the Sickle Cell Foundation of Ghana to offer technical assistance on a variety of related screening activities, such as needs assessments, genetic counseling and educating providers and parents. The initiative, launched in 2011, began with a survey of community needs, which revealed a gap in the availability of genetic counselors who specialize in sickle cell disease.

In turn, APHL led a 2013 workshop on developing a sickle cell disease counselor training and certification program in Ghana, where participants helped tailor a culturally competent training program specific to the needs of Ghana’s communities. Then in 2015, APHL put together a curriculum and trained the first 15 counselors using the new Genetic Education and Counseling for Sickle Cell Conditions in Ghana. A second training workshop took place in Ghana in the summer of 2016.

In all, Ojodu said, APHL has worked with providers in about a half-dozen African nations to improve sickle cell disease outcomes and newborn screening, including Mali, Kenya, Nigeria, Liberia, Uganda and Tanzania. The work, he said, has shown that newborn sickle cell disease screening and counseling in sub-Saharan Africa is possible — the real sticking point is securing the funding and support to shift from small pilots at hospitals and universities to population-wide screening. (He added that most sickle cell disease screening in sub-Saharan Africa is happening in hospital labs, which he said might be the preferred setting for such newborn screening in the region, as public health agencies there must focus their limited resources on considerable communicable disease threats.)

In Ghana, Ojodu noted, providers use the same technology to screen for sickle cell disease as labs do in the US, which underscores the adaptability of current sickle cell disease screening techniques to a variety of settings.

“If we can do it here, they can do it there,” Ojodu said. “Of course, it will take time and coordinated efforts. It’s really a slow build-up of justifying that No. 1, this saves lives, and No. 2, it can be done.”

Venée Tubman, MD, MMSc, a member of the African Newborn Screening and Early Intervention Consortium, which came out of the American Society of Hematology’s Sickle Cell Disease Working Group on Global Issues, noted that a number of attempts have been made to start newborn screening programs in sub-Saharan African, but also reported that no country has yet succeeded in adopting a universal screening effort. She noted that based on progress in sickle cell disease survival rates in the US — where about 96% of babies with sickle cell disease now survive into adulthood — it’s reasonable to believe that similar improvements can be achieved for children in sub-Saharan Africa with the expansion of early detection and treatment. For instance, in the US, CDC reports that with the introduction of pneumococcal disease vaccination, sickle cell disease related deaths among black children younger than four dropped by 42% between 1999 and 2002.

“That fact that we were able to implement some basic measures and increase survivability pretty dramatically leads me to believe that, yes, most of these deaths are preventable,” said Tubman, an assistant professor in pediatrics at Baylor College of Medicine.

She added that the existence of the consortium and the Sickle Cell Disease Coalition speaks to the progress being made to boost early detection and intervention in sub-Saharan Africa.

“Even beginning to strategize and organize around this problem — the infrastructure limitations and the myth and perceptions around sickle cell — is a sign of progress,” Tubman said. “We have a long way to go, but at least we’re on the road.”

Ojodu noted that with the elimination of CDC funding for global newborn screening development, APHL is looking for new funding partners to continue its work abroad.

“This is possible,” he said, referring to improving sickle cell disease survivability rates in sub-Saharan Africa. “We can affect change there just like we did in the US.”


*Header photo is a screenshot from the Sickle Cell Disease Coalition’s “Global Sickle Cell Disease Public Service Announcement.”

The post Improving newborn sickle cell screening in Africa: ‘We can affect change there just like we did in the US’ appeared first on APHL Lab Blog.

NASA Seeking Partner in Contest to Name Next Mars Rover

Artist's rendition depicts NASA's Mars 2020 rover

NASA is looking for corporate, nonprofit and educational organizations to team up for a contest enabling K-12 students to name the next rover to the Red Planet.

Wansink admits mistakes, but says there was “no fraud, no intentional misreporting”

Brian Wansink, the Cornell food marketing researcher who announced his resignation yesterday and has been found to have committed misconduct by the university, admits to mistakes and poor record-keeping in a statement released today. But he insists that there was “no fraud, no intentional misreporting, no plagiarism, or no misappropriation.” (See entire statement below.) As … Continue reading Wansink admits mistakes, but says there was “no fraud, no intentional misreporting”

Posted by in Uncategorized


The rise and plummet of the name Heather

Hey, no one told me that baby name analysis was back in fashion. Dan Kopf for Quartz, using data from the Social Security Administration, describes the downfall of the name Heather. It exhibited the sharpest decline of all names since 1880.

Talking to Laura Wattenberg:

Wattenberg says the rise and fall of Heather is exemplary of the faddish nature of American names. “When fashion is ready for a name, even a tiny spark can make it take off,” she says. “Heather climbed gradually into popularity through the 1950s and ’60s, then took its biggest leap in 1969, a year that featured a popular Disney TV movie called Guns in the Heather. A whole generation of Heathers followed, at which point Heather became a ‘mom name’ and young parents pulled away.”

Tags: , ,

Posted by in Uncategorized


New Small Satellite Peers Inside Hurricane Florence

This animation combines the TEMPEST-D (Temporal Experiment for Storms and Tropical Systems Demonstration) data with a visual image of the storm from NOAA's GOES (Geoweather Operational Environmental Satellite) weather satellite

A new satellite no bigger than a cereal box revealed the hidden interior of Hurricane Florence, using miniature technology that could change the future of storm monitoring.

Cornell finds that food marketing researcher Brian Wansink committed misconduct, as he announces retirement

A day after the JAMA family of journals retracted six of his studies, Cornell food marketing researcher Brian Wansink tells Retraction Watch that he will be retiring next year. And Cornell said today that it found that Wansink “committed academic misconduct in his research and scholarship, including misreporting of research data, problematic statistical techniques, failure … Continue reading Cornell finds that food marketing researcher Brian Wansink committed misconduct, as he announces retirement

Posted by in Uncategorized