Election modeling explained

In election reporting, there’s a gap between real-time results and final results, so news orgs use statistical models to show where results appear to be headed. For The Washington Post, Adrian Blanco and Artur Galocha explain the basic concepts behind their model, using a fictional state called Voteland.

Tags: , ,

Agent-based modeling in JavaScript

Atomic Agents is a JavaScript library by Graham McNeill that can help simulate the interactions between people, places, and things in a two-dimensional space. Saving for later. Looks fun.

Tags: , , ,

Teaching statistical models with wine tasting

For The Pudding, Lars Verspohl provides an introduction to statistical models disguised as a lesson on finding good wine. Start with a definition of wine, which becomes a way to describe it with the numbers. Define what makes a wine good. Find the wines that look closer to that definition.

Tags: , ,

How experts use disease modeling to help inform policymakers

Harry Stevens and John Muyskens for The Washington Post put you in the spot of an epidemiologist receiving inquiries from policymakers about what might happen:

Imagine you are an epidemiologist, and one day the governor sends you an email about an emerging new disease that has just arrived in your state. To avoid the complexities of a real disease like covid-19, the illness caused by the novel coronavirus, we have created a fake disease called Simulitis. In the article below, we’ll give you the chance to model some scenarios — and see what epidemiologists are up against as they race to understand a new contagion.

Fuzzy numbers, meet real-world decisions.

Tags: , , , ,

Comparing Covid-19 models

FiveThirtyEight compared six Covid-19 models for a sense of where we might be headed. With different assumptions and varying math, the trajectories vary, but they at least provide clues so that policymakers can make educated decisions.

If you’re interested in the data behind these models, check out the COVID-19 Forecast Hub maintained by the Reich Lab at the University of Massachusetts Amherst. They helped with the FiveThirtyEight comparisons and are also the source for the official CDC forecast page.

Tags: , ,

Challenges of making a reliable Covid-19 model

Fatalities from Covid-19 range from the hundreds of thousands to the millions. Nobody knows for sure. These predictions are based on statistical models, which are based on data, which aren’t consistent and reliable yet. FiveThirtyEight, whose bread and butter is models and forecasts, breaks down the challenges of making a model and why they haven’t provided any.

Tags: , ,

Evolution unchained: The development of person names and the limits of sequences


What do person names like Jack and Hans have in common, and what unites Joe and Pepe? Both name pairs go back to a common ancestor. For Jack and Hans, this would be John (ultimately going back to Iōánnēs in Greek), and for Joe and Pepe, this would be Josef (originally from Hebrew). Given the striking dissimilarity of the names in their current form, the pathways of change by which they have evolved into their current shape are quite complicated.

While the German name Hans can be easily shown to be a short form of the German variant Johannes, the evolution of Jack is more complicated. First (at least this is what people on Wikipedia suppose), Iōánnēs becomes John in English, similar to the process that transformed German Johannesinto Hans. Then, in an ancient form of English, a diminutive was built for John, which yielded the form Jenkin, with the diminutive suffix -kin that has a homologous counterpart in German -chen (which can be attached to Hansas well, yielding Hänschen). Etymologically, Jack is little Johnny.

While Joe in English is a shortening of Josef, the development of Pepe is again a bit more complex. First, we find the form Giuseppe as an Italian counterpart of Josef. How this form then yielded Pepe as a diminutive is not completely clear to me; but since we find the pe in the Italian form, we can think of a process by which Giuseppe becomes Giuseppepe, leaving Pepeafter the deletion of the initial two syllables.

The complexity of person-name evolution

Even from these two examples alone, we can already see that the evolution of person names can easily become quite complex. If all words in all spoken languages in the world evolved in the same way in which our person names evolve, we would have a big problem in historical linguistics, since the amount of speculation in our etymologies would drastically increase.

When comparing etymologically related words from different languages, we generally assume that they show regular correspondences among their sound segments. This presupposes that there is still enough sound material that reflects these correspondences, allowing us to detect and assess them. But since the evolution of person names rarely consists of the regular modification of sounds, but rather results in the deletion, reduplication, and rearrangement of whole word parts, there is rarely enough left in the end that could be used as the basis for a classical sequence comparison.

With the name Tina in German being the short form of Bettina, Christina, and at times even Katharina, and with Bettina itself going back to Elisabeth, and with Tina becoming Tinchen, Tinka, or Tine, we face an almost insurmountable challenge when trying to model the complexity of the various patterns by which names can change.

Modeling word derivation with directed networks

That words do not evolve solely by the alternation of sounds, but also by different forms of derivation, is nothing new for historical linguistics. We face the problem, for example, when looking for etymologically related words in the basic lexicon of phylogenetically related languages. However, these phenomena can be easily investigated by enhanced means of annotation. The evolution of person names, on the other hand, presents us with larger challenges.

While working as a research fellow in France in 2015-2016, I had the time to develop a small tool that allows us to represent derivational relations between related words with help of a directed network, and thus allows us to model these relations in a rough way. Such a graph is directed, and our words are the nodes in the network, with the edges drawn between the assumed ancestor word forms and their descendants. This tool, which I then called DeriViz, is still available online. and makes it possible to visualize network relations between words.

I have now conducted a small experiment with this tool, by taking name variants of Elisabeth, as they are listed in Wikipedia, and trying to model them in a directed network, along with intermediate stages. You can do this easily yourself, by copying the network that I have constructed in text form below, and pasting it into the field for data entry at the DeriViz-Homepage. The network will be visualized when you press on the OK button; and you can play with it by dragging it around.
Elisabeth → BETT
BETT → Betty
BETT → Bettina
BETT → Bettine
BETT → Betsi
Elisabeth → ELISABETH
ELISABETH → Elise
ELISABETH → Elsbeth
ELISABETH → Else
ELISABETH → Elina
Elisabeth → ILSA
ILSA → Ilsa
ILSA → Ilse
Elisabeth → Isabella
Elisabeth → LISA
LISA → Lieschen
LISA → Liese
LISA → Liesel
LISA → Lis
LISA → Lisa
LISA → Lisbeth
LISA → Lisette
LISA → Lise
LISA → Liesl
Elisabeth → LILA
LISA → Lila
LISA → Liliane
LISA → Lilian
LISA → Lilli
Elisabeth → Sisi
I intentionally reduced the amount of data here, in order to make sure that the graphic can still be inspected. But it is clear that even this simple model, which assumes unique ancestor-descendant relations among all of the derived person names, is stretched to its limits when applied to names as productive as Elisabeth, at least as far as the visualization is concerned.

Derivation network of names derived from Elisabeth


If you now imagine that there are various processes that turn an ancestral name into a descendant name, and that one would ideally want to model the differences between these processes as well, one can see easily that it is indeed not a trivial problem to model the evolution of person names (and we are not even speaking of inferring any of these relations).

How names evolve

Names evolve in various ways along different dimensions. With respect to their primary function, or their use, we tend to use, among others, nick names. Formally, nick names are often a short form of an original name, but depending on the community of speakers, it is also possible that there is a formal procedure by which a nick name can be derived from a base name. Thus, every speaker of Russian should know that Jekaterina can be turned into Katerina, which can be turned into Katja, which can be turned into Katjuscha, or, in the case of a Vocative, into Katj. Once the primary function of a name changes, its form usually also changes, as we can now see in many examples.

But the form can also change when a name crosses language borders. If you go with your name into another country, and the speakers have problems pronouncing certain sounds that occur in your name, it is very likely that they will adjust your name's pronunciation to the phonetic needs of their own language, and modify it. Names cross language borders very quickly, since we tend not to leave them at home when visiting or migrating to foreign countries. As a result, a great deal of the diversity of person names  observed today is due to the migration of names across the world's larger linguistic communities.

How we change names when building short forms or nick names, or when trying to adapt a name to a given target language, depends on the structure of the language. The most important part is the phonology of the language in which the change happens. For example, when transferring a name from one language to another, and the new language lacks some of the sounds in the original name, speakers will replace them with those sounds which they perceive to be closest to the lacking ones.

But the modification is not restricted to the replacement of sounds. My own given name, Mattis, for example, usually has the stress on the first syllable, but in France, most people tend to call me Matisse, with the accent on the second syllable, reflecting the general tendency to stress the last syllable of a word in French. In Russian, on the other hand, Mattis could be perfectly pronounced, but since people do not know the name, they often confuse it with its variant Matthias, which then sounds like Matjes when pronounced in Russian (which is the name for soused herring in Germany). There are more extreme cases; and both English and German speakers are also good at drastically adjusting foreign names to the needs of their mother tongues.

It would be nice if it was possible to investigate the huge diversity in the evolution of person names more systematically. In principle, this should be possible. I think, starting from directed networks is definitely a good idea; but it would probably have to be extended by distinguishing different types of graph edges. Even if a given selection may not handle all of the processes known to us, it might help to collect some primary data in the first place.

With a large enough set of well-annotated data, on the other hand, one might start to look into the development of algorithms that could infer derivation relationships between person names; or one could analyze the data and search for the most frequent processes of person name evolution. Many more analyses might be possible. One could see to which degree the processes differ across languages, or how names migrate from one language to another across times, usage types, and maybe even across fashions.

Outlook

I assume that the result of such a collection would be interesting not only for couples who are about to replicate themselves, but would also be interesting for historical research and research in the field of cultural evolution. Whether such a collection will ever exist, however, seems less likely. The problem is that there are not enough scholars in the world who would be interested in this topic, as one can see from the very small number of studies that have been devoted to the problem up to now (as one of the few exceptions known to me, compare the nice overview of person name classification by Handschuh 2019). I myself would not be able to help in this endeavour, given that I lack the scholarly competence of investigating name evolution. But I would sure like to investigate and inspect the results, if they every become available.

Reference

Handschuh, Corinna (2019) The classification of names. A crosslinguistic study of sex-specific forms, classifiers, and gender marking on personal names. STUF — Language Typology and Universals 72.4: 539-572.

Myth of the impartial machine

In its inaugural issue, Parametric Press describes how bias can easily come about when working with data:

Even big data are susceptible to non-sampling errors. A study by researchers at Google found that the United States (which accounts for 4% of the world population) contributed over 45% of the data for ImageNet, a database of more than 14 million labelled images. Meanwhile, China and India combined contribute just 3% of images, despite accounting for over 36% of the world population. As a result of this skewed data distribution, image classification algorithms that use the ImageNet database would often correctly label an image of a traditional US bride with words like “bride” and “wedding” but label an image of an Indian bride with words like “costume”.

Click through to check out the interactives that serve as learning aids. The other essays in this first issue are also worth a look.

Tags: , , ,

Game of Thrones death predictor

Monica Ramirez tried her hand with modeling deaths on Game of Thrones and trying to predict the next ones:

Since the series is so famous for killing principal characters (It’s true! Yu can’t have a favourite character because he/she wouls die, and slowly, other characters take the lead… and would probably die too), I decided to make a Classification Model in Python, to try to find any rule or pattern and discopver: Who will die on this last season?

I’m always on a viewing delay with this stuff, so I’m not sure whether this is right or completely wrong, but there you go. The above shows the characters ordered by probability of death (not order in which they will die).

Tags: , ,

Keeping it simple in phylogenetics


This is a post by Guido, with a bit of help from David.

There's an old saying in physics, to the effect that: "If you think you need a more complex model, then you actually need better data." This is often considered to be nonsense in the biological sciences and the humanities, because   the data produced by biodiversity is orders of magnitude more complex than anything known to physicists:
The success of physics has been obtained by applying extremely complicated methods to extremely simple systems ... The electrons in copper may describe complicated trajectories but this complexity pales in comparison with that of an earthworm. (Craig Bohren)
Or, more succinctly:
If it isn’t simple, it isn’t physics. (Polykarp Kusch)
So, in both biology and the humanities there has been a long-standing trend towards developing and using more and more complex models for data analysis. Sometimes, it seems like every little nuance in the data is important, and needs to be modeled.


However, even at the grossest level, complexity can be important. For example, in evolutionary studies, a tree-based model is often adequate for analyzing the origin and development of biodiversity, but it is inadequate for studying many reticulation processes, such as hybridization and transfer (either in biology or linguistics, for example). In the latter case, a network-based model is more appropriate.

Nevertheless, the physicists do have a point. After all, it is a long-standing truism in science that we should keep things simple:
We may assume the superiority, all things being equal, of the demonstration that derives from fewer postulates or hypotheses. (Aristoteles) 
It is futile to do with more things that which can be done with fewer. (William of Ockham) 
Plurality must never be posited without necessity. (William of Ockham) 
Everything should be as simple as it can be, but not simpler. (Albert Einstein)
To this end, it is often instructive to investigate your data with a simple model, before proceeding to a more complex analysis.

Simplicity in phylogenetics

In the case of phylogenetics, there are two parts to a model: (i) the biodiversity model (eg. chain, tree, network), and (ii) the character-evolution model. A simple analysis might drop the latter, for example, and simply display the data unadorned by any considerations of how characters might evolve, or what processes might lead to changes in biodiversity.

This way, we can see what patterns are supported by our actual data, rather than by the data processed through some pre-conceived model of change. If we were physicists, then we might find the outcome to be a more reliable representation of the real world. Furthermore, if the complex model and the simple model produce roughly the same answer, then we may not need "better data".


Modern-day geographic distribution of Dravidian languages (Fig. 1 of Kolipakam, Jordan, et al., 2018)

Historical linguistics of Dravidian languages

Vishnupriya Kolipakam, Fiona M. Jordan, Michael Dunn, Simon J. Greenhill, Remco Bouckaert, Russell D. Gray, Annemarie Verkerk (2018. A Bayesian phylogenetic study of the Dravidian language family. Royal Society Open Science) dated the splits within the Dravidian language family in a Bayesian framework. Aware of uncertainty regarding the phylogeny of this language family, they constrained and dated several topological alternatives. Furthermore, they checked how stable the age estimates are when using different, increasingly elaborate linguistic substitution models implemented in the software (BEAST2).

The preferred and unconstrained result of the Bayesian optimization is shown in their Figs 3 and 4 (their Fig. 2 shows the neighbour-net).

Fig. 3 of Kolipakam et al. (2018), constraining the North (purple), South I (red) and South II (yellow) groups as clades (PP := 1)
Fig. 4 of Kolipakam et al. (2018), result of the Bayesian dating using the same model but not constraints. The Central and South II group is mixed up.

As you can see, many branches have rather low PP support, which is a common (and inevitable) phenomenon when analyzing non-molecular data matrices providing non-trivial signals. This is a situation where support consensus networks may come in handy, which Guido pointed out in his (as yet unpublished) comment to the paper (find it here).

On Twitter, Simon Greenhill (one of the authors) posted a Bayesian PP support network as a reply.

A PP consensus network of the Bayesian tree sample, probably the one used for Fig. 3 of Kolipakam et al. 2018, constraining the North, South I, and South II groups as clades (S. Greenhill, 23/3/2018, on Twitter).

Greenhill, himself, didn't find it too revealing, but for fans of exploratory data analysis it shows, for example, that the low support for Tulu as sister to the remainder of the South I clade (PP = 0.25) is due to lack of decisive signal. In case of the low support (PP = 0.37) for the North-Central clade, one faces two alternatives: it's equally likely that the Central Parji and Olawi Godha are related to the South II group which forms a highly supported clade (PP = 0.95), including the third language of the Central group (one of the topological alternatives tested by the authors).

A question that pops up is: when we want to explore the signal in this matrix, do we need to consider complex models?

Using the simplest-possible model

The maximum-likelihood inference used here is naive in the sense that each binary character in the matrix is treated as an independent character. The matrix, however, represents a binary sequence of concepts in the lexica of the Dravidian languages (see the original paper for details).

For instance, the first, invariant, character encodes for "I" (same for all languages and coded as "1"), characters 2–16 encode for "all", and so on. Whereas "I" (character 1) may be independent from "all" (characters 2–16), the binary encodings for "all" are inter-dependent, and effectively encode a micro-phylogeny for the concept "all": characters 2–4 are parsimony-informative (ie. split the taxon set into two subsets, and compatible); the remainder are parsimony-uninformative (ie. unique to a single taxon).


The binary sequence for "All" defines three non-trivial splits, visualized as branches, which are partly compatible with the Bayesian tree; eg. Kolami groups with members of South I, and within South II we have two groups matching the subclades in the Bayesian tree.

Two analyses were run by the original authors, one using the standard binary model, Lewis’ Mk (1-paramter) model, and allowing for site-specific rate variation modelled using a Gamma-distribution (option -m BINGAMMA). As in the case of morphological data matrices (or certain SNP data sets), and in contrast to molecular data matrices, most of the characters are variable (not constant) in linguistic matrices. The lack of such invariant sites may lead to so-called “ascertainment bias” when optimizing the substitution model and calculating the likelihood.

Hence, RAxML includes an option to correct for this bias for morphological or other binary or multi-state matrices. In the case of the Dravidian language matrix, four out of the over 700 characters (sites) are invariant and were removed prior to rerun the analysis applying the correction (option -m ASC_BINGAMMA). The results of both runs show a high correlation— the Pearson correlation co-efficient of the bipartition frequencies (bootstrap support, BS) is 0.964. Nonetheless, BS support for individual branches can differ by up to 20 (which may be a genuine or random result, we don't know yet). The following figures show the bootstrap consensus network of the standard analysis and for the analysis correcting for the ascertainment bias.

Maximum likelihood (ML) bootstrap (BS) consensus network for the standard analysis. Green edges correspond to branches seen in the unconstrained Bayesian tree in Kolipakam et al. (2018, fig. 4), the olive edges to alternatives in the PP support network by S. Greenhill. Edge values show ML-BS support, and PP for comparison.

ML-BS consensus network for the analysis correcting for the ascertainment bias. BSasc annotated at edges in bold font, with BSunc and PP (graph before) provided for comparison. Note the higher tree-likeness of the graph.

Both graphs show that this characters’ naïve approach is relatively decisive, even more so when we correct against the ascertainment bias. The graphs show relatively few boxes, referring to competing, tree-incompatible signals in the underlying matrix.

Differences involve Kannada, a language that is resolved as equally related to Malayam-Tamil and Kodava-Yeruva — BSasc = 39/35, when correcting for ascertainment bias; but BSunc < 20/40, using the standard analysis); and Kolami is supported as sister to Koya-Telugu (BSasc = 69 vs. BSunc. = 49) rather than Gondi (BSasc < 20, BSunc = 21).

They also show that from a tree-inference point of view, we don't need highly sophisticated models. All branches with high (or unambiguous) PP in the original analysis are also inferred, and can be supported using maximum likelihood with the simple 1-parameter Mk model. This also means that if the scoring were to include certain biases, the models may not correct against this. At best, they help to increase the support and minimize the alternatives, although the opposite can also be true.

For relationships within the Central-South II clade (unconstrained and constrained analyses), the PP were low. The character-naïve Maximum likelihood analysis reflects some signal ambiguity, too, and can occasionally be higher than the PP. BS > PP values are directly indicative of issues with the phylogenetic signal (eg. lack of discriminative signal, topological ambiguity), because in general PP tend to overestimate and BS underestimate. The only obvious difference is that Maximum likelihood failed to provide support for the putative sister relationship between Ollari Gadba and Parji of the Central group.

The crux with using trees

When inferring a tree as the basis of our hypothesis testing, we do this under the assumption that a series of dichotomies can model the diversification process. Languages are particularly difficult in this respect, because even when we clean the data of borrowings, we cannot be sure that the formation of languages represents a simple split of one unit into two units. Support consensus networks based on the Bayesian or bootstrap tree samples can open a new viewpoint by visualizing internal conflict.

This tree-model conflict may be genuine. For example, when languages evolve and establish they may be closer or farther from their respective sibling languages and may have undergone some non-dichotomous sorting process. Alternatively, the conflict may be due to character scoring, the way one transforms a lexicon into a sequence of (here) binary characters. The support networks allow exploring these phenomena beyond the model question. Ideally, a BS of 40 vs. 30 means that 40% of the binary characters support the one alternative and 30% support the competing one.

In this respect, historical-linguistic and morphological-biology matrices have a lot in common. Languages and morphologies can provide tree-incompatible signals, or contain signals that infer different topologies. By mapping the characters on the alternatives, we can investigate whether this is a genuine signal or one related to our character coding.

Mapping the binary sequences for the concept "all" (example used above to illustrate the matrix basic properties; equalling 15 binary characters) on the ML-BS consensus network. We can see that its evolution is in pretty good agreement with the overall reconstruction. Two binaries support the sister relationship of the South II languages Koya and Telugo, and a third collects most members of the South I group. All other binaries are specific to one language, hence, do not produce a conflict with the edges in the network.