Birding and data visualization

Jer Thorp has combined birding and data visualization into a unique course called Binoculars to Binomials:

I dreamt up Binoculars to Binomials as a hybrid site of learning. It’s for coders who are interested in cultivating an observational practice, and for birders who want to dive into the rich pool of data that comes out of their hobby.

More broadly, it’s for anyone who’s interested in the overlap between nature, data and creativity.

Sounds good to me.

One of the best ways to learn how to visualize data is to apply it to a specific field. You figure out the mechanics and the context behind the data, which makes visualization meaningful and useful. In this case, you get your hands in all parts of the process.

Tags: , ,

Shifting bird populations

Using data from the crowdsourced database eBird, Harry Stevens mapped the shifts in bird populations for the Washington Post. Increased building and climate change have led to population declines for many species over the past decade, but some species, such as the blue jay, have seen growth.

Be sure to check out the interactive at the end that lets you search the full species list.

Diligent birders log data on eBird, which they can use to keep track of their own observations. Collectively, researchers can then generate reliable models with the data. The scale of this project continues to amaze.

Tags: , , ,

Crows might understand probabilities

Researchers at the University of Tübingen are studying crows’ abilities to understand statistical inference. For Ars Technica, Kenna Hughes-Castleberry reports:

To do this, Johnston and her team began by training two crows to peck at various images on touchscreens to earn food treats. From this simple routine of peck-then-treat, the researchers significantly raised the stakes. “We introduce the concept of probabilities, such as that not every peck to an image will result in a reward,” Johnston elaborated. “This is where the crows learn the unique pairings between the image on the screen and the likelihood of obtaining a reward.” The crows quickly learned to associate each of the images with a different reward probability.

In the experiment, the two crows had to choose between two of these images, each corresponding to a different reward probability. “Crows were tasked with learning rather abstract quantities (i.e., not whole numbers), associating them with abstract symbols, and then applying that combination of information in a reward maximizing way,” Johnston said. Over 10 days of training and 5,000 trials, the researchers found that the two crows continued to pick the higher probability of reward, showing their ability to use statistical inference.

Tags: , ,

Database of feathers

There’s a database of feathers called Featherbase, because of course there is:

Featherbase is a working group of German feather scientists and other collectors worldwide who came together with their personal collections and created the biggest and most comprehensive online feather library in the world. Using our website, it is possible to identify feathers from hundreds of different species, compare similarities between them, work out gender or age-specific characteristics and look at the statistics of countless feather measurements.

Tags: ,

Bird power rankings

Using data from Project FeederWatch, which is a community tracking project to count birds around feeders, Miller et al. estimated the pecking order among 200 species. This was in 2017. For The Washington Post, Andrew Van Dam and Alyssa Fowers worked with the researchers for an updated ranking using a more comprehensive dataset. The result is bird power rankings 2021 edition.

Tags: , ,

Bird migration forecast maps

BirdCast, from Colorado State University and the Cornell Lab of Ornithology, shows current forecasts for where birds are headed over the United States:

Bird migration forecasts show predicted nocturnal migration 3 hours after local sunset and are updated every 6 hours. These forecasts come from models trained on the last 23 years of bird movements in the atmosphere as detected by the US NEXRAD weather surveillance radar network. In these models we use the Global Forecasting System (GFS) to predict suitable conditions for migration occurring three hours after local sunset.

Tags: , ,

Bird flight patterns captured through long-exposure photography

For several years, Xavi Bou has been using long-exposure photography to capture stills of bird flight patterns. The project, Ornitographies, produced gloriously abstract images. There’s also a video (above) piece under the same premise.

Jessica McKenzie, reporting for Audubon:

More recently, Bou has expanded the project to video, including one called Murmurations that shows a flock of starlings evading a hawk. “What happens is, if in this moment a hawk appears to attack them, it’s when they do this dance,” he says. “The hawk is like carving this ephemeral sculpture that’s in the air.” As with the still images, Bou knit multiple series of photographs together to create an animation. He estimates that every day of filming requires two weeks of post-production work; for Murmurations, he also enlisted the help of a film editor. The final product, which was filmed in southern Catalonia, was then set to ethereal music.

The video deserves the full-screen treatment.

See also the swallows of essex by Dennis Hlynsky.

Tags: , ,

Large morphomatrices – trivial signal


In my last post about fossils, Farris and Felsenstein Zones, I gave an example of a trivial (signal-wise perfect) binary phylogenetic matrix, which will give us the true tree no matter which optimality criterion we use. In this post, we will look at a real world example, a huge bird therapods matrix.
S. Hartman, M. Mortimer, W. R. Wahl, D. R. Lomax, J. Lippincott, D. M. Lovelace
A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight. PeerJ 7: e7247.
What intrigued me about this particular paper (I have no idea about dinosaurs, but the documentation, pictures and data, and presentation seems impeccable) was the following sentence:
The analysis resulted in >99999 most parsimonious trees with a length of 12,123 steps. The recovered trees had a consistency index of 0.073, and a retention index of 0.589.
What can you possibly do with strict consensus trees (Losing information in phylogenetic consensus) based on an unknown number of MPTs that have a CI converging to 0 (but and RI of 0.6; The curious case[s] of tree-like matrices with no synapomorphies)? And isn't this a case for some networks-based exploratory data analysis?

The complete matrix has 501 taxa and 700 characters (the largest plant morphological matrices have hardly more than 100 characters) but also a gappyness of 72%. In this case, 255,969 of the 353,500 cells in the matrix are ambiguous or undefined (missing). The matrix is a (rich) Swiss cheese with very big holes. The high number of MPTs is hence not surprising, and neither is the low CI.

Why run elaborate tree-inferences on such a swiss cheese matrix? One answer is that (some) vertebrate palaeophylogeneticists are convinced that few taxa – many character matrices can lead to wrong clades (clades that are not monophyletic); and each added taxon, no matter how many characters can be scored, will lead to a better tree, by eliminating (parsimony) branching artifacts (see Q&A to the paper). At least 56 of the 501 taxa have 5% or fewer defined characters; still, with 700 characters, 5% equals up to 35 defined traits, which is more than we can recruit for most plant fossils. The median missing data proportion is 74% — more than half of the taxa are scored for less than 26% (< 182 out of 700) of the characters. Can such taxa really save the all-inclusive tree from branching artefacts, or is the high number of MPTs an indication for signal conflicts and data gaps issues?

For this post, we will just look at the tip of the iceberg. What is the signal from the 700 characters to start with?

The basic signal

Here's the heat map for the 19 taxa that have a gappyness of less than 15% (ie. at least 595 of 700 possible characters are defined). The taxon order is mostly the one from the original matrix, sorted by phylogenetic groups — for more orientation, I added next-inclusive superclass "Clades" from Wikipedia (so apologize any errors).


In my last post, I showed that evolutionary lineages (and monophyly) can be directly deduced from such a heat map following the simple logic: two taxa sharing a (direct) common origin are usually more similar to each other than to a third, fourth etc. taxon not part of the same lineage. Exceptions include fossils close to the last common ancestors lacking advanced traits.

The outgroup as used (in this taxon sample: Allosaurus to Tyrannosaurus) is most similar to each other but not monophyletic. One (Allosaurus) respresents the sister lineage of, the other an early split within the lineage that lead to the birds (Coelurosauria:Tyrannoraptora). The extinct (monophyletic) families (Tyrannosauridae, Ornithomimidae, Dromaesauridae) are, however, well visible, being defined by low intra-family and higher inter-family pairwise distances. The same is true for the direct relatives (Clade Ornithurae) of modern birds (class Aves).

Very typical for such datasets is the increasing distance between the (primitive?) outgroups and the most derived, modern-day taxa (living birds: Struthio – ostrich, Anas – duck, Meleagris – turkey). Closest relatives in the taxon set, phylogenetically and time-wise, are (much) more similar than distant ones. Allosaurus may be most similar to the tyrannosaurs, not because of common ancestry but because both are scored as being primitive with respect to the group of interest.

The only tree

This situation becomes very obvious from the only possible (single-optimal) tree that can be inferred from this matrix, when visualized as a phylogram (Stop using cladograms!)

The ML, MP and LS/NJ tree overlapped and scaled to equal root (first split within Tyrannoraptor) to tip (split between Anas and Meleagris) distance (phylogenetic distance, via the tree). Pink, the LS clade conflicting with ML and MP trees, and Wikipedia's tree(s).

No matter which optimisation criterion is used (here Least-Squares via Neighbor-joining, Maximum Parsimony, Maximum Likelihood), the result is the same. The only exception is that the NJ/LS tree places Archaeopteryx as sister to Dromaeosauridae; and the relative branch lengths of roots vs. tips also differ.

Because our matrix has favorable properties (few taxa, many defined characters), it's straightforward to establish branch support. This is a bit frowned upon in palaeontological circles, but having dealt with morphological evolution in cases where we have molecular data, I want to know how robust my clades are, and what may be the alternatives, before I conclude that they reflect monophyly. Bootstrapping coupled with consensus networks is a quick and simple way to test robustness and investigate ambiguous support (Connecting tree and network edges) .

The BS support consensus networks for NJ/LS and ML have only a single reticulation each.

Rooted support consensus networks based on the NJ/LS (10,000 pseudoreplicates, PAUP*) and ML bootstrap (100, number of necessary replicates determined by bootstop criterion implemented in RAxML) samples. Only splits are shown that ocurred in at least 15% of the BS pseudoreplicates.

The MP BS support consensus network is, however, has many more reticulations.

Rooted MP-BS support consensus network (10,000 BS pseudoreplicates, PAUP*). Green — edge bundles corresponding to clades in the all-optimal tree(s); orange — less supported conflicting alternatives; red – higher supported conflicting alternatives; pink – wrong clade in NJ/LS tree.

We can make two generally relevant observations here:
  1. The wrong Archaeopterix-Dromaeosauridae clade (pink edge/branch) masks a split BSNJ support: 68 for the wrong clade, 31 for the right one. While resampling under ML appears to be inert to this conflict, MP is not.
  2. While the NJ- and ML support networks are very tree-like, all clades in the inferred tree have high to unambiguous support, and are near-congruent, the MP network is much more boxy. In some cases the split in agreement with the all-optimal tree has a lower BS support than an alternative (here usually in conflict with the gold tree).
Similar observations can be made with other data sets: although NJ/LS and ML optimisation are fundamentally different (distance- vs. character-based, equal change vs. varying probability of change), they show more agreement with each other when it comes to supporting a topology (or topological alternatives) than MP (character-based like ML, but all changes are treated as equal like NJ/LS). MP is a very conservative approach, highly dependent on possibly a few discerning characters. If they are missing from the BS pseudoreplicate, the backbone tree collapses or changes, and BS values may decrease rapidly. This is so even for a very data-dense matrix like the one used here (few taxa, many characters, low gappyness).

On the positive side, we can expect that MP will produce fewer false positives. On the negative side, it is also more dependent on character coverage, and will produce much more false negatives. Any fossil lacking the crucial characters (or showing too few of them) may be still resolved (placed and supported) under NJ/LS and ML but not using MP. When inferring trees, these fossils will quickly increase the number of MPTs and decrease branch support for the part of the tree they interact with. Personally, given how hard it can be to place a fossil per se with the data at hand, I always preferred a method that can give some result, and point towards possible alternatives (even risking including erroneous), rather than no result at all.

The simplest of networks

Naturally, we can use the distance matrix directly to infer a Neighbor-net, and explore the basic differentiation signal beyond trees but also with regard to the all-optimal tree.

Neighbor-net based on the pairwise distance matrix. Coloration highlights edges found (or not) in the optimised trees.

The Neighbor-net recovers the clades from the all-optimal tree (green, purple the NJ/LS-unique branch), but shows additional edges (orange). The principal signal in the data has, for instance, problems with placing Archaeopteryx, because it is (signal-wise) intermediate between the Avebrevicaudata, the lineage including modern birds, and the Dromaeosauridae, their sister lineage (note that the vertebrate fossil record is considered to be free of ancestors and precursors; all fossils represent extinct sister lineages – evolutionary dead-ends). Skeleton IGM 100042 (an Oviraptoridae), placed as sister to both in the all-optimal tree, also lacks obvious affinities: this is a taxon where the tree inference makes a decision that is not based on a trivial signal encoded in the matrix.

The central boxy part of the Neighbor-net correlates with the 2/3-dimensional part of the parsimony BS consensus network: to resolve these relationships, we need a large set of characters (under MP). On the other hand, recognizing the Ornithurae, members of an extinct family, or a relative of IGM 100042, should be straightforward even with a limited amount of defined characters. Based on the Neighbor-net, which is inferred in a blink no matter how large the matrix, we can also make a decision, as to which taxa interfere and which ones facilitate tree-inferences. The more tree-like the Neighbor-net graph becomes, the easier it is for a tree inference to be made.

Placing fossils, quickly and easily

Using this backbone graph, it is easy to assess in which phylogenetic neighborhood a newly coded fossil falls, eg. the fossil newly described in Hartman et al. and scored for 267 unambiguously defined traits, Hesperornithoides.

Neighbor-net including Hesperornithoides.

Hesperornithoides is obviously a member of the Eumaniraptora (= Paraves), morphologically somewhat intermediate between the Avialae, the "flying dinosaurs", and Dromaeosauridae, but doesn't seem to be part of either of these sister lineages. The graph lacks a prominent neighborhood, the Archaeopteryx-Bambiraptor neighborhood may reflect local long-edge attraction (note the long terminal edges) or convergent evolution in both taxa and, possibly, also the Hesperornithoides lineage. Just based on this simple and quick-to-infer network, Hartman et al.'s title "A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight" appears to be correct (in future posts, we may come back to this morphological supermatrix to see what else networks could have quickly shown).

One should be willing to leave the phylogenetic beaten track – ie. relying on strict consensus parsimony trees as the sole basis for phylogenetic hypothesis. The Neighbor-net is a valuable tool for quick pre- and post-analysis because it can:
  • visualize how coherent the clades in our trees are, 
  • how easy it will be for the tree inference (especially MP) to find and support clades, 
  • help to differentiate ambiguous from important taxa, and finally, 
  • assess whether a new fossil really requires an in-depth re-analysis of the full matrix (and dealing with >99,999 MPTs) instead of using a more focussed taxon (and character) set.

Nature Observations. January 1, 2020. #iNaturalist #Birds #BirdPhotography

So one of my New Year's resolution is to try to make and post nature observations of some kind every day this year. I am hoping to post pictures and also post observations to iNaturalist.

On January 1, I made some observations in my yard. I posted some pics to iNaturalist: https://www.inaturalist.org/calendar/phylogenomics/2020/1/1.

I am trying to just copy straight from that iNaturalist page into this blog - not sure how well that will work


8
taxa
 8 birds







See below for embedded versions of these posts though I really do not like the way the embed works right now.




Trail desk at the Davis Wetlands

So I am just finally coming out of a rough patch of about three weeks with an infection or series of infections. Still have a bad cough and my brain is in a fuzz.  But I felt physically a bit better today.  As I did not want to risk spreading my cooties to people at work, I stayed away from the office and ended up scheduling four phone calls / video chats today.  Two of them were this AM and so, since I did not feel horribly bad, I decided to do the AM calls while at one of my trail desks.

My "trail desk" is the term I use to refer to doing calls while outdoors.  Today I went to the Davis Wetlands since they are only open on Mondays in the Winter.  And I did two phone calls while walking around there and taking pictures.

As usual I posted entries to iNaturalist: https://www.inaturalist.org/calendar/phylogenomics/2019/12/16https://www.inaturalist.org/calendar/phylogenomics/2019/12/16

Best part of the outing.  Lots of Sandhill Cranes migratory overhead.   So if you were on one of these calls and could hear the background when I was not on mute,  you may have been hearing the Sandhill cranes.  Or the geese.  Or me coughing (yes I am not totally better).



And I posted some of the better pics to Smugmug.