Exploring the relationship between gender and author order and composition in NIH-funded research

Last week there was a brief but interesting conversation on Twitter about the practice of “co-first” authors on scientific papers that led me to do some research on the relationship between author order and gender using data from the NIH’s Public Access Policy.

I want to note at the outset that this is my first foray into analyzing this kind of data, so I would love feedback on the data, analyses and finding, especially links to other work on the subject, as I know some of these issues have been addressed elsewhere.

A long post follows, but here are some main things I found:

  • The number of female authors falls off as you go down the list of authors of a paper, with fewer than 30% of senior authors female.
  • Contrary to my expectation, there doesn’t seem to be a bias to put the male author first when there are male-female co-first author pairs.
  • There are, however, far fewer male-female co-first author pairs than there should be based on the number of male and female first and second authors.
  • The same thing holds true more generally for first-second author pairs. There is a deficit of cross gender pairs and a surplus of same gender pairs.
  • Part (and maybe most) of this effect is due to an overall skew in gender composition of authors on papers.
  • If you are female, there is a 45% chance that a random co-author on one of your papers is female. If you a male, there is only a 35% chance that a random co-author on one of your papers is female.

Before I explain how I got all this, let me start with a quick explainer about how to parse the list of authors on a scientific paper.

By convention in many scientific disciplines (including biology, which this post is about), the first position on the author list of a paper goes to the person who was most responsible for doing the work it describes (typically a graduate student or postdoc) and the last position to the person who supervised the project (typically the person in whose lab the work was done). If there are more than two authors an effort is made to order them in rough relationship to their contributions from the front, and degree of supervision from the back.

Of course a single linear ordering can not do justice to the complexity of contribution to a scientific work, especially in an era of increasingly collaborative research. One can imagine many better systems. But, unfortunately, author order is currently the only way that the relative contributions of different people to a work is formally recorded. And when a scientist’s CV is being scrutinized for jobs, grants, promotions, etc… where they are in the author order matters A LOT – you only really get full credit if you are first or last.

Because of the disproportionate weight placed on the ends of the author list, these positions are particularly coveted, and discussions within and between labs about who should go where, while sometimes amicable, are often difficult and contentious.

In recent years it has become increasingly common for scientists to try and acknowledge ambiguity and resolve conflicts in author order by declaring that two or more authors should be treated as “co-first authors” who contributed equally to the work, marking them all with a * to designate this special status.

But, as the discussion on Twitter pointed out, this is a bit of a ruse. First is still first, even if it’s first among equals (the most obvious manifestation of this is that people consider it to be dishonest to list yourself first on the author list on your CV if you were second with a * on the original paper).

Anyway, during this discussion I began to wonder about how the various power dynamics at play in academia played out in the ordering of co-equal authors. And it seemed like an interesting opportunity to actually see these power dynamics at play since the * designation indicates that the contributions of the *’d authors was similar and therefore any non-randomness in the ordering of *’d authors with respect to gender, race, nationality or other factors likely reflects biases or power asymmetries.

I’m interested in all of these questions, but the one that seemed most accessible was to look at the role of gender. There are probably many ways to do this, but I decided to use data from PubMed Central (PMC), the NIH’s archive of full-text scientific papers. Papers in PMC are available in a fairly robust XML format that has several advantages over other publicly available databases: 1) full names of authors are generally provided, making it possible to infer many of their genders with a reasonable degree of accuracy, and 2) co-first authorship is listed in the file in a structured manner.

I downloaded two sets of papers from PMC: 1,355,350 papers in their “open access” (OA) subset that contains papers from publishers like PLOS that allow the free text to be redistributed and reused 424,063 papers from the “author manuscript” (AM) subset that contains papers submitted as part of the NIH’s Public Access Policy. There papers are all available here.

I then wrote some custom Python scripts to parse the XML, extracting from each paper the author order, the authors’ given names and whether or not they were listed as “co-first” or “equal” authors (this turned out to be a bit trickier than it should have been, since the encoding of this information is not consistent). I will comment up the code and post it here ASAP.

I looked at several options for inferring an author’s gender from their given name, recognizing that this is a non-trivial challenge, with many potential pitfalls. I found that a program called genderReader, recommended by Philip Cohen, worked very well. It’s a bit out of date, but names don’t change that quickly, so I decided to use it for my analyses.

I parsed all the files (a bit of a slow process even on a fast computer) and started to look at the data. I’m going to focus on the AM subset here first, because these are all NIH funded papers and thus mostly from the US, so intercountry differences in authorship practices won’t confound the analyses, and because the set is likely more representative of the universe of papers as a whole than is the OA subset. I will try to note where these two datasets agree and disagree.

Of the 424,063 papers in AM, there are 2,568,858 total authors with a maximum of 496 and a wide distribution.

Author Number Histogram

There are 219,559 unique given names (including first name + middle initials), of which about 75% were classified successfully by genderReader as male, mostly male, female, mostly female or unisex. About 25% were not in their database. For the purpose of these analyses, I treated mostly male as male and mostly female as female. I’m sure there’s some errors in this process, but I looked over a reasonable subset of the calls and the only clear bias I saw was that it didn’t do a good job of classifying Asian names – treating most of them as unisex, and thereby excluding them from my analysis. All together there were 1,206,616 male authors, 737,424 female authors and 624,818 who weren’t classified. Of the authors who were classified, 62% were male.

Of the 424K paper 32,304 contained co-equal authors, and 28,184 contained two or more co-first authors (assessed by asking if the co-equal authors were at the beginning of the author list). Of these, 85% (24,087) had exactly two co-first authors and 12% (3,285) had three co-first authors (one had 20 co-first authors, which I’m just going to leave here for discussion). I decided to use only those with exactly two co-first authors for the next set of analyses.

There were a total of 11,340 papers with exactly two co-first authors both of whose genders were inferred. Of these, the author order counts were as follows:

 CountPercent 
Male-Male428637.8
Male-Female247921.9
Female-Male239921.1
Female-Female217619.2

I will admit I expected to see a lot more papers with Male-Female than Female-Male orders amongst two co-first authors. That is, however, not what the data show.

However, that doesn’t mean there’s not something interesting going on with gender here. First, there’s obviously a lot more male authors than female authors. In this set of papers, only 40.3% of authors in position 1 and 41.0% in position 2 are female. Given this you can easily calculate the expected number of MM, MF, FM and FF pairs there should be.

 ExpectedObserved
Male-Male39944286
Male-Female27762479
Female-Male26962399
Female-Female18742176

Although there doesn’t seem to be a bias in favor of M-F over F-M, there are significantly (p << .0000000001 by Chi-square) fewer mixed gender co-first author pairs than you’d expect given the overall number of male and female co-first authors.

What can explain this? Are young scientists less likely to collaborate across gender lines than within them? Are male and female pairs better able to resolve their authorship disputes, and are thus underrepresented amongst co-first authors? Or are there fewer opportunities for them to collaborate because of biased lab compositions?

First I wanted to ask if there was a similar bias if we looked at all papers, not just the relatively rare co-first author papers. Here is the fraction of female author by position in author list for all papers (excluding the last author for now).

Author gender by position

Female authors are most common in the first author position and they are increasingly less represented as you go back in the author order. Maybe this has to do with the well documented problem of academia driving out women between graduate school and faculty position. So next I asked what fraction of senior authors are women.

Gender by Senior Author Position

Yikes. Only 28% of senior authors of NIH author manuscripts are female compared to 46% of first authors. That’s horrible.

So what about the question from above. Are mixed gender first and second author pairs less common across all papers, not just co-firsts? The answer is yes.

 ExpectedObserved
Male-Male6005266807
Male-Female4887442120
Female-Male5162344869
Female-Female4201448769

Again, there are lots of possible explanations for this, but I was curious about the effect of biased lab composition (if the gender composition of labs is skewed away from parity then you’d expect more same gender author pairs). It’s hard to look at this directly with this data, but if one were going to guess at a covariate for skewed lab gender it would be the gender of the PI, and this I can look at with this data.

So, I next broke the data down by the gender of the senior author.

Author gender by PI gender

And in tabular form since the data are so striking.

 PI FemalePI Male
1st56.3 %41.0 %
2nd53.0 %40.6 %
3rd50.6 %40.0 %
4th48.5 %39.0 %
5th45.1 %37.1 %

This data very strongly suggests that women are more likely to join labs with female PIs and men more likely to join labs with male PIs. But it doesn’t say why. It could be that people simply choose labs with a PI of their gender, or that PIs select people of the same gender for their labs. This could have to do with direct gender bias, or with lab style or many other things. Or it could be that there’s a hidden field effect here – that different fields have different gender biases, which would drive the gender distribution of labs on average away from parity.

But whatever the reason it’s a clear confounding factor in looking at gender and authorship. Interestingly, the bias against mixed gender first and second authorship is still there (p-values << .0000000001) even if you control for the gender of the PI.

Next I asked if we could detect a skew in the gender composition of the entire author list of papers. So I took sets of papers with number of authors ranging from 2 to 8 (these are the ones for which we have enough data), filtered out papers where one or more authors didn’t have an inferred gender, and compared the distribution of the number of female authors to that expected by the frequency of male and female authors at each position. There is very consistently a skew towards the extremes, with a significant excess in every case of papers with authors of one gender.

Gender skew

So there’s a pretty systemic skew in the gender composition of authors on papers, but where that skew comes from is unclear. Let’s look at the gender mix of all of the other authors on a paper as a function of the gender of the last author.

Gender skew by last author

Again, there’s a pretty strong skew. But is this due to the PI’s gender or to a more general gender imbalance? It’s a bit hard to tell from this data alone. It turns out the skew you see after dividing based on the gender of the last author is roughly the same if you divide based on the gender of any other position in the author order. Here, for example, is what you get for papers with six authors.

effect of reference author

There’s a lot more one could and should do with this data, and I will come back to it later, but for now I will end with this observation. If you are female, there is a 45% chance that a random co-author on one of your papers is female. If you are male, it goes down to 35%. That’s a pretty big and striking difference, and I’m curious if anyone has a good explanation for it.

The current system of scholarly publishing is the real infringement of academic freedom

Rick Anderson has a piece on “Open Access and Academic Freedom” at Inside Higher Ed arguing the open access policies being put into place by many research funders and some universities that require authors to make their work available under open licenses (most commonly Creative Commons’ CC-BY) are a violation of academic freedom and should be viewed with skepticism.

Here is the basic crux of his argument:

The meaningful right that the law provides the copyright holder is the exclusive (though limited) right to say how, whether, and by whom these things may be done with his work by others.

So the question is not whether I can, for example, republish or sell copies of my work under CC BY — of course I can. The question is whether I have any say in whether someone else republishes or sells copies of my work — and under CC BY, I don’t.

This is where it becomes clear that requiring authors to adopt CC BY has a bearing on academic freedom, if we assume that academic freedom includes the right to have some say as to how, where, whether, and by whom one’s work is published. This right is precisely what is lost under CC BY. To respond to the question “should authors be compelled to choose CC BY?” with the answer “authors have nothing to fear from CC BY” or “authors benefit from CC BY” is to avoid answering it. The question is not about whether CC BY does good things; the question is whether authors ought to have the right to choose something other than CC BY.

Although for reasons I outline below I disagree with Anderson’s conclusion that concerns about academic freedom should trump the push for greater access, the point bears some consideration, especially because he is far from the only one raising it.

But what actually is this “academic freedom” we are talking about?  I will admit that, even though I am a long-time academic, and have a general sense of what academic freedom is, when I first started hearing this complaint about open access mandates, I didn’t really understand what the term “academic freedom” actually means. And part of the problem is that there isn’t really a thing called “academic freedom”.

The Wikipedia definition pretty much captures the concept:

Academic freedom is the belief that the freedom of inquiry by faculty members is essential to the mission of the academy as well as the principles of academia, and that scholars should have freedom to teach or communicate ideas or facts (including those that are inconvenient to external political groups or to authorities) without being targeted for repression, job loss, or imprisonment.

But this broad concept lacks a unified concrete reality. Anderson cites as his evidence that CC-BY mandates violate academic freedom the following passage from the widely-cited “1940 Statement of Principles on Academic Freedom and Tenure” from the American Association of University Professors:

Teachers are entitled to full freedom in research and in the publication of the results, subject to the adequate performance of their other academic duties; but research for pecuniary return should be based upon an understanding with the authorities of the institution.

Note that while this document provides a definition of academic freedom that has been fairly widely accepted, it is not in any way legally binding nor, more importantly, does it reflect a universal consensus about what academic freedom is. Nonetheless, it’s hard not to get behind the general principle that academics should have the “freedom to publish”. However, it is by no means clear what this actually entails.

Virtually everything I have ever read about academic freedom starts with the importance of giving academics the freedom to express the results of their scholarship irrespective of their specific conclusions. We grant them tenure in large part to protect this freedom, and I know of no academic who would sanction their employer telling them that they can not publish something they wish to publish.

But imposing a requirement that academics employ a CC-BY license does not impose a restriction on the content of their publication, but rather imposes a limit on venues available for publication (and it’s important for open access supporters to acknowledge this – there exist journals today that would not accept papers that were available online elsewhere, with or without a CC-BY license). But I’m not sure this constitutes a limit on academic freedom?

Clearly some restrictions on venues would have the effect of restricting authors’ ability to communicate their work. If a university told its academics that they could only publish in venues that appeared exclusively in print, they would unambiguously limit their ability to communicate and we would not sanction it. But what if they required that all works be available online to facilitate assessment and access for students? This would also impose some limits on where they could publish, but, in the current online-heavy universe, this would not be a meaningful limit on the authors’ ability to communicate.

So it seems to me that we have to make a choice. Approach 1 would be to evaluate such conditions on a case by case basis to determine if the limitations placed on authors actually limit academic freedom.  Approach 2 would be to enshrine the principle that any conditions placed on how or where academics publish by universities and funders are unacceptable.

If we take the case-by-case approach, we have to ask if the specific requirement that authors make their work available under a CC-BY license constitutes an infringement of their freedom to communicate their work. It certainly imposes some limits on where they can publish, but, given the wide diversity of journals that don’t prohibit pre-prints, it’s hard to describe this as a significant infringement.

The second issue raised by Anderson, that by requiring CC-BY and thereby granting others the right to reuse and republish a work without author permission you are depriving authors of the right to control how their work is used. I am a bit sympathetic to this point of view. But in reality authors have actually already lost an element of this control, as the fair use component of copyright law grants others the right to use published works in certain ways without author permission (to write reviews of the work, for example), so it’s hard to see this as a major intrusion.

Which brings me to one of my main points. Anderson argues that the principle of “freedom to publish” should be sacrosanct. But it clearly is not. While scholars my have the theoretical ability to publish their work wherever they want to, in reality the hiring, promotion, tenure and funding policies of universities and funding agencies impose a major constraint on how and where academics publish. Scientists are expected to publish in certain journals, other academics are expected to publish books with certain publishers. Large parts of the academic enterprise are currently premised on restricting the freedom of academics to publish where and how they want. In comparison to these restrictions – which manifest themselves on a daily basis – the added imposition of requiring a CC-BY license seems insignificant.

Furthermore, one has to view the push for CC-BY licenses in a broader context in which they are part of an effort to alter the ecology of scholarly publishing so that authors are not judged by their publication in a narrow group of journals or with a narrow group of university presses. Thus I would argue that, viewed practically, the shift to CC-BY would actually promote academic freedom and the freedom of authors to publish how and where they want.

One could reasonably respond that it’s not my place to decide on behalf of other scholars what does and does not constitute an imposition of their academic freedom. Which brings us to approach 2, enshrining the principle that any conditions placed on how or where academics publish by universities and funders are unacceptable. If you hold this position then you will clearly view a mandatory CC-BY policy as an unacceptable imposition of academic freedom. But you would then also have to see the hiring, promotion, tenure and funding policies that push authors to certain venues as an even bigger betrayal of academic freedom. I am happy to completely embrace this point of view.

In the end, I didn’t find Anderson’s article as repugnant as many of my open access friends did. Academic freedom is important, and it should be defended. And the points he raised are interesting and important to consider. But I take exception with Anderson’s focus on the supposed negative effects of the use of a CC-BY license on academic freedom, when, if we are serious about defending academic freedom we should instead be looking at how the entire system of scholarly publishing limits it. Indeed, I have now been inspired by Anderson’s article to make academic freedom a major lynchpin of my future arguments in favor of fundamental reform of scholarly publishing.

 

The DOE’s public access policy sells out the public

Yesterday the Department of Energy became one of the first federal agencies to announce its plan to comply with a 2013 White Houses directive ordering federal agencies to provide the public with access to the results of research that they fund.

Here are the main features:

  • DOE will host a centralized database of metadata (title, authors)
  • The full-text of the articles will be made available on publisher websites, primarily through their CHORUS system
  • Articles will be made available within 12 months of publication

Although it may not seem like it at first glance, this is a terrible turn for public access. Yes, this policy will make a good number of publications freely available, and that is a step forward. But the choice to go with the “link to publisher website” model being pushed by publishers, instead of the centralized database model already successfully used by the NIH, is a disaster.

Most importantly, the DOE has bought into the ridiculous notion that publishers should own the results of federally funded research, and that the interest of the publishers in maintaining control of the content they publish trumps the public interest in making this content freely available and free to use.

PAGES does not include, as far as I can tell, the ability to do full-text searches. Because access will be provided by publishers and not the DOE itself, the process of getting and reading articles will likely be cumbersome. But the clearest evidence that the DOE cares more about publishers than the public is found in their attitude towards bulk-downloading of the freely-available content (see page 6 of the formal policy announcement):

The distributed nature of PAGES’ full-text content inherently makes unauthorized mass downloading and redistribution more difficult. For the limited full-text content it hosts publicly, OSTI will enforce a download limit and post appropriate fair use policies.

Note that not only does this policy prevent the perfectly reasonable action of downloading and reusing content produced by US taxpayer dollars, the DOE is celebrating the fact that they have made this impossible. This is completely unacceptable.

By any reasonable standard the product of government-funded research should belong to the public. And indeed, it DOES belong to the public, until the moment that authors assign their copyright over to journals. All the DOE has to do is forbid their authors from assigning copyright to publishers and instead place them in the public domain. This would not only ensure public access, but would also enable researchers and companies access to the full contents to develop new and interesting ways to use the results of publicly-funded research.

That the DOE eschewed this path in favor of reifying publisher ownership and control of government-funded literature is unforgivable.

Why I, a founder of PLOS, am forsaking open access

PLEASE NOTE BEFORE YOU READ THIS THAT IT WAS WRITTEN FOR APRIL FOOLS DAY!!!


I co-founded the Public Library of Science (PLOS) in 2002 because I believed deeply that the open access publishing model PLOS espoused and has come to dominate was good for science, scientists and the public.  Over the past decade open access has become a personal crusade – my own religion – one I have fervently promoted here on this blog, on social media, and to thousands of colleagues at meetings and social engagements. To back up my commitment to open access, since 2000, I have exclusively published papers from my lab in open access journals, and have urged – some might say hectored and harassed – my colleagues to do the same.

But in the last few weeks I have had a major change of heart. Yesterday at group meeting I told the members of my lab that they are free to send their papers to any journal they want to – including (and especially) the previously reviled especially Nature, Cell and Science. I am announcing this here today because I have been so publicly associated with open access, and I felt I owe my readers and the community an explanation for why I have made this dramatic change.

The most immediate reason is that, to be honest, I’m jealous. I just got back from the annual fly meeting in San Diego. Throughout the meeting – after talks, in the poster sessions and at the bar – people kept coming up to me and telling me how much they love our work, how they’re using our data, our methods or our ideas. But these words of praise rang hollow, lacking as they did that glint in the eye people get when they say “I really loved your Nature paper”.

It used to be cool to publish in PLOS. The small band of early open access adherents  – identifiable by our gaudily colored, slightly risqué  t-shirts (“Where would Jesus publish?”) – were everyone’s favorite rebels with a cause. Maybe people didn’t share our willingness to stand up to The Man. But they wished they did. And we had their respect.

But now those t-shirts are ratty, and PLOS has become The Man. Its reviews are slow. Its editorial decisions are capricious. And, frankly, nobody ever really cared about whether the public could read their papers anyway.

What people do care about is the cachet that comes from having an overworked editor at one of the big three journals decide that their paper is “The One”. I could see it in my students’ and postdocs’ eyes every time we passed by an adoring horde gathered round the latest winner of the great “Science, Nature and Cell” game, listening to them tell tales of how they worked the latest buzzwords into their abstract and buried all their confusing data in supplemental materials. Who am I to deny this joy to the young scientists who have entrusted their careers to me, just because I don’t think it’s “right”?

And who’s to say what’s right anyway. I’ve been going back over the last several years of posts from The Scholarly Kitchen. And when I listen to what they – especially Kent Anderson – say free of the haze of an open access zealot they start to make a lot of sense.

First of all, the whole idea that the public is clamoring for free access to the scientific literature is a pipe dream. Sure PubMed Central – the free database of papers produced with funding from the National Institutes of Health – gets over 1,000,000 hits a day. But do you really believe numbers from the government? After all, these are the same people who are saying that 7,000,000 people have signed up for Obamacare. The open access lobby can always dig up some people – cancer patients or something like that – who have benefited from open access. But we never hear about the people who’ve been hurt – like all the students at places like Harvard and Stanford who no longer have better access to the scientific literature than hoi poloi at lesser institutions.

And now that I’ve had a chance to think about it, it makes no sense to wrest the system of forging new scientists, making promotions and assigning tenure at institutions of higher learning away from the for-profit corporations that control it today. Who’s going to do it instead? Scientists???? Have you been to a faculty meeting? Or served on a study section? These kind of decisions are best left to people who are far removed from the messy details of the science and who care primarily about making money – only they can be truly objective.

I have come to appreciate the important role that prestigious journals like Science, Nature and Cell play in filtering out bad science, and protecting both the public and other researchers from wasting their time reading about – or following up on – results that are not believable. You have all, undoubtedly heard about recent studies examining the reproducibility of scientific results. For example, a recent Nature paper [paywalled, so you can believe it] described how scientists at drug company Amgen were able to successfully replicate six of 53 landmark studies in cancer research.

As these were landmark studies, most were published in the highest profile subscription journals. And these results prove that – contrary to what I would have expected – the top subscription journals doing a great job of picking papers. First, Amgen, who doesn’t like to waste their money, found 53 of these studies important enough to try to replicate. I don’t think they’ve bothered to try even a dozen PLOS ONE papers. But more amazingly these scientists at Amgen were able to get the same results as important academic scientists OVER ten percent of the time. This means that the papers must have described the methods extremely clearly – a hallmark of high profile journals.

Finally, there’s the issue of money. Funding agencies and universities across the world spend over $10,000,000,000 a year subscribing to research journals in science, technology and medicine that publish, collectively, about 1,500,000 articles (or around $6,500 per article). We all know that the point of economies is to expand, and journal  publishing has been doing its part, with costs increases exceeding inflation (meaning it is growing fast!) every year for the past few decades. But imagine what will happen if we switch to universal open access as I have been advocating. Everyone agrees that open access journals charge scientists a lot less than $6,500 to publish their papers. So, if we start publishing more open access papers, we’ll be spending less money (a LOT less if publishers like PeerJ get their way) for every article, and therefore LESS money on publishing. This is called contraction, and it’s what caused the Great Depression.

This is why I now strongly support CHORUS – the publisher’s answer to calls from Congress and The President to provide better public access to government funded research. CHORUS will provide people with access to papers after a delay – timed to ensure that no subscription revenues will be lost. Thus for the entire period of time when articles are actually useful to people they will be behind a paywall where they can generate money for the economy. This makes sense, whereas using “open access” publishing to make these articles immediately freely available to everyone at a lower cost clearly does not.

I have a lot to answer for. I want to apologize to all the people who have followed me into the abyss of open access. All I can say is that I meant well, and that I hope you will forgive me for the joy I have taken out of your lives and for the broken dreams of the career you could have had if you’d only published your postdoc paper in Cell.

FIRST of all, THIS is why you should never trust publishers

When President Obama announced last year that he was requiring federal agencies that fund science to develop policies to make papers arising from the work they publish freely available to the public, major subscription-based publishers responded in a generally favorable manner – reflecting the extent to which they had drawn the White House back from more aggressive proposals on the table. They even put forth a proposal – called CHORUS (Clearinghouse for the Open Research of the United States) through which they offered to implement  these public access policies for federal agencies – providing free access to articles on their own websites.

wrote at the time about why CHORUS was a ruse, that would never work. In particular, I warned that, despite their public veneer of support, publishers would continue to work to reverse these public access policies, and, because with CHORUS they would never have to give up control of published papers, they could just turn off public access if they ever succeeded.

Well, they’re trying to do just this. Last week a bill was introduced – The Frontiers in Innovation, Research, Science and Technology (FIRST) Act of 2014 – a section of which (Section 303) is designed to undermine this already fairly weak policy. The language is a bit dense and confusing, but here is what it would do.

  • The surest way to kill a policy initiative in DC is to call for more study. FIRST calls for 18 additional months of study of public access policies, specifically calling for “data-driven” justification for embargo periods. This is language publishers have used before and is code for “set embargo periods so that they do not harm the bottom line of publishers”.
  • It calls for the use of existing infrastructure, including the NLM, but also in the private sector, and minimizing the burden of providing access – things that the publishers use to promote CHORUS
  • It weakens the embargo period to 24 months from its already unacceptably long 12 months, and allows for agencies to EXTEND this period for up to an additional year

I don’t have direct evidence that publishers are behind this, but it echoes all the main talking points they’ve been using to complain about public access. It’s not clear how far this will go, since this is just the House version of the bill, and the Senate version has not emerged. But it’s worth using the tools available through SPARC to voice your opposition to this bill.

PubMed Commons: Post publication peer review goes mainstream

I have written a lot about how I think the biggest problem in science communication today is the disproportionate value we place on where papers are published when assessing the validity and import of a work of science, and the contribution of its authors. And I have argued that the best way to change this is to develop a robust system of post publication peer review (PPPR) , in which works are assessed continuously after they are published so that flaws can be identified and corrected and so that the most credit is reserved for works that withstand the test of time.

There have been LOTS of efforts to get post-publication peer review off the ground – usually in the form of comments on a journal’s website – but these have, with few exceptions, failed to generate sustained use. There are lots of possible reasons for this – from poor implementation, to lack of interest on the part of potential discussants. However, I’ve always felt the biggest flaw was that these were on journal websites – that you had to think about where the work was published, and whether they had a commenting system, and whether you had an account, etc…

What we’ve always needed was a central place where you know you can always go to record comments on a paper you are reading, and, conversely, where you can get all of the comments other scientists have on a paper you’re reading or are interested in. There have been a couple of services that have tried to create such a system – cf PubPeer, which lets you comment on any paper in PubMed – but they have been slow to gain traction in the community.

The obvious place to build such a commenting/post publication review system has always been directly in PubMed – it has everything and everyone already uses it. This is why I am excited – and cautiously optimistic – about a new project called PubMed Commons that will allow registered users (for now primarily NIH grantees) to post comments on any paper in PubMed, which will then appear alongside the paper when it is received in a search.

Here is how PubMed Commons describes itself:

PubMed Commons is a system that enables researchers to share their opinions about scientific publications. Researchers can comment on any publication indexed by PubMed, and read the comments of others.

PubMed Commons is a forum for open and constructive criticism and discussion of scientific issues. It will thrive with high quality interchange from the scientific community.

The system is still pretty threadbare – it only allows simply commenting, and not, for example, rating of the work – but I’ve used it and it is easy to get in, comment and get out. A lot more info on the project can be found here.

This is a great opportunity for us to make PPPR real. But it’s only going to work if people participate. So, if you’re an NIH grantee, and you want to see science communication improve, make a commitment to comment in a paper you’ve read at least once a week, and let’s make this thing work!!

Let’s not get too excited about the new UC open access policy

It was announced today that systemwide Academic Senate representing the 10 campuses of the University of California system had passed an “open access” policy.

The policy will work like this. Before assigning copyright to publishers, all UC faculty will grant the university a non-exclusive license to make the works freely available, provide the university with a copy of the work, and select a creative commons license under which is will be made freely available in UC’s eScholarship archive.

A lot of work went into passing this, and its backers – especially UCLA’s Chris Kielty – are to be commended for the cat herding process required to get it though UC’s faculty governance process.

I’m already seeing lots of people celebrating this step as a great advance for open access. But color me skeptical. This policy has a major, major hole – an optional faculty opt-out. This is there because enough faculty wanted the right to publish their works in ways that were incompatible with the policy that the policy would not have passed without the provision.

Unfortunately, this means that the policy is completely toothless. It provides a ready means for people to make their works available – which is great. And having the default be open is great. But nobody is compelled to do it in any meaningful way – therefore it is little more than a voluntary system.

More importantly, the opt-out provides journals with a way of ensuring that works published in their journals are not subject to the policy. At UCSF and MIT and other places, many large publishers, especially in biomedicine, are requiring that authors at institutions with policies like the UC policy opt-out of the system as a condition of publishing. At MIT, these publishers include AAAS, Nature, PNAS, Elsevier and many others.

We can expect more and more publishers to demand opt-outs as the number of institutions with open/public access policies grows. In the early days of such “green” open access, publishers were pretty open about allowing authors to post manuscript versions of their papers in university archives. They were open because there was no cost to them. Nobody was going to cancel a subscription because they could get a tiny fraction of the articles in a journal for free somewhere on the internet.

However, as more universities – especially big ones like UC – move towards institutional archiving policies, an increasing fraction of the papers published in subscription journals could end up in archives – which WOULD threaten their business models. So, of course (and as I and others predicted a decade ago), subscription publishers are now doing their best to prevent these articles from becoming available.

So long as the incentives in academia push people to publish in journals of high prestige, authors are going to do whatever the journal wants with respect to voluntary policies at their universities. And so, we’re really back to where we were before. Faculty can make their work freely available if they want to – and now have an extra way to do it. But if they don’t want to, they don’t have to.

The only way this is going to change is if universities create mandatory open access policies – with no opt-outs or exceptions. But this would likely require actions from university administrators who have, for decades, completely ignored this issue.

So don’t get me wrong. I’m happy the faculty senate at UC did something, and I think the eScholarship repository will likely become an important source of scholarly papers in many fields, and the use of CC licenses is great. And maybe the opt out will be eliminated as the policy is reviewed (I doubt it). But, because of the opt out, this is a largely symbolic gesture – a minor event in the history of open access, not the watershed event that some people are making it out to be.

A CHORUS of boos: publishers offer their “solution” to public access

As expected, a coalition of subscription based journal publishers has responded to the White House’s mandate that federal agencies develop systems to make the research they fund available to public by offering to implement the system themselves.

This system, which they call CHORUS (for ClearingHouse for the Open Research of the United Status) would set up a site where people could search for federally funded articles, which they could then retrieve from the original publisher’s website. There is no official proposal, just a circulating set of principles along with a post at the publisher  blog The Scholarly Kitchen and a few news stories (1,2), so I’ll have to wait to comment on details. But I’ve seen enough to know that this would be a terrible, terrible idea – one I hope government agencies don’t buy in to.

The Association of American Publishers, who are behind this proposal, have been, and continue to be, the most vocal opponent of public access policies. They have been trying for years to roll back the NIH’s Public Access Policy and to defeat any and all efforts to launch new public access policies at the federal and state levels. And CHORUS does not reflect a change of heart on their part – just last month they filed a lengthy (and incredibly deceptive) brief opposing a bill in the California Assembly would provide public access to state funded research.

Putting the AAP in charge of implementing public access policies is thus the logical equivalent of passing a bill mandating background checks for firearms purchasing and putting the NRA in charge of developing and operating the database. They would have no interest in making the system any more than minimally functional. Indeed, given that the AAP clearly thinks that public access policies are bad for their businesses, they would have a strong incentive to make their implementation of a public access policy as difficult to use and as functionless as possible in order to drive down usage and make the policies appear to be a failure.

You can already see this effect at work  - the CHORUS document makes no mention of enabling, let alone encouraging, text mining of publicly funded research papers, even though the White House clearly  stated that these new policies must enable text mining as well as access to published papers. Subscription publishers have an awful track record in enabling reuse of their content, and nobody should be under any illusions that CHORUS will be any different.

The main argument the CHORUS publishers are making to funding agencies is that allowing them to implement a solution will save the agencies money, since they would not have to develop and maintain a system of their own, and would not have to pay to convert author manuscripts into a common, distributable format. But this is true only if you look at costs in the narrowest possible sense.

First, there is no need for any agency to develop their own system. The federal government already has PubMed Central – a highly functional, widely used and popular system. This system already does everything CHORUS is supposed to do, and offers seamless full-text searching (something not mentioned in the CHORUS text), as well as integration with numerous other databases at the National Library of Medicine. It would not be costless to expand PMC to handle papers from other agencies, and there would be some small costs associated with handling each submitted paper. However, these costs would be trivial compared to the costs of the funding the research in question, and would produce tremendous value for the public. What’s more, most of these costs would be eliminated if publishers agreed to deposit their final published version of the paper directly to PMC – something most have steadfastly refused to do.

But even if we stipulate that running their own public access systems would cost agencies some money, the idea that CHORUS is free is risible. There is a reason most subscription publishers have opposed public access policies – they are worried that, as more and more articles become freely available, that their negotiating position with libraries will be weakened and they will lose subscription revenues as a consequence. Since a large fraction of these subscription revenues (on the order of 10%, or around $1 billion/year ) come from the federal government through overhead payments to libraries, the federal government stands to save far, far, far more money in lower subscription expenditures than even the most gilded public access system could ever cost to develop and operate.

CHORUS is clearly an effort on the part of publishers to minimize the savings that will ultimately accrue to the federal government, other funders and universities from public access policies. If CHORUS is adopted, publishers will without a doubt try to fold the costs of creating and maintaining the system into their subscription/site license charges – the routinely ask libraries to pay for all of their “value added” services. Thus not only would potential savings never materialize, the government would end up paying the costs of CHORUS indirectly.

Publishers desperately want the federal agencies covered by the White House public access policy to view CHORUS as something new and different – the long awaited “constructive” response from publishers to public access mandates. But there is nothing new here. Publishers proposed this “link out” model when PMC was launched and when the NIH Public Access policy came into effect, and it was rejected both time. Publishers hate PMC not because it is expensive, or even because it leads to a (small) drop in their ad revenue. They hate it because it works, is popular and makes most people who use it realize that we don’t really need publishers to do all the things they insist only they can do.

CHORUS is little more than window dressing on the status quo – a proposal that would not only undermine the laudable goals of the White House policy, but would invariably cost the government money. Let’s all hope this CHORUS is silenced.