2013-01-28

Free for non-commercial academic use only sucks

Scientific software released for free under an "academic only"/"non-commercial" licence really irks me, even if the source code is included. This is mainly because it isn't open source, and I believe science should be open, but this also causes a whole lot of headaches. Right now I'm particularly disappointed by the GATK licensing debacle from The Broad Institute - and I'm not alone in this, for example Mick Watson has just blogged about why the GATK re-licensing matters and there has been plenty of comment on Twitter.

As the tone of this post will hopefully convey, I'm a big fan of open source software (OSS) in general. I use it daily, and help support OSS with reproducible bug reports, usability feedback, patches. Any code I write and share online I release as open source software too. This also applies to my day job doing bioinformatics - I should probably disclose at this point that just over a year ago I was elected to the board of directors of the Open Bioinformatics Foundation (OBF). I believe that the openness in open source software is vital for scientific software - a binary blob is a black box, with only the author's description to tell you what they think it does with your data (and in some cases they don't even tell you that).

Commercial vs Academic

One of the practical issues with an "academic only"/"non-commercial" licence is where does that line lie? Research done at a university is probably OK... but what if the grant is part funded by an industrial parter, as in the BBSRC Industrial Partnerhip Awards (IPA), or a CASE studentship (as scheme offered by most of the RCUK funding agencies)? Every license like this seems to be different, meaning wasted time reviewing their terms - something you don't have to worry about with a mainstream OSS licence.

How about government funded research institutes? I know from first hand experience that their academic status is often viewed as borderline. In one case Apple said yes, and granted access to their Education Store with its lower prices (very worthwhile for their computers). On the other hand, The MathWorks said no, meaning they charged their extortionate full prices for MatLab (later they introduced a third category for such borderline cases). That wasn't a problem for me personally as I'm quite happy using Python + NumPy + matplotlib instead, all OSS and free.

Then there are research organisations which do a mixture of 'pure' academic work and analysis as a service. For example, many sequencing centres offer to ship you just raw data, raw data plus a paid analysis service, or a collaborative approach.

Non-standard Licences

In my experience, every free for "academic only"/"non-commercial" usage licence is unique, and frankly I don't want to waste time reading it - but you need to. For instance, a licence may require each individual user to agree - which is a logistics nightmare if you want to wrap the software as a service (for example in Galaxy). There are similar concerns if there are usage restrictions based on location - multiple sites or campus might not be covered.

The initial version of the GATK v2 licence was a particularly jaw dropping example: It required that you only disclose the results of data analysis to other academic non-commercial users! That was quickly addressed following user out-cry (see this thread and the Twitter comments from July 2012).

Mixed licenses for a large codebase are another headache - if you're only able or willing to accept one of the licences on option, sorting out which parts of the code this gives you access to (and if that is enough for your needs) can be another barrier to usage.

No redistribution or packaging

One of the practical benefits of OSS is the licences allow anyone (even companies) to modify and redistribute the code. In particular, Linux distributions and similar efforts on Windows (e.g. Cygwin) and Mac OS X (e.g. macports), can provide repositories of packaged software with meta-data describing inter-dependencies, which can be downloaded and installed at the click of a button or a single command at the terminal. This is enormously useful. Similarly, OSS licences allow the creation and sharing of virtual machines as way of sharing preconfigured systems.

On a smaller scale, webserver or GUI front ends written to provide a user friendly interface to command line tools also benefit from being able to bundle the underlying tools (both for ease of install, but also to avoid changes with different versions of a dependency).

None of that is possible with "academic only"/"non-commercial" licenses.

Why use a free for academic use only licence?

The only reasons I can understand to do this are about money and control. Overzealous university intellectual property agents might think there is money to be made selling software to commercial users - which could be defensible in some poorly funded cases I suppose. In the case of control, having a novel tool unique to your group can give you a leg up over your academic rivals - rationalised on the grounds that the method itself has been published so your rivals can reimplement it if they want to. I find that ethically distasteful.

Why are the Broad doing this? Why are they making things worse?!

Initially GATK v1 was MIT licensed (one of the simplest and most liberal OSS licenses).

According to the Broad, they were getting requests for support and therefore wanted to offer that as a commercial service. That's fine - but that doesn't explain why they didn't continue releasing all of GATK under an open source licence while selling optional support (a proven commercial strategy as used by RedHat Linux). I've yet to hear a clear answer from the Broad on this.

In July 2012 for GATK v2, a mixed model was adopted - the functionality of GATK v1 remained open source under the name GATK-Lite, but new functionality would be released without source code, and restricted to commercial users for a fee, or available free of charge to academic non-commercial users. People complained, especially about not being able to even see the source.

I didn't like the idea at the time, but I felt the announced hybrid approach for GATK v2 didn't seem too bad, providing they stuck to the outlined plan where the core "GATK Lite" remained open source, and closed source functionality in the full GATK would migrate to it in time (sadly this did not happen). Quoting from the July 2012 GATK v2 announcement:
GATK-Lite isn't a dead-end branch of GATK1. All GATK-Lite infrastructure will be fully supported -- to the same degree as GATK1 -- by the GSA team, as we will rely on these tools day-in and day-out. GATK-Lite is evolve in lock-step with the full GATK, GATK-Lite and GATK(-full) will carry the same release numbers, and will be pushed out by the GSA group simultaneously. As we add new file formats to the GATK (BCF2, for example) these changes will go into the core of GATK, and be available through both GATK and GATK-Lite.
And their FAQ claimed:
Will you ever make the new GATK 2.0 tools open source?Yes, over time we plan to migrate closed source tools into the open source branch of the GATK.
This month (Jan 2013), the Broad Institute announced new licence terms for GATK v2.4 - they still offer a free-as-in-beer option for academic use only (but this time including source code - the only good news), or the option to buy a commercial licence with support via their partner company. The bad news was the open source GATK-Lite has been dropped, with only the core programming framework remaining open source under the MIT license.

Note that some of the previously open source analysis tools ("walkers") which were released with GATK-Lite are no longer open source - causing considerable inconvenience to to groups already using this (and breaking the expectations laid with the release of GATK v2 and GATK-Lite).

To my mind, the description of GATK-Lite given in July 2012, and how it was described in Jan 2013 are very different:
Second, we did a poor job of communicating the purpose of Lite and how it differed from the Full version. Even though Lite was always intended as an interim solution, some organizations opted to adopt it instead of the Full version and seem to view it as a viable long-term solution for genetic analysis.
I doubt anyone outside the Broad Institute was surprised by the fact that lots of people and groups adopted GATK Lite? It was described as the open source core of GATK v2, with new functionality to be added over time. But now GATK Lite is being dropped.

GATK licensing in a nutshell

GATK v1 was 100% open source, as of July 2012 only GATK-Lite was open source, as of Jan 2013 even less was open source (only the core GATK framework).

The core framework does remain open source (MIT licensed), but that is all. Some of the previously open source analysis tools ("walkers") which were released under GATK-Lite are no longer open source - causing considerable inconvenience to to groups already using this (and breaking the expectations laid with the release of GATK v2 and GATK-Lite).

The GATK v2 suite on top of the framework is not open source. If you're eligible, you can use the free-as-in-beer non-commercial academic license - which lets you see the source code. Since this is not free-as-in-speech, this is a look but don't touch arrangement. If it is even possible to edit the source code and recompile it, you won't be able to share your changes. If you want to reuse their implementation ideas in your own software, you can't. Clearly this is antithetical to the ideals of using open source software in science - something that I had thought was now mainstream in bioinformatics.

I'm upset because I care.


P.S. See also Mick Watson's post about why the GATK re-licensing matters, and this growing GATK Twitter archive on Storify.

4 comments:

  1. Hi Peter,

    You make some fair points, particularly that we handled things poorly in the first round of mixed-licensing the GATK. Not much we can say to that except that we're trying to do a better job of communicating this time around and hopefully fix some of the mistakes that were made at the time.

    We are genuinely sorry that we're breaking expectations based on the 2.0 announcements. But we need to move forward without being tied down by past mistakes.

    My response to Mick's post may address some of your other concerns, particularly the question of why we didn't go the RedHat route for paid support, which lots of people have been asking.

    I also want to point out that the MIT-licensed framework is really not a minor resource as sometimes seems to be implied in these discussions. It includes the engine, infrastructure libraries and a great many "utility" walkers. Taken together that is a powerful package -- and it is completely open and free for the taking. We also provide documentation on how people can write their own tools to use the framework. So really anyone could go and write their own alternative to GATK based on the GATK's own core.

    As for the full suite (which just additionally includes the dozen or so "Best Practices pipeline" analysis tools), any academic researcher is free to clone the repo, make modifications and recompile the source as they please. The limitation is that they are not allowed to distribute it outside of their academic institution. I agree that in principle this is not ideal, but in practice we have found that essentially no-one makes changes to the walkers. People typically make changes that affect the framework, engine etc, which remains open and ok to share.

    Having the source accessible, along with the publication of our algorithms in the scientific literature, should give people all they really need to reuse our ideas in their own (non-commercial) software. So we don't think this will truly amount to an obstacle to research methods development.

    ReplyDelete
    Replies
    1. Dear Geraldine. Thank you for commenting here and on Mick's blog. It will be interesting to see how the wider community of developers using (and in some cases contributing) to GATK responds to this over the next few months.

      Delete
  2. Thanks for posting. It's interesting to hear a variety of perspectives on this issue.

    You mention "money and control" - I agree that control is a terrible reason to restrict access to software, but certainly in today's funding landscape the importance of money can't be overstated. Funding is needed for initial planning and development, maintenance/enhancement, and sometimes hosting of any given tool. If the groups that make these tools can't recover these costs from their (commercial) users, and are having an increasingly difficult time finding other sources of funding like public grants (which often won't extend to long-term maintenance or hosting anyway), where will that money come from?

    I wouldn't personally begrudge software developers who decided to charge funded academic users, either. Other material costs are written into their grants (and lab equipment isn't purchased at cost, but instead helps someone make a profit and recoup the expense of designing and creating the equipment.) So why should the cost of software tools be any different?

    BioPython, of course, is a counterexample - something that's free and developed by the community (although some of our contributions are supported in other ways.) But this can't be everyone's solution - if BioPython had a component that required expensive hosting on a server cluster, where would that money come from?

    ReplyDelete
    Replies
    1. I strongly disagree with attempts by academics to charge other academics to use their research software - the point about scientific progress is sharing knowledge and building on each other's discoveries. That works nicely with open source licensing, but it blocked by propriety licenses.

      Research council etc funding for software maintenance is currently difficult but possible, especially for higher profile projects where it is easier to make a strong case with many letters of support from a wide range of users (e.g. the EMBOSS project had BBSRC funding - their website mentions grants BB/D018358/1 and grant BBR/G02264X/1 at least). Tracking usage via download counts, mailing list subscribe counts, or even a 'phone home' system for active users can also help here. And of course, there are citations - which is currently the main way researchers have to justify their software output.

      In general, with an open source model the burden of support is shared between technical users, and if a tool is widely used this can be sustainable as a community project.

      In your example of software requiring expensive hosting on a server cluster, except for big central projects, I would expect the user to provide, buy or rent the server cluster themselves. Cloud computing makes this much easier than in the past - e.g. provide a virtual machine image (as well as installable tools for use on other machines/architectures). If you make this as easy as possible, you'll get more users, more citations, and have more impact.

      Delete