Archive for computers

Kudos to Dropbox for supporting Linux

There are many reasons to use Linux, but the main drawback is a lack of support for third-party software. When I set up my new Fedora Desktop, I was caught off guard by the fact that Box.com does not provide sync software (there is a hack, but it’s not really a solution) — I hadn’t realized how dependent I had become on Box.com to backup my work and synchronize my computers. UC Berkeley contracts with both Box.com and Google to provide cloud storage, but neither of them come through in this situation.

However, by free Dropbox account has come to the rescue! The Dropbox client had one minor glitch, and otherwise works perfectly. Now I’m shifting to using them for my day-to-day backup, and limiting my use of Box and Drive to sharing large files.

Thanks Dropbox!

Comments off

Why be a good bioinformatician?

Here is some “advice” on how NOT to be a bioinformatician (i.e. how to make bad software for biology). This makes me ask the question: “Why be a bioinformatician?”

Much of the advice in here makes me think that a lot of “bioinformaticists” don’t really have a good reason for doing what they do. I have to say that I’ve seen a lot of bad biology-focused software. I’ve even heard respected biologists declare that the entire field of bioinformatics is worthless (at least, the stuff published in bioinformatic-focused journals is worthless).

So what is a bioinformaticist trying to achieve?

One approach to bioinformatics is to create software that addresses one’s own research interest. The funny thing is, these typically are not the programs that are published in bioinformatics journals — they are published in biology journals. When I look at the software tools that have been most useful to me, they are not made by people I consider bioinformaticists — they are made by biologists, who are programming computers as a tool to solve problems that they are interested in. Even when these scientists are trained in statistics and CS, they are still tightly connected to a particular biological community and they are designing software that answers research questions that this community cares about. This often allows them to answer questions that nobody has been able to answer before.

The other approach to bioinformatics is to build a tool that others will use. This seems to be the focus of the linked SCFBM article.

All too often, these software/algorithm development projects aim only to produce incremental improvements in existing methods (e.g. making them more accurate or faster or user-friendly). These typically don’t lead anywhere, and I don’t consider these to be appropriate academic projects — this type of optimization should be performed within teams that are interested in some sort of mass-production and have real accountability for the performance of their software (e.g. at commercial firms). Publishing this type of work is an invitation for BS.

There is still space for applying serious CS to improving bioinformatic tools, but these should focus on radically different approaches to the analysis, so that they enable order-of-magnitude improvements in the efficiency of the algorithm.

This same problem of misguided motivation is seen in the plethora of web services that have emerged during the mass-sequencing era. I have been very frustrated by these, since the vast majority of them simply waste my time by promising things that they cannot deliver. Many of them are not maintained — which makes perfect sense given their limited utility to begin with.

If you are going to make a software tool “for biologists”, you need to ask yourself whether it will be useful enough to be worth making properly and maintaining it. If your service is very narrowly focused, are you going to bother maintaining it just to serve the one user per month? Are biologists going to bother discovering your service if it nearly duplicates an existing service that they are already familiar with (e.g. NCBI)? Will they ever hear about it if it provides a single narrowly focused service? Does the service actually provide useful information, or does it simply make predictions that a biologist will need to test anyway if the prediction really matters?

So before trying to figure out how to properly develop bioinformatics software, figure out why you want to make these tools at all.

Comments off

Agent Based Modelling with Repast

A student brought my attention to Repast when he did some modelling with me this past summer. This is an agent-based modelling (ABM) platform, primarily for Java (though there are also tools for other languages). It is apparently based on an earlier system called “Swarm”, which I’ve heard is slightly more powerful and slightly more difficult to use. Since I was already familiar with Java and Eclipse (of which Repast Simphony is a derivative), we decided to give Repast a shot. In addition to the core agent-manipulation libraries, Repast has powerful visualization tools, active developers, and a decent-sized user community.

The main difficulty in using Repast in the dearth of documentation. This post is meant to help on that front by collecting links to the materials that I have found useful (as much for my own use as anyone else’s). But before getting into that, I’ll provide a little context for why Repast is interesting and perhaps why its documentation is so difficult. Everything I write is based on Repast Simphony 2.1, which is built on top of Eclipse Kepler (Build id: 20130614-0229).

One appealing feature of Repast is that it provides interfaces that can hide the inner-workings from the user, allowing researchers with different levels of programming skills to access the tools. At the simplest end in ReLogo, which is similar to the very accessible NetLogo (apparently both derived from something called Logo). I played with this briefly, but got stumped on how to do simple arithmetic in this language. Rather than learn another language, I  transferred over to the Java side of Repast, which promised greater power anyway. Even here, the core of the modelling engine is still hidden from me, which has become an issue when I wanted to access the scheduling mechanism. It seems that one consequence of developing these different levels of accessibility is that the user community is split into three groups, each of which requires separate documentation.

Since I want to learn about the advanced features of Repast, the obvious place to go is to the developers, who are active on the Repast-interest mailing list. However, I don’t want to bother them with questions that they’ve answered a million times, so I should first search the list archives (as suggested on the sign-up page). But the archive page does not include a search feature (really?). So I do a Google search, and find that someone else asked for a solution on the mailing list, and was told to try searching on Nabble. This is way too meta (To reiterate, I did a Google search to ask how to search the Repast-interest archives, which directed me to a Nabble page contaning an old discussion from Repast-interest, where the answer was that we should use Nabble.)

Another good resource is the large collection of demonstration models. The downside is that there is no quick way to find the model that demonstrates the technique that you are interested in. In my experience, the StupidModel series of models shows the most sophisticated methods.

The Repast GUI provides a powerful interface to the models, but make it difficult to just open up the source code and track the logic. For instance, the main class for all GUI models is “RepastMain”, which is a quite terse and cryptic launcher. As I understand things, Repast launches the GUI (which appears to be encoded at a lower level than Java), which then gets its instructions from a collection of XML files that are associated with your project. For Batch files, the main class is called BatchMain (but the header indicates that this is deprecated in favor of RepastBatchMain).

After poking around the web a bit, it seems that the classes to use to get at the core of the model are RepastEssentials and RunEnvironment.

With all that searching for answers, I wonder does StackOverflow have information about Repast? They don’t seem to have anything specific to Repast, but the do have some discussions of Agent Based Modelling where they discuss Repast in the general context of ABM.

Other links I found helpful/interesting:

Stackoverflow discussion of ABM approaches…and another.

Wikipedia comparison of ABM tools

Repast self-study guide (links to tons of resources)

Comments (3)

Why I deleted my ResearchGate account

Several months ago, I was excited to discover ResearchGate, and online community for scientists. I was initially attracted by the discussion boards, which included a lot of useful technical feedback. I set up an account, and proceeded to use the service occasionally and share my expertise. The service was not terribly useful to me, but it seemed to be growing and improving, so I was happy to play along. A couple of months ago, I noticed that I could not see anything on the site without first logging in.

I have finally decided to delete the account. Here’s what I told them:

I was originally attracted to Research Gate due to the discussions. Like any other professional/technical discussion board (e.g. StackOverflow), I expect public discussions to be truly public — not controlled by the service. I am very disappointed that Research Gate has placed a virtual wall around its content.

This is a deal breaker for me. I will not contribute content to any service that tries to take control of that content.

Too many companies are trying to make a buck by gaining control over our social interactions. This is sick, and ResearchGate does not offer nearly enough benefits to keep me on board through this process. I hope they will change their business model and recognize the users and content creators as true “members”, not just a commodity to be fed into a pipeline. If not, good riddance.

Comments (1)

Software Carpentry workshops

Software Carpentry seeks to train biologists in the basics of software design. Unfortunately, I was not able to attend when one was held at UC Berkeley, but I suspect that this is exactly what is needed for most biologists.

Comments off

Resources for aspiring genomicists

Here is a brain-dump targeted at an incoming graduate student:

At some point, you should take a look at the following resources. They are very useful for any genomic analysis:

Reference Sequences from the National Center for Biotechnology Information (i.e. NCBI’s RefSeq)
http://www.ncbi.nlm.nih.gov/RefSeq/

Bacterial sequences
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/

X. fastidiosa Temecula1
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Xylella_fastidiosa_Temecula1_uid57869/

The most readable file is “.gbk”. Good for humans, bad for computers.

This file (README) describes the sequence formats
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ReadMe.txt

NCBI also has a convenient portal for all sequenced species, e.g. X. fastidiosa
http://www.ncbi.nlm.nih.gov/genome/173

Here’s a good system for getting information about genomes:
http://www.microbesonline.org/

If you are going to do anything yourself, you should be familiar with BLAST (the best bioinformatic software ever made)
There’s the website for searching the general database:
http://blast.ncbi.nlm.nih.gov/Blast.cgi

…and the stand-alone package (BLAST+) which will be useful for looking at any new sequences we have:
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download

Finally, useful GUI software for genome analysis:
Mauve: http://gel.ahabs.wisc.edu/mauve/
Tablet: https://ngslib.i-med.ac.at/node/120

-adam

Comments off

Biologists, please learn to use the command line

The other day, a young microbiologist and I were discussing the skill-set that was necessary for him to do his research. He indicated that he didn’t expect to ever need any software that had to be called from the command line (of course, he didn’t know the term “command line”). I quickly laid that idea to rest.

This attitude is common, yet frustrating. The computer is a central tool in modern biology, yet many biologists are happy to have only the most superficial familiarity with it. They act as though everything they need will be provided in a neat (and affordable) little software package with an intuitive graphical user interface (GUI). They won’t. GUI’s almost always cripple the underlying analytical software, and they introduce a whole new layer of bugs and complexity. They are often harder to describe than simple command-line interfaces, and are less standardized. All that time a biologist spends learning the ins and outs of some arbitrary GUI for a single commodity analysis could be spent learning standard command-line interfaces that are used by the most powerful and cutting-edge (and often free) software out there.

So here’s my plea (and advice) to biologists. Learn to use the command line.  I’m not saying that you should learn to program*. Just the command line.

How to use the command line

In Mac and Linux it’s called “the terminal”. Just right-click (in linux) to select it from the menu. In Windows it’s called the “command window” — right-click while holding shift to select it from the menu. If you don’t know what to do once you have the window open, try this:

ping www.google.com

Hit Ctrl-C when you want to end. To see the options, type “ping -h”.

A new world awaits.

* if you want to do an analysis of any complexity, it would help to learn a scripting language (e.g. python), and probably regular expressions too.

Comments off