Large BLAST runs and output formats

I have used BLAST in many different forms and on many scales, from single gene analysis to large "all vs all" comparisons. This is a short story of how I decided to delete 164GB of Blast output.

I will save my reasoning for doing such a large Blast for another post. For now, all you have to know is that I am doing an "all vs all" Blast for 2,178,194 proteins. That is (2,178,194)^2 = 24,744,529,101,636 comparisons. Sure quite a few, but nothing that a large compute cluster can't handle (go big or go home is usually my motto).

I usually use the tabular output format for Blast (-m 8). However, one of the nice functions in BioPerl allows you to calculate the coverage of all hsps with respect to the query or subject sequence. BioPerl handles the tiling of the hsps which is annoying to have to code yourself. I often use this coverage metric to filter Blast hits downstream in my pipeline. So here comes the annoying thing. The tabular output of Blast does not include the start or end positions (or length) of the sequences in the Blast comparison. Therefore, to calculate coverage you need to go back to the original sequence and retrieve the length of the sequence. I know this is not a hard thing to do, but I am a lazy programmer and I like fewer steps whenever possible. Therefore, I decided to try out the Blast XML format (-m 7). A few test runs showed that the files were much larger (5X), but this format includes all information about the Blast run including the sequence coordinates. Therefore, I decided not to worry about space issues and launched my jobs. Bad decision.

Well 3 days later, I find out my quota is 300GB and since I already had 150GB from another experiment the blast output put me over. I can't easily tell which jobs completed normally, so I am faced with the decision to either write a script to figure out which jobs completed normally, or scrap all the data and re-run it the right way. I have opted to delete my 164GB of blast output and re-run it using the tabular format and I might even gzip the data on the fly to ensure this doesn't happen again.

Of course this isn't rocket science, but I thought I would tell my tale in case others are in similar circumstances.
Reblog this post [with Zemanta]

Hello California!

Well UC Davis to be more precise. I accepted a postdoctoral fellowship from Jonathan Eisen to be a part of the iSEEM project working on metagenomics. I have only been here for a few days, and first impressions seem great. First, the research field is exactly what I was most interested in; second, my previous PhD research is definitely of relevance; and third, I feel like I have lots to learn from the people around me.

Considering my previous Blog tag line/description is inaccurate:
"A PhD student's point of view on bioinformatics, evolution, and microbial diversity; with an interest in cutting edge computer tools that make them all a bit easier."

I decided to radically change it to:
"A post-doc's point of view on bioinformatics, evolution, and microbial diversity; with an interest in cutting edge computer tools that make them all a bit easier."
Jonathan's opinion on open-access publishing is quite similar to my own, so in addition to blogging about microbial evolution, expect to see more posts about my views on academic publishing.

Looking for a bioinformatics expert?

What I have to offer:
  • A balanced background in both biology (BSc) and computer science (BCS)
  • Soon to be completed PhD
  • Extensive research experience in bioinformatics, genomics, phylogenetics/phylogenomics, evolution, and bacteria pathogenesis
  • Some previous research experience in medical imaging, ontology development, and metagenomics
  • An impressive publishing record (7 papers, 3 first authors, 2 more first authors under review)
  • Solid computational skills including Perl programming, database design (MySQL), parallel programming, and web design (PHP & JavaScript)
  • Good communication and social skills
  • More information
What I am looking for:
  • Post-doc or job (academic or industrial)
  • Preferably, a position where I have some significant manager or leadership responsibilities
  • Geographically interested in north eastern parts of North America (Ottawa down to New York), but would entertain positions elsewhere in N.A.
I didn't put any limitations on research interests, since I am open to many areas. However, anything having to due with the human microbiome project, human-bacteria interactions, or metagenomics would be of particular interest.

Please email me if you are interested or if you have suggestions on some good openings.