Balti and Bioinformatics “On Air” Wednesday September 10th 2014

Doing something a little bit different, our balti and bioinformatics meeting goes virtual this time, with a Google Hangout-on-Air, streaming live to YouTube!

Join us for a nanopore-themed programme which will be streamed from Google Hangouts to YouTube.

Draft schedule (subject to change)

Times are British Summer Time (GMT+1)

Start: 14:00 BST, 13:00 GMT, 09:00 Eastern Standard Time

14.00 – Intro
14.05 – Clive Brown, Nanopore sequencing
14.45 – Nick Loman, Early data from nanopore sequencing: bioinformatics opportunities and challenges
15.15 – Matt Loose, Streaming data solutions for nanopore
15.30 – Josh Quick, Nanopore sequencing in outbreaks
15.45 – Torsten Seemann, Awesome pipelines for microbial genomics
16.15 – Finish

Please note that some talks may not be recorded, so make sure you tune in live so as not to miss anything!

Go register at Google Plus.

Update 11th September:

This went great! Please check out the entire event on Youtube:

Balti and Bioinformatics: 27th May, 2014

We’re back after a bit of a hiatus following last year’s triumphant Beatles and Bioinformatics (available to watch on YouTube).

As usual this is a meeting organised by juniors, aimed at those toiling at the coal-face (but all ages welcome!) with a strong focus on methods and technologies and discussion. We go for a Birmingham balti (it’s a curry) afterwards. What’s not to like?

Tuesday, 27th May 12-5.
Mechanical Engineering Lecture Theatre G29 (Ground Floor)
Building Y3 on Campus Map
University of Birmingham, Edgbaston, Birmingham, B15 2TT. Book trains to “University (Birmingham)” and go via Birmingham New St.

Park in the new multi-storey North East car park.

Strictly capped at 100 people, please register early at this link.

The loose theme: “Emerging technologies“. Details to emerge.

AGENDA

12:30 Registration

Part I: New Technologies

13:00 – 13:05 Introductions

13:05 – 13:35 Alan McNally, PI, Nottingham Trent University, “Parallel independent evolution of pathogenicity within the genus Yersinia

13:35 – 13:55 Zhemin Zhou, Post-doc, Achtman Group, Warwick University – “Hidden markov models for detection of recombination and Darwinian selection from whole-genome data”

13:55 – 14:35 Mark Pallen and Tom Connor, the MRC Cloud Infrastructure for Microbial Bioinformatics Consortium (Universities of Warwick, Birmingham, Cardiff and Swansea) – “Introducing CLIMB”

14:35 – 14:40 Yannick Wurm, Queen Mary University of London – “Lessons learnt from cloud computing”

Discussion: Cloud computing for genomics

15:00 Tea, coffee

Part II: Microbial Genomics

15:20 – 15:40 Elita Jauneikaite, PhD student, University of Southampton, “Analysis of Streptococcus pneumoniae whole-genomes from Singapore”

15:40 – 16:00 Lauren Cowley, PhD student, Public Health England, Title Food festival: attack of the aggs

16:00 – 16:20 Justin O’Grady, University of East Anglia, “Rapid diagnostics-by-sequencing of low-input DNA samples”

16:30 – 16:50 Phil Ashton, Public Health England, “Routine sequencing of Salmonella enterica for identification and molecular epidemiology”

Discussion: WGS – why isn’t this routine yet?

17:15 close, taxis and curry

An outsiders guide to bacterial genome sequencing on the Pacific Biosciences RS

It had to happen eventually. My Twitter feed in recent times had become unbearable with the insufferably smug PacBio mafia (that’s you Keith, Lex, Adam and David) crowing about their PacBio completed bacterial genomes. So, if you can’t beat ‘em, join ‘em. Right now we have a couple of bacterial genomic epidemiology projects that would benefit from a complete reference genome. In these cases our chosen reference genomes are quite different in terms of gene content, gene organisation and have divergent nucleotide identities to the extent where we are worried about how much of the genome is truly mappable.  And in terms of presenting the work for publication, there is a certain aesthetic appeal to having a complete genome.

And so, after several false starts relating to getting the strains selected and enough pure DNA isolated, we finally sent some DNA off to Lex Nederbragt at the end of last year for Pacific Biosciences RS (PacBio) sequencing!

This week I received the data back, and I thought it would be interesting to document a few things I have learnt about PacBio sequencing during this intiguing process.

It’s all about the library, stupid

The early PacBio publications in 2011 showing the results of de novo assembly of PacBio data weren’t great, giving N50 results not materially much better than 454 sequencing. This was despite the vastly longer read length achieved by the instrument, even then mean read lengths of 2kb were achievable. Since then, incremental improvements to all aspects of the sequencing workflow have resulted in dramatic improvements to assembly performance, such that single contig bacterial genome assemblies can be expected routinely. This is probably best illustrated by the HGAP paper published last year where single contig assemblies for three different bacterial species including E. coli were demonstrated. HGAP is PacBio’s assembly pipeline.

The main improvements have been:

You need a lot of DNA (like, a lot)

There is a trade-off between the amount of input DNA and which library prep you can do. 5 micrograms for 10kb libraries, ideally 10 micrograms for 20kb libraries. Not always a trivial amount to get your hands on, even for fast-growing bacteria. This is one of the things that would limit our use of PacBio for metagenomics pathogen discovery right now, because this amount of microbial DNA from a culture-free sample is basically impossible to get.

However, in fact we managed to get a library made from 2.6ug of DNA but in this case the BluePippen size cut-off had to be dropped to 4kb (from 7kb).

Input DNA Library prep BluePippen cut-off Number of reads Average read length MBases
2.6ug 10kb 4kb 36 696 5529 202.9
70 123 5125 359.4
>5ug 10kb 7kb 49 970 6898 334.7
58 755 6597 387.6
>10ug 20kb 7kb 42 431 6829 289.9
59 156 7093 419.6

Table 1. PacBio library construction parameters and accompanying output statistics (per SMRTcell, 180 minute movie)

(An aside, I wonder if Oxford Nanopore’s reported shorter than expected read length from David Jaffe’s AGBT talk may just be a question of feeding the right length fragments into the workflow. Short fragments in equals short reads out).

Choose the right polymerase

For bacterial de novo assemblies, all the experts I spoke to recommended the P4-C2 enzyme. This polymerase doesn’t generate the very longest reads, but is recommended for de novo assembly because the newer P5-C5 has systematic error issues with homopolymers (as presented by Keith Robison at AGBT). P5-C3 is therefore recommended for scaffolding assemblies, or could be used in conjunction with P4-C2.

Longer reads may mean reduced loading efficiency

You want long fragments, but we were warned by several centres that 20kb libraries load less efficiently than 10kb libraries, meaning throughput is reduced. It was suggested we would need 3 SMRTcells to get 100x coverage for an E. coli sized genome of 5mb. However in our case, it didn’t really seem that was the case for us (Table 1).

Shop around for best prices

As you almost certainly can’t afford your own PacBio, and even if you could your floor wouldn’t support its weight, you will probably be using an external provider like I did. Prices vary, but the prices I had per SMRTcell were around £350 and quotes for 10kb libraries around £400, with 20kb libraries being more expensive. In the end I went with Lex Nederbragt and the Oslo CEES – not the very cheapest but I know and trust Lex not to screw up my project and to communicate well, an important consideration (see Mick Watson’s guide to choosing a sequencing provider). In the UK, the University of Liverpool CGR have just acquired a PacBio and also would be worth a try. TGAC also provide PacBio sequencing. In the US, Duke provide a useful online quote generator and the prices seem keen.

What language you speaking?

It’s both refreshing and a bit unnerving to be doing something so familiar as bacterial assembly, but having to wrap your head around a bunch of new nomenclature. Specifically, the terms you need to understand are the following:

  • Polymerase reads: these are basically just ‘raw reads’
  • Subreads: aka ‘reads of insert’. This is, I think, the sequence between the adaptors, factoring in that the PacBio has a hairpin permitting reading of a fragment and its reverse strand. This term also relates to the becoming obsoleted circular consensus sequencing mode. Lex has a description here: (http://flxlexblog.wordpress.com/2013/06/19/longing-for-the-longest-reads-pacbio-and-bluepippin/)
  • Seeds: in the HGAP assembly process, these are the long reads which will be corrected by shorter reads
  • Pre-assembled reads: a bit confusing, these are the seeds which have been corrected, they are only assembled in the sense that they are consensus sequence from alignment of short reads to long reads and that PacBio uses an acyclic graph to generate the consensus
  • Draft assembly: the results of the Celera assembler, before polishing with the read set

The key parameter for HGAP assembly is the Mean Seed Cutoff

The seed length cutoff is the set of longest reads which give >30x coverage

The seed length cutoff is the set of longest reads which give >30x coverage

This parameter is critical and defines how many of the longer, corrected reads go into the draft Celera assembly process. The default is to try and get 30x coverage from the longest reads in the dataset. This is calculated from the genome size you specify, which ideally you would know in advance. If this drops below 6000 then 6000 will be used instead. You can also specify the mean seed cutoff manually. According to Jason Chin the trade-off here is simply the time taken to correct the reads, versus the coverage going in the assembly. I am not clear if there is also any quality trade-off. Tuning this value did seem to make important differences to the assembly (a lower cut-off gave better results). The HGAP2 (for it is this version you want) tutorial is helpful on tuneable parameters for assembly.

SMRTportal is cool, but flakey

I used Amazon m2.2xlarge (32Gb RAM) instances with the latest SMRTportal 2.1.0 AMI. About half the assembly jobs I started failed, with different errors, despite doing the same thing each time. Some times it worked with the same settings. I am not sure why this should be, maybe my VM was dodgy.

HGAP is slooooooow

Being used to smashing out a Velvet assembly in a matter of minutes, the glacial speed of the HGAP pipeline is a bit of a shock. On the Amazon AMI instance assemblies were taking well over 24 hours. According to Keith Robison on Twitter this is because much of the pipeline is single-threaded, with multi-threading only occurring on a per-contig basis. So if you are expecting a single contig you are bottlenecked onto a single processor. We therefore chose the m2.2xlarge instance type because the high-memory instances have the fastest serial performance of the available instance types. Actually this is important in a clinical context. Gene Myers (yes, THAT Gene Myers) presented at AGBT 2014 to say that he had a new assembler which can do an E. coli sized genome in 30 minutes, can’t come soon enough as far as I’m concerned.

Single contig assemblies are cool

Screen Shot 2014-02-26 at 21.00.01

A very long contig, yesterday

Well, my first few attempts have given me two contigs, but that is cool enough. And it is pretty damn cool. If money was no object (and locally we are looking at a 20:1 cost ratio for PacBio sequencing over Illumina) then I would get them every time. As it is, for now, we will probably confine our use to when we really need to generate a quality reference sequence to map Illumina reads against, for example when investigating an outbreak without a good reference. Open pan-genome species like Pseudomonas and E. coli are good potential applications for this technology, where you have a reasonable expectation of large scale genome differences between unrelated genomes. Our Pseudomonas genomes went from 1000 contigs to 2 contigs, which does make a huge difference to alignments. As far as I can see it is pointless to use PacBio for monomorphic organisms, unless you are interested in the methylation patterns. Keith Robison wrote recently and eloquently predicting the demise of draft genome sequencing, but whilst the price differential remains I think this is premature.

Polished assemblies still need fixing

Inspecting the alignment of Illumina reads back to the polished assemblies reveals errors remain, these are typically single-base insertions relative to the reference which need further polishing (Torsten Seeman’s Nesoni would be a good choice for this)

The Norwegian Sequencing Centre rocks

I’m very grateful for Lex Nederbragt and Ave Tooming-Klunderud and the rest of the staff of the Norwegian Sequencing Centre in Oslo for their help with our projects, they have been very helpful and I recommend them highly. Send them your samples and first born!

Also many thanks to those on Twitter who have answered my stupid questions about PacBio particularly Keith Robison, Jason Chin, Adam Philippy, Torsten Seemann.

When I have more time I will dig into the assemblies produced and look a bit more about what they mean for both the bioinformatics analysis and biology.

The biggest genome sequencing projects: the uber-list!

I am just writing a short presentation for a meeting in Hinxton. I wanted to demonstrate the profound effect that whole-genome sequencing is having on the study of biology, and the size and scope of recent studies.

So I thought it would be fun to catalogue the largest – in terms of samples – genome projects that have been published so far.

A few things are notable here. As expected, many of the biggest studies in terms of numbers are bacterial, enabled partly due to their smaller genome size.

Update: My attention has just been drawn to a study of 2,007 C. elegans genomes!

I found it interesting that all the bacterial studies listed herald from the UK, we are clearly blazing a trail in this field of study!

A PhD for sequencing a gene? A single genome? A hundred genomes? How about a thousand genomes? A million?

Name Number Reference
S. pyogenes 3,615 Nasser et al. 2014
S. pneumoniae 3,085 Chewapreecha et al. 2014
Rice (Oryza sativa) 3,000 The 3,000 rice genomes project
C. elegans 2,007 Thompson et al. 2013
Clostridium difficile 1,250 Eyre et al. 2013
The thousand genome project 1092 human genomes 1000 Genome Project Consortium, 2013
Mycobacterium tuberculosis 1,000 Casali et al. 2014
Plasmodium falciparum 825 Miotto et al, 2013
Streptococcus pneumoniae 616 Croucher et al. 2013
Mycobacterium tuberculosis 390 Walker et al. 2013
Salmonella in cattle and humans 373 Mather et al. 2013
Shigella sonnei 263 Holt et al. 2013
Mycobacterium tuberculosis 259 Comas et al. 2013
Streptococcus pneumoniae 240 Croucher at al. 2011
Methicillin-resistant Staphylococcus aereus 193 Holden et al. 2013
Campylobacter jejuni 192 Sheppard et al. 2013
Mycobacterium abscessus in CF 170 Bryant et al. 2013

So, what’s coming up that could potentially knock these studies off their perch?

Did I miss a study? Please drop a comment below.

Rules for inclusion:

  • whole-genome sequencing >10X average per sample (no exome, target capture)
  • at least one library per sample (e.g. no pooled species, quasispecies)
  • not a meta-analysis, fresh data for the paper

Thanks to: Casey Bergman, Scott Edmunds, Prashant, Liz Batty, Craig Duffy, Cui Yujun, Lex Nederbragt for suggestions!

Update 10-02-2014: Added Chewapreecha et al, Casali et al, now occupying positions 1 and 4 respectively in the uber-list!
Update 15-04-2014: Added Nasser et al, new position 1!
Update 29-05-2014: Added 3,000 rice genome project, new position 3!

Beatles and Bioinformatics: Our best meeting yet

Wow, so last Wednesday we held the fourth instalment of our Balti and Bioinformatics series which was a brilliant success, attracting over 100 participants. The idea of this meeting is to bring those developing cutting-edge bioinformatics methods together with those who actually use them. Thanks to the generous sponsorship of the Centre for Genomic Research at the University of Liverpool and the Medical Research Council and the BBSRC we were able to change up a gear, inviting two incredible international speakers: Sébastien Boisvert and Daniel Huson. We were also able to afford a proper lunch for everyone – of course this was the traditional Liverpool dish ‘scouse’.

One thing that gave the meeting a little extra edge was that we did a live ‘webcast’ for the very first time through YouTube’s live events system. Apart from a few issues with the sound right at the beginning, this was a great success and the YouTube statistics told me that 461 playbacks were made during the broadcast. There was also a flurry of Twitter activity on the #BeatlesAndBioinformatics hashtag (see the Storify by Surya Saha here) and we even managed to take a question over Twitter.

The great thing about the YouTube live events is that it also saves a copy, and so we are able to record the event for posterity. I’ve had a few people ask about how to set up such a webcast themselves, we will try and write a short guide for the blog at some point.

A great meeting, I am incredibly grateful to our speakers: Séb, Daniel, Chris Quince, Susannah Salter, Sujai Kumar, Mike Cox, Rebecca Gladstone and Chris Hayman. I am also massively grateful to the team at Liverpool for helping to organise; Neil Hall, Christiane Hertz-Fowler and especially Lesley Parsons. Also thanks to Christina Bronowski and Ian Goodhead and the Free State Kitchen for help with the evening catering. And finally to Barbara Myers, Paul Loman and Josh Quick for organising the live video webcast.

Finally, we ended up in the Cavern Pub where we were entertained by the guitar antics of The Amazing Kappa.

Check out the webcast!

You can even jump directly to a talk, thanks to the tags that Sebastien Boisvert has put in.

13.00 – KEYNOTE: Sebastien Boisvert, Université Laval, Québec, Canada – “Ray and Ray Cloud Browser for Metagenomics” 5:46

13.50 – Chris Stewart, University of Northumbria at Newcastle – “Development of the Gut Microbiome in Preterm Infants at Risk of Necrotising Enterocolitis and Sepsis” 59:48

14.10 – Chris Quince, University of Glasgow – “CONCOCT: Clustering cONtigs on COverage and ComposiTion” 1:15:32

14.30 – Susannah Salter, Wellcome Trust Sanger Institute – “What’s lurking in your kits?”1:39:00

15.10 – KEYNOTE: Daniel Huson – Center for Bioinformatics, University of Tübingen – “Identifying Organisms from a Stream of DNA Sequences” 2:32:30

16.00 – Sujai Kumar, University of Oxford – “Blobology: exploring raw (meta)genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots”3:10:50

16.20 – Mike Cox, Imperial College – “Copy number correction in 16S analysis” 3:33:16

16.40 – Rebecca Gladstone, University of Southampton / Wellcome Trust Sanger Institute – “Managing hundreds and thousands of bacterial genome sequences” 3:48:30

Course Advert: NERC Workshop on Population and Metagenomics Analysis

I help out with a fair few workshops on genomic analysis, but most are limited by being restricted to just a day or two, hardly enough time to cover more than the basics. So the 10 day long NERC Population and Metagenomics Analysis course, organised by the fantastic Konrad Paszkiewicz from the University of Exeter is a quite amazing opportunity. The teaching will be from some of the best people in the field (and me!) and of the highest quality. Incredibly, if you are a UK-funded researcher the costs of the course are entirely waived, with accommodation and meals also included! Spaces are limited but I would definitely urge you to sign-up. Full details below,  and please head over to the course workshop page to register. Hurry!

Workshop Overview

A ten-day workshop taking place between 25 February – 6 March 2014 providing detailed hands-on training for population and meta-genomics analysis for researchers with little or no background in mathematics or computing.

Venue: Dartington Hall, Totnes, Devon (nearest train station – Totnes)

Times: 25 February – 6th March 2014.

Arrival evening of Tuesday 25 February 2014. Departure morning of 6th March 2014. The course itself will take place 9am-12pm, 2pm-5pm and on some evenings 7pm-10pm everyday 26 February-5th March. Students are expected to attend the entire course.

Contact: research-events@exeter.ac.uk

Registration

The course itself is free of charged and is funded by a Professional Postgraduate Development Award from NERC.

A total of 30 funded places are available which cover the costs of accommodation and food, but not the cost of transportation to/from the venue.

An additional 10 places are available for participants from industry. The cost of accommodation and meals will need to be covered by the participants.

You should register your interest by 31 December 2013. Participants will be informed by 10th January 2014 as to whether they have been selected. Please note that preference will be given to researchers funded by NERC.

Accomodation and Transport:

For UK-based academic researchers:

The course is free of charge for up to 30 academic researchers working at recognised UK HEIs and research institutes. Accommodation at Dartington Hall is included and includes breakfast, lunch and dinner. Transportation to/from Dartington hall is NOT included.

For all other participants:

Whilst course fees will be waived, the cost of accommodation and meals will NOT be included. If selected, you will need to book accommodation with Dartington Hall separately. A special rate of £102 per night plus VAT has been negotiated (this includes breakfast, lunch and dinner).

Requirements:


Selected participants must bring their own laptops to the course. This needs to be a modern laptop (Windows, Mac OSX or Linux) with a full-size screen and wireless (please do not bring netbooks etc). Ethernet sockets may not be available during the course so please plan accordingly.

 

Draft Programme

An outline of the short course is given below with lead instructors in brackets. Please note that this is subject to change.

Tuesday 25 February 2014:

Arrival at Dartington Hall

Evening welcome buffet

Wednesday 26 February 2014:

Breakfast

Morning session 9am-12pm:

Introduction to the course

Hands on-workshop: Introduction to Amazon EC2 cloud (Konrad Paszkiewicz)

Lunch

Afternoon session 2pm-5pm:

Hands-on workshop: Introduction to Linux (Konrad Paszkiewicz and Julian Catchen)

Dinner

Evening session 7pm-10pm:

Hands-on workshop: Introduction to Linux (Konrad Paszkiewicz and Julian Catchen)

Thursday 27 February 2014:

Breakfast

Morning session 9am-12pm:

Hands-on workshop: Introduction to Linux (Konrad Paszkiewicz and Julian Catchen)

Lunch

Afternoon session 2pm-5pm:

Lecture: Introduction to genomics and bioinformatics (David Studholme)

Dinner

Evening session 7pm-10pm:

Hands-on workshop: Short read genomics (Konrad Paszkiewicz and David Studholme)

Friday 28 February 2014:

Breakfast

Morning session 9am-12pm:

Lecture: Introduction to RAD-seq (William Cresko and Julian Catchen)

Lunch

Afternoon session 2pm-5pm:

Hands-on workshop: RAD-seq (William Cresko and Julian Catchen)

Dinner

Evening session 7pm-10pm:

Hands-on workshop: RAD-seq (William Cresko and Julian Catchen)

Saturday 1 March 2014:

Breakfast

Morning session 9am-12pm:

Hands-on workshop: RAD-seq (William Cresko and Julian Catchen)

Lunch

Afternoon session 2pm-5pm:

Participant presentations

Dinner

Evening session 7pm-10pm:

 

Sunday 2 March 2014:

Free day with organised activities

Monday 3 March 2014:

Breakfast

Morning session 9am-12pm:

Lecture: Marker-based metagenomics introduction (Jose Clemente)

Lecture: Statistical challenges in metagenomics analysis (Chris Quince)

Lunch

Afternoon session 2pm-5pm:

Hands-on workshop: Introduction to QIIME (Jose Clemente, Daniel McDonald)

Dinner

Evening session 7pm-10pm:

Hands-on workshop: Introduction to QIIME (Jose Clemente, Daniel McDonald)

 

Tuesday 4 March 2014:

Breakfast

Morning session 9am-12pm:

Lecture: Whole genome metagenomics introduction (Nick Loman)

Lunch

Afternoon session 2pm-5pm:

Hands-on workshop: Whole genome metagenomics (Nick Loman, Chris Quince)

Dinner

Evening session 7pm-10pm:

Hands-on workshop: Whole genome metagenomics (Nick Loman, Chris Quince)

Wednesday 5 March 2014:

Breakfast

Morning session 9am-12pm:

Hands-on workshop: Free session (bring your own data)

Lunch

Afternoon session 2pm-5pm:

Hands-on workshop: Free session (bring your own data)

Dinner

Evening session 7pm-10pm:

End of workshop party

Thursday 6 March 2014:

Breakfast

Departure

Full event information

Population and metagenomics analysis are fields which have developed rapidly over the recent years and have opened up new methodologies to researchers in ecology, systematics, evolutionary development and ecotoxicology. However, the software which has been developed to analyse these types of data are typically non-graphical and complex to master for researchers in biological sciences who have not been specifically trained in bioinformatics. In this short course the Amazon EC2 cloud will be used for training using laptops.

One of the most significant recent developments in population genomics is Restriction-site Associated DNA sequencing (RAD-seq)1. This technique uses high-throughput sequencing to simultaneously sequence and genotype organisms at tens of thousands of loci. The number of markers generated makes analysis much more sensitive than traditional microsatellite-based approaches, enabling resolution between very closely related individuals who belong to the same microsatellite type. It also requires comparatively less development and optimization time since the number of markers is proportional to the number of fragments digested. It can be applied even in the absence of a reference genome and can assist with genome assembly as well as provide functional information1,2.

Metagenomics involves the study of communities of microbial organisms in particular environments. The combination of uncultured-based techniques and high-throughput sequencing technology has made possible a comprehensive characterization of whole communities for a fraction of the cost. This enables studies of particular environmental niches over time or changing conditions at different resolution levels3,8.

We have arranged for leading population genomic and metagenomic experts in the US and the UK to serve as instructors for this short-course. The US instructors below have either developed the molecular methods, the theory behind the analysis, or have actively developed the relevant software to perform the analysis.

 

Workshop instructors

  1. Professor William Cresko is a Principle Investigator and Director of the Institute of Ecology and Evolution at the University of Oregon. He is a pioneer of the RAD-seq technique, and has used the approach extensively to perform genetic mapping of stickleback fish phenotypic variation1 as well as the evolutionary genomics of pipefishes and seahorses (http://creskolab.uoregon.edu/).
  2. Dr Julian Catchen is a Postdoctoral Research Fellow at the University of Oregon Institute of Ecology and Evolution. He is the author of Stacks – the most popular software package designed to process and analyse RAD-seq data.
  3. Prof. Jose Clemente is based at the Icahn School of Medicine at Mount Sinai, New York. He is a contributing author to the QIIME (Quantitative Insights Into Microbial Ecology)4 software package which is one of the most popular software tools for performing metagenomic analysis. His lab at Mount Sinai is particularly focused on characterizing the mechanisms of action of the microbiome in IBD.
  4. Dr Nick Loman is an MRC Special Training Research Fellow currently working at the University of Birmingham. His research program focuses on the genomic and metagenomic analysis of microbial sequence data in a clinical context.
  5. Daniel McDonald is a graduate student in the Interdisciplinary Quantitative Biology program in the BioFrontiers Institute at the University of Colorado, and a part of Prof. Rob Knight’s lab, a recognized leader in microbiome research. Daniel is a contributing author of QIIME and a core software developer on the project.
  6. Dr David Studholme is a Senior Lecturer in Bioinformatics at the University of Exeter. His research interests encompass applications of genomics, transcriptomics and metagenomics to plant-pathogen interactions. His recent projects have focussed on tree-pathogens Chalara fraxinea and Phytophthora ramorum as well as bacterial pathogens of banana, enset, tomato and other crops.
  7. Dr Konrad Paszkiewicz is the Director of the Wellcome Trust Biomedical Informatics hub. He is responsible for the provision of training for PhD students and researchers as well as bioinformatics facilities and capabilities within the University of Exeter.
  8. Prof. Peter Kille is the director of Bio-Initiatives at the University of Cardiff. His primary research expertise lies in the application of molecular techniques such as proteomics and genomics to eco-toxicology. His research interests encompass the effect on biological systems of the release of heavy metals into the environment.
  9. Dr Christopher Quince is a Reader at the University of Glasgow. He leads the Computational Microbial genomics group which focuses on the development of novel algorithms to aid the analysis of microbial community structures. The group also develop engineering systems using microbial communities including microbial fuel cells and filtration systems. He is the author of PyroNoise and AmpliconNoise which are integral to the analysis of many high-throughput metagenomic datasets.

All of the instructors have extensive experience teaching in a short-course/workshop environment. These include the National Evolutionary Synthesis Centre Workshops (NESCent) in Next Generation Sequencing 2011 and 2012 () held at Duke University, North Carolina, USA and the Evomics workshop held every January in Český Krumlov, Czech Republic. The QIIME group hold regular workshops in the US and worldwide. Much of the teaching material for the proposed short course has already been produced, delivered and tested to a large number of audiences. Amazon cloud images for each section of the short course have already been produced by the instructors and have been extensively tested in previous workshops.

The Amazon and Linux training will be based on a modified version of the ‘Unix & Perl for Biologists’7 course adapted for use on the cloud with an extensive EC2 tutorial. Pre-existing in-house workshop materials will be used to teach the basics of remapping, assembly and variant calling. The Stacks software suite will be used in conjunction with R to teach RAD-seq analysis. We will use QIIME4, MEGAN5 and MetaPhlAn6 software packages to teach various aspects of marker based and shotgun metagenomics.

 

References:

  1. Hohenlohe PA, Bassham S, Etter PD, Stiffler, N. Cresko, W.A. Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags. PLoS Genetics 2010;6.
  2. J. Catchen, P. Hohenlohe, S. Bassham, A. Amores, and W. Cresko. Stacks: an analysis tool set for population genomics. Molecular Ecology. 2013.
  3. J Rousk, E Bååth, PC Brookes, CL Lauber, C Lozupone, JG Caporaso, R Knight Soil bacterial and fungal communities across a pH gradient in an arable soilThe ISME journal 4 (10), 1340-1351
  4. J. Caporaso et al. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7, 335-336 (2010)
  5. Huson, DH, Mitra, S, Weber, N, Ruscheweyh, H, and Schuster, SC (2011). Integrative analysis of environmental sequences using MEGAN4. Genome Research, 21:1552-1560
  6. Segata, N et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods 9, 811-814 (2012).
  7. Bates, S. et al. Global biogeography of highly diverse protisan communities in soil. ISME J 7: 652-659; Dec. 2012. http://korflab.ucdavis.edu/Unix_and_Perl/

Learning to say no: Seven questions to ask before accepting an invitation

I just got back from a short meeting in the States. Even though I was only out of the office a few days, I am now even further behind with backed-up peer review, finishing manuscript drafts, collaborations. My long-suffering family now have to cope with me being tired and irritable while I catch up on missed sleep and adjusting back to local time. And I can see several more weeks out of the office coming before Christmas.

During the meeting dinner I fell to discussing the pressures of attending all these meetings and keeping sane with Julie Segre, an investigator at the NIH. It was great to meet Julie, who is one of the genome detectives featured in Carl Zimmer’s inspiring article Mutants. She told me her brilliant scheme for dealing with academic invitations — she has seven questions she will ask of every invitation. Unless all the answers are yes, she will decline.

I thought this was such a good idea, I asked her to send them to me to share on this blog.

Over to Julie!

I think the important part is to formulate your own questions and revisit as your life changes. I have switched the order of these questions at different times.  Right now, I’m wondering why the invitation being an honor is so high on my list – I think this was intended to capture the times that it is grad students inviting me rather than some committee.  My important criteria are:

1. Is the person inviting me someone whom I really like or owe a favor?
2. Will I meet people that I need/want to meet?
3. Is this invitation really an honor?
4. Have I already agreed to other trips during this time period?
5. Do I completely understand the commitment; e.g. If this is a review of a candidate, is it just on paper or does it require a travel to the site?
If I agree to be on the program committee will I have to review 100+ abstracts?
6. Is the trip somewhere that I really want to go? (What is the travel time? Are there direct flights?)
7. Have I checked with my spouse?

I say ‘no’ and don’t feel bad.  I truly enjoy my open days with protected time.  I find that writing this down and making myself go through these questions has helped to separate the things that I want to do from things that I should do from things that I really need to do.

Blog away – others have loved this idea, although I’m not sure they have adopted it.  I’ve tried to make a similar decision making tree to pick a journal for submission, but I have yet to understand that rather irrational process.

Well, I think this is great advice. I am really grateful for Julie for letting me post these.

Saying no is always a difficult thing to do, but I realise it is actually vital if you want to keep control of your work, family and sanity! I will adapt Julie’s list and try and stick to it as my New Year’s resolution.

The Oxford Nanopore Golden Ticket

There have been rumblings of news from Oxford Nanopore at the American Society of Human Genetics meeting taken place at the moment in Boston. I have to admit I may have had advance sighting of their exciting press release today, leading to some very childish “I have a secret” type tweets between myself and a few others in the know!

Luckily it is a secret no more. Today’s announcement of the MinION access programme (MAP) is very exciting! For a refundable deposit of $1,000, Oxford Nanopore say – if you are selected – they will send you a MinION and a regular supply of flow cells “sufficient to allow frequent usage of the system”, with only shipping costs funded by the user!

Wow!

Sign me up, obviously – and the rest of the world.

What I love about this is that anyone with a proposal that would benefit from nanopore sequencing, in their words “where long reads, simple workflow, low costs, and real time analysis can be shown to make a key difference” has a chance of getting a golden ticket. They do suggest that those prepared to buy multiple MinIONs have a better chance, so you could adopt a Veruca Salt strategy here …

It goes without saying we will be applying, and no doubt you will too.

As is typical, the announcement is a little short on detail. I caught up with a representative of Oxford Nanopore to try and coax a bit more information out of them. For example, I asked how many flow cells a MAP participant might expect to receive, this is “still in discussion”

They promise that “MAP participants will be the first to publish data from their own samples”. I take this to mean that they have decided they won’t, in fact, be putting out the long awaited “proof of sequencing” data, but the participants on the programme will be responsible for keeping the community updated as to progress with the platform until it is released “properly”.

How long does each flow-cell last? Here, “one of the key parts of the programme will be to work out the useful life time when used and possibly abused by customers”. OK.

Sample preparation kits are again mentioned; in Clive Brown’s talk at the UK Genome Science meeting in Nottingham at the start of September he discussed sample prep in a little more detail. Again, he confirmed that the fundamental requirement for nanopore sequencing was double-stranded DNA with a 5′ overhang. He mentioned several preparation methods for this including the “standard” prep method using PCR, and the “duo” prep which is similar to the standard Illumina TruSeq method with physical fragmentation, end-repair and A-tailing. However he also mentioned a transposon based method and the use of molecular “tethers” in order to help guide fragments to the pore as they “range in 3-dimensional space”, in order to increase pore occupancy. These tethers could also serve to preferentially target sequences of interest to the pore.

Another interesting factoid is that flow cells are supplied separately from the MinION itself, suggesting that the fully-disposable sequencer idea may have run its course and they are transitioning to a more standard instrument-consumable model.

What is the input requirement? “small”

So — many questions remain. For example, registration is open in November, but we do not know how long the selection process will go on for before the first MinIONs are sent out. Let’s hope it is a matter of weeks or a few months.

How many participants will get to enrol in this programme? It seems clear to me that Oxford Nanopore will be hugely, massively oversubscribed no matter how many slots they have planned for.

How long will this programme go on before MinION is on general release? “If everything goes perfectly, it will be a short programme but realistically there is plenty for us to learn about the tech, customers, applications, training and support.”

The burn-in period makes sense in terms of making sure participants are getting the best out of the instrument before running their own samples, but we don’t have any information about how long this might last and what the pre-agreed criteria are, potentially a source of frustration if the platform is beset with technical problems.

You know what? I wouldn’t bother applying actually, I’ll apply and check it out for you ;)

Full release below …

You can follow the rest of the news from ASHG through the Twitter hashtag, #ASHG2013.

MinION Access Programme
In late November, Oxford Nanopore will open registration for a MinION Access Programme (MAP – product preview). This is a substantial but initially controlled programme designed to give life science researchers access to nanopore sequencing technology at no risk and minimal cost.
MAP participants will be at the forefront of applying a completely novel, long-read, real-time sequencing system to existing and new application areas. MAP participants will gain hands-on understanding of the MinION technology, its capabilities and features. They will also play an active role in assessing and developing the system over time. Oxford Nanopore believes that any life science researcher can and should be able to exploit MinION in their own work. Accordingly, Oxford Nanopore is accepting applications for MAP participation from all1, 2.
About the programme
A substantial number of selected participants will receive a MinION Access programme package. This will include:
* At least one complete MinION system (device, flowcells and software tools).
* MAP participants will be asked to pay a refundable $1,000 deposit on the MinION USB device, plus shipping.
* Oxford Nanopore will provide a regular baseline supply of flowcells sufficient to allow frequent usage of the system. MAP participants will ONLY pay shipping costs on these flowcells. Any additional flowcells required at the participants’ discretion may be available for purchase at a MAP-only price of $999 each plus shipping and taxes.
* Oxford Nanopore will provide Sequencing Preparation Kits. MAP participants may choose to develop their own sample preparation and analysis methods; however, at this stage on an unsupported basis.
What are the terms of the MAP agreement?
Participation in the MAP product preview program will require participants to sign up to an End User License Agreement (EULA) and simple terms intended to allow Oxford Nanopore to further develop the utility of the products, applications and customer support while also maximising scientific benefits for MAP participants. Further details will be provided when registration opens, however in outline:
* MAP participants will be invited to provide Oxford Nanopore with feedback regarding their experiences through channels provided by the company.
* All used flow cells are to be returned to Oxford Nanopore3.
* MAP participants will receive training and support through an online participant community and support portal.
* MAP participants will go through an initial restricted ‘burn-in’ period, during which test samples will be run and data shared with Oxford Nanopore. After consistent and satisfactory performance has been achieved under pre-agreed criteria, the MAP participants will be able to conduct experiments with their own samples. Data can be published whilst participants are utilising the baseline supply of flowcells.
* MAP participants or Oxford Nanopore may terminate participation in the programme at any time, for any reason. Deposits will be refunded after all of the MAP hardware is returned.
* MAP participants will be the first to publish data from their own samples. Oxford Nanopore does not intend to restrict use or dissemination of the biological results obtained by participants using MinIONs to analyse their own samples. Oxford Nanopore is interested in the quality and performance of the MiniION system itself.
* Oxford Nanopore intends to give preferential status for the GridION Access Programme (GAP) when announced to successful participants in the MinION access programme.
* The MinION software will generate reports on the quality of each experiment and will be provided to Oxford Nanopore only to facilitate support and debugging.
Registration process
Registration will open in late November for a specific and limited time period. Oxford Nanopore will operate a controlled release of spaces on the programme.
MAP participants will be notified upon acceptance to the programme. They will then able to review and accept the EULA before providing the refundable deposit and joining the programme. MAP participants will then receive a login for the participant support portal and a target delivery date for their MinION(s) and initial flow cells.
The online participant support portal will provide training materials, FAQs, support and other information such as data examples from Oxford Nanopore. It will also include a community forum to allow participants to share experiences.
Who can join?
Anybody who is not affiliated with competitors of Oxford Nanopore. Strong preference will be given to biologists/researchers working within the field of applied NGS where long reads, simple workflow, low costs, and real time analysis can be shown to make a key difference. Preference may also be given to individuals/sites opting for multiple MinIONs. If the programme is oversubscribed, some element of fairly applied random selection may be used to further prioritise participants.
1. If you would like us to keep you informed of the opening of this registration please visit our contact page and select the box marked ‘Keep me informed on the MinION Access programme’.
2. The MinION system is for Research Use Only
3. Flowcells can be easily, quickly and thoroughly washed through with water and dried before return.

Beatles and Bioinformatics! 27th November 2013

So far we have held three meetings in the Baltis and Bioinformatics series, which has had some great talks and resulted in some great connections being made. The meetings are based around the linked topics of genome sequencing and bioinformatics, with the idea to take leaders in the bioinformatics field, and put them together with students, post-docs, ‘service’ bioinformaticians and others at the coal-face of data analysis. The meetings should stimulate discussion, collaboration opportunities and learning. We traditionally finish off with a bioinformatics clinic, where you can ask the hardest (or stupidest) questions you can think of. There is a heavy emphasis on method at this meeting – ‘how did you do a particular analysis?’, ‘What software is best to try for this particular problem?’, ‘What issues did you encounter along the way?’, rather than glossy re-hashing of already published work.

For the fourth meeting in this series, I am delighted and in fact bowled-over that the Center for Genomic Research at the University of Liverpool have decided not only to host but to spend a bit of money on the meeting. Not only that, they have stumped up a bit of cash to invite some international speakers, and they have subsidised lunch and the venue fee so that we can keep this meeting free. They are great people, and I should also thank the BBSRC and NERC for their generosity in sponsoring the CGR workshops. Many thanks for Neil Hall and Christiane Hertz-Fowler for their kind support of this meeting. This does also mean we need plenty of attendees, so please spread this announcement far and wide so we can make this the largest meeting yet.

BEETLES

BEATLES

Finally, the loose theme for the meeting is ‘metagenomics’, although this meeting will also be of interest to those doing ‘just genomics’. And of course, being in Liverpool we just had to have a BEATLES (not BEETLES) theme.

A 16S & metagenomics workshop will be held on the 28th and 29th of November, further details on this to be announced shortly.

It is free to attend, but please register through this link! And whilst you are waiting for the meeting, please check the highly entertaining #BeatlesAndBioinformatics hashtag.

Due to high demand, registration for this meeting will close on Thursday, 31st October 2013 at 5pm (GMT).

AGENDA

Venue: The Chapel, The Foresight Centre, University of Liverpool, Brownlow Street, Liverpool, L69 3GL. http://www.foresightcentre.co.uk

12.00 – Registration & Lunch

13.00 – KEYNOTE: Sebastien Boisvert (@sebhtml), Université Laval, Québec, Canada – “Ray and Ray Cloud Browser for Metagenomics”

13.50 – Chris Stewart (@CJStewart7), University of Northumbria at Newcastle – “Development of the Gut Microbiome in Preterm Infants at Risk of Necrotising Enterocolitis and Sepsis”

14.10 – Chris Quince, University of Glasgow – “CONCOCT: Clustering cONtigs on COverage and ComposiTion”

14.30 – Susannah Salter (@Zannah_Du), Wellcome Trust Sanger Institute – “What’s lurking in your kits?”

14.50 – Refreshment break

15.10 – KEYNOTE: Prof. Daniel Huson – Center for Bioinformatics, University of Tübingen – “Identifying Organisms from a Stream of DNA Sequences”

16.00 – Sujai Kumar (@SujaiK), University of Oxford – “Blobology: exploring raw (meta)genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots”

16.20 – Mike Cox (@MikeyJ), Imperial College – “Copy number correction in 16S analysis”

16.40 – Rebecca Gladstone (@RAGladstone), University of Southampton / Wellcome Trust Sanger Institute – “Managing hundreds and thousands of bacterial genome sequences”

17.00 – Chris Hayward, Amazon Web Services – “Getting started with genomics in the cloud”

17.30 – The Magical Mystery Tour (dinner , followed by a visit to Matthew Street and the Cavern Club)

Beatles and Bioinformatics! 27th November 2013

So far we have held three meetings in the Baltis and Bioinformatics series, which has had some great talks and resulted in some great connections being made. The meetings are based around the linked topics of genome sequencing and bioinformatics, with the idea to take leaders in the bioinformatics field, and put them together with students, post-docs, ‘service’ bioinformaticians and others at the coal-face of data analysis. The meetings should stimulate discussion, collaboration opportunities and learning. We traditionally finish off with a bioinformatics clinic, where you can ask the hardest (or stupidest) questions you can think of. There is a heavy emphasis on method at this meeting – ‘how did you do a particular analysis?’, ‘What software is best to try for this particular problem?’, ‘What issues did you encounter along the way?’, rather than glossy re-hashing of already published work.

For the fourth meeting in this series, I am delighted and in fact bowled-over that the Center for Genomic Research at the University of Liverpool have decided not only to host but to spend a bit of money on the meeting. Not only that, they have stumped up a bit of cash to invite some international speakers, and they have subsidised lunch and the venue fee so that we can keep this meeting free. They are great people, and I should also thank the BBSRC and NERC for their generosity in sponsoring the CGR workshops. Many thanks for Neil Hall and Christiane Hertz-Fowler for their kind support of this meeting. This does also mean we need plenty of attendees, so please spread this announcement far and wide so we can make this the largest meeting yet.

BEETLES

BEATLES

Finally, the loose theme for the meeting is ‘metagenomics’, although this meeting will also be of interest to those doing ‘just genomics’. And of course, being in Liverpool we just had to have a BEATLES (not BEETLES) theme.

A 16S & metagenomics workshop will be held on the 28th and 29th of November, further details on this to be announced shortly.

It is free to attend, but please register through this link! And whilst you are waiting for the meeting, please check the highly entertaining #BeatlesAndBioinformatics hashtag.

AGENDA

Venue: The Chapel, The Foresight Centre, University of Liverpool, Brownlow Street, Liverpool, L69 3GL. http://www.foresightcentre.co.uk

12.00 – Registration & Lunch

13.00 – Sebastien Boisvert (@sebhtml), Université Laval, Québec, Canada – “Ray and Ray Cloud Browser for Metagenomics”

13.50 – Chris Stewart (@CJStewart7), University of Northumbria at Newcastle – “Development of the Gut Microbiome in Preterm Infants at Risk of Necrotising Enterocolitis and Sepsis”

14.10 – Mike Cox (@MikeyJ), Imperial College – “Copy number correction in 16S analysis”

14.30 – Refreshment break

15.00 – TBA

15.50 – Sujai Kumar (@SujaiK), University of Oxford – “Blobology: exploring raw (meta)genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots”

16.10 – Rebecca Gladstone (@RAGladstone), University of Southampton / Wellcome Trust Sanger Institute – “Managing hundreds and thousands of bacterial genome sequences”

16.30 – Bioinformatics clinic and discussion

17.00 – The Magical Mystery Tour