An original paper published in Nature Biotechnology this month reports that a 290-fold improvement in variant calling accuracy can be gained through the application of a set of filters that include a consensus filter. Brian Hilbush and John Cleary of Real Time Genomics collaborated with VIB scientists on the paper, which is freely available online (Optimized filtering reduces the error rate in detecting genomic variants by short read sequencing).
With great advances in technology come pundits crying out that this will be the year that everything changes. It seems that every year from 1982 to 1992 was the "Year of the LAN." Yet, in 1993 the Internet would be the overnight success story in networking.
Thus, with great humility and modest expectations, I bring to you my top three predictions for genomics in 2012. Not a single one of these will make this the actual Year of the Genome. But you can be sure that we are well on our way to the actual overnight success story in genomics.
Last week I had the great privilege to join over 60 cancer investigators and bioinformaticians at the Cancer Genome Sequencing Summit, in Boston, MA. Leading practitioners from major pharma and biotech companies were in attendance to discuss real opportunities to translate actionable information from foundational cancer genome research into effective clinical therapies. I had extensive conversations with nearly half of them, and can say that we made more quality contacts for our company at this meeting than at any other event this year.
Are researchers really buried in genomic data? Does it really cost more to analyze a genome than to produce the sequence for it? A recent article by Andrew Pollack, biotechnology editor for the NY Times, titled "DNA Sequencing Caught in Deluge of Data," suggests an exploding bioinformatics market that specifically addresses the DNA data deluge. Advances in bioinformatics are necessary because, according to David Haussler of UCSC, "Data handling is now the bottleneck. It costs more to analyze a genome than to sequence a genome."
Through a unique sequence analysis pipeline for high quality variant detection in whole genome and exome sequencing applications, RTG Investigator offers the highest possible confidence in SNP calling. The accurate identification of known genetic variants and the discovery of "private" variants in previously unsequenced individuals are prerequisites for personalized medicine and genome-based diagnostics.
This technical white paper (The RTG Pipeline for High Performance Variant Detection in Whole Genome and Exome Sequencing Applications) provides an in-depth description of the workflow and summarizes the variant analysis results, particularly in comparison with that of the BWA/GATK pipeline. The performance metrics reported, both accuracy and overall analysis speed, indicate that this method for SNP/Indel calling will provide an excellent foundation for human disease investigations.
Download RTG Investigator now for free individual use, or contact us for an evaluation with large scale datasets.
As I mentioned in my last post, here is a description of the poster I presented at the conference.
RTG Investigator - High Throughput Metagenomics Analysis Toolkit
Early in 2011, the German E. coli outbreak sparked competition between several research groups to assemble and identify the pathogenic components and provide a taxonomic classification of the strain from shotgun sequencing data. Since NGS and 3rd generation sequencing technologies can generate reads from these samples within hours, the challenge is to get the same quick turn around with the software used to perform downstream analysis on the data. Computational tasks for the German outbreak include the assembly and identification of known and novel components of the new strain and determining molecular phylogenetic relationships among E. coli species.
Real Time Genomics has built RTG Investigator, a software toolkit with metagenomics and variance detection features bundled into a single package. A novel hash-based indexing search engine provides high-throughput mapping and gapped alignment of shotgun reads to both genome references and protein databases. The engine has been adapted to perform large-scale k-mer phylogenetic analysis on genome sequence and metagenomic samples. Multiple outputs are produced to provide a rich set of alternative views and afford unique insights into phylogenetic relationships. Continue reading ›
There were two sessions of talks today and a poster session in the evening, at which I presented our poster "RTG Investigator - High Throughput Metagenomics Analysis Toolkit".
The first session of the day, entitled "Sequencing Pipelines and Assembly", had a couple of interesting talks on the evaluation of genome assemblers from the Assemblathon 1 genome assembly competition by Benedict Paten of UCSC, and GAGE by Stephen Salzberg of Johns Hopkins University. In both those talks they described different metrics, from the traditional N50 measures through to more detailed statistical analysis of contig lengths, gaps and scaffold gap bridging. What was evident is that each assembler's performance varied on different genomes, and the quality of the reads used has a significant effect on the final assemblies. The genome assembly domain has a lot of black art involved in it with most assemblers requiring data to be 'cleaned' before they can perform well. Another talk by Michael Sckats of CSHL showed how combining the outputs from several assemblers can both improve contig lengths but also remove incorrect deletes/insertions from within the contigs. So overall, if you can afford the computation resources and time, then running several assemblers with definitely garner better results. Continue reading ›
Only one session today, and it was entitled "Transcriptions, Alternative Splicing and Gene Predictions". A lot of the content was new to me, but there were some interesting insights from a computation and accuracy of mapping point of view.
RNA-Seq data is still a popular form of input for many studies, as it has become cheap and easy to produce on modern sequencers. This allows larger studies to be performed on a cohort rather than just individuals, which in turn gives more meaning to results that are attained. Also, while it is cheaper to do RNA-Seq runs there is still a need to do whole genome sequencing. It was shown that some SNPs may or may not be present in either form of analysis depending on loci specific read coverage and other factors, but more importantly with RNA-Seq data you only get reads in regions that are expressed, meaning some regions are not covered at all.
From a computational point of view, RNA-Seq data is easy to process that WGS due to smaller quantities of reads. While there are many pieces of software out there to map/align reads of any type, three interesting facts showed up in the talks. Continue reading ›
Montreal October 14th, 2011. The data keep getting richer and deeper- Several presentations at the ICHG/ASHG meeting reviewed progress and provided a view of the surging scope of the large-scale genome projects that were launched a few years ago. For me and about 200 others, Ion Torrent’s first User Group meeting and its most excellent party at the Intercontinental were a bonus to end the evening.
From Montreal, October 12th, 2011. I reveled in a day featuring great science around genome instability and gene networks at the 12th annual meeting of the American Society of Human Genetics (ASHG) in Montreal.
Careful and thorough attention to detail takes a bit more time up front, but avoids rework and increases satisfaction in the final result. As the carpenter says, measure twice and cut once. With RTG Investigator, Real Time Genomics applied the expert software craftsmanship of senior programmers with Ph.D. degrees in computer science to the demanding requirements of NGS sequence analysis. The goal? Enable biological researchers to accelerate their scientific investigation with large NGS data sets using a state-of-the-art software platform. What exactly were the requirements for the platform? Continue reading ›
Take a second look at your NGS data with RTG Investigator. This short article suggests an efficient method for validation and extension of variant detection analysis results from whole genome and exome sequencing experiments. RTG Investigator from Real Time Genomics requires only a two day incremental investment in time and 10% more processing on a local cluster. Continue reading ›
The RTG development team is continually evolving, improving the core algorithms of, and adding new functionality to, RTG Investigator. To ensure new code is an improvement an arsenal of automated tests are employed, both low level unit tests and higher level regression tests. While this testing is designed to give quick turn around so any coding issues that arise can be detected and resolved in a timely manner, we also test candidate releases on large real world data sets. In particular we have several standard human NGS data sets that we regularly push through RTG's short variant detection pipeline and as a result we have built up a set of useful scripts to help manage this process.
At the RTG San Francisco office we have a small cluster of 8 compute nodes, affectionately known as "The Tanks" (for historical reasons I won't go into here), that we use to process the NGS data sets. Each tank is running Linux (Centos 5.6) and has 2 Xeon CPUs (4 core each), 48GB of RAM, 1TB local disk space, and they share a NFS network file system. We run (Oracle) Sun Grid Engine (SGE) to manage the cluster and to distribute work to the compute nodes.
Following is a shell script that we use to set up and manage the dozens of rtg map and rtg snp jobs required to process an Illumina NGS read set for a human sample. In the script the following RTG Investigator features are used to split lanes of reads, make use of node disk resources and to perform variant calling on regions of the genome: Continue reading ›
Among the many tasks required for full investigation of a genomic DNA sample from tumors or tissue biopsies, blood or complex body sites is the evaluation of viral or microbial species content. Whole genome sequencing provided by the Complete Genomics Analysis Service (CGA™) delivers data with respect to a single human reference genome but does not map or assemble reads against other individual genomes, sequence databases or non-human reference genomes.
RTG Investigator tools can be employed to rapidly detect and quantitate viral and bacterial species in human DNA samples sent to Complete Genomics. To begin the analysis, the primary read data received in the reads.tsv format is converted to RTG’s sequence data format (SDF) using a command line utility called cg2sdf. The process creates paired directories (31 to 35bp reads with 62 to70bp per DNB) which can then be mapped to any reference as paired-end reads. RTG’s cgmap is a gap-tolerant aligner that recognizes the unique read structure resulting from Complete Genomics’ sequencing chemistry (see Science. 2010 Jan 1;327(5961):78-81). With RTG software, read mapping is performed in a similar manner to other sequencing data types where an index is created in RAM for the reads and the genome is scanned sequentially. Outputs are written to SAM alignment files with CIGARs designed to incorporate information related to the variable gap read structure of Complete Genomics sequence data.
Utilizing RTG’s CG aligner, collections of viral or bacterial genomes can be screened to detect the presence of known or homologous sequence in the DNA sample of interest. As an example, we processed a subset of read data from NA12878, a DNA sample used in the International HapMap and 1000 Genomes Projects (Nature 2005 Oct 27;437(7063):1299-320; Nature 2010 Oct 28;467(7319):1061-73). Continue reading ›
Anybody that has had to process NGS reads for mammalian or plant genomes will appreciate the amount of disk space that is required to store the raw data and the disk space that is used during the mapping/alignment process. Over the past few months I have heard of several cases where disk failures on network drives has resulted in lost data and required rerunning of some or all the mapping processes.
At RTG we have the philosophy of using as little disk space as possible in all the RTG Investigator tools. For us short term use of RAM is a better alternative than longer term use of the typically inexpensive disks that are generally used for shorter term network storage in modern computation clusters.
We apply these ideas into RTG Investigator tools in the following ways:
- combine as much functionality into each stage of RTG pipelines as possible to reduce the amount of file reading and writing that is needed
- write compressed outputs to reduce disk space
- allow multi-file inputs to variance tools to avoid the need to merge outputs from mapping runs
As an example of the disk savings made when using RTG Investigator I recorded the disk space used over time for the mapping of a lane of Illumina 35bp paired-end reads. I compared it to the bwa/samtools steps that are rquired to produce output that is sorted on reference genome order, ready for further variant call processing. Continue reading ›
