Frequently Asked Questions - RTG Investigator

Product

Q: What is RTG Investigator?
A: RTG Investigator is the most advanced sequence analysis software for variant detection and metagenomic analysis with large-scale Illumina, Complete Genomics and Roche 454 data sets.
Q: What can I do with it?
A: RTG Investigator supports research in diverse fields, such as human health, agriculture, bio-fuels and environmental security, as well as across multiple phases of a project. Anyone starting a new project can get an immediate jump-start on pipeline development. Researchers with existing pipelines can easily apply the additional sensitivity of RTG Investigator to extract premium value from current experiments or unblock investigations that require deeper analysis than existing tools can provide.
Q: Why should I use it?
A: Variant detection and metagenomic analysis pipelines based on these algorithms produce high quality results 10-1000x faster than open source and other commercial alternatives. The resulting workflows are simple and efficient yet still achieve the most accurate results. Algorithms developed by Real Time Genomics also allow functionality not previously available from other analysis software tools, such as mapping the overlapping, gapped reads from Complete Genomics and fully integrating these reads into a standard analysis pipeline alongside Illumina sequence data.
Q: Where do it get it?
A: RTG Investigator is available for immediate download at this page.
Q: How much does it cost?
A: RTG Investigator is free for individual use. Organizational pricing is based on production usage at competitive market rates.

License

Q: What do I get for free?
A: The individual use license allows personal use of the full RTG Investigator sequence analysis software suite for one year, running on up to five computers simultaneously.
Q: What happens when my free license expires?
A: An individual use license can be renewed after twelve months of use.
Q: Can I publish my research results?
A: Yes. There is no restriction on the use of experimental results obtained through analysis with the software. However, you must have advanced written consent from RTG to publish benchmark comparisons of RTG to other software.
Q: Do I retain complete control over my research data and methods?
A: Yes. Any and all results that are obtained from use of the software by you are solely and fully owned by you.
Q: What do I pay for?
A: Organizations that want additional service and support, centralized deployment, and unlimited use, may license RTG Investigator on an annual subscription basis. The paid license allows an organization to implement a centrally administered pipelines for which direct or indirect compensation may be obtained.

Download

Q: What do I get in the free download?
A: The RTG Investigator package includes the RTG Investigator executable, an extensive user manual, and a recent version of the Java Runtime Environment (JRE). A "nojre" version download package is available for use with Mac OS X systems. Separately, one may download tutorial files and instructions.
Q: What operating environment do I need to run the software?
A: RTG Investigator has been tested on 64 bit Linux, Windows and Mac OS X systems with the most recent build of the JRE version 1.6.
Q: Why do you say "use educational or government domains for rapid approval"?
A: Any biological investigator working in an academic or government organization fits the profile of a non-competitive researcher who may benefit from use of the software. Thus, anyone with an email suffix of .edu or .gov gets put onto a white list for immediate approval.
Q: What other software/libraries do I need to install to run RTG Investigator?
A: None. Everything needed to run RTG Investigator is included in download package. As such it can be installed in user space with no need for administrator/root privileges.
Q: How do I install RTG Investigator?
A: Download the RTG Investigator zip file appropriate for your processing environment from the Real Time Genomics web site. Unzip the rtg zip archive to install and then run the rtg (rtg.bat for Windows) command. Agree to the EULA by typing 'y' when prompted.

Features

Q: Does RTG support industry standard data and file formats?
A: Yes. RTG accepts sequence data for read mapping in FASTA, FASTQ and Complete Genomics formats. Alignments are reported in the Sequence Alignment/Map (SAM) format. Variant calling functions accept SAM format alignments and report results in the Variant Call Format (VCF). Coverage depth is reported in the BED format.
Q: What support does RTG have for compute clusters and parallel processing?
A: RTG automatically splits compute-intensive commands for read mapping and variant calling into multiple threads that run simultaneously on multi-core systems. Use of the "start-read" and "end-read" command parameters allows quick and simple scripting to split large jobs onto multiple processors.
Q: What sequence data can RTG process?
A: RTG accepts large-scale next generation sequencing datasets from Illumina GAII and HiSeq, Complete Genomics, and Roche 454 instruments.
Q: How long can the sequences be?
A: Sequence data can be any size from 36 to 1000 base pairs.  Performance has been optimized for the currently popular sequencing instruments and lab kits.
Q: What is the tolerance for sequencing errors?
A: RTG map command default sensitivity settings balance mapping percentage and speed requirements, delivering highly accurate results for human read sequence from Illumina instruments while accommodating error rates of 2% or less. For novel studies with emerging platforms or cross-species mapping, additional sensitivity up to 5% error tolerance can be obtained with a modest sacrifice of accuracy.
Q: How does RTG handle reads that map multiple times?
A: By default, RTG reports the top five mapping results for any given read, which allows objective reporting of ambiguous reads (i.e. reads that map at multiple locations on the reference template). Extensive filtering options allow precise use of this additional information. Additionally, a "top-random" option allows for use of the mapping command in pipelines developed around the BWA approach of selecting one location at random for the placement of a read that maps multiple times.
Q: Does RTG call SNPs (like Samtools or GATK)?
A: Yes. RTG has an integrated pipeline that includes mapping and variant calling. It reports alignments in the SAM format, allowing interoperation with other variant callers such as GATK.  The RTG variant caller reports variants in the VCF format.
Q: Does RTG work on reference sequences longer than 4GB in total?
A: Yes. With RTG, read sequences are indexed and loaded into RAM at runtime. The reference sequence(s) are then streamed past the reads. Thus, any size reference can be used, including very large databases.

Usage

Q: How do I convert reads to a format that will work with RTG?
A: The map command directly processes normal and compressed FASTA and FASTQ files using Sanger or Illumina quality encodings. The cgmap command directly processes Complete Genomics reads.tsv.bz2 files. For maximum efficiency and speed, we recommend that you convert FASTA, FASTQ and tsv.bz2 files to our SDF format. For other formats, use vender supplied tools to first convert those files to FASTQ.
Q: How do I produce alignment outputs that are compatible with samtools?
A: Currently the map command produces CIGARs in the SAM output that are not compatible with samtools (version 0.1.9). To allow samtools to process RTG SAM files run the map command with the --legacy-cigars option. This will produce CIGARs without any 'X' or '=' characters in them.
Q: The mating counts from RTG map are low. How do I increase matings?
A: Check that mated gap values are correct for the PE library you are mapping. See -m/-M in help for the map command.
Q: What is a SAM read group header line and why is it needed?
A: It is an optional line added to the SAM file containing information about the sequencing run that produced the reads. The snp command, and other commands that process mapped reads, use the read group information to apply sequencing run specific calibrations to the reads in order to improve variance calling accuracy.
Q: How do I create a read group header line for use with the map command?
A: Create a file with a single line containing a unique ID, a sample name and the platform the reads came from. For example for a lane of Illumina reads (SRR016606) from the NA19240 sample, the read group line would be created with:

echo -e "@RG\tID:SRR016606\tSM:NA19240\tPL:ILLUMINA" > rg_SRR016606.txt

It is critical that the tab characters \t are escaped correctly when creating the read group header file, echo -e achieves this. The --sam-rf flag is used in the map command to include the header information in the SAM mappings files that are output:

rtg map -t sdf_hg19 -i sdf_SRR016606 -o map_SRR016606 --sam-rf rg_SRR016606.txt
Q: How do I create the name mapping file required for rtg species --relabel-species-file?
A: The format command keeps the name of each sequence up to the first space, as SAM files produced by the map command cannot have spaces in their names. For mapping runs intended for whole genome sequence mapping this is not usually and issue as the chromosomes tend to be named simply, for example 'chr10'. But for metagenomics applications this is not usually the case. For example when using a database of bacteria genomes the sequence names are longer and more descriptive.

ls -1 Escherichia_coli_APEC_O1_uid58623/
NC_008563.fna
NC_009837.fna
NC_009838.fna

head -2 Escherichia_coli_APEC_O1_uid58623/NC_008563.fna
>gi|117622295|ref|NC_008563.1| Escherichia coli APEC O1, complete genome
AACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGC

Escherichia_coli_APEC_O1_uid58623/NC_009837.fna
>gi|157418083|ref|NC_009837.1| Escherichia coli APEC O1 plasmid pAPEC-O1-ColBM, complete sequence
GGTGATGCTGCCAACTTACTGATTTAGTGTATGATGGTGTTTTTGAGGTGCTCCAGTGGCTTCTGTTTC


When formating these fasta files to sdf only the 'gi|117622295|ref|NC_008563.1|' part of the name is stored in the rtg sdf directory. While this is sufficient to differentiate sequences, it is not the most human friendly form. Tools like rtg species have an option to specify a mapping file of the short 'reference' name to a longer form. The following bash script creates a name mapping file called 'rename.txt' from the bacteria sequence .fna files. In this case it replaces any spaces in the names with underscores.

for f in /path/to/sequence/files/*.fna; do
    head -1 $f | gawk '{gi=substr($1,2); printf "%s", gi; for (i=2;i<=NF;i++) {printf "_%s",$i}; print " "gi}'
done > rename.txt

cat rename.txt
...
Escherichia_coli_APEC_O1_complete_genome gi|117622295|ref|NC_008563.1|
Escherichia_coli_APEC_O1_plasmid_pAPEC-O1-ColBM_complete_sequence gi|157418083|ref|NC_009837.1|
Escherichia_coli_APEC_O1_plasmid_pAPEC-O1-R_complete_sequence gi|157412014|ref|NC_009838.1|
Escherichia_coli_ATCC_8739_complete_genome gi|170018061|ref|NC_010468.1|
Escherichia_coli_'BL21-Gold(DE3)pLysS_AG'_chromosome_complete_genome gi|253771435|ref|NC_012947.1|
Escherichia_coli_B_str._REL606_chromosome_complete_genome gi|254160123|ref|NC_012967.1|
Escherichia_coli_BW2952_complete_genome gi|238899406|ref|NC_012759.1|
Escherichia_coli_CFT073_chromosome_complete_genome gi|26245917|ref|NC_004431.1|
...
Q: How do I create a single bacteria sequence from whole chromosome and plasmid FASTA files?
A: Use the following script:

cat Escherichia_coli_APEC_O1_uid58623/*.fna | \
    gawk '{if (NR==1) {print $0} else {if (substr($1,1,1) == ">"){print "NNNNNNNNNNNNNNNNNNNNNNNNN"} else {print $0}}}' \
    > Escherichia_coli_APEC_O1_uid58623_merged.fna