Fast, easy and reliable analysis pipeline for NGS data – also available for you

UMCG’s Genetics department has developed a routine in-house NGS DNA analysis pipeline, which is currently available for everybody.
We offer this as a service, where you can easily upload your targeted panel or exome sequencing data (for example fastq files) and in several days get the results (WGS data may take longer).
Alternatively, you can deploy the pipeline on your own cluster, with an easy install automatic deployment method.
The bio-informatics pipeline consist of a number of thoroughly tested and validated tools and steps which can be divided as follows:

Preprocessing:
During the first preprocessing steps of the pipeline, PhiX reads are inserted in each sample to create control SNPs in the dataset. Subsequently, Illumina encoding is checked and QC metrics are calculated using a FastQC tool.
The script Burrows-Wheeler Aligner (BwaAlign) is used to align the sequence data to a reference genome resulting in a SAM file, which is converted to BAM format using Picard SamFormatConverter. The reads in the BAM file are sorted (Sambamba sort), resulting in a sorted BAM file and an Index file. If multiple lanes or flow cells are used during sequencing, all BAM files can be merged and indexed to a new file (Sambamba merge). The (merged) BAM file is marked for duplicates of the same read pair (Sambamba markdup and Sambamba flagstat).

Indel calling with Manta:
The program Manta is used to search for insertions and deletions (InDels) from the (merged) BAM file that was preprocessed in the previous steps. The results, along with information about difference in length between reference and alternative alleles, types of structural variants and allele depth are summarized in a VCF file.

Determine gender:
The tool Picard CalculateHSMetrics is used to predict whether a sample has one or two x-chromosomes, ie is male or female. It calculates the coverage on the non-pseudo autosomal region and compares this to the average coverage on the complete genome.

Coverage calculations:
The GATK DepthOfCoverage tool calculates the coverage per base and per target, resulting in an tab delimited file which contains information about chromosomal position, coverage per base and gene annotation.

Metrics calculations:
Several Picard QC tools are used to generate text files and PDF files with QC metrics, which allow to create tables and graphs of the quality of the alignment that was done in the previous steps.

Variant discovery:
The GATK HaplotypeCaller estimates the most likely genotypes and allele frequencies in an alignment using a Bayesian likelihood model for every position of the genome regardless of whether a variant was detected at that site or not. This information can later be used in the project based genotyping step.
When using a large number of samples (200 or more) in the pipeline, batches with equal sample size will be calculated and created automatically in this step. And in addition, combined again using GATK CombineGVCFs.
Subsequently, there will be a joint analysis of all the samples in the project. This leads to a posterior probability of a variant allele at a site. SNPs and small Indels are written to a VCF file, along with information such as genotype quality, allele frequency, strand bias and read depth for that SNP/Indel.
Several other steps are performed, such as running GATK CatVariants to merge all the files created in the genotype variants step into one. Also, a HTML file with some statistics and a text file with SNPs per gene and region is produced. Based on certain quality thresholds (based on GATK best practices) the SNPs and indels are filtered, and marked as Lowqual or Pass. Then, merge all the SNPs and indels into one file (per project) and merge SNPs and indels per sample. Eventually, a final VCF file is produced.

Important Side steps:
One of these steps involve Cram conversion and concordance check, to produce more compressed bam files (using the tool Scramble).
Another small side step is the generation of md5sums, and a concordance check of the data using Sequonom results from the same sample. Detected SNPs should show significant overlap and this step also makes detection of sample swaps possible. Several tools are used for these steps.

QC reporting:
The control reads inserted in the first steps of the pipeline, are checked resulting in a text file with the output of this concordance check.
Furthermore, all generated statistics form previous steps are used in these final steps to produce a QC report, with tables and graphs, in html and pdf format.
If all steps are finished (by a script that checks this), the final results are shipped to the customer. A selection of files from the temporary directory are copied to a results directory, together with a md5checksum, and readme text file.
All this information, and more details can be found on:

https://github.com/molgenis/NGS_DNA

To use/deploy:
The Molgenis NGS-DNA pipeline is released on Github, and can be deployed using Easybuild software.
For details on this, or if you want your data processed using our pipeline, please contact Pieter Neerincx (Pieter.neerincx@gmail.com), or Elisa Hoekstra (e.j.hoekstra@umcg.nl).

24-02-2017