Improve Variant Calling Accuracy with NVIDIA Parabricks

-


Built for data scientists and bioinformaticians, NVIDIA Parabricks is a scalable genomics software suite for secondary evaluation. Providing GPU-accelerated versions of open-source tools for increased speed and accuracy, researchers can uncover biological insights faster.

The most recent release, Parabricks v4.6, offers improvements to multiple features, most notably support for Google’s DeepVariant and DeepSomatic 1.9. This features a pangenome-aware mode for DeepVariant, which improves accuracy across genetic variations and diverse populations. 

Recent features:

  • DeepVariant and DeepSomatic 1.9, including pangenome-aware DeepVariant.
  • DeepSomatic long read and whole exome sequencing (WES) support.
  • STAR quantMode including GeneCounts.

Improved features:

  • STAR speedups: Almost 8x faster on two NVIDIA RTX PRO 6000 GPUs in comparison with CPU-only solutions.
  • Additional arguments for Mutectcaller, including mitochondrial mode.

Improve variant calling with DeepVariant and DeepSomatic 1.9

Variant calling is a critical step in genomic evaluation. It identifies differences between the sample genome (i.e., a person or population) and a reference genome. Understanding these genetic differences gives scientists a greater understanding of diseases and potential treatments.

There may be a wide selection of tools built to handle variant calling, including HaplotypeCaller and Mutect2 within the Genomic Evaluation Toolkit (GATK) from the Broad Institute. Along with the industry standards from GATK, deep-learning-based variant callers have develop into widely used. 

Developed by Google, DeepVariant and DeepSomatic use deep learning to support variant identification. For germline data, DeepVariant determines inherited variants. Alternatively, DeepSomatic shows how somatic variants affect non-inherited mutations, including those present in tumor cells. 

Enhancing variant calling accuracy is critical, particularly when considering genetic diversity. In line with a recent paper, pangenome-aware DeepVariant reduced errors by as much as 25.5% across all settings compared to linear-referenced-based DeepVariant. 

“Taking genetic diversity under consideration is critical to accurate genome evaluation, especially across diverse populations. Recent pangenome methods allow more comprehensive maps of genetic variation to tell evaluation,” says Andrew Carroll, product lead at Google Research. “I’m excited by Parabricks v4.6 support for pangenome-aware DeepVariant v1.9, which mixes the incredible speed of Parabricks with the brand new DeepVariant ability to directly use pangenome information during variant calling.”

Improve accuracy much more with Giraffe and DeepVariant v1.9

Traditional linear references, including the Genome Reference Consortium Human Construct 38 (GRCh38), are built from the DNA of only a number of individuals, providing a universal coordinate system for genomic research. Nevertheless, these references don’t capture the complete spectrum of genetic variation present across the broader human population. Because of this, essential subpopulation diversity is usually underrepresented. This may introduce bias into subsequent analyses, akin to read mapping and variant detection, which can miss or inaccurately interpret essential genetic differences tied to ancestry or disease. 

Unlike linear references, pangenomes are built by integrating multiple high-quality genomes from diverse individuals, capturing a much wider range of genetic variation present in human populations. This comprehensive approach reduces reference bias, improves variant detection across populations, and supports more accurate and equitable genomic analyses. Giraffe, a software tool developed by researchers on the University of California, Santa Cruz, enables efficient read alignment to pangenome graphs.

Giraffe maps genomic sequences to a reference pangenome fairly than a conventional linear reference, improving variant-calling accuracy across diverse populations. Combining Giraffe with pangenome-aware mode in DeepVariant, which is now available in Parabricks v4.6, improves the accuracy of identified variants and provides the speed of Parabricks GPU acceleration. 

  • Accuracy: Open-source pangenome-aware DeepVariant was more accurate than BWA, receiving the next F1 scores in accordance with Pangenome-aware DeepVariant.
    • Pangenome-aware DeepVariant: SNP: 0.9981 | Indel 0.9971
    • BWA: SNP: 0.9973 | Indel: 0.9968  
  • Speed: Using GPU-acceleration in Parabricks, Giraffe, and DeepVariant runtimes resulted in over a 14x speedup in comparison with CPU-only Giraffe and DeepVariant with pangenome-aware mode on 4 NVIDIA RTX PRO 6000 GPUs.
Pangenome-aware DeepVariant 1.9 and Giraffe total runtimes resulted in over a 14x speedup on 4 NVIDIA RTX PRO 6000 GPUs.Pangenome-aware DeepVariant 1.9 and Giraffe total runtimes resulted in over a 14x speedup on 4 NVIDIA RTX PRO 6000 GPUs.
Figure 1. Using 4 NVIDIA RTX PRO 6000 GPUs, the entire runtime for pangenome-aware DeepVariant 1.9 and Giraffe reduced from greater than 9 hours on CPU-only solution to under 40 minutes

“Roche’s SBX technology enables sequencing at unparalleled data rates and versatile data processing workflows for various sequencing applications,” says John Mannion, VP Computational Sciences at Roche. “Through our collaboration with NVIDIA, we plan to leverage GPU-accelerated versions of multiple aligners, including Giraffe, to offer users with an integrated solution allowing for faster and more accurate evaluation.”

Start with Giraffe and DeepVariant

Existing users of Parabricks can run DeepVariant after providing:

  • the suitable FASTA reference file from the Giraffe index files, 
  • a BAM file and the graph GPZ file output from running Giraffe.

Instructions on obtaining these files can be found within the Parabricks Giraffe documentation focused on Using Giraffe in Variant Calling workflows. The next steps also guide you thru the method.

Step 1 

Run baseline VG to generate a FASTA file from the graph.

Please note that step 1 with baseline VG is a one-time run. Once you have got the FASTA file from the graph, you don’t have to run step 1. As an alternative, run steps 2 and three to handle more FASTQ samples.

# Extract the sequences corrresponding to the list of paths to a FASTA file
docker run --rm --volume $(pwd):/workdir 
    --workdir /workdir 
    quay.io/vgteam/vg:v1.59.0 
    vg paths -x hprc-v1.1-mc-grch38.gbz -p hprc-v1.1-mc-grch38.paths.sub -F > hprc-v1.1-mc-grch38.fa

# Index the fasta file
samtools faidx hprc-v1.1-mc-grch38.fa

Step 2

Next, run Giraffe normally.

# This command assumes all of the inputs are in the present working directory and all of the outputs go to the identical place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir 
    --workdir /workdir 
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 
    pbrun giraffe --read-group "sample_rg1" 
    --sample "sample-name" --read-group-library "library" 
    --read-group-platform "platform" --read-group-pu "pu" 
    --dist-name /workdir/hprc-v1.1-mc-grch38.dist 
    --minimizer-name /workdir/hprc-v1.1-mc-grch38.min 
    --gbz-name /workdir/hprc-v1.1-mc-grch38.gbz 
    --ref-paths /workdir/hprc-v1.1-mc-grch38.paths.sub 
    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} 
    --out-bam /outputdir/${OUTPUT_BAM}

Step 3 

Finally, these three files could be used as inputs for Deep Variant. Run pangenome_aware_deepvariant with the BAM from step 2, FASTA from step 1, and the graph GBZ file.

# Pangenome_aware_deepvariant
# This command assumes all of the inputs are in the present working directory and all of the outputs go to the identical place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir 
    --workdir /workdir 
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 
    pbrun pangenome_aware_deepvariant 
    --ref /workdir/hprc-v1.1-mc-grch38.fa 
    --pangenome /workdir/hprc-v1.1-mc-grch38.gbz 
    --in-bam /workdir/${INPUT_BAM} 
    --out-variants /outputdir/${OUTPUT_VCF}

STAR improvements: including quantMode GeneCounts

Along with pangenome-aware mode for DeepVariant, the newest release of Parabricks also includes improvements to STAR. STAR is a tool used to speed up RNA-sequencing alignment. It is especially useful as a result of its speed and accuracy for RNA-seq data across sequencing platforms and scalability for giant datasets. Already available in Parabricks, STAR is further accelerated because of GPU-acceleration–leading to nearly 8x faster speedups on two NVIDIA RTX PRO 6000 GPUs in comparison with CPU-only solutions. 

In the newest release of Parabricks, quantMode GeneCounts is a brand new option available for STAR, which is helpful for quite a lot of applications relevant to gene expression, QC, normalization, and data integration. Throughout the mapping step of alignment, quantMode GeneCounts enables fast generation of gene-level read counts.

STAR runtimes resulted in almost an 8x speedup on 2 RTX PRO 6000 GPUs compared to CPU-only solutions.STAR runtimes resulted in almost an 8x speedup on 2 RTX PRO 6000 GPUs compared to CPU-only solutions.
Figure 2. In comparison with CPU-only solutions that took over 105 minutes, STAR runtimes were reduced to under 14 minutes on two NVIDIA RTX PRO 6000 GPUs

 Start with STAR

QuantMode GeneCounts could be run as an argument that could be added to STAR. An example command is below. 

docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir 
    --workdir /workdir 
nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 
pbrun rna_fq2bam 
--genome-lib-dir ${GENOME_DIR} 
--in-fq ${FASTQ1} ${FASTQ2} 
--output-dir ${OUT_DIR} 
--ref ${GENOME} 
--out-bam ${OUT_BAM} 
--num-gpus ${GPU_NUM} 
--quantMode GeneCounts

Download Parabricks today

Download NVIDIA Parabricks v4.6 to start with GPU-accelerated genomic evaluation and join the conversation on the NVIDIA Parabricks Developer Forum



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x