How scientists decipher COVID-19 — Part 3: IDseq

Zunshi Wang
5 min readMay 30, 2020
Image from IDseq

Introduction

Following the discussion from How scientists decipher COVID-19 — Part 2: Sanger and Next Generation Sequencing, regarding the economic and technical limitations to using existing tools, IDseq is meant to provide an easy and accessible solution for detecting new pathogens.

What are the limitations of existing tools?

  1. Economic limitations: tools require paid subscription, high system and storage requirements. Additionally, bioinformatics and programming expertise are needed.
  2. Technical limitations: informatics challenges that make surfacing useful data difficult, due to lack of sensitive removal of irrelevant sequences.

What benefits does IDseq have?

IDseq reduces mNGS data analysis entry barriers for scientists, clinicians and bioinformaticians through the following benefits:

  1. Is open source (free), cloud based.
  2. Accepts raw mNGS data.
  3. Provides rapid and unbiased detection and identification of genes.
  4. Does not need:
    a) a prior knowledge of microbial landscape
    b) a culture
    c) pathogen specific reagents.

What is the goal of mNGS data analysis?

  1. Determine what nucleic acid derives from the host and environmental contaminants, and what is from the pathogen (through host and QC filtration)
  2. Determine the relative abundances of different taxa in a sample.
  3. Understand trends in the abundances and relatedness of organisms (through aligning sequencing reads to a reference database).

IDseq assists with mNGS data analysis through the following functions or steps:

  1. Performs host filtration.
  2. Performs quality filtration.
  3. Executes assembly-based alignment pipeline.
  4. Generates assignment of reads and contigs to taxonomic categories.
  5. Visualizes for taxonomic relative abundance (easy to interpret data and generate hypothesis).
  6. Assists with data interpretation by generating an environmental background model.
  7. Every step is available for download and total numbers of reads and basic QC metrics are visible to users.

Pipeline Implementation Overview

Upload

IDseq can take in DNA or RNA from any sample type that is raw and short-read sequencing data.

Host Filtration

The benefit of host filtration and quality control is to reduce computational burden and noise in subsequent steps.

IDseq will validate input files. Then using relevant methods and tools to perform a priori subtraction of host sequences, aligning raw reads to database.

Quality Control

Quality control uses various tools to remove duplicates, low quality reads and low complexity reads.

Assembly based alignment

IDseq assigns taxonomic identities to each read. Then, filtered short read sequences are aligned to NCBI nt and nr databases.

Downstream reporting

The following are the methods IDseq offers users to interpret pipeline results:

  1. Relevant QC metrics and pipeline provide:
    a) # of reads remaining at each steps for host filtration and QC.
    b) Estimates of internal control abundances.
  2. Sample report tables provide:
    a) Total number of reads aligning to the taxon (nt and nr).
    b) Stats from assembly based alignment.
  3. Tree view provides:
    a) Rapid assessment of taxonomic relatedness of microbes.
  4. User selectable compound query (for easy investigation of data).
  5. Filtering tools (for easy investigation of data).
  6. One click downloads.
  7. Auto generation of coverage plots relative to all corresponding accessions.
  8. Background model generation provides:
    a) Ability to distinguish microbial signal from reagent and environmental contamination.
    b) Evaluate significance of relative abundance estimates for taxon vs. water only (or relative) controls.
  9. Visualization provides:
    a) Taxon heat maps.
    b) Pipeline visualization tool that documents input parameters at each step of the analysis pipeline.

Applications #1: Pediatric Meningitis

Main Challenges

  1. Impact of PCR amplification on samples with low amounts of RNA input.
  2. Background contamination.
  3. Genomic similarity between short regions of related organisms.

However, IDseq helped to overcome these challenges.

Sample 1 (CHRF_0094)

To uncover neuroinvasive chikungunya virus, a major challenge was that host sequences dominated the mNGS library, rendering virus to represent only less than 1% of total sequencing reads. Through IDseq host filtering and QC steps, non-host reads is boosted to 63%.

A second major challenge is the presence of environmental contaminants. By using the Z-score approach, comparing relative abundance of sample vs. background controls by user; setting Z-score thresholds and removing taxa; researchers found 4 contigs that align to the virus.

Using the portal coverage visualization, researchers observe a 11kb contig that matches with the Genbank accession, demonstrating a full-genome coverage.

Sample 2 (CHRF_0002)

The highlight in this analysis is that, despite the Streptococcus pneumoniae sample (a pathogen) having generated 98.1% reads, it only had a 87.3% coverage breadth. So despite having proportionally higher mNGS reads the coverage was lower in larger bacterial genomes in comparison to virtual genomes. Low coverage does not provide confidence in identifying bacterial strains, which is useful in a clinical context.

Sample 3 (CHRF_0000)

In a water sample, it is common to find overamplification of low biomass nucleic acid input. ERCC (a control) can help back calculate total input RNA concentration from the samples. This one had 3.7pg total input RNA, while the previous two has 29.6pg and 213.6pg. Although abundance values are similar, the absolute amount of nucleic acid is a lot less, which indicate to us this is an environmental sample.

#2: IDseq can provide useful pathogen information before a full reference genome is created.

For example, SAR-CoV-2 (COVID-19) did not have a reference database in NCBI during the early outbreak. mNGS data analysis via IDseq revealed 571 reads, representing about 33% of genome coverage, aligned to betacoronvirus (a virus that causes sever acute respiratory syndrome). Through collective efforts, researchers built a BLAST database using 54 sequences and deposited in NCBI, sharing with the international body. IDseq demonstrated 97.8% read-level recall prior to having the reference genome.

References

IDseq — An Open Source Cloud-based Pipeline and Analysis Service for Metagenomic Pathogen Detection and Monitoring

Continue Reading

How scientists decipher COVID-19 — Part 1: Brief Overview

How scientists decipher COVID-19 — Part 2: Sanger and Next Generation Sequencing

Ending note

This is a 3 part pre-internship study I did to facilitate my onboarding and prepare for my Product Design Internship at Chan Zuckerberg Initiative for Summer 2020. I am super excited to help out with such an impactful product in healthcare, and so relevant and important amidst the pandemic.

Hope everyone are staying safe and healthy!

Last but not least, thank you so much Phoenix Yin (Microbiology@UBC | Pre-med student) for answering all my questions and helping me crash cram understanding all papers and concepts :)!

--

--

Zunshi Wang

Product Designer | Content Creator | Currently seeking my unfair advantage | This is my sanctuary to let my thoughts flow