CobiontID overview

Identifying cobionts and contaminants

Disentangling sequences from different sources can be both interesting and challenging. It can reveal interactions between organisms and their genomes through time (considering extant associations and molecular fossils). In some cases, the aim may simply be to remove contamination that has found its way into a sample. However, determining where a sequence came from is not always straightforward - especially when exploring less well sampled parts of the tree of life, where few close relatives have been sequenced.

On these pages, you will find some examples showing how we are tackling this issue in the Tree of Life programme. Our approach, which is tailored for HiFi data, combines grouping sequences from the same source with finding reliable taxonomic hints. This allows us to reduce reliance on databases, which can be incomplete and contain mislabelled sequences.

The CobiontID approach

The CobiontID process has two parts: First, Marker scan provides taxonomic information. HMM profiles of marker genes, such as rRNAs, which are well-sampled and conserved, are useful to classify sequences from genomes that are otherwise too diverged from their closest sequenced relative. We can therefore gauge which species are present in a given sample, and construct streamlined databases for read classification. Second, a combination of assembly, read mapping and compositional clustering allows the sequences to be assigned to groups that can be tagged with this taxonomic information.

What kind of information does CobiontID provide?

See here for an illustration of the outputs the tools presented here provide, and how to interpret them. If you have ever looked at the “Cobionts” section of a page on Tree of Life QC and wondered how to read the tables and plots, your questions will hopefully be answered here.

Software used in the pipelines

Standalone tools

Tool Description Application Language
kmer-counter Fast k-mer counter for large read sets Get tetranucleotide counts Rust
unique-kmers Count distinct k-mers in sequences Calculate k-mer diversity Rust
hexamer Detect likely coding regions Estimate coding density C
fastk-medians Calculate median number of times each large k-mer in a sequence occurs across the set (modified version of Profex from the original FASTK library) Approximate k-mer coverage C

Workflows

Workflow Description
MarkerScan Determine taxonomic composition of an assembly; separate and assemble individual components
read VAE Generate annotated 2D visualisations for long reads; interactively explore and select data for downstream analyses

Additional information

Code

image GitHub

Presentations

Publications

  • Disentangling Cobionts and Contamination in Long-Read Genomic Data using Sequence Composition https://www.biorxiv.org/content/10.1101/2024.05.30.596622v1
  • Phylogenomic analysis of Wolbachia genomes from the Darwin Tree of Life biodiversity genomics project https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001972
  • MarkerScan: Separation and assembly of cobionts sequenced alongside target species in biodiversity genomics projects https://doi.org/10.12688/wellcomeopenres.20730.1