Identifying cobionts and contaminants

Disentangling sequences from different sources can be both interesting and challenging. It can reveal interactions between organisms and their genomes through time (considering extant associations and molecular fossils). In some cases, the aim may simply be to remove contamination that has found its way into a sample. However, determining where a sequence came from is not always straightforward - especially when exploring less well sampled parts of the tree of life, where few close relatives have been sequenced.

On these pages, you will find some examples showing how we are tackling this issue in the Tree of Life programme. Our approach, which is tailored for HiFi data, combines grouping sequences from the same source with finding reliable taxonomic hints. This allows us to reduce reliance on databases, which can be incomplete and contain mislabelled sequences.

The CobiontID approach

The CobiontID process has two parts: First, Marker scan provides taxonomic information. HMM profiles of marker genes, such as rRNAs, which are well-sampled and conserved, are useful to classify sequences from genomes that are otherwise too diverged from their closest sequenced relative. We can therefore gauge which species are present in a given sample, and construct streamlined databases for read classification. Second, a combination of assembly, read mapping and compositional clustering allows the sequences to be assigned to groups that can be tagged with this taxonomic information.

What kind of information does CobiontID provide?

See here for an illustration of the outputs the tools presented here provide, and how to interpret them. If you have ever looked at the “Cobionts” section of a page on Tree of Life QC and wondered how to read the tables and plots, your questions will hopefully be answered here (a list of pages with examples can be found here).

Software used in the pipelines

Standalone tools

Tool	Description	Application	Language
kmer-counter	Fast k-mer counter for large read sets	Get tetranucleotide counts	Rust
unique-kmers	Count distinct k-mers in sequences	Calculate k-mer diversity	Rust
hexamer	Detect likely coding regions	Estimate coding density	C
fastk-medians	Calculate median number of times each large k-mer in a sequence occurs across the set (modified version of Profex from the original FASTK library)	Approximate k-mer coverage	C

Workflows

Workflow	Description
MarkerScan	Determine taxonomic composition of an assembly; separate and assemble individual components
read VAE	Generate annotated 2D visualisations for long reads; interactively explore and select data for downstream analyses

Additional information

Code

GitHub

Presentations

Slides from talk on CobiontID at PopGroup55 (2022)
Flash presentation accompanying PopGroup talk

Publications

Disentangling Cobionts and Contamination in Long-Read Genomic Data using Sequence Composition https://www.biorxiv.org/content/10.1101/2024.05.30.596622v1
Phylogenomic analysis of Wolbachia genomes from the Darwin Tree of Life biodiversity genomics project https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001972
MarkerScan: Separation and assembly of cobionts sequenced alongside target species in biodiversity genomics projects https://doi.org/10.12688/wellcomeopenres.20730.1

CobiontID overview