Identifying cobionts and contaminants
Disentangling sequences from different sources can be both interesting and challenging. It can reveal interactions between organisms and their genomes through time (considering extant associations and molecular fossils). In some cases, the aim may simply be to remove contamination that has found its way into a sample. However, determining where a sequence came from is not always straightforward - especially when exploring less well sampled parts of the tree of life, where few close relatives have been sequenced.
On these pages, you will find some examples showing how we are tackling this issue in the Tree of Life programme. Our approach, which is tailored for HiFi data, combines grouping sequences from the same source with finding reliable taxonomic hints. This allows us to reduce reliance on databases, which can be incomplete and contain mislabelled sequences.
The CobiontID approach
The CobiontID process has two parts: First, Marker scan provides taxonomic information. HMM profiles of marker genes, such as rRNAs, which are well-sampled and conserved, are useful to classify sequences from genomes that are otherwise too diverged from their closest sequenced relative. We can therefore gauge which species are present in a given sample, and construct streamlined databases for read classification. Second, a combination of assembly, read mapping and compositional clustering allows the sequences to be assigned to groups that can be tagged with this taxonomic information.
What kind of information does CobiontID provide?
See here for an illustration of the outputs the tools presented here provide, and how to interpret them. If you have ever looked at the “Cobionts” section of a page on Tree of Life QC and wondered how to read the tables and plots, your questions will hopefully be answered here (a list of pages with examples can be found here).
Software used in the pipelines
Standalone tools
Tool | Description | Application | Language |
---|---|---|---|
kmer-counter | Fast k-mer counter for large read sets | Get tetranucleotide counts | Rust |
unique-kmers | Count distinct k-mers in sequences | Calculate k-mer diversity | Rust |
hexamer | Detect likely coding regions | Estimate coding density | C |
fastk-medians | Calculate median number of times each large k-mer in a sequence occurs across the set (modified version of Profex from the original FASTK library) | Approximate k-mer coverage | C |
Workflows
Workflow | Description |
---|---|
MarkerScan | Determine taxonomic composition of an assembly; separate and assemble individual components |
read VAE | Generate annotated 2D visualisations for long reads; interactively explore and select data for downstream analyses |
Additional information
Code
Presentations
- Slides from talk on CobiontID at PopGroup55 (2022)
- Flash presentation accompanying PopGroup talk
Publications
- Disentangling Cobionts and Contamination in Long-Read Genomic Data using Sequence Composition https://www.biorxiv.org/content/10.1101/2024.05.30.596622v1
- Phylogenomic analysis of Wolbachia genomes from the Darwin Tree of Life biodiversity genomics project https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001972
- MarkerScan: Separation and assembly of cobionts sequenced alongside target species in biodiversity genomics projects https://doi.org/10.12688/wellcomeopenres.20730.1