Introduction

Current metagenomics workflows can exploit three different streams of analysis: read-based, assembly-based, and binning-based. Read-based and/or assembly-based analyses are often neglected in favor of binning-driven inferences on the basis of their different reliability and sensitivity. However, the filtering steps involved in moving from reads to bins progressively reduce the potential amount of information, and thus the meaningfulness of the obtained data. Therefore, there was, the need to create a metagenomic workflow that would combine these three different streams of analysis. Geomosaic was created to meet this purpose. This pipeline allows annotations to be performed on the three streams of analysis, and specially devised for the easy integration of the various programs and packages required. This approach maximizes the output of information gathered from raw data. Even so, Gemosaic flexibility allows the user to completely customize the analysis by choosing any stream of analysis, and to further tailor it with modules and packages. Thus, the Geomosaic workflow is sewed for the user purposes.

Geomosaic Graph Structure

The base graph structure is made up of three main analyses that have to be taken into account when choosing the desired workflow:

Stream

Module

Depends on

Read-based

Pre-processing

-

Assembly-based

Assembly

Pre-processing

Binning-based

Binning

Pre-processing, Assembly

In fact, these dependencies can not be overlooked when generating the workflow with Geomosaic. For instance, ignoring the Assembly module hinders the execution of the Binning module exactly because of the dependency-based structure.

The full tree of dependencies among all modules is shown here.

modules_DAG

Integrated modules

To summarise, the dependency tree has to be considered when ignoring specific modules, as they may inadvertently block other modules in the current or the next stream of analysis.

Stream-level Modules Packages
Read-based Pre Processing fastp
trimgalore
trimmomatic
Reads Quality Check fastqc + reads count
Functional Annotation mi-faser
Taxonomic Annotation Kaiju
metaPhlAn
Assembly-Based Assembly metaSPAdes
Megahit
Assembly Quality Check Quast
Meta-Quast
Read Mapping Bowtie2
Bowtie2 - Output without unmapped reads
BBMap
BBMap - Output without unmapped reads
Read Coverage CoverM (contigs)
Taxonomic Annotation Kraken2
ORF Prediction Prodigal
Domain Annoation reCOGnizer
HMM Annotation HMMSearch
ORF Annotation eggNOG-mapper
KOfam Scan
Functional Annotation Bakta
Binning-Based Binning Multi-Binners (Metabat2 + MaxBin2 + SemiBin2)
Binning De-replication DAS Tool
Binning Quality Assessment CheckM
MAGs Retrieval MAGs Retrieval
MAGs Functional Annotation DRAM
Bakta
MAGs Taxonomic Annotation GTDBtk
MAGs ORF Prediction Prodigal
MAGS Domain Annotation reCOGnizer
MAGs ORF Annotation KOfam Scan
MAGs Coverage CoverM (Genome)
MAGs HMM Annotation HMMSearch

Modules that could be integrated in future

The following modules are under evaluation for future integration.

Read-based:

  • Functional annotation

    • mi-faser (with custom database)

Assembly-based:

  • Functional Annotation:

    • Prokka

  • Taxonomic Annotation

    • cat/bat

However, if you know a module/package you would like to see integrated into Geomosaic, you can open an issue with all the information asking for this integration. At the moment, we accept only packages that can be installed from any Conda channel.