Tips on specific tools¶

In this page you will find suggestions on how to optimize the execution of some tools, based on our experience. Have a look on the table of content in the right side of the page.

ARGs-OAP with Custom DB¶

(Module: func_classification_rb)

For detailed documentation, please refer to the ARGs-OAP Repository: https://github.com/xinehc/args_oap

I will try to simplify based also on my experience using this tool. Moreover some checks will be performed on the provided files to avoid later issues with the code.

To use this tool, you will need two files:

A fasta file of protein sequences, named for example sequences.fasta (Do not put space in the filename). We suggest to make this file as simple as possible. The header of each sequence should contain just the ID without any space, tab, or other irregular characters such as forward slash. Avoid duplicated headers and duplicated sequences.

>id1
DQEATRFKT...
>id2
GWTRCMDCQ...

A file (for example named mapping.tsv) for mapping the IDs in the sequences file to categories of interest, and it must be tab-separated. This file should contain at least one column, describing all the IDs of the fasta sequences. However you can put more columns, each one representing Class, Subclass or categories of your interests. Do not put space in the column name. We suggest putting “_” instead of spaces. Geomosaic will make some checks.

IDs    Class    Subclass    Metal_Resistances
id1    class1    subclass1    iron
id2    class2    subclass2    iron

Moreover, Geomosaic will start asking some questions.

The first question is the following:

[ADDITIONAL PARAMETER REQUEST - CUSTOM DB for ARGs-OAP - FASTA FILE]
Description: you need to specify the path to the protein fasta file for the custom DB of ARGs-OAP

Geomosaic is asking for the absolute path to the file described above in point 1): /path/to/folder/sequences.fasta

The second question will regard the file described above in point 2):

[ADDITIONAL PARAMETER REQUEST - CUSTOM DB for ARGs-OAP - Mapping FILE]
Description: you need to specify the path to the mapping file for the custom DB of ARGs-OAP

/path/to/folder/mapping.tsv

The third question will regard a custom name that you can provide for your database

[ADDITIONAL PARAMETER REQUEST - CUSTOM DB for ARGs-OAP - Database folder]
Description: you need to specify a name for the custom database folder. For example, if this argsoap custom will concern hydrogenases, you may want to rename the database folder as 'argsoap_custom_hydrogenases'"

argsoap_custom_hydrogenases

Suggestion

If you already have created a database using this tool/module, you can type the same name to refer and use it.

Indeed, by doing this you can avoid executing the script for creating the external db (slurm_extdb_geomosaic.sh or parallel_extdb_geomosaic.sh).

The fourth question will regard a custom name for the output folder

[ADDITIONAL PARAMETER REQUEST - CUSTOM DB for ARGs-OAP - Output folder]
Description: you need to specify a name for the output folder. For example, if this argsoap custom will concern hydrogenases, you may want to rename the output folder as 'argsoap_custom_hydrogenases'

Since this tool can be used with a custom db, Geomosaic is asking this question to let you choose a custom name for the output folder. The idea is to have a specific name for a specific database of interest and execution. This will allow to run the tool with different databases and have several output folders, each one with its own name.

argsoap_custom_hydrogenases

Indeed, you will find a folder called with your prompted name inside of each sample.

Metaspades¶

(Module: assembly)

For detailed documentation, please refer to the Spades Repository: http://ablab.github.io/spades/

I will try to simplify based also on my experience using this tool.

Assembly is a main bottleneck of metagenomics execution as it may require a lot of memory (RAM). However, for metaspades as you can see here http://ablab.github.io/spades/running.html#advanced-options, the default memory setting is 250 Gb. This may prevent the right execution of the assembly even if you set a higher value of the --memory during the geomosaic prerun, as the latter will only affects slurm resources.

For this reason, we always suggest overwriting the default value of 250 Gb, by adding the --memory option in the corresponding yaml file in the GM_USER_PARAMETERS folder, and then also set the memory in the geomosaic prerun very similar to this value, maybe an higher value could be better just to be sure.

Suggestion

If you don’t know where the “GM USER PARAMETERS” folder is located, you can retrieve it by opening the file called gmsetup.yaml

Inside this folder you should see metaspades.yaml.

Suggestion

Basically this is a text configuration file in which you can specify additional options for the corresponding tool as a bullet point list.

nano metaspades.yaml

metaspades:
- -k 33,55,77,127
- --memory 500

seqkit:
- --min-len 2000

and then, during the geomosaic prerun you can have

geomosaic prerun ... -m 505

Megahit¶

(Module: assembly)

For detailed documentation, please refer to the Spades Repository: https://github.com/voutcn/megahit/wiki

I will try to simplify based also on my experience using this tool.

Memory documentation of Megahit https://github.com/voutcn/megahit/wiki/MEGAHIT-Memory-setting

Similarly to what was described in the metaspades section, in this case you can specify memory value in BYTES by adding the corresponding option to the megahit.yaml file in the GM_USER_PARAMETERS folder.

nano megahit.yaml

megahit:
# More Sensitive, but slower
- --presets meta-sensitive
- --memory 500000000000

seqkit:
- --min-len 2000

and then, during the geomosaic prerun you can have

geomosaic prerun ... -m 505