geomosaic setup

Overview

This commands should be the first one to be executed as it sets up the Geomosaic working directory with the corresponding folder and sample names.

geomosaic setup --help
usage: geomosaic setup -d DIRECTORY -t SAMPLE_TABLE [-s SETUP_FILE] [-c CONDAENV_GMFOLDER] [-e EXTERNALDB_GMFOLDER] [-u USERPARAMS_GMFOLDER]
                       [-f {tsv,csv,excel}] [-w WORKING_DIR] [-n PROJECT_NAME] [--move_and_rename] [--skip_checks] [-h]

DESCRIPTION: It creates the geomosaic working directory and the relative samples folders based on the provided sample table

Required Arguments:
  -d DIRECTORY, --directory DIRECTORY
                        Path to the directory containing raw reads (fastq.gz files) (default: None)
  -t SAMPLE_TABLE, --sample_table SAMPLE_TABLE
                        Path to the user sample table (default: None)

Optional Arguments:
  -s SETUP_FILE, --setup_file SETUP_FILE
                        Output name for the geomosaic setup file (yaml extension). This file is necessary for the < geomosaic workflow > command. (default:
                        gmsetup.yaml)
  -c CONDAENV_GMFOLDER, --condaenv_gmfolder CONDAENV_GMFOLDER
                        This option allows to provide a path folder in which geomosaic is going to install all the conda environments of your workflow. This
                        option is very useful if you want to execute Geomosaic for different set of reads, as here you can provide the same folder and
                        prevent multiple installation of the same conda environments. If not specified geomosaic will create a folder called 'gm_conda_envs'
                        inside the directory provided by the '-w' option. (default: None)
  -e EXTERNALDB_GMFOLDER, --externaldb_gmfolder EXTERNALDB_GMFOLDER
                        This option allows to provide a path folder in which geomosaic is going to download all the external databases used by the packages
                        of your workflow. This option is very useful if you want to execute Geomosaic for different set of reads, as here you can provide
                        the same folder and prevent multiple donwload of the same external databases. If not specified geomosaic will create a folder called
                        'gm_external_db' inside the directory provided by the '-w' option. (default: None)
  -u USERPARAMS_GMFOLDER, --userparams_gmfolder USERPARAMS_GMFOLDER
                        This option allows to provide a path folder in which geomosaic is going to put file for additional options/parameters that can be
                        used by the packages of your workflow. This option is very useful if you want to execute Geomosaic for different set of reads and
                        using the tools with the same sets of options/parameters of previous Geomosaic executions. If not specified geomosaic will create a
                        folder called 'gm_user_parameters' inside the directory provided by the '-w' option. (default: None)
  -f {tsv,csv,excel}, --format_table {tsv,csv,excel}
                        Format of the provided table. Allowed: tsv, csv, excel (default: tsv)
  -w WORKING_DIR, --working_dir WORKING_DIR
                        Is the Geomosaic working directory that has to be created for the pipeline execution. Default: 'geomosaic' folder created in the
                        current directory (default: geomosaic)
  -n PROJECT_NAME, --project_name PROJECT_NAME
                        Name of the project. The first 8 Characters will be used for SLURM job name (default: Geomosaic_Workflow)
  --move_and_rename     Suggested flag if the provided raw reads directory is already a backup of the original files. In this case, geomosaic will create
                        only symbolic link of raw reads to its working directory. Note: This flag cannot be used if there are multiple files for each R1 and
                        R2 sample reads, as geomosaic will 'cat' them to a single file. (default: False)
  --skip_checks         If you are sure that every file is in its correct location and the sample names are filled correctly, you can skip checks with this
                        flags. However we do not suggest to use it. (default: False)

Help Arguments:
  -h, --help            show this help message and exit

Arguments

setup command has two required arguments:

  • REQUIRED

    • (-t) the sample table including all names of the reads files and the desired sample names. It is expected to provide a table with three columns having the following headers: r1, r2, sample. For instance:

      r1

      r2

      sample

      S1_L001_R1_fastq.gz

      S1_L001_R2_fastq.gz

      sample1

      S1_L002_R1_fastq.gz

      S1_L002_R2_fastq.gz

      sample1

      S2_L001_R1_fastq.gz

      S2_L001_R2_fastq.gz

      sample2

      S2_L002_R1_fastq.gz

      S2_L002_R2_fastq.gz

      sample2

      S3_L001_R1_fastq.gz

      S3_L001_R2_fastq.gz

      sample3

      NOTE 1: Sample and reads name with no spaces: the values on the sample column (but also on the r1 and r2 columns) should not present any space as it increases the folder organization complexity. However, Geomosaic will perform a check (assertion) and possibly print an Error message describing the line that containes the space.

      NOTE 2: Different lines with the same sample name are allowed: sequencing data are often splitted for a single sample. In this case, is it quite common to cat them all in one file; this is performed for both R1 (read R1) and R2 (read R2). Geomosaic allows the same sample name in multiple lines of the provided table, as it will cat all the reads R1 in one file R1 and all the reads R2 in one file R2. This case is also presented in the example table above (line 2,3 and 3, 4).

      In the NOTE 2 scenario, Geomosaic will perform a group by of R1 and R2 files based on the sample column to then use the cat command. For instance, by performing a group by on the provided example table, the following files are obtained:

      r1

      r2

      sample

      [S1_L001_R1.fastq.gz, S1_L002_R1.fastq.gz]

      [S1_L001_R2.fastq.gz, S1_L002_R2.fastq.gz]

      sample1

      [S2_L001_R1.fastq.gz, S2_L002_R1.fastq.gz]

      [S2_L001_R2.fastq.gz, S2_L002_R2.fastq.gz]

      sample2

      [S3_L001_R1.fastq.gz, … ]

      [S3_L001_R2.fastq.gz, …]

      sample3

    • (-d) the directory containing all raw reads. The name of the listed files has to match the ones provided in the tabular table.

The setup command has various optional arguments:

  • OPTIONAL

    • (-s) The output name for the setup file of Geomosaic. Default: gmsetup.yaml (in the current directory).

    • (-c) This option is very useful if you are going to run geomosaic for different set of raw reads. By specifying this folder, geomosaic will not reinstall all the conda environments. Here you can provide the same folder and prevent multiple installation of the same conda environments. If not specified geomosaic will create a folder called gm_conda_envs inside the directory provided by the -w option.

    • (-e) Similarly to the -c parameter, this option is very useful if you are going to run geomosaic for different set of raw reads. By specifying this folder, geomosaic can use already downloaded external databases. If not specified geomosaic will create a folder called gm_external_db inside the directory provided by the -w option.

      • For example, let’s assume that in the following path /mnt/storage/geomosaic_extdb are already presents the following databases kaiju_extdb, kraken2_extdb and checkm_extdb (that were downloaded with Geomosaic) are already present; you can use this option as -e /mnt/storage/geomosaic_extdb .

    • (-u) Similarly to the -c and -e flags, this options allows you to specify the same folder path that you may have used in previous Geomosaic execution for the user parameter folder. This is useful to execute same tools with the same options that you may have added in those executions. If not specified geomosaic will create a folder called gm_user_parameters inside the directory provided by the -w option.

    • (-f) With this option you can specify the format separated value of the table provided with the flag -t.

    • (-w) The Geomosaic working directory to create for its execution. Default: geomosaic folder created in the current directory.

    • (-n) A project name can be specified with this parameter. It is recommended to use a name without any space.

    • (--move_and_rename) You can use this option to directly move the reads file from the directory specified in -d parameter instead of creating another copy.

      NOTE 3: --move_and_rename parameter not allowed in NOTE 2 scenario: due to the need to cat all sample sequencing reads in one file, the --move_and_rename parameter is not allowed in the NOTE 2 scenario. Indeed, --move_and_rename parameter can be used when there is a single file for both R1 and R2 sequencing reads.

What to expect from this command

After completing this command you should see:

  • a geomosaic setup file, by default called gmsetup.yaml that contains some essential information for the next commands. Do not modify this file.

  • and the working directory for geomosaic (eventually provided with the option -w) which stores a directory for each of the user unique samples. Each sample folder contains containing two files:

    • R1.fastq.gz

    • R2.fastq.gz

../_images/geomosaic_setup_output.png

Example usage geomosaic setup

Assuming my sequencing reads are in the folder termed sequencing_mg_expedition2023 and my tabular file is called sample_table_expedition2023.tsv, the setup command can be executed as follows

geomosaic setup -d sequencing_mg_expedition2023 -t sample_table_expedition2023.tsv

or providing optional parameters to add further information

geomosaic setup -d sequencing_mg_expedition2023 \
                -t sample_table_expedition2023.tsv \
                -s gmsetup_exp2023.yaml \
                -w ./geomosaic_exp2023 \
                -n "Geomosaic EXP2023" \
                --move_and_rename