Integration example 1: Simple package


This tutorial will guide you through an integration of a simple package, that doesn’t need an external database and it is not referred to MAGs module.

What we need for this integration

  • Understand the Stream-level: in this case Read-based

  • Module name: reads_qc

  • Package name: fastqc_readscount

  • Know which are the conda dependencies for this packages, or the conda package name.

  • Know the code to actual integrate and execute the package.

Step 1: Clone/fork the repository and install Geomosaic

Since the final strategy is to make a pull request to the main repository, we suggest to fork our repo and then clone it (in the SSH way)

git@github.com:<YOURNAME>/Geomosaic.git

Install the Geomosaic conda environment. You can follow the Installation Guide.

Remember to replace <YOURNAME> with your GitHub user account.

Once you have cloned the repository, open the directory created with the clone and also create another branch specifying the name of the package that you are going to integrate

git checkout -b fastqc

Step 2: Create the module folder (if does not exists)

In this case we are going to integrate a package that belongs to a module related to the quality checks of the reads after the pre_processing step, so we create a module folder called reads_qc inside the modules folder (Figure below in Step 4).

Important

This step is necessary only if the module folder does not exists.

Warning

Do not use any special characters or insert spaces in the name.

Highlight

Just rely on underscore and all lower-case characters

Step 3: Create the package folder

In this step we create the package folder inside the module of interest. In this case, our package folder will be fastqc_readscount (Figure below in Step 4).

Warning

Do not use any special characters or insert spaces in the name.

Highlight

Just rely on underscore and all lower-case characters

Step 4: Create package’s snakefiles

Create three code files inside the package folder, with the following filename:

  • Snakefile.smk

  • Snakefile_target.smk

  • param.yaml

For now you can leave them empty.

Important

The names for this file are standard and are the same for each package. Do not change the filenames.

modules_folder

Step 5: create the corresponding conda env file

We need to create the corresponding conda env file describing the necessary dependencies for our package. For this purpose, in the envs folder we create a file with the same name of the package (with the yaml extension). As content of the file we are going to write the necessary dependencies. In this case we are going to specify only fastqc from the bioconda channel as the reads count will be computed through a bash commands and thus we don’t need any specfici dependencies. The name of the conda environment is the name of the package with geomosaic_ as prefix.

condaenvfile

Now we can write our code inside the Snakefile

Step 7: Write the actual code.

For this package the code is very easy. Since it uses only the processed reads, we can use the template of the assembly. We copy paste the code inside the Snakefile.smk of metaspades and then modify it.

Step 7.1 Snakefile: input/output section

We need to change the rule definition with the package name, composed also of the prefix run_.

  • Our input section is fine, as we need only the reads from the pre_processing module.

  • In output section usually we put the folder output that must be the same of the package name. However if you know that your package is going to provide in output a specific file, you can even increase the detail of this section by inserting also that file.

snakefile_io

Step 7.2 Snakefile: threads section

The threads section is fine like this. If we know that is not possible to execute our package through parallelization we can put in this section 1, otherwise we can leave it as it is.

Step 7.3 Snakefile: conda section

In this section, we only need to put our package name.

Step 7.4 Snakefile: params section

In each package we put a params variable called user_params, which is going to read the param.yaml file that we have created in the Step 4. The code to read user parameters, is almost always the same (so you don’t need to modify it):


user_params=( lambda x: " ".join(filter(None , yaml.safe_load(open(x, "r"))["fastqc_readscount"])) ) (config["USER_PARAMS"]["fastqc_readscount"])

Just replace fastqc_readscount with your package name.

snakefile_params

Step 7.5 Snakefile: shell section

This is the section in which we are going to put the actual code to execute our programs.

So for this package we want to integrate the use of fastqc to check the quality of the reads, and then perform a reads count on the processed reads.

snakefile

Step 8: Snakefile Target

In our file Snakefile_target.smk we only need to write few rows. First, the name of the rule must be the same name of the package name with the all_ prefix. And then we need to change the rows in the input section, and we need to specify the same folder output as in this case was our only output that we specified in the Snakefile.smk.

snakefile_target

Step 9: Param.yaml file

The param.yaml is a file in which the user, before the execution of the workflow, can insert all the optional parameters belonging to the package as bullet points. In this case, we only need to open this file and add the following line at the top:

fastqc_readscount:
- 

Test the integration

Now we should test the integrated package. Activate the conda environment of geomosaic. Updated geomosaic by doing

pip install .

Once we have tested, we can commit the changes and create the pull request.