How to - MiModD recipes

Add the mimodd executable to $PATH

If you get a command not found or similar error message with the mimodd command, you need to make sure that your system can discover a copy of the executable file from the $PATH environment variable.

For this, you need to locate the executable file and add the containing folder to the $PATH variable. Try to:

  1. Find the folder that MiModD has been installed in by executing:

    python3 -m MiModD.config
    

    and record the path reported in the top part of the output.

  2. Truncate this (possibly long) path from the end up to the right-most occurence of lib/ and replace the lib/ with bin/.

    The result is the <executable_folder> you are looking for.

  3. Finally, you can either

    1. append the folder to your $PATH variable.

      For just the currently opened command line session you can use:

      PATH=$PATH:<executable_folder>
      export $PATH
      

      The best way to make the change to $PATH permanent is OS- and shell-dependent. Some guidelines can be found here.

    or, alternatively,

    1. copy the executable file from its current folder to a location already in $PATH.

      Use:

      echo $PATH
      

      to learn which folders are already added to $PATH, then do:

      cp <executable_folder>/mimodd <folder_on_$PATH>
      

    Note

    In a multi-user environment, copying the executable to a folder that is also defined in the $PATH of other users may lead to confusion, so approach a) should, generally, be preferred to b).

Example:

% python3 -m MiModD.config

Settings for package MiModD in: /home/xy/.local/lib/python3.4/site-packages/MiModD

The path to the executable is /home/xy/.local/bin/.

Now either add this folder to $PATH:

PATH=$PATH:/home/xy/.local/bin/
export $PATH

or copy the executable. For example:

% echo $PATH

/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

% cp /home/xy/.local/bin/mimodd /usr/local/bin

where the last command would likely require a prepended sudo.

Re-align coordinate-sorted (previously aligned) reads

coming soon …


Understand and use multiprocessing efficiently

Parallel processing built into MiModD

Several MiModD tools use parallel data processing transparently to speed up the respective analysis step. By default, these tools, including the computationally intensive core tools snap, snap-batch and varcall, will use as many parallel processes as indicated by the MULTITHREADING_LEVEL setting in the MiModD configuration file.

Note

Make sure you use the config tool to set the MULTITHREADING_LEVEL parameter to a value suitable for your machine.

Alternatively, all tools that use internal parallel processing, from the command line, can be instructed to use a certain number of parallel processes, usually, by providing the number with the -t or --threads command line option. This runtime setting overrides the setting in the MiModD configuration file.

For example, to perform the variant calling from the third tutorial example using 6 parallel processes independent of the general configuration settings, you would run:

mimodd varcall WS220.64.fa example3.aln.bam -o example3_calls.bcf -t 6 --quiet --verbose

This method can be used to make one specific analysis step use more or less processor power depending on other tasks currently running on the machine.

There is no equivalent to the -t option from the Galaxy interface and this is intentional to prevent remote users accessing the server from overcommitting processors on the machine.

What is the optimal setting for parallel processing?

The optimal degree of parallel processing depends on several factors, but most importantly on the number of logical processors available on your system, which should dictate its maximum setting: splitting up calculation-intensive work as performed by most MiModD tools into more processes than there are processors to take care of them, will generally result in reduced, rather than improved performance.

In practice, you will usually want to keep always at least one processor unoccupied by MiModD to ensure responsiveness of other system tasks during an ongoing analysis. If you run MiModD from within Galaxy, be aware that one additional process will be occupied with running Galaxy itself. If you are planning to run additional processor-hungry software in parallel to analyses with MiModD, you will probably want to set aside additional processor power for these applications.

An additional factor to consider is the number of chromosomes of your favorite organism. If you are going to analyse mostly data from this organism, it can be advantageous to choose the degree of multiprocessing such that the chromosome number is a multiple of it since, in the current version, variant calling in MiModD is split into processes on a per-chromosome basis.

As a starting point, we recommend setting MULTITHREADING_LEVEL to ~ 2/3 of the number of available processors rounded to the nearest multiple of your organism’s chromosome number. When using MiModD from the command line, you can then use the -t tool option to allocate more or less processor power in situations with an exceptionally low or high number of additional tasks running on the system.

Parallel execution of several MiModD tools

Currently, MiModD does not provide built-in support for parallel processing downstream of variant calling. However, the analysis workflow of the package was designed to allow for simple parallel job execution by users.

Specifically, the varextract, covstats and delcall tools do not need to be run sequentially, but require as input only files generated by upstream tools. After variant calling is performed on a particular dataset, it is, thus, possible to extract the called variants, calculate coverage statistics and call deletions in parallel using three independent processes. From Galaxy, parallel job execution is as simple as executing additional jobs while others are still running. From the command line, you can use standard shell job control syntax, i.e., use bg or append an &, to execute any command as a background job and keep the command prompt available for executing additional commands.

As an example, you could run:

$ mimodd varextract example1_calls.bcf -o example1_extracted_variants.vcf &
$ mimodd delcall SRR101486.aln.bam example1_calls.bcf -o example1_deletions.txt --max-cov 4 --min-size 100

to perform, in parallel, analysis steps 3 and 5 from the first tutorial section.


Change the order of the samples in a multi-sample vcf file

vcf files in MiModD are typically multi-sample files that store information about variant sites on a per-sample basis.

Although these files conform to the official file format specification, some third party tools may not operate correctly on them. Some tools may refuse to work with multi-sample input files, while others may appear to work, but will really just detect and work on the first sample in the file.

For both cases, MiModD offers the possibility to generate compatible output through the vcf-filter command line tool or the VCF Filter tool from Galaxy. The sample-specific information in a vcf file is organized in columns with the first sample described in the left-most column and the tool will allow you to keep just certain columns and to change the order of the retained columns.

To do this from Galaxy, open the tool’s interface and provide the (comma-separated) names of all samples you would like to keep in the output in their new desired order in the sample field, i.e., for a vcf file describing three samples A, B, C, you may write B,C,A to obtain a new file with the sample order rotated counter-clockwise.

_images/recipes_sample_order.png

From the command line, use the –vfilter option to specify the new sample order, e.g.:

mimodd vcf-filter <vcf input file> -o <filtered output file> --vfilter B C A

The result is a new vcf file ready to be used to analyse sample B with tools only capable of detecting the first sample of the file, while retaining all sample information for use by MiModD and other suitable software.

Likewise, for tools rejecting any multi-sample vcf file, you can omit all, but one sample from the output. From the command line, to keep just sample B from the above example you can run:

mimodd vcf-filter <vcf input file> -o <filtered output file> --vfilter B