Tool Documentation¶

As an end-user you have two possibilities to work with MiModD:

you can use it as a suite of tools from the command line or, more conveniently, through its built-in Galaxy interface.

The first has the advantage of not requiring any additional software installation and avoids the memory and performance overhead of running the Galaxy server.

The later clearly provides the more intuitive user-interface, particularly for beginners.

This chapter discusses both the command line usage and the corresponding Galaxy wrapper for each tool in the MiModD package. For most tools, the relationship between their command line parameters and their configuration in Galaxy should be obvious, but we will point out differences between the two interfaces when they are important.

Overview of the MiModD tools¶

The command line interface to the analysis tools¶

The following is a list of all command line analysis tools currently built into MiModD together with a brief description of the purpose of each tool:

info         retrieve information about the samples encoded in a file for
             various supported formats
header       generate a SAM format header from an NGS run description
convert      convert between different sequenced reads file formats
reheader     from a BAM file generate a new file with specified header
             sections modified based on the header of a template SAM file
sort         sort a BAM file by coordinates (or names) of the mapped
             reads
snap         align sequence reads using the SNAP aligner
snap-batch   run several snap jobs and pool the resulting alignments into
             a multi-sample SAM/BAM file
snap-index   index a reference genome for use with the SNAP aligner
varcall      predict SNPs and indels in one or more aligned read samples
             and calculate the coverage of every base in the reference
             genome using samtools/bcftools
varextract   extract variant sites from BCF input as generated by varcall
             and report them in VCF
covstats     summary coverage statistics for varcall output
delcall      predict deletions in one or more samples of aligned paired-
             end reads based on coverage of the reference genome and on
             insert sizes
vcf-filter   extract lines from a vcf variant file based on sample- and
             field-specific filters
annotate     annotate a vcf variant file with information about the
             affected genes
snpeff-genomes
             list installed SnpEff genomes
cloudmap     generate CloudMap-compatible output from a vcf file

as you can obtain it, from the command line, by running:

mimodd --help

You can run any individual tool through:

mimodd <tool name> <tool specific arguments>

Additional help on each tool and the arguments it accepts can be obtained by typing a variation of:

mimodd <tool name> --help

, e.g.:

$ mimodd header --help

The Galaxy tool interface¶

If you have enabled MiModD in your local Galaxy, the same tools should be available (under slightly different names) in the MiModD section of the Galaxy Tools bar as shown in the screenshot.

While the MiModD Galaxy tools are really just wrappers around the tools available from the command line, they provide a clickable, graphical interface that typically is easier to handle for the new or occasional user.

In addition, to providing user-friendly access to individual tools, Galaxy also lets you combine individual tools into complex workflows without advanced knowledge of shell syntax or scripting languages. The chapter MiModD and Galaxy in this guide explains how MiModD lets you run complete analyses of NGS data - from sequenced reads to lists of variants - as a single Galaxy workflow without user intervention.

The MiModD administrative tools¶

MiModD provides a small number of tools for managing the package. These tools are available only from the command line, not from Galaxy, and are invoked, differently from the analysis tools, via the general pattern:

python3 -m MiModD.toolname

where python3 refers to the python executable used to install MiModD.

Help on a specific tool can be obtained by using:

python3 -m MiModD.toolname --help

Currently, these are the administrative tools installed with MiModD:

config        view and modify configuration settings
upgrade       check for and install upgrades of the package (requires pip)
enablegalaxy  integrate the package into a local installation of Galaxy

Depending on how MiModD and/or Galaxy were installed on your system, these tools may have to be run with superuser rights, i.e., prepending their invocation with sudo.

In-development tools¶

In addition to the administrative tools above, the

python3 -m MiModD.toolname

invocation pattern also provides access to a few in-development tools that are not considered mature enough to deserve their place among the fully exposed analysis tools, but which may be useful enough already to be shared with you.

Currently, these in-development tools are accessible:

index         generate a bam index file
              as expected by many external tools
sanitize      ensure input files are compatible with MiModD;
              currently, only works with fasta files

The MiModD tools in detail¶

The MiModD tools can be grouped in five categories:

the format converter and query tools

info, header, convert, reheader, sort, snpeff-genomes

help you annotate and arrange your sequencing data and let you convert between different formats for storage and use by other MiModD and external tools.

the core tools

snap, snap-batch, snap-index, varcall, varextract and delcall

do the computation-intensive analysis of the data, aligning sequence reads to reference genomes and detecting sequence variants with respect to the reference.

the data exploration tools

covstats, annotate, vcf-filter and cloudmap

let you summarize, annotate and filter the results of these analyses and help you to extract biologically relevant information.

the administrative tools

config, upgrade, enablegalaxy

in-development tools

index and sanitize

The Format Converter and Query Tools¶

The info tool¶

Provide information about file contents.

command line usage¶

mimodd info <input file> [OPTIONS]

The tool expects an input file and reports the metadata it finds encoded in it. It works with and auto-detects almost every file format used anywhere in the MiModD package, i.e., SAM, BAM, vcf, bcf and fasta, and is useful for quick checks of file contents. The tool can produce either plain-text or html output, which, like for many MiModD tools, is printed to the standard output (stdout, i.e., the console/terminal window) by default, but can be redirected with the -o or --ofile option to the specified file.

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file

--oformat <html|txt> : [default: txt]

-v, --verbose

Galaxy tool name:
	Retrieve File Information

Choose the input file from a drop-down menu of available files from your user history, then click Execute to run the tool. In Galaxy, a tool’s output goes to a file. Galaxy stores these files in an internal database and presents them to you in the history pane. Upon running the tool, its results file will automatically be added to your history. Click on the eye symbol to see the file contents in the main pane.

The header tool¶

Construct a SAM format file header from information about read groups and comments, which can be used by other tools (reheader, convert and snap) to annotate sequence reads files with metadata.

command line usage¶

mimodd header [--rg-id <RG_ID> [RG_ID]...] [--rg-sm <RG_SM> [RG_SM]...]
              [--rg-cn <RG_CN> [RG_CN]...] [--rg-ds <RG_DS> [RG_DS]...]
              [--rg-dt <RG_DT> [RG_DT]...] [--rg-lb <RG_LB> [RG_LB]...]
              [--rg_pl <RG_PL> [RG_PL]...] [--rg_pi <RG_PI> [RG_PI]...]
              [--rg_pu <RG_PU> [RG_PU]...]
              [--co <COMMENT> [COMMENT]... ] [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-x, --relaxed: do not enforce a sample name to be specified for every read group

A read group in SAM/BAM format refers to a group of sequences that form a logical unit and should be treated together. Typically, a read group comprises sequence reads obtained from one sample in a particular sequencing run.

A frequent source of confusion for beginners when dealing with SAM/BAM format is the distinction between read group ID and sample name and why it is (sometimes) necessary to have both.

Essentially, the distinction becomes relevant whenever you obtain several sequencing datasets for the same sample. You may then want to keep these datasets separate at certain steps of an analysis, but treat them together at others. If, for example, one dataset consists of single-end reads and the other of paired-end reads, you would want to have them aligned separately (using different algorithms), but may consider calling variants from the combined set of aligned reads.

MiModD can handle such situations provided that you declared read group ids and sample names accordingly.

Here is a header that you may use in the above scenario:

@HD VN:1.5
@RG ID:001      SM:sample1      DS:sample1 single-end sequencing data
@RG ID:002      SM:sample1      DS:sample1 paired-end sequencing data

This header can be generated with:

mimodd header --rg-id 001 002 --rg-sm sample1 sample1 --rg-ds "sample1 single-end sequencing data" "sample1 paired-end sequencing data"

and will allow you to safely store both read sets in the same file. Most MiModD tools will analyze them separately, but tools for which it makes sense will let you specify whether you want to group data based on read groups or based on sample names (look at the --group-by-id option of the varcall and delcall tools.

You can produce SAM headers with information about one or several read groups using the following command line arguments:

--rg-id <ID> [ID]...: Read group IDs are unique strings unambiguously identifying a read group. They may be provided by the sequencing center generating the data, but you can choose any ID(s) as long as you make sure that your naming scheme does not cause ID collisions in downstream analysis steps.
--rg-sm <sample name> [sample name]...: For each read group declared by providing an ID for it, you must specify a sample name using this option (or use the -x switch to make sample names optional). The sample name should identify the biological sample that the read group provides sequencing data for.
--rg-cn <sequencing center> [sequencing center]...: Optional name of the sequencing center that generated the data for the read group(s).
--rg-ds <description> [description]...: Optional one-line read group description(s).
--rg-dt <YYYY-MM-DD> [YYYY-MM-DD]...: Optional information in YYYY-MM-DD date format about when the data for the read group(s) was generated.
--rg-lb <lib identifier> [lib identifier]...: Optional library identifier(s). A unique identifier of the sequence library that was used for sequencing should be given here.
--rg-pl <platform name> [platform name]...: Optional name of the sequencing technology that was used to produce the data for the read group(s). Should be one of the following officially supported names: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, PACBIO.
--rg-pi <insert size> [insert size]...: Optional value(s) for the predicted median insert size between sequence reads of the read group(s) on the sequenced DNA fragments. Only applicable for paired-end sequencing data.
--rg-pu <unit ID> [unit ID]...: Optional identifier of the physical sequencing unit that the data for the read group(s) was generated on. Depending on the sequencing platform that was used (see the –rg_pl option) this could be, e.g., a flowcell-barcode or a sequencing slide.

The number of read groups declared in the resulting SAM header is determined by the number of read group IDs and matching number of sample names specified through the --rg-id and --rg-sm options. All other information for any read group is optional. If you do not want to provide a certain parameter for any read group you can simply omit the option. To provide a given parameter for some, but not all of the specified read groups, use the empty string "" as a placeholder. Pairing between parameters is determined by the input order.

Example 1¶

Specify two read-groups with sequencing date information

$ mimodd header --rg-id 001 002 --rg-sm sample1 sample2 --rg-dt 2014-10-23 2014-09-14

@HD VN:1.5
@RG ID:001  SM:sample1      DT:2014-10-23
@RG ID:002  SM:sample2      DT:2014-09-14

Example 2¶

Specify three read-groups providing dates for all three, but a description only for the second

$ mimodd header --rg-id 001 002 003 --rg-sm sample1 sample2 sample3 --rg-dt 2014-10-23 2014-09-14 2014-11-09 --rg-ds "" "WT control" ""

@HD VN:1.5
@RG ID:001  SM:sample1      DT:2014-10-23
@RG ID:002  SM:sample2      DS:WT control   DT:2014-09-14
@RG ID:003  SM:sample3      DT:2014-11-09

Example 3¶

Generate the three-sample header above, but with two general comment lines appended

In addition to or instead of read group information the header tool also lets you specify an arbitrary number of comment lines. These are not bound to any specific read group and should, generally, be used for remarks concerning the read groups or their analysis as a whole though you can, of course, refer to a particular read group in the comment text:

$ mimodd header --rg-id 001 002 003 --rg-sm sample1 sample2 sample3 --rg-dt 2014-10-23 2014-09-14 2014-11-09 --rg-ds "" "WT control" "" --co "data from the xx sequencing project" "provided by yy lab"

@HD VN:1.5
@RG ID:001  SM:sample1      DT:2014-10-23
@RG ID:002  SM:sample2      DS:WT control   DT:2014-09-14
@RG ID:003  SM:sample3      DT:2014-11-09
@CO data from the xx sequencing project
@CO provided by yy lab

Note the use of quotes to group elements. Without them a new comment line would be started after every whitespace.

Example 4¶

Generate a header without read groups and only comment lines

$ mimodd header --co "forgot to say:" this

@HD VN:1.5
@CO forgot to say:
@CO this

While such a header may not be terribly useful on its own, it can be used with the reheader tool to add comments to an existing header.

Note

Within MiModD the order of comment lines will, generally, be preserved, but this is not a file-format guarantee, i.e., other software is free to change it.

Galaxy tool name:
	NGS Run Annotation

This tool is more limited than its command line counterpart in that it only allows a single read group to be specified and described. There is no possibility to incorporate reference sequence information or comments. The main purpose of the tool is to generate annotation for unaligned single-sample input files that can be merged with this data by the Convert and Snap Read Alignment tools. In addition, it can be used in combination with the Reheader tool to modify or replace individual read groups in multi-sample BAM files.

The convert tool¶

Convert between different sequenced reads file formats.

command line usage¶

mimodd convert <input file> [input file]... [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]

redirect the output to the specified file

the format of the input file(s); the input format determines

whether the -h option (see below) is applicable and
whether multiple input files are allowed and how they are used:
- with input in SAM or BAM format only one input file is accepted
- with input in single-end fastq format (compressed or not; format strings fastq, gz) multiple input files may be given and the reads from all files will be written to the output in concatenated form, i.e., as if they had come from a single source, providing an easy way to merge split fastq files into a single SAM/BAM file
- with input in paired-reads fastq format (compressed or not; format strings fastq_pe, gz_pe) multiple input files are interpreted as alternating r1 and r2 files, i.e., every other file is expected to contain the read mates of the preceding file

--oformat <sam|bam|fastq|gz> : [default: sam]

the output format; the output format determines how the --ofile option is interpreted:

when writing output in SAM or BAM format, the name provided with the --ofile option is used as the name of the (single) output file. Without the option the output goes to the standard output stream.

Note that the -r option changes this behaviour.
when writing output in fastq or gzip-compressed fastq, use of --ofile is mandatory and the name provided with the option is used as the basename of possibly multiple output files. One output file will be generated per read group for single-end data and two output files per read group for paired-end data (the required splitting is auto-detected).

-h <header file>, --header <header file>

use the header of the provided SAM file as the output header; applicable only when --iformat specifies a headerless file format, i.e., one of fastq, fastq_pe, gz or gz_pe

-t <threads>, --threads <threads> : [default: config settings]

the maximum number of threads that the conversion is allowed to use; this option is ignored if the specified conversion type does not support parallel processing

-r, --split-on-rgs

if the input file has reads from different read groups, write them to separate output files. With this option, --ofile is required and gets interpreted as the basename of possibly multiple output files. If --oformat is set to bam or sam, one output file per input read group will be produced. -r is implied when --oformat is set to fastq or gz and leads to the splitting described under --oformat.

-t <threads>, --threads <threads> : [default: config settings]

the maximum number of threads that the conversion is allowed to use; this option is ignored if the specified conversion type does not support parallel processing

Additional specifications:

Standard input can be used instead of an input file by specifying a minus sign - instead of a file name.

The following table summarizes the conversions that are currently supported:

from	to	input file #	output file #	optional header
fastq(.gz)	sam/bam	>= 1	1	yes
sam/bam	bam/sam	1§	1§	no
sam/bam	fastq(.gz)	1	>= 1	no

§ with –split-on-rgs the number of output files will equal the number of read: groups in the input file.

Note

We anticipate future versions of MiModD to implement subsampling and filtering during conversions. This is the reason why, in addition to the format combinations above, the currently meaningless (unless you are looking for a slow way to copy files) “conversions” sam <-> sam and bam <-> bam are also accepted.

Examples:

Example 1¶

View the contents of a BAM file by converting it to SAM format

$ mimodd convert myreads.bam

Conversion from BAM to SAM is the default behavior.

Example 2¶

Convert a SAM file to BAM format

$ mimodd header --rg-id 007 --rg-sm "secret sample" --co "for your eyes only" | \
  mimodd convert - --iformat sam --oformat bam

�BCdsr�e�c``p�p�
                ��2�3�r��t�200�
                               ��*NM.J-Q(N�-�I�rp��L�/R��/-RH�L-V��˩���`��J�BCw

While the output is not too useful, this example demonstrates how the tool can be used in a pipe. Now just to show that the above worked as expected:

$ mimodd header --rg-id 007 --rg-sm "secret sample" --co "for your eyes only" | \
  mimodd convert - --iformat sam --oformat bam | mimodd convert -

@HD VN:1.5
@RG ID:007  SM:secret sample
@CO for your eyes only

Galaxy tool name:
	Convert

This tool has the exact same functionality in Galaxy and from the command line. The Galaxy input form simplifies the choice of parameters, though, by dynamically displaying only applicable options based on the selected input file format.

The reheader tool¶

command line usage¶

mimodd reheader [template SAM file] <BAM input file>
                [--rg ignore|update|replace [RG_TEMPLATE] [RG_MAPPING]...]
                [--sq ignore|update|replace [SQ_TEMPLATE] [SQ_MAPPING]...]
                [--co ignore|update|replace [CO_TEMPLATE]]
                [--rgm RG_MAPPING [RG_MAPPING]...]
                [--sqm SQ_MAPPING [SQ_MAPPING]...] [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-H: output only the resulting header in SAM format

This complex, but powerful tool rewrites an existing <BAM input file> modifying its header on the fly based on the header information of one or several template SAM files according to at least one of these optional parameters:

template SAM file

if specified the header of this file provides the read group information, the sequence dictionary and the comment lines that will be used to modify the input file header.

--rg ignore|update|replace [RG template SAM file] [RG mapping]...

the –rg option followed by one of the keywords ignore, update or replace specifies how the read group section of the new header should be compiled. With ignore any template read group information is ignored and the read group section of the BAM input file is copied to the new file unchanged. With update the input file header is modified by template read group information according to these rules: any read groups found only in the BAM input or in the template information are copied to the new header. For read groups with matching IDs found in the BAM input and in the template, any information about this read group found in the template will be written to the new header (possibly overwriting input file information), but information found only in the BAM input will be copied unchanged. With replace the rules are: all template read groups are written to the new header as they are (possibly overwriting read groups with matching IDs in the BAM input and without information merging). Read groups found only in the BAM input file are discarded.

Both update and replace may optionally be followed by

a read group-specific template file, which will be used as the source of template read group information (i.e., if the general template SAM file is also given, its read group section is ignored)

and/or

a read group mapping to be used to determine read group ID matching between the BAM input file and the template. This mapping has to be provided in the exact form input_id : template_id (with spaces around the colon). Several mappings must also be separated by a space. When a mapping is given and a template read group with the ID template_id is found, it is treated as if its ID was input_id.

--sq ignore|update|replace [SQ template SAM file] [SQ mapping]...

the –sq option determines the use of template sequence information in just the same way as –rg does for read groups. The only differences are that sequences will be matched by their name (SN header values) and that the length (LN value) and the MD5 checksum (M5 value) of a sequence will never be modified when rewriting sequence information from the BAM input to the new header.

--co ignore|update|replace [template SAM file for comments]

determines how template comments are used. They can be ignored, appended to (with update) any comments found in the BAM input file or be used to replace any comments from the BAM input. As with the –rg and –sq options, an optional specific template SAM file can also be specified.

--rgm RG mapping [RG mapping]...

this option lets you specify one or read group mappings (in the format shown above for the –rg option) that are used to rename read groups. Importantly, renaming is always happening after all other modifications and the input_id(s) given in the mapping(s) have to still exist at that point.

--sqm SQ mapping [SQ mapping]...

equivalent to –rgm for renaming sequence names

Additional specifications:

A minus sign - can be used in place of any of the different template file names that the tool supports indicating that the respective header template(s) should be read from standard input. This feature makes it possible to use the tool efficiently in pipes, e.g., with input coming from the header tool.
If a general template SAM file is specified the tool’s default for the --rg, --sq and --co will be set to replace, otherwise the defaults will be ignore.
The tool autodetects header changes (changes in read group IDs or sequence names, elimination of read groups or sequence records) that may be incompatible with records in the body of the BAM file. In these situations, it will scan the file body for incompatible records and try to fix them or, failing that, will terminate with an error message. In this way, the tool guarantees that its reheadered output will always be a valid BAM file. On the other hand, the scan causes very significant runtime increases leading to the folowing recommendations:
- reserve the use of the --rgm and --sqm options for vital cases
- use replace mode for read groups and sequence names (--rg and --sq options) with caution and take care not to remove read group or sequence records from the final header that get referred to in the body of the BAM file.

Examples:

The following examples assume that you have a BAM input file named “example_input” with the following header (printed in SAM format):

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DT:2013-04-30
@RG ID:002  SM:sample 2     DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

This file contains reads from two samples mapped against a common reference genome, i.e. the sequences chrI, chrII, chrIII and the read group IDs 001 and 002 will be referenced in the body of the file.

Note

The reheader tool is quite complex and may take some practice to use efficiently. The -H option instructs the tool to generate only the new header in human readable SAM format and we encourage beginners to use this option to check the outcome of their parameter settings before producing the full reheadered BAM file.

Example 1¶

Adding descriptions to existing read groups

$ mimodd header -x --rg-id 001 002 --rg-ds "labbook #1, exp #216" "labbook #1, exp #217" | \
  mimodd reheader -H example_input --rg update -

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:002  SM:sample 2     DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

Note that, for this illustration, we used the -H option to obtain only the resulting header in SAM format.

We decided to generate the SAM header template with the descriptions to be added using the header tool. Use of the -x option saved us from having to redundantly specify the sample names again. We piped the output directly to the reheader tool to use it as the --rg template. update mode allows us to add the descriptions to the existing information of each read group. Using replace instead would have produced this unintended output:

@HD VN:1.0  SO:coordinate
@RG ID:001  DS:labbook #1, exp #216
@RG ID:002  DS:labbook #1, exp #217
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

with all the existing read group information replaced by the newly porvided one.

Example 2¶

Adding comments

$ mimodd header --co "let me just say:" "this header needs comments" | \
  mimodd reheader -H example_input --co update -

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DT:2013-04-30
@RG ID:002  SM:sample 2     DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700
@CO let me just say:
@CO this header needs comments

Again, we generated the new header elements using mimodd header. In this particular case, the same output would have been generated with --co replace mode since no comment lines existed in the original BAM file.

Example 3¶

Renaming a read group

$ mimodd reheader -H example_input --rgm 001 : 003

@HD VN:1.0  SO:coordinate
@RG ID:003  SM:sample 1     DT:2013-04-30
@RG ID:002  SM:sample 2     DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

Note that if we had not used the -H option, but had produced a full output BAM file with:

$ mimodd reheader example_input --rgm 001 : 003 -o example_output

the tool would have scanned the body of the input BAM file and would have replaced any references to read group ID 001 with references to ID 003.

Example 4¶

Adding information to read groups and changing their ID in one pass

Assume you have a template SAM file named “template.sam” that has additional information about the read groups, but lists those read groups under different IDs, e.g:

@HD VN:1.0  SO:coordinate
@RG ID:752  SM:sample 1     DS:labbook #1, exp #216
@RG ID:753  SM:sample 2     DS:labbook #1, exp #217

You could use:

$ mimodd reheader -H example_input --rg update template.sam 001 : 752 002 : 753

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:002  SM:sample 2     DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

to assign the new information to the correct read groups. If you not only wanted to update the information about the read groups, but also wanted to change their IDs to match those in the template file, you would have to use:

$ mimodd reheader -H example_input --rg update template.sam 001 : 752 002 : 753 --rgm 001 : 752 002 : 753

@HD VN:1.0  SO:coordinate
@RG ID:752  SM:sample 1     DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:753  SM:sample 2     DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

Providing the mapping twice is necessary in this case because the changes of IDs according to the --rgm option are always made after any operations specified by --rg.

Galaxy tool name:
	Reheader BAM file

The sort tool:¶

Sort the reads in a SAM or a BAM file.

command line usage¶

mimodd sort <input file> [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file

--iformat <bam|sam> : [default: assume bam] --oformat <bam|sam> : [default: bam]

-n, --by_name

-l <compression level> : [default: 6 (standard BAM compression level)]

-m <memory>, --memory <memory> : [default: config settings]

-t <threads>, --threads <threads> : [default: config settings]

By default, the tool sorts the reads in a BAM input file by their mapped reference coordinates. The -n switch can be used to sort by read names instead. The -l, -m, and -t options let you tweak the tool for optimal performance on your system. Specifically, they control the compression level to use for intermediate and final output, how much memory in GB to use for the sorting and how many cores to engage in it. Since the default values for -m and -t come from MiModD’s configuration settings (see Configuring MiModD for your system) they should be adequate in most situations.

Galaxy tool name:
	Sort BAM File

The Galaxy tool version exposes the --by_name and --oformat options of the command line tool as check boxes and should be self-explaining.

The snpeff-genomes tool:¶

Query the host machine’s SnpEff installation for registered and installed (i.e., usable) genome annotation files.

command line usage¶

mimodd snpeff-genomes <input bcf file> [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-c <path>, --config <path>: use this path to locate the SnpEff installation directory instead of looking it up in the MiModD configuration file

Galaxy tool name:
	List Installed SnpEff Genomes

The Core Tools¶

The snap tool:¶

Align sequence reads to a reference genome.

command line usage¶

mimodd snap single|paired <reference genome>|<index directory>
            <input file> [<input file2>] -o <file> [OPTIONS]

For the sake of clarity, the many options of this tool can be grouped into four categories:

general options

paired-end data options that define the exact requirements for a valid read pair and will be ignored when aligning single-end reads

advanced alignment options that adjust parameters of the alignment algorithm and will require a bit of understanding of the alignment process to be used reasonably

indexing options that control the details of reference genome index building in preparation of the alignment step

While all options may occassionally be useful, beginners can safely ignore the advanced alignment and indexing options, for which the default values should work reasonably well in most situations.

general options:

--iformat <fastq|gz|sam|bam> : [default: bam]

--oformat <sam|bam> : [default: bam]

--header <header file>: a SAM file providing information about exactly one read group in its header, which will become the read group information in the aligned output file. Specification of a valid header file is required with headerless <input files>, i.e., input in fastq format or SAM/BAM input without read group information. When used with SAM/BAM input containing read group information, this original information will be overwritten.

-t <threads>, --threads <threads> : [default: config settings]

-m <memory>, --memory <memory> : [default: config settings]

maximmal memory to use in GB. Please note that, currently, this setting will only be respected during sorting of the aligned reads, but NOT at the alignment step itself, which will always consume memory as required to index the reference genome.

--no-sort

by default, the tool sorts aligned reads based on their mapped coordinates on the reference genome; with this option, the original order of reads in the <input file(s)> is retained instead.

Please note that ONLY coordinate-sorted aligned read files can be processed with downstream MiModD tools like varcall and delcall.

-X

indicates that CIGAR strings in the output should use = and X to indicate matches/mismatches instead of M for both

USE OF THIS OPTION IS DISCOURAGED as =/X CIGAR strings are still not fully supported by useful third-party tools like IGV

-q, --quiet

-v, -verbose

options affecting paired-end reads alignment:

The options in this category will be ignored by the tool when executed in single mode. In paired mode, however, they can have a relatively big impact on alignment quality so adjusting them to your specific use-case is highly recommended.

-D [RF|FR|FF|RR|ALL]..., --discard-overlapping-mates [RF|FR|FF|RR|ALL]... : default

discard overlapping read mates of the specified orientation(s).

The values RF, FR, FF and RR correspond to the following mapped read orientations (with === indicating the reference genome):

RF:          5' ------>
      =======================
            <------ 5'


FR:      5' ------>
      =======================
                <------ 5'


FF:      5' ------>
             5' ------>
      =======================


RR:
      =======================
            <------ 5'
               <------ 5'

By default, overlapping mate pairs are not treated specially (i.e., they are kept in the output along with non-overlapping read pairs) and this is fine for most sequencing datasets. However, if you have a priori knowledge, which sort of mate overlaps must represent anomalous alignments given the sequencing protocol used, you can tell the tool to discard mates of this type and this will, generally, improve the overall alignment quality (though how much of an improvement this will bring depends on the fraction of such reads in the input).

One typical use-case is with Illumina PE sequencing where you know that the RF overlap type must be the result of misalignment or a too small insert size of the genomic DNA template fragment (resulting in sequencing into Illumina adapter sequences). Running snap with -D RF will eliminate these potentially misinforming reads from the output.

-s MIN MAX, --spacing MIN MAX : [default: 100 10000]

The minimum and maximum insert size for an aligned read pair to be considered a valid pair. Only if the mapped reads show an insert size in the given range, will they be flagged as a proper read pair in the aligned output. Reads with an insert size outside the range will not be lost, but will be treated like single-end reads during alignment. For optimal alignment results, the specified range should match the expected insert size distribution determined by the genomic DNA library preparation (though, in practice, alignments are relatively robust against wrong assumptions). In addition, however, you will not be able to discover deletions larger than MAX when using the output file with delcall.

advanced alignment options:

These options map 1:1 to the options of the wrapped SNAP aligner although in some cases MiModD uses different defaults than SNAP as a standalone tool. Most options are explained in detail in the original SNAP manual so we limit ourselves here to listing them briefly together with their MiModD default values.

-d EDIT DISTANCE, --maxdist EDIT DISTANCE : [default: 8]: maximum edit distance allowed per read or pair
-n SEEDS, --maxseeds SEEDS : [default: 25]: number of seeds to use per read
-h HITS, --maxhits HITS : [default: 250]: maximum hits to consider per seed
-c THRESHOLD, --confdiff THRESHOLD : [default: 2]: confidence threshold
-a THRESHOLD, --confadapt THRESHOLD : [default: 7]: confidence adaptation threshold
-e, --error-rep: compute error rate assuming wgsim-generated reads
-P, --no-prefetch: disables cache prefetching in the genome; may be helpful for machines with small caches or lots of cores/cache
-x, --explore: explore some hits of overly popular seeds (useful for filtering)
-f, --stop-on-first: stop on first match within edit distance limit (filtering mode)
-F FILTER, --filter-output FILTER: retain only certain read classes in output (a=aligned only, s=single hit only, u=unaligned only)
-I, --ignore: ignore non-matching IDs in the paired-end aligner
-S SELECTIVITY, --selectivity SELECTIVITY: selectivity; randomly choose 1/selectivity of the reads to score
-C ++|+-|-+|--, --clipping ++|+-|-+|-- : [default: ++]: specify a combination of two + or - symbols to indicate whether to clip low-quality bases from the front and back of reads respectively
-G PENALTY, --gap-penalty PENALTY: specify a gap penalty to use when generating CIGAR strings
-b, --bind-threads: bind each thread to its processor

indexing options:

--idx-seedsize SEED SIZE : [default: 20]: Seed size used in building the index
--idx-slack SLACK : [default: 0.3]: Hash table slack for indexing
--idx-overflow FACTOR : [default: 40]: factor (between 1 and 1000) to set the size of the index build overflow space
--idx-out INDEX DIR: name of the index directory to be created; if given the index directory will be permanent (i.e., available for additional runs of the tool), otherwise a temporary directory will be used

In addition, to the above options, the tool recognizes two legacy options solely for backwards-compatibility with earlier versions of MiModD. These are:

-M, --mmatch-notation: indicates that CIGAR strings in the output should use M (alignment match) rather than = and X (sequence (mis-)match); this option has been superseeded by the -X option
--sort, --so: sort output file by alignment location; this is now the default behaviour; use –no-sort to turn sorting off

The tool aligns NGS reads in fastq (gzipped or uncompressed), SAM or BAM format (specified with --iformat) to a reference genome using the SNAP aligner.

Besides an <input file> (or two in case of paired-end reads in fastq format) of reads, it requires specification of a sequencing mode, which can either be single or paired, and a <reference genome> to align the reads against. The tool will automatically build an index for this reference using the snap-index tool internally. Alternatively, a precalculated index (obtained from a previous snap call using the --idx-out option or directly from the snap-index tool) may be specified. Whether a reference genome or an index is provided will be auto-detected.

The tool requires your input file(s) to have associated header information. If this is not the case, e.g., with fastq input files, you must provide a SAM file with that information through the --header option. You can also use this option to exchange existing header information in the input just for the alignment.

Note that, currently, the tool does NOT support printing to the standard output, so specifying an output file with the -o option is mandatory.

Galaxy interface¶

Galaxy tool name : -

Note

This tool is not directly accessible from Galaxy. The Galaxy interface of MiModD uses the snap-batch tool instead.

The snap-batch tool:¶

Execute a batch of snap tool calls automatically in an optimized way and merge the results of all calls into a single aligned reads file with several read groups.

command line usage¶

mimodd snap-batch -s <mimodd snap command line> [mimodd snap command line]...

or:

mimodd snap-batch -f <input file with snap command lines>

-s¶: Read individual snap calls from the rest of the command line

The individual commands passed through the -s option or in the file specified with -f should be valid command lines for the mimodd snap tool, but without the leading mimodd.

In case of the -s option, individual commands should be enclosed in quotes to allow command grouping. With the -f option, the file should have one snap command per line.

Since this tool produces only one merged output file for all jobs, the options -o, --oformat and --no-sort are interpreted only once from the first snap call. Conflicting settings in later calls are ignored. Note, however, that -o, --ofile is a required option and, thus, needs to be provided with every call to make it syntactically correct.

Galaxy tool name:
	Snap Read Alignment

A few restrictions currently apply when using this tool from the Galaxy interface:

As explained above, all parameters except for the output file, its format and sort order can differ between the indiviual snap calls specified when the tool is used from the command line.

In contrast, all but the input file format, and optional custom input file header information are shared between calls when the tool is used from Galaxy. This behaviour helps to keep the Galaxy interface relatively simple and is good enough for aligning reads from several samples and/or read groups against a common reference with common settings.

For more complex tasks, however, you will have to use the snap-batch command line interface.

The Snap Read Alignment tool with standard options.

The advanced options section, which can be expanded by selecting change settings under further parameter settings.

The snap-index tool:¶

Generate an index of a reference genome for use by the snap and snap-batch tools.

command line usage¶

mimodd snap-index <reference genome> <index_directory> [OPTIONS]

-s <seed size>

-h <slack>

-t <threads>, --threads <threads> : [default: config settings]

Use of this tool is rarely necessary since snap and snap-batch call the underlying indexing function internally if needed.

Galaxy interface¶

Galaxy tool name : -

Note

This tool is not accessible from Galaxy.

The varcall tool:¶

Predict single-nucleotide variants (SNVs) and indels with respect to a reference genome in one or more groups of aligned reads.

command line usage¶

mimodd varcall <reference genome> <input file> [input file]... [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-d <depth>, --depth <depth>: set max per-BAM depth
-i, --group-by-id: call variants on per-read group basis instead of per-sample
-x, --relaxed: turn off md5 checksum comparison between sequences in the reference genome and those specified in the BAM input file header(s)

-t <threads>, --threads <threads> : [default: config settings]

-q, --quiet

-v, --verbose

Galaxy tool name:
	Variant Calling

The tool lets you choose a reference genome and one or more input files of aligned reads from drop-down menus.

Toggle switches can be used to enable or disable the md5 checksum comparison and to set the mode of read grouping.

The maximum per-BAM depth can also be set.

The varextract tool:¶

Extract variant sites from BCF input as generated by varcall and report them in VCF format.

command line usage¶

mimodd varextract <input bcf file> [OPTIONS]

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file

-p <vcf file> [<vcf file>]..., --pre-vcf <vcf file> [<vcf file>]...

-a, --keep-alts

-v, --verbose

Galaxy tool name:
	Extract Variant Sites

The delcall tool:¶

Predict deletions in one or more groups of aligned paired-end reads.

command line usage¶

mimodd delcall <input bam file> [input bam file]... <coverage file> [OPTIONS]

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file

--max-cov <maximal coverage>

--min-size <minimal size>

-u, include-uncovered: include uncovered regions (even if they do not qualify as deletions)

-i, --group-by-id

-v, --verbose

The tool predicts deletions in one or more samples of aligned paired-end reads from the specified input bam files based on coverage of the reference genome (taken from the input cov file) and on observed insert sizes around the region.

By default, the tool reports uncovered regions, for which their is statistical evidence that they are real deletions. This behaviour can be changed so that any uncovered region will be included with the -u switch.

The output (in gff format) can be redirected from the standard output to a file specified with the -o option.

The --max-cov and --min-size options can be used to set coverage and size thresholds for reporting uncovered regions/deletions.

Like in varcall, there is an -i switch to control the mode of read grouping, but its meaning is slightly different. If enabled reads are grouped strictly by their read groups, but the default is to group reads by their associated sample names only for the coverage analysis, i.e., to detect uncovered regions. The statistical test for real deletions, however, will always be performed on a per-read group basis.

Note

This tool requires paired-end data for the statistical assessment of deletions, but single-end data with the same sample name can be used in and improve the initial coverage analysis.

Galaxy tool name:
	Deletion Prediction for Paired-end Data

You can select one or more bam input files and a cov file from drop-down menus.

You can provide custom values for the maximal coverage allowed inside a region to consider it as a deletion and the minimal deletion size.

Toggle switches let you control whether to inlude uncovered regions that are not statistically significant deletions in the output and how to group reads from the input bam file(s).

The Data Exploration Tools¶

The covstats tool:¶

Provide a coverage summary report for output from the varcall tool.

command line usage¶

mimodd covstats <input bcf file> [-o <file>]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file

Galaxy tool name:
	Coverage Statistics

The vcf-filter tool:¶

Filter multi-sample vcf files, like the ones generated by varcall, by chromosomal region, variant type (i.e., INDEL or single nucleotide change), or by sample-specific genotypes and coverage.

command line usage¶

mimodd vcf-filter <input file> [OPTIONS]

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-s <sample name> [sample name]..., --samples ...: one or more names of samples that the sample-specific filters --gt , --dp and --gq should work on
--gt <genotype pattern> [genotype pattern]...: one or more genotype patterns of the form x/x[, x/x] where x=0 indicates a reference allele and x=1 a variant allele; exactly one genotype pattern (possibly composed of several comma-separated genotypes) needs to be given per sample name specified through -s and variants will be retained only if every listed sample has a genotype contained in its corresponding genotype patterns; use ANY as the genotype pattern to exclude a sample from being used in genotype filtering
--dp <depth> [depth]...: depth of coverage threshold; exactly one value needed for every sample specified through -s; variants will only be retained if every sample declared with -s shows a read coverage of the variant site at least equal to the corresponding threshold; use 0 to skip the requirement for a given sample
--gq <genotype quality> [genotype quality]: genotype quality threshold; exactly one value needed for every sample specified through -s; variants will only be retained if every sample declared with -s has a quality score for its predicted genotype equal to or higher than the corresponding threshold; specifying 0 as the threshold effectively skips gq filtering for a given sample
--af <allelic fraction specifier> [allelic fraction specifier]: exactly one value is needed for every sample specified through -s; each allelic fraction specifier has the format [ALLELE#]:[MIN FRACTION]:[MAX FRACTION] and variants are only retained if the indicated allele number is found in a fraction of reads between MIN and MAX FRACTION; if omitted MIN and MAX FRACTION default to 0 and 1, respectively; if no ALLELE# is specified than the fraction criterion must be met by the most frequent non-reference allele
-r <chromosome>:[start-stop] [chromosome:[start-stop]]..., --region ...: retain variants only if they fall into one of the given chromosomal regions (specified in the format chrom:start-stop or chrom: for a whole chromosome)
-i``|-I``, --no-indels``|–indels-only``: mutually exclusive options to filter out or retain only indels
--vfilter <sample name> [sample name]...: filter by sample name(s); instead of filtering variants, this option will retain sample-specific information about each variant only for the specified samples; if several sample names are provided their order also determines the order of sample-specific information in the output; useful for demultiplexing and restructuring (the sample-specific part of) multi-sample vcf files

To filter variants based on sample-specific information, first use the -s option to specify one or more samples by name to which the filters should be applied. Then use the --gt, --dp or --gq options to specify the filter criteria. Each of these options, when present, has to be followed by a number of filters matching the number of samples declared with -s and the filters are associated with the samples by position on the command line.

–gt sets genotype filters. Currently, 0/0 (homozygous ref allele), 0/1 (heterozygous) and 1/1 (homozygous alt allele) are allowed as <genotype patterns>. Several genotypes can be specified by separating them with commas. The keyword ANY is allowed as a placeholder (i.e., no genotype filter should be applied to the given sample), and this should be preferred over the alternative 0/0,0/1,1/1 notation for clarity, briefity and efficiency.

–dp sets coverage filters. To pass the filter a variant site has to be covered by at least as many sample-specific reads as set with –dp. Consequently, 0 can serve as a placeholder here.

–gq sets genotype quality filters. To pass the filter the quality of the genotype call for the relevant sample has to be at least as high as the filter criterion. Again, 0 has a placeholder function.

All filters discussed above are subtractive and line-based, i.e., if, and only if, a variant passes ALL filters, the full entry is retained.

In addition, the tool supports one column-based/vertical filter via the –v_filter option. If used, then of the sample-specific columns of any line (that passes all other filters) only those describing samples specified with the option are retained.

Examples:

Example 1¶

Of a vcf input file variants_in.vcf with a sample called sample1 (and possibly others) retain all variants for which sample1’s genotype is 0/1 (heterozygous) or 1/1 (homozygous for the variant allele):

$ mimodd vcf-filter variants_in.vcf -s sample1 --gt 0/1,1/1

Example 2¶

Of the same file, retain all variants, for which sample1’s genotype is 0/1 or 1/1 AND for which sample2’s genotype is 0/0 (homozygous for the reference allele):

$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 0/1,1/1 0/0

Example 3¶

With the same input again, retain all variants for which the genotype of sample1 is 0/1 or 1/1 AND for which sample1 AND sample2 show at least 3-fold coverage (the genotype of sample2 is not used for filtering):

$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 0/1,1/1 ANY --dp 3 3

Example 4¶

Retain all variants found on chromosomes chrI and chrII for which the genotype of sample1 is 1/1 AND that of sample2 is 0/0. For these variants, report only sample-specific information for sample1:

$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 1/1 0/0 --region chrI: chrII: --vfilter sample1

Galaxy tool name:
	VCF Filter

The annotate tool:¶

Generate annotated output for called variants to simplify data exploration

command line usage¶

mimodd annotate <input file> [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-f <html|text>, --oformat <html|text> : [default: html]: output format
--grouping <by_sample|by_genes>: group annotated variants by sample or by affected gene instead of simply keeping the input file order
-l <link formatter file>, --link <link formatter file>: file to read custom hyperlink formatting instructions from; effective only with -f html
--species <species>: the name of the species to assume for the annotations

In addition, the following SnpEff-specific options are supported:

-g <genome>, --genome <genome>: use SnpEff with <genome> as its genome file to annotate variants with their effects on genes and transcripts; the exact name of the file has to be provided; use the snpeff-genomes tool to get a list of all available files
-s <file>, --snpeff-out <file>: generate an additional file <file> which holds the original SnpEff output
--stats <file>: generate an additional file <file> with the SnpEff results summary
-m <memory>, --memory <meemory> : [default: config settings]: maximal memory to use in GB

--minC <coverage threshold>

--minQ <qualtiy threshold>

--no-downstream, --no-upstream, --no-intron, --no-intergenic, --no-utr: do not annotate downstream, upstream, intron, intergenic or UTR effects, respectively
--ud <distance> : [default: 500]: specify the upstream/downstream interval length, i.e., variants more than <distance> nts from the next annotated gene are considered to be intergenic

--chr

This tool provides information about the genes and transcripts affected by the variants in a vcf input file.

It uses SnpEff to add annotations to the vcf input, and an installed annotated SnpEff genome has to be specified, from which the annotations will be extracted. The snpeff-genomes tool can be used to get a list of all installed SnpEff genomes on your system.

By default, the tool generates a single output file specified with the -o option. This is a html file with a table of all variants, which transcripts they affect and, possibly, hyperlinks to the relevant genes and genome views (supported for selected organisms).

The –group_by_sample switch affects the order of this table, changing it from position-based with interleaved samples (the default) to sorted by sample first.

On demand, the original annotated vcf generated by SnpEff can be stored in a separate file specified with the -s option.

In addition, an optional SnpEff summary file of the variant effects can be specified with the –stats option.

The -q and -v switches have their usual MiModD meaning, i.e., to suppress the original output from SnpEff and to enable verbose output from the annotate tool itself, respectively.

All other options affect the behaviour of SnpEff:

Galaxy tool name:
	Variant Annotation

The cloudmap tool:¶

From a vcf file, generate output compatible with one of the CloudMap tools EMS Variant Density Mapping and Hawaiian Variant Mapping.

Note

This user guide has a separate section dedicated to a conceptual explanation of mapping-by-sequencing using MiModD and CloudMap.

In the following we focus on a technical description of the tool interface and assume you know what you are trying to achieve with it.

command line usage¶

mimodd cloudmap <vcf input file> SVD|VAF <mapping sample name>
                [-r <parent sample name>] [-u <parent sample name>]
                [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-s <file>, --seqdict <file>: generate also the sequence dictionary required by the CloudMap Hawaiian Mapping tool for most species and write it to the specified file
-i, --infer-missing: if variant data for either the related or the unrelated parent strain is not provided, the tool can try to infer the alleles present in that parent from the allele spectrum found in the mapping sample. This is an experimental option, the benefits and caveats of which are currently not well investigated. There should normally be no need to use it!

The tool provides two analysis modes - SVD (for Simple Variant Density mapping) and VAF (Variant Allele Frequency mapping), where

“SVD” mode produces output for visualization with the CloudMap EMS Variant Density Mapping tool, and

“VAF” mode produces output that can be visualized with the Hawaiian Variant Mapping tool.

The <mapping sample name> must specify the name of the sample for which mutation mapping should be performed. VAF mode requires the additional specification of at least one <parent sample name through the -r, --related-parent or the -u, --unrelated-parent options. These parent samples provide the variants the inheritance pattern of which should be analyzed in the mapping sample. A related parent is defined as a sample that has contributed variants to the variant pool of the original mutant strain (or is that mutant strain), while an unrelated parent is a sample that has contributed variants to the mapping strain (or is that strain) that was used in a cross with the original mutant strain to generate the mapping sample. If both a related and an unrelated parent sample are available both may be specified and this will, typically, result in better mapping resolution.

Note

All sample names must be provided exactly as they appear in the <vcf input file>. If unsure, you can use the info tool with:

mimodd info <vcf input file>

to query sample names (and other information) encoded in the input file.

Parent samples cannot be specified in SVD mode. Instead, crossing strain variants should be “subtracted” from the vcf input using the vcf-filter tool.

Galaxy tool name:
	Prepare Variant Data for Mapping

The Galaxy tool has the same functionality as its command line counterpart, but the input form shows/hides the parent sample input fields automatically depending on the selected analysis mode.

The tool interface in SVD mode ...

... and in VAF mode.

The Administrative Tools¶

The config tool¶

View and modify package configuration.

command line usage¶

python3 -m MiModD.config [OPTIONS]

When run without options, the tool displays the install location and the current configuration settings of the package.

Use any of the available options to change the corresponding package settings:

--tmpfiles <path>, --tmpfiles-path <path>: set the directory that MiModD uses to store temporary data in
--snpeff <path>, --snpeff-path <path>: set the directory that you have installed SnpEff into to start using the SnpEff-dependent functionality of the annotate tool
-t <multithreading level>, --threads ..., --multithreading-level ...: set the maximal number of threads that any MiModD tool is allowed to use by default (respected approximately)
-m <max memory>, --memory ..., --max-memory ...: set the maximal amount of memory (in GB) that any MiModD tool is allowed to use by default (most tools will use much less, snap may use more)

The upgrade tool¶

check for and install upgrades of the package (requires pip).

command line usage¶

python3 -m MiModD.upgrade [install|hg-install] [OPTIONS]

When run without options, the tool checks online for the latest version of MiModD. When used with install, it upgrades the package to the latest available version. hg-install can be used to upgrade to the latest in-development version instead (this requires Mercurial to be installed and is not recommended for normal users as development versions may not be functional).

Your MiModD configuration settings will be preserved during the upgrade.

available options:

-v <version>, --version <version>: upgrade to a specific version of the package instead of the latest version. <version> must be a valid MiModD version number when used with install or an existing changeset identifier when used with hg-install.

Examples:

python3 -m MiModD.upgrade install

The standard command to keep MiModD updated!

Tries to upgrade your installation to the latest available version. The upgrade will abort if no newer version than the currently installed version can be found.

python3 -m MiModD.upgrade install --version 0.1.5.3

Upgrade to version 0.1.5.3 of MiModD independent of the currently installed version. If your current version is newer, it will be downgraded. If your current version is 0.1.5.3, it will get re-installed.

The enablegalaxy tool¶

enable the package for use from a local installation of Galaxy.

command line usage¶

python3 -m MiModD.enablegalaxy <path to Galaxy> [OPTIONS]

The tool will integrate the package into the local Galaxy installation found at <path to Galaxy>. Specifically, it will autodetect and modify the Galaxy configuration file to add the MiModD section of tools to the Tools bar, and it will modify the installed package to have its tool configuration file point to the tool wrappers. If you have no clue what is meant by all this, do not worry: things should just work. In case they don’t (most likely, because you are using some future backwards-incompatible version of Galaxy), contact us and we may be able to figure out how to use the options below to make things work for you.

available options:

-c <config file>, --config_file <config file>: explicit path to and name of the Galaxy configuration file. Should not normally be required.
-t <line token>, --token <line token> : [default: tool_config_file]: override the name of the key in the configuration file that the path to the MiModD tool wrappers for Galaxy should be added to. Should not normally be required.

In-development Tools¶

The index tool¶

Generate a bam index file.

command line usage¶

python3 -m MiModD.index <bam input file>

As part of the package’s automated file management, the main MiModD analysis tools generate bam indices on the fly when they are needed and remove them again at the end of the analysis. However, several useful thrid-party tools, like IGV, for example, require both a bam and its associated index file as input so we provide this tool to generate permanent index files explicitly.

The index file will be stored under the same name as the <bam input file>, but will have a .bai extension appended to the name. This naming scheme should allow external tools to auto-detect the index file from the name of the bam file alone.

Note that only coordinate-sorted bam files can be indexed.

The sanitize tool¶

Ensure input files are compatible with MiModD.

Currently, only works with fasta files.

command line usage¶

python3 -m MiModD.sanitize <fasta input file> [OPTIONS]

The tool replaces characters in FASTA description lines that are illegal in MiModD and ensures uniform lengths of sequence data lines. The output generated by the tool is guaranteed to be compatible with all MiModD analysis tools.

available options:

-o <file>, --ofile <file> : [default: stdout]

redirect the output to the specified file

-r <string>, --replace-with <string>

replace all illegal characters in FASTA description lines with the specified character or sequence of characters

If the option is not provided, the tool will replace illegal characters with their hexadecimal capitalized percent encoding (e.g., %20 will replace SPACE characters and %3D will replace =).

-b <width>, --block-width <width> : [default: 80]

wrap sequence lines to the specified number of nucleotides per line