Tool Documentation¶

As an end-user you have two possibilities to work with MiModD:

you can use it as a suite of tools from the command line or, more conveniently, through its built-in Galaxy interface.

The first has the advantage of not requiring any additional software installation and avoids the memory and performance overhead of running the Galaxy server.

The later clearly provides the more intuitive user-interface, particularly for beginners.

This chapter focuses on a discussion of the command line interface of the MiModD tools, but it should be relatively easy for Galaxy users to map the Galaxy tool interface settings to their command line equivalents. To assist you with this, we try to point out peculiarities of the Galaxy tool interface whenever a given tool does not provide a clear 1:1 mapping of its parameters to those available from the command line. Watch out for Galaxy-specific hints of this sort:

Hint

Galaxy tool specifics

…

throughout the documentation of individual tools below.

Command line help system¶

If you are using MiModD through the command line interface, you can access its help system by invoking:

mimodd help [tool or topic]

, where mimodd help gives an overview of all available tools and additional help topics, while detailed help for a specific tool or topic can be requested with the long form. For example, you may run:

mimodd help header

to obtain help for the header tool.

Overview of the MiModD tools¶

The command line interface to the analysis tools¶

The following is a list of all command line analysis tools currently built into MiModD together with a brief description of the purpose of each tool:

info         retrieve information about the samples encoded in a file for
             various supported formats
header       generate a SAM format header from an NGS run description
convert      convert between different sequenced reads file formats
reheader     from a BAM file generate a new file with specified header
             sections modified based on the header of a template SAM file
sort         sort a BAM file by coordinates (or names) of the mapped
             reads
index        generate fasta, bam or snap index files for use with downstream
             or external tools
snap         align sequence reads using the SNAP aligner
snap-batch   run several snap jobs and pool the resulting alignments into
             a multi-sample SAM/BAM file
varcall      predict SNPs and indels in one or more aligned read samples
             and calculate the coverage of every base in the reference
             genome using samtools/bcftools
varextract   extract variant sites from BCF input as generated by varcall
             and report them in VCF
covstats     summary coverage statistics for varcall output
delcall      predict deletions in one or more samples of aligned paired-
             end reads based on coverage of the reference genome and on
             insert sizes
vcf-filter   extract lines from a vcf variant file based on sample- and
             field-specific filters
rebase       rebase the variant positions in a VCF file on a different
             reference coordinate system according to a UCSC chain file
annotate     annotate a vcf variant file with information about the
             affected genes
snpeff-genomes
             list installed SnpEff genomes
varreport    report the variants from a VCF file in a user-friendly form
map          perform a linkage analysis between a selected phenotype and
               identified variants

You can run any individual tool through:

mimodd <tool name> <tool specific arguments>

The Galaxy tool interface¶

If you have enabled MiModD in your local Galaxy, the same tools should be available (under slightly different names) in the MiModD section of the Galaxy Tools bar as shown in the screenshot.

While the MiModD Galaxy tools are really just wrappers around the tools available from the command line, they provide a graphical user interface that typically is easier to handle for the new or occasional user.

In addition, to providing user-friendly access to individual tools, Galaxy also lets you combine individual tools into complex workflows without advanced knowledge of shell syntax or scripting languages. The chapter MiModD and Galaxy in this guide explains how MiModD lets you run complete analyses of WGS data - from sequenced reads to lists of variants - as a single Galaxy workflow without user intervention.

In general, MiModD offers its full functionality also through Galaxy and minor restrictions applying to a few tools are documented at the individual tool level below.

There are some general differences between the set of MiModD command line tools and their Galaxy counterparts though:

Galaxy tool wrappers never let you overwrite MiModD configuration parameters since allowing this would make no sense on a server.

When the command line version of a tool has options like --threads, --memory or the like that will take precedence over package-wide settings, no equivalent parameter will be offered in the Galaxy interface and the tool will always use the preconfigured settings.
Galaxy tools do not let you control output verbosity.

In their command line version, many MiModD tools support a -v, --verbose switch to turn on MiModD status messages and tool run summaries. In addition, MiModD tools using third-party tools internally, provide a -q, --quiet option to switch off status output from these wrapped tools.

Since the main use of these switches is in interactive mode, when the actual analysis output is not getting written to a file, but printed out, they are not made accessible from Galaxy (which always writes output to files) and all tools that understand the --verbose and/or --quiet switches will be running with maximal verbosity (i.e., with --verbose, but not -quiet) when used from Galaxy.
Galaxy tools that require index files cannot use user-provided ones.

The command line versions of most such tools implement a --index-files option to pass a pre-calculated index or, like the snap tool, accept an index instead of a reference genome.

Galaxy, however, silently generates many of these index files anyway and MiModD will try to reuse them if possible (though this behavior can be disabled using the config tool). For the same reason, the index tool does not have a Galaxy version.

The MiModD administrative tools¶

MiModD provides a small number of tools for managing the package. These tools are available only from the command line, not from Galaxy, and are invoked, differently from the analysis tools, via the general pattern:

python3 -m MiModD.toolname

where python3 refers to the python executable used to install MiModD.

Currently, these are the administrative tools installed with MiModD:

config        view and modify configuration settings
upgrade       check for and install upgrades of the package (requires pip)
enablegalaxy  integrate the package into a local installation of Galaxy

Depending on how MiModD and/or Galaxy were installed on your system, these tools may have to be run with superuser rights, i.e., by prepending their invocation with sudo.

In-development tools¶

In addition to the administrative tools above, the

python3 -m MiModD.toolname

invocation pattern also provides access to a few in-development tools that are not considered mature enough to deserve their place among the fully exposed analysis tools, but which may be useful enough already to be shared with you.

Currently, these in-development tools are accessible:

sanitize      ensure input files are compatible with MiModD;
              currently, only works with fasta files

The MiModD tools in detail¶

Conceptually, you can think of the MiModD tools as being arranged into five categories:

the format converter and query tools

info, header, convert, reheader, sort, index, snpeff-genomes, rebase

help you annotate and arrange your sequencing data and let you convert between different formats for storage and use by other MiModD and external tools.

the core tools

snap, snap-batch, varcall, varextract and delcall

do the computation-intensive analysis of the data, aligning sequence reads to reference genomes and detecting sequence variants with respect to the reference.

the data exploration tools

covstats, annotate, varreport, vcf-filter and map

let you summarize, annotate and filter the results of these analyses and help you to extract biologically relevant information.

the administrative tools

config, upgrade, enablegalaxy

in-development tools

sanitize

The Format Converter and Query Tools¶

The info tool¶

Provide information about file contents.

command line usage¶

mimodd info <input file> [OPTIONS]

The tool expects an input file and reports the metadata it finds encoded in it. It works with and auto-detects almost every file format used anywhere in the MiModD package, i.e., SAM, BAM, vcf, bcf and fasta, and is useful for quick checks of file contents. The tool can produce either plain-text or html output, which, like for many MiModD tools, is printed to the standard output (stdout, i.e., the console/terminal window) by default, but can be redirected with the -o or --ofile option to the specified file.

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
--oformat <html|txt> : [default: txt]: generate output in the specified format

-v, --verbose

The header tool¶

Construct a SAM format file header from information about read groups and comments, which can be used by other tools (reheader, convert and snap) to annotate sequence reads files with metadata.

command line usage¶

mimodd header [--rg-id <RG_ID> [RG_ID]...] [--rg-sm <RG_SM> [RG_SM]...]
              [--rg-cn <RG_CN> [RG_CN]...] [--rg-ds <RG_DS> [RG_DS]...]
              [--rg-dt <RG_DT> [RG_DT]...] [--rg-lb <RG_LB> [RG_LB]...]
              [--rg_pl <RG_PL> [RG_PL]...] [--rg_pi <RG_PI> [RG_PI]...]
              [--rg_pu <RG_PU> [RG_PU]...]
              [--co <COMMENT> [COMMENT]... ] [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-x, --relaxed: do not enforce a sample name to be specified for every read group

A read group in SAM/BAM format refers to a group of sequences that form a logical unit and should be treated together. Typically, a read group comprises sequence reads obtained from one sample in a particular sequencing run.

A frequent source of confusion for beginners when dealing with SAM/BAM format is the distinction between read group ID and sample name and why it is (sometimes) necessary to have both.

Essentially, the distinction becomes relevant whenever you obtain several sequencing datasets for the same sample. You may then want to keep these datasets separate at certain steps of an analysis, but treat them together at others. If, for example, one dataset consists of single-end reads and the other of paired-end reads, you would want to have them aligned separately (using different algorithms), but may consider calling variants from the combined set of aligned reads.

MiModD can handle such situations provided that you declared read group ids and sample names accordingly.

Here is a header that you may use in the above scenario:

@HD VN:1.5
@RG ID:001      SM:sample1      DS:sample1 single-end sequencing data
@RG ID:002      SM:sample1      DS:sample1 paired-end sequencing data

This header can be generated with:

mimodd header --rg-id 001 002 --rg-sm sample1 sample1 --rg-ds "sample1 single-end sequencing data" "sample1 paired-end sequencing data"

and will allow you to safely store both read sets in the same file. Most MiModD tools will analyze them separately, but tools for which it makes sense will let you specify whether you want to group data based on read groups or based on sample names (look at the --group-by-id option of the varcall and delcall tools.

You can produce SAM headers with information about one or several read groups using the following command line arguments:

--rg-id <ID> [ID]...: Read group IDs are unique strings unambiguously identifying a read group. They may be provided by the sequencing center generating the data, but you can choose any ID(s) as long as you make sure that your naming scheme does not cause ID collisions in downstream analysis steps.
--rg-sm <sample name> [sample name]...: For each read group declared by providing an ID for it, you must specify a sample name using this option (or use the -x switch to make sample names optional). The sample name should identify the biological sample that the read group provides sequencing data for.
--rg-cn <sequencing center> [sequencing center]...: Optional name of the sequencing center that generated the data for the read group(s).
--rg-ds <description> [description]...: Optional one-line read group description(s).
--rg-dt <YYYY-MM-DD> [YYYY-MM-DD]...: Optional information in YYYY-MM-DD date format about when the data for the read group(s) was generated.
--rg-lb <lib identifier> [lib identifier]...: Optional library identifier(s). A unique identifier of the sequence library that was used for sequencing should be given here.
--rg-pl <platform name> [platform name]...: Optional name of the sequencing technology that was used to produce the data for the read group(s). Should be one of the following officially supported names: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, PACBIO.
--rg-pi <insert size> [insert size]...: Optional value(s) for the predicted median insert size between sequence reads of the read group(s) on the sequenced DNA fragments. Only applicable for paired-end sequencing data.
--rg-pu <unit ID> [unit ID]...: Optional identifier of the physical sequencing unit that the data for the read group(s) was generated on. Depending on the sequencing platform that was used (see the –rg_pl option) this could be, e.g., a flowcell-barcode or a sequencing slide.

The number of read groups declared in the resulting SAM header is determined by the number of read group IDs and matching number of sample names specified through the --rg-id and --rg-sm options. All other information for any read group is optional. If you do not want to provide a certain parameter for any read group you can simply omit the option. To provide a given parameter for some, but not all of the specified read groups, use the empty string "" as a placeholder. Pairing between parameters is determined by the input order.

Example 1¶

Specify two read-groups with sequencing date information

$ mimodd header --rg-id 001 002 --rg-sm sample1 sample2 --rg-dt 2014-10-23 2014-09-14

@HD VN:1.5
@RG ID:001  SM:sample1      DT:2014-10-23
@RG ID:002  SM:sample2      DT:2014-09-14

Example 2¶

Specify three read-groups providing dates for all three, but a description only for the second

$ mimodd header --rg-id 001 002 003 --rg-sm sample1 sample2 sample3 --rg-dt 2014-10-23 2014-09-14 2014-11-09 --rg-ds "" "WT control" ""

@HD VN:1.5
@RG ID:001  SM:sample1      DT:2014-10-23
@RG ID:002  SM:sample2      DS:WT control   DT:2014-09-14
@RG ID:003  SM:sample3      DT:2014-11-09

Example 3¶

Generate the three-sample header above, but with two general comment lines appended

In addition to or instead of read group information the header tool also lets you specify an arbitrary number of comment lines. These are not bound to any specific read group and should, generally, be used for remarks concerning the read groups or their analysis as a whole though you can, of course, refer to a particular read group in the comment text:

$ mimodd header --rg-id 001 002 003 --rg-sm sample1 sample2 sample3 --rg-dt 2014-10-23 2014-09-14 2014-11-09 --rg-ds "" "WT control" "" --co "data from the xx sequencing project" "provided by yy lab"

@HD VN:1.5
@RG ID:001  SM:sample1      DT:2014-10-23
@RG ID:002  SM:sample2      DS:WT control   DT:2014-09-14
@RG ID:003  SM:sample3      DT:2014-11-09
@CO data from the xx sequencing project
@CO provided by yy lab

Note the use of quotes to group elements. Without them a new comment line would be started after every whitespace.

Example 4¶

Generate a header without read groups and only comment lines

$ mimodd header --co "forgot to say:" this

@HD VN:1.5
@CO forgot to say:
@CO this

Note

Within MiModD the order of comment lines will, generally, be preserved, but this is not a file-format guarantee, i.e., other software is free to change it.

While such a header may not be terribly useful on its own, it can be used with the reheader tool to add comments to an existing header.

Hint

Galaxy tool specifics

The MiModD Run Annotation tool available through Galaxy is more limited than its command line counterpart in that it only allows a single read group to be specified and described. There is also no possibility to provide comment fields. This is because the Galaxy Reheader tool takes over the extra functionality provided by the command line tool.

The only function of the Run Annotation tool is to generate the metadata for a single dataset of unaligned reads for use with the Convert or Read Alignment tools. Since these tools require an accompanying sample name for every read group the tool does not support the -x, --relaxed mode available from the command line, i.e, a read group ID and a sample name are always required to run the Galaxy tool.

The convert tool¶

Convert between different sequenced reads file formats.

command line usage¶

mimodd convert <input file> [input file]... [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]

redirect the output to the specified file

the format of the input file(s); the input format determines

whether the -h option (see below) is applicable and
whether multiple input files are allowed and how they are used:
- with input in SAM or BAM format only one input file is accepted
- with input in single-end fastq format (compressed or not; format strings fastq, gz) multiple input files may be given and the reads from all files will be written to the output in concatenated form, i.e., as if they had come from a single source, providing an easy way to merge split fastq files into a single SAM/BAM file
- with input in paired-reads fastq format (compressed or not; format strings fastq_pe, gz_pe) multiple input files are interpreted as alternating r1 and r2 files, i.e., every other file is expected to contain the read mates of the preceding file

--oformat <sam|bam|fastq|gz> : [default: sam]

the output format; the output format determines how the --ofile option is interpreted:

when writing output in SAM or BAM format, the name provided with the --ofile option is used as the name of the (single) output file. Without the option the output goes to the standard output stream.

Note that the -r option changes this behaviour.
when writing output in fastq or gzip-compressed fastq, use of --ofile is mandatory and the name provided with the option is used as the basename of possibly multiple output files. One output file will be generated per read group for single-end data and two output files per read group for paired-end data (the required splitting is auto-detected).

-h <header file>, --header <header file>

use the header of the provided SAM file as the output header; applicable only when --iformat specifies a headerless file format, i.e., one of fastq, fastq_pe, gz or gz_pe

-r, --split-on-rgs

if the input file has reads from different read groups, write them to separate output files. With this option, --ofile is required and gets interpreted as the basename of possibly multiple output files. If --oformat is set to bam or sam, one output file per input read group will be produced. -r is implied when --oformat is set to fastq or gz and leads to the splitting described under --oformat.

-t <threads>, --threads <threads> : [default: config settings]

the maximum number of threads that the conversion is allowed to use; this option is ignored if the specified conversion type does not support parallel processing

Additional specifications:

Standard input can be used instead of an input file by specifying a minus sign - instead of a file name.

The following table summarizes the conversions that are currently supported:

from	to	input file #	output file #	optional header
fastq(.gz)	sam/bam	>= 1	1	yes
sam/bam	bam/sam	1§	1§	no
sam/bam	fastq(.gz)	1	>= 1	no

§ with –split-on-rgs the number of output files will equal the number of read: groups in the input file.

Note

We anticipate future versions of MiModD to implement subsampling and filtering during conversions. This is the reason why, in addition to the format combinations above, the currently meaningless (unless you are looking for a slow way to copy files) “conversions” sam <-> sam and bam <-> bam are also accepted.

Examples:

Example 1¶

View the contents of a BAM file by converting it to SAM format

$ mimodd convert myreads.bam

Conversion from BAM to SAM is the default behavior.

Example 2¶

Convert a SAM file to BAM format

$ mimodd header --rg-id 007 --rg-sm "secret sample" --co "for your eyes only" | \
  mimodd convert - --iformat sam --oformat bam

�BCdsr�e�c``p�p�
                ��2�3�r��t�200�
                               ��*NM.J-Q(N�-�I�rp��L�/R��/-RH�L-V��˩���`��J�BCw

While the output is not too useful, this example demonstrates how the tool can be used in a pipe. Now just to show that the above worked as expected:

$ mimodd header --rg-id 007 --rg-sm "secret sample" --co "for your eyes only" | \
  mimodd convert - --iformat sam --oformat bam | mimodd convert -

@HD VN:1.5
@RG ID:007  SM:secret sample
@CO for your eyes only

Hint

Galaxy tool specifics

The MiModD Convert tool available through Galaxy offers the same functionality as the command line tool described above.

However, it handles gzip compression of fastq datasets transparently - just make sure your gzipped fastq datasets have their datatype declared as fastqsanger.gz (see also the Understanding Galaxy datatypes section in this manual).

In addition, it supports the conversion to and from Galaxy dataset collections where appropriate.

Support for Galaxy dataset collections by the MiModD Convert tool.

The reheader tool¶

command line usage¶

mimodd reheader [template SAM file] <BAM input file>
                [--rg ignore|update|replace [RG_TEMPLATE] [RG_MAPPING]...]
                [--sq ignore|update|replace [SQ_TEMPLATE] [SQ_MAPPING]...]
                [--co ignore|update|replace [CO_TEMPLATE]]
                [--rgm RG_MAPPING [RG_MAPPING]...]
                [--sqm SQ_MAPPING [SQ_MAPPING]...] [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-H: output only the resulting header in SAM format

This complex, but powerful tool rewrites an existing <BAM input file> modifying its header on the fly based on the header information of one or several template SAM files according to at least one of these optional parameters:

template SAM file

if specified the header of this file provides the read group information, the sequence dictionary and the comment lines that will be used to modify the input file header.

--rg ignore|update|replace [RG template SAM file] [RG mapping]...

the –rg option followed by one of the keywords ignore, update or replace specifies how the read group section of the new header should be compiled. With ignore any template read group information is ignored and the read group section of the BAM input file is copied to the new file unchanged. With update the input file header is modified by template read group information according to these rules: any read groups found only in the BAM input or in the template information are copied to the new header. For read groups with matching IDs found in the BAM input and in the template, any information about this read group found in the template will be written to the new header (possibly overwriting input file information), but information found only in the BAM input will be copied unchanged. With replace the rules are: all template read groups are written to the new header as they are (possibly overwriting read groups with matching IDs in the BAM input and without information merging). Read groups found only in the BAM input file are discarded.

Both update and replace may optionally be followed by

a read group-specific template file, which will be used as the source of template read group information (i.e., if the general template SAM file is also given, its read group section is ignored)

and/or

a read group mapping to be used to determine read group ID matching between the BAM input file and the template. This mapping has to be provided in the exact form input_id : template_id (with spaces around the colon). Several mappings must also be separated by a space. When a mapping is given and a template read group with the ID template_id is found, it is treated as if its ID was input_id.

--sq ignore|update|replace [SQ template SAM file] [SQ mapping]...

the –sq option determines the use of template sequence information in just the same way as –rg does for read groups. The only differences are that sequences will be matched by their name (SN header values) and that the length (LN value) and the MD5 checksum (M5 value) of a sequence will never be modified when rewriting sequence information from the BAM input to the new header.

--co ignore|update|replace [template SAM file for comments]

determines how template comments are used. They can be ignored, appended to (with update) any comments found in the BAM input file or be used to replace any comments from the BAM input. As with the –rg and –sq options, an optional specific template SAM file can also be specified.

--rgm RG mapping [RG mapping]...

this option lets you specify one or read group mappings (in the format shown above for the –rg option) that are used to rename read groups. Importantly, renaming is always happening after all other modifications and the input_id(s) given in the mapping(s) have to still exist at that point.

--sqm SQ mapping [SQ mapping]...

equivalent to –rgm for renaming sequence names

Additional specifications:

A minus sign - can be used in place of any of the different template file names that the tool supports indicating that the respective header template(s) should be read from standard input. This feature makes it possible to use the tool efficiently in pipes, e.g., with input coming from the header tool.
If a general template SAM file is specified the tool’s default for the --rg, --sq and --co will be set to replace, otherwise the defaults will be ignore.
The tool autodetects header changes (changes in read group IDs or sequence names, elimination of read groups or sequence records) that may be incompatible with records in the body of the BAM file. In these situations, it will scan the file body for incompatible records and try to fix them or, failing that, will terminate with an error message. In this way, the tool guarantees that its reheadered output will always be a valid BAM file. On the other hand, the scan causes very significant runtime increases leading to the folowing recommendations:
- reserve the use of the --rgm and --sqm options for vital cases
- use replace mode for read groups and sequence names (--rg and --sq options) with caution and take care not to remove read group or sequence records from the final header that get referred to in the body of the BAM file.

Examples:

The following examples assume that you have a BAM input file named “example_input” with the following header (printed in SAM format):

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DT:2013-04-30
@RG ID:002  SM:sample 2     DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

This file contains reads from two samples mapped against a common reference genome, i.e. the sequences chrI, chrII, chrIII and the read group IDs 001 and 002 will be referenced in the body of the file.

Note

The reheader tool is quite complex and may take some practice to use efficiently. The -H option instructs the tool to generate only the new header in human readable SAM format and we encourage beginners to use this option to check the outcome of their parameter settings before producing the full reheadered BAM file.

Example 1¶

Adding descriptions to existing read groups

$ mimodd header -x --rg-id 001 002 --rg-ds "labbook #1, exp #216" "labbook #1, exp #217" | \
  mimodd reheader -H example_input --rg update -

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:002  SM:sample 2     DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

Note that, for this illustration, we used the -H option to obtain only the resulting header in SAM format.

We decided to generate the SAM header template with the descriptions to be added using the header tool. Use of the -x option saved us from having to redundantly specify the sample names again. We piped the output directly to the reheader tool to use it as the --rg template. update mode allows us to add the descriptions to the existing information of each read group. Using replace instead would have produced this unintended output:

@HD VN:1.0  SO:coordinate
@RG ID:001  DS:labbook #1, exp #216
@RG ID:002  DS:labbook #1, exp #217
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

with all the existing read group information replaced by the newly porvided one.

Example 2¶

Adding comments

$ mimodd header --co "let me just say:" "this header needs comments" | \
  mimodd reheader -H example_input --co update -

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DT:2013-04-30
@RG ID:002  SM:sample 2     DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700
@CO let me just say:
@CO this header needs comments

Again, we generated the new header elements using mimodd header. In this particular case, the same output would have been generated with --co replace mode since no comment lines existed in the original BAM file.

Example 3¶

Renaming a read group

$ mimodd reheader -H example_input --rgm 001 : 003

@HD VN:1.0  SO:coordinate
@RG ID:003  SM:sample 1     DT:2013-04-30
@RG ID:002  SM:sample 2     DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

Note that if we had not used the -H option, but had produced a full output BAM file with:

$ mimodd reheader example_input --rgm 001 : 003 -o example_output

the tool would have scanned the body of the input BAM file and would have replaced any references to read group ID 001 with references to ID 003.

Example 4¶

Adding information to read groups and changing their ID in one pass

Assume you have a template SAM file named “template.sam” that has additional information about the read groups, but lists those read groups under different IDs, e.g:

@HD VN:1.0  SO:coordinate
@RG ID:752  SM:sample 1     DS:labbook #1, exp #216
@RG ID:753  SM:sample 2     DS:labbook #1, exp #217

You could use:

$ mimodd reheader -H example_input --rg update template.sam 001 : 752 002 : 753

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:002  SM:sample 2     DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

to assign the new information to the correct read groups. If you not only wanted to update the information about the read groups, but also wanted to change their IDs to match those in the template file, you would have to use:

$ mimodd reheader -H example_input --rg update template.sam 001 : 752 002 : 753 --rgm 001 : 752 002 : 753

@HD VN:1.0  SO:coordinate
@RG ID:752  SM:sample 1     DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:753  SM:sample 2     DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

Providing the mapping twice is necessary in this case because the changes of IDs according to the --rgm option are always made after any operations specified by --rg.

Hint

Galaxy tool specifics

The MiModD Reheader tool available through Galaxy offers the same functionality as described above for the command line tool, yet provides a greatly simplified user interface that should be mostly self-explanatory.

As an additional feature, it lets you specify new read group information directly through its interface, where from the command line you would have to produce this data with the upstream header tool as demonstrated in the examples above.

Screenshot of the MiModD Reheader tool illustrating direct specification of new read group information.

The sort tool:¶

Sort the reads in a SAM or a BAM file.

command line usage¶

mimodd sort <input file> [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file

--iformat <bam|sam> : [default: assume bam] --oformat <bam|sam> : [default: bam]

-n, --by_name

-l <compression level> : [default: 6 (standard BAM compression level)]

-m <memory>, --memory <memory> : [default: config settings]

-t <threads>, --threads <threads> : [default: config settings]

By default, the tool sorts the reads in a BAM input file by their mapped reference coordinates. The -n switch can be used to sort by read names instead. The -l, -m, and -t options let you tweak the tool for optimal performance on your system. Specifically, they control the compression level to use for intermediate and final output, how much memory in GB to use for the sorting and how many cores to engage in it. Since the default values for -m and -t come from MiModD’s configuration settings (see Configuring MiModD for your system) they should be adequate in most situations.

The index tool:¶

Generate an index of a reference genome or an aligned reads bam file for use with downstream tools like the snap, snap-batch or varcall tools or with third-party tools like IGV.

command line usage¶

mimodd index <index type> <file to index> [OPTIONS]

-o <path>, --output <path> : [default: alongside file to index]: specifies the location at which to save the index; the default is to save the index alongside the file to index as <file to index>.<index type> for indices of type fai and bai, or as <file to index>.snap_index for indices of type snap
-t <threads>, --threads <threads> : [default: config settings]: currently, only the building of snap indices can make use of multiple threads and the option is ignored for fai and bai index building

snap-specific indexing options: These options can only be used when index type is snap.

-s <seed size>, --seedsize <seed size>, --idx-seedsize <seed size> : [default: 20]: seed size used in building the index
-h <slack>, --slack <slack>, --idx-slack <slack> : [default: 0.3]: hash table slack for indexing
-O <factor>, --overflow <factor>, --idx-overflow FACTOR : [default: 40]: factor (between 1 and 1000) to set the size of the index build overflow space; for certain genomes you may have to increase this value if you are getting a corresponding error from the tool

As part of the package’s automated file management, the MiModD analysis tools generate indices on the fly when they are needed and remove them again at the end of the analysis. If you find yourself running the same tool repeatedly on the same input (e.g., to test the effect of certain parameter settings), you may consider generating permanent index files with this tool and have the downstream tool use these files instead of recalculating the indices every time. In addition, several useful third-party tools, like IGV, require both a bam or fasta file and its corresponding permanent index file.

Currently, this tool can generate:

snap indices of reference genomes,

which can be passed to the snap tool through its index directory argument; specify snap as index type and a reference genome fasta file as the file to index to generate such an index; the output will take the form of a directory, which, by default, will be created alongside the file to index with the name <file to index>.snap_index; use the --output option to change the location and the name of the directory
samtools-style fasta indices of reference genomes;

these are used by certain samtools subcommands that you may want to run directly occasionally; specify fai as index type and a reference genome fasta file as the file to index to generate this type of index; the output will be an index file, which, by default, will be created alongside the file to index with the name <file to index>.fai; this is the name and the location that samtools subcommands expect and that will allow these tools to autodetect it, but you can use the --output option to change the location and the name of the file
samtools-style bam indices;

the resulting index files can be passed to the varcall and the delcall tools through their --index-files option to avoid on-the-fly index calculation by these tools; in addition, IGV and several other third-party tools require this type of index to be able to work with the corresponding bam files; specify bai as the index type and a coordinate-sorted bam file of aligned reads as the file to index to generate this type of index; by default, the index file will be created alongside the file to index with the name <file to index>.bai as this is the name and the location that will allow IGV and most other tools to autodetect it, but you can use the --output option to change the location and the name of the file; Note that only coordinate-sorted bam files can be indexed.

Hint

Galaxy tool specifics

This tool is not accessible from Galaxy.

All MiModD Galaxy tools will build the indices they require on the fly or reuse indices built by Galaxy itself (see the config tool documentation on how to control the use of Galaxy-built indices).

The snpeff-genomes tool:¶

Query the host machine’s SnpEff installation for registered and installed (i.e., usable) genome annotation files.

command line usage¶

mimodd snpeff-genomes <input bcf file> [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-c <path>, --config <path>: use this path to locate the SnpEff installation directory instead of looking it up in the MiModD configuration file

Hint

Galaxy tool specifics

The Galaxy interface of the List Installed SnpEff Genomes tool does not provide access to the -c, --config option, but works just like the command line tool, otherwise.

The rebase tool:¶

Rebase the variant positions in a VCF file on a different reference coordinate system using the mapping defined in a UCSC chain file.

command line usage¶

mimodd rebase <input vcf file> <chain file> [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
--r, --reverse: swap the genome versions specified in the chain file, i.e., assume the coordinates in the input file are based on the chain file target genome version and should be mapped to the source genome version

Warning

The tool ports variant coordinates between different reference genome versions, but it does not re-evaluate the variant calls. In other words, if a variant allele listed in the VCF input is actually the reference allele in the mapped-to version of the reference genome, then the tool will not account for this fact and still report the coordinate-mapped site as a variant site in its output.

The Core Tools¶

The snap tool:¶

Align sequence reads to a reference genome.

command line usage¶

mimodd snap single|paired <reference genome>|<index directory>
            <input file> [<input file2>] -o <file> [OPTIONS]

For the sake of clarity, the many options of this tool can be grouped into four categories:

general options

paired-end data options that define the exact requirements for a valid read pair and will be ignored when aligning single-end reads

advanced alignment options that adjust parameters of the alignment algorithm and will require a bit of understanding of the alignment process to be used reasonably

indexing options that control the details of reference genome index building in preparation of the alignment step

While all options may occasionally be useful, beginners can safely ignore the advanced alignment and indexing options, for which the default values should work reasonably well in most situations.

general options:

--iformat <fastq|gz|sam|bam> : [default: bam]

--oformat <sam|bam> : [default: bam]

--header <header file>: a SAM file providing information about exactly one read group in its header, which will become the read group information in the aligned output file. Specification of a valid header file is required with headerless <input files>, i.e., input in fastq format or SAM/BAM input without read group information. When used with SAM/BAM input containing read group information, this original information will be overwritten.

-t <threads>, --threads <threads> : [default: config settings]

-m <memory>, --memory <memory> : [default: config settings]

maximmal memory to use in GB. Please note that, currently, this setting will only be respected during sorting of the aligned reads, but NOT at the alignment step itself, which will always consume memory as required to index the reference genome.

--no-sort

by default, the tool sorts aligned reads based on their mapped coordinates on the reference genome; with this option, the original order of reads in the <input file(s)> is retained instead.

Please note that ONLY coordinate-sorted aligned read files can be processed with downstream MiModD tools like varcall and delcall.

-X

indicates that CIGAR strings in the output should use = and X to indicate matches/mismatches instead of M for both

USE OF THIS OPTION IS DISCOURAGED as =/X CIGAR strings are still not fully supported by useful third-party tools like IGV

-q, --quiet

-v, -verbose

options affecting paired-end reads alignment:

The options in this category will be ignored by the tool when executed in single mode. In paired mode, however, they can have a relatively big impact on alignment quality so adjusting them to your specific use-case is highly recommended.

-D [RF|FR|FF|RR|ALL]..., --discard-overlapping-mates [RF|FR|FF|RR|ALL]... : default

discard overlapping read mates of the specified orientation(s).

The values RF, FR, FF and RR correspond to the following mapped read orientations (with === indicating the reference genome):

RF:          5' ------>
      =======================
            <------ 5'


FR:      5' ------>
      =======================
                <------ 5'


FF:      5' ------>
             5' ------>
      =======================


RR:
      =======================
            <------ 5'
               <------ 5'

By default, overlapping mate pairs are not treated specially (i.e., they are kept in the output along with non-overlapping read pairs) and this is fine for most sequencing datasets. However, if you have a priori knowledge, which sort of mate overlaps must represent anomalous alignments given the sequencing protocol used, you can tell the tool to discard mates of this type and this will, generally, improve the overall alignment quality (though how much of an improvement this will bring depends on the fraction of such reads in the input).

One typical use-case is with Illumina PE sequencing where you know that the RF overlap type must be the result of misalignment or a too small insert size of the genomic DNA template fragment (resulting in sequencing into Illumina adapter sequences). Running snap with -D RF will eliminate these potentially misinforming reads from the output.

-s MIN MAX, --spacing MIN MAX : [default: 100 10000]

The minimum and maximum insert size for an aligned read pair to be considered a valid pair. Only if the mapped reads show an insert size in the given range, will they be flagged as a proper read pair in the aligned output. Reads with an insert size outside the range will not be lost, but will be treated like single-end reads during alignment. For optimal alignment results, the specified range should match the expected insert size distribution determined by the genomic DNA library preparation (though, in practice, alignments are relatively robust against wrong assumptions). In addition, however, you will not be able to discover deletions larger than MAX when using the output file with delcall.

advanced alignment options:

These options map 1:1 to the options of the wrapped SNAP aligner although in some cases MiModD uses different defaults than SNAP as a standalone tool. Most options are explained in detail in the original SNAP manual so we limit ourselves here to listing them briefly together with their MiModD default values.

-d EDIT DISTANCE, --maxdist EDIT DISTANCE : [default: 8]: maximum edit distance allowed per read or pair
-n SEEDS, --maxseeds SEEDS : [default: 25]: number of seeds to use per read
-h HITS, --maxhits HITS : [default: 250]: maximum hits to consider per seed
-c THRESHOLD, --confdiff THRESHOLD : [default: 2]: confidence threshold
-a THRESHOLD, --confadapt THRESHOLD : [default: 7]: confidence adaptation threshold
-e, --error-rep: compute error rate assuming wgsim-generated reads
-P, --no-prefetch: disables cache prefetching in the genome; may be helpful for machines with small caches or lots of cores/cache
-x, --explore: explore some hits of overly popular seeds (useful for filtering)
-f, --stop-on-first: stop on first match within edit distance limit (filtering mode)
-F FILTER, --filter-output FILTER: retain only certain read classes in output (a=aligned only, s=single hit only, u=unaligned only)
-I, --ignore: ignore non-matching IDs in the paired-end aligner
-S SELECTIVITY, --selectivity SELECTIVITY: selectivity; randomly choose 1/selectivity of the reads to score
-C ++|+x|x+|xx, --clipping ++|+x|x+|xx : [default: ++]: specify a combination of two + or x symbols to indicate whether to clip low-quality bases from the front and back of reads respectively
-G PENALTY, --gap-penalty PENALTY: specify a gap penalty to use when generating CIGAR strings
-b, --bind-threads: bind each thread to its processor

indexing options:

--idx-seedsize SEED SIZE : [default: 20]: seed size used in building the index
--idx-slack SLACK : [default: 0.3]: hash table slack for indexing
--idx-overflow FACTOR : [default: 40]: factor (between 1 and 1000) to set the size of the index build overflow space
--idx-out INDEX DIR: name of the index directory to be created; if given the index directory will be permanent (i.e., available for additional runs of the tool), otherwise a temporary directory will be used

In addition, to the above options, the tool recognizes two legacy options solely for backwards-compatibility with earlier versions of MiModD. These are:

-M, --mmatch-notation: indicates that CIGAR strings in the output should use M (alignment match) rather than = and X (sequence (mis-)match); this option has been superseeded by the -X option
--sort, --so: sort output file by alignment location; this is now the default behaviour; use –no-sort to turn sorting off

The tool aligns NGS reads in fastq (gzipped or uncompressed), SAM or BAM format (specified with --iformat) to a reference genome using the SNAP aligner.

Besides an <input file> (or two in case of paired-end reads in fastq format) of reads, it requires specification of a sequencing mode, which can either be single or paired, and a <reference genome> to align the reads against. The tool will automatically build an index for this reference. Alternatively, a precalculated index (obtained from a previous snap call using the --idx-out option or directly from the index tool) may be specified. Whether a reference genome or an index is provided will be auto-detected.

The tool requires your input file(s) to have associated header information. If this is not the case, e.g., with fastq input files, you must provide a SAM file with that information through the --header option. You can also use this option to exchange existing header information in the input just for the alignment.

Note that, currently, the tool does NOT support printing to the standard output, so specifying an output file with the -o option is mandatory.

Hint

Galaxy tool specifics

This tool is not directly accessible from Galaxy. The Galaxy interface of MiModD always uses the snap-batch tool instead.

The snap-batch tool:¶

Execute a batch of snap tool calls automatically in an optimized way and merge the results of all calls into a single aligned reads file with several read groups.

command line usage¶

mimodd snap-batch -s <snap command> [<snap command> ...]

or:

mimodd snap-batch -f <input file with snap command lines>

-s <snap command> [<snap command> ...]

read individual snap calls from the rest of the command line; individual commands should be enclosed in quotes to allow proper grouping.

-f <file>: read individual snap calls from <file>; the file must have one valid snap command per line.

The individual commands passed through the -s option or in the file specified with -f should be valid command lines for the mimodd snap tool, but without the leading mimodd.

Since this tool produces only one merged output file for all jobs, the options -o, --oformat and --no-sort are interpreted only once from the first snap command provided. Conflicting settings in later calls are ignored. Note, however, that -o, --ofile is a required option and, thus, needs to be provided with every command to make it syntactically correct.

Hint

Galaxy tool specifics

The Galaxy MiModD Read Alignment tool, internally, uses snap-batch even for aligning just single datasets.

Currently, the only restriction when using this tool as compared to working from the command line is that you can only specify a single reference genome, to which all sequenced reads datasets will be aligned (while, from the command line you could specify a separate reference genome with every snap command). Consequently, you can also specify only one global set of reference indexing options.

As the reference genome, users can choose an uploaded fasta dataset available in their current history, or any reference genome offered by the server (select the Use a built-in genome option for this - if it does not bring up any reference genomes to choose from, there aren’t any installed on the server).

In addition, the Galaxy tool allows you to define global settings for the alignment options that are then applied to all individual alignment jobs that do not overwrite the global default. Using these global settings can save quite some time when populating the tool interface!

The MiModD Read Alignment tool in Galaxy.

The varcall tool:¶

Generate variant call statistics for each base in the reference genome based on the mapped sequenced reads in one or more groups of aligned reads.

command line usage¶

mimodd varcall <reference genome> <input file> [input file]... [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-i, --group-by-id: call variants on per-read group basis instead of per-sample
-x, --relaxed: turn off md5 checksum comparison between sequences in the reference genome and those specified in the BAM input file header(s)
--index-files <index file> [<index file> ...]: pre-computed index files for all input files; while the tool can calculate the required bam indices on the fly, it saves time, with repeated runs on the same input files, to calculate the indices only once and store them for reuse; this can be done with the index tool with its index type argument set to bai; Note: if you are using this option, you need to provide indices for all input files, not just for some of them
-d <depth cap>, --max-depth <depth cap> : [default: 250]: average sample depth cap applied to input with extraordinarily large numbers of samples sequenced at high coverage to limit memory usage (default: 250)

-t <threads>, --threads <threads> : [default: config settings]

-q, --quiet

-v, --verbose

The tool generates BCF output with per-reference nucleotide call statistics independent of whether they provide evidence for a variant or not. This output can be used with:

the varextract tool to extract only bona fide variant sites in VCF format
the covstats tool to calculate per-chromosome and -sample read coverage
the delcall tool, together with the original BAM input files, to find evidence for deletions

MD5 checksums for reference contigs/chromosomes

By default, the tool will check whether its BAM input file(s) provide(s) MD5 checksums for the reference genome contig/chromosome sequences used during read alignment (the snap and snap-batch tools always store these in the BAM file header). If it finds MD5 sums for all sequences, it will compare them to the checksums of the provided reference genome sequences and abort with an error message if there is a discrepancy between them. If it finds contigs/chromosomes with matching checksum, but different names in the aligned reads input file(s) and the reference genome, it will use the name from the reference genome in its output.

This behavior has two benefits:

1) It protects from accidental variant calling against a wrong reference genome (i.e., a different one than that used during the alignment step), which would result in wrong calls. This is the primary reason why we recommend to leave the check activated.

2) It provides an opportunity to change sequence names between aligned reads files and variant call files by providing a reference genome file with altered sequence names (but identical sequence data).

Since there may be rare cases where you really want to align against a reference genome with different checksums (e.g., you may have edited the reference sequence based on the alignment results), the check can be turned off using the -x, --relaxed switch, but be sure you only do this if you know exactly why.

Average sample depth cap limit

For each of a total of M BAM input files, the tool will only pile up a maximum number of reads N per position to avoid excessive memory usage with very large numbers of samples sequenced at high coverage. N will be calculated as the maximum of 8000 / M and <depth cap> * S, where S is the maximum number of samples found in a single input file and <depth cap> is the average sample depth cap limit specified by the -d, --max-depth option. This option, thus controls the average depth of the pile-up per sample that is guaranteed to be used even when there is a very large number of samples. As can be seen from the formula above, however, it will rarely become relevant for any regular-size analysis.

Hint

Galaxy tool specifics

The MiModD Variant Calling tool groups its -d, --max-depth and -x, --relaxed equivalents in a More options section, which is collapsed by default.

Like other Galaxy tools, it does not expose the --index-files option available from the command line, but will reuse Galaxy-built indices if appropriate.

The varextract tool:¶

Extract variant sites from BCF input as generated by varcall and report them in VCF format.

command line usage¶

mimodd varextract <input bcf file> [OPTIONS]

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-p <vcf file> [<vcf file>]..., --pre-vcf <vcf file> [<vcf file>]...: incorporate precalculated variant site information from the given additional VCF files in the output
-a, --keep-alts: extract all sites from the input file for which at least one non-reference base has been observed in at least one sample; enabling this option should normally not be necessary and will increase the number of extracted sites dramatically; occassionally the option might be helpful though for closer inspection of candidate allelic sites that may have been missed by the variant caller
-v, --verbose: currently without effect

By default, the tool defines a variant site as a position in the genome for which variant calling provided statistical evidence for a genotype other than homozygous reference for at least one sample. This behaviour can be changed through the -a option.

If precalculated variant site information is provided through the -p option, the samples found in the additional VCF files will be added to the output and sites from the main BCF input will be included in the output if they qualify as variant sites based on the genotype call in either the BCF or the additional VCF files.

A minimal example of a VCF file with precalculated variant information is (consecutive spaces indicate TAB separation):

##fileformat=VCFv4.2
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
chrI      1222   .     .      .      .       .         .
chrI      2651   .     .      .      .       .         .

...

If sample and genotype information is omitted as in this example, the sample name will be set to external_source_# (where # indicates the sample number starting at 1 for the first unnamed sample found) and a genotype of 1/1 will be assumed for each listed site. VCF format compliant sample-specific genotype information can, of course, be included in more complex situations.

The delcall tool:¶

Predict deletions in one or more groups of aligned paired-end reads.

command line usage¶

mimodd delcall <input bam file> [input bam file]... <coverage file> [OPTIONS]

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file

--max-cov <maximal coverage>

--min-size <minimal size>

-u, --include-uncovered: include uncovered regions (even if they do not qualify as deletions)

-i, --group-by-id

--index-files <index file> [<index file> ...]: pre-computed index files for all input bam files; while the tool can calculate the required bam indices on the fly, it saves time, with repeated runs on the same input bam files, to calculate the indices only once and store them for reuse; this can be done with the index tool with its index type argument set to bai; Note: if you are using this option, you need to provide indices for all input bam files, not just for some of them

-v, --verbose

The tool predicts deletions in one or more samples of aligned paired-end reads from the specified input bam files based on coverage of the reference genome (taken from the input cov file) and on observed insert sizes around the region.

By default, the tool reports uncovered regions, for which their is statistical evidence that they are real deletions. This behaviour can be changed so that any uncovered region will be included with the -u switch.

The output (in gff format) can be redirected from the standard output to a file specified with the -o option.

The --max-cov and --min-size options can be used to set coverage and size thresholds for reporting uncovered regions/deletions.

Like in varcall, there is an -i switch to control the mode of read grouping, but its meaning is slightly different. If enabled reads are grouped strictly by their read groups, but the default is to group reads by their associated sample names only for the coverage analysis, i.e., to detect uncovered regions. The statistical test for real deletions, however, will always be performed on a per-read group basis.

Note

This tool requires paired-end data for the statistical assessment of deletions, but single-end data with the same sample name can be used in and improve the initial coverage analysis.

Hint

Galaxy tool specifics

Like other Galaxy tools, the MiModD Deletion Calling tool does not expose the --index-files option available from the command line, but will reuse Galaxy-built indices if appropriate.

The Data Exploration Tools¶

The covstats tool:¶

Provide a coverage summary report for output from the varcall tool.

command line usage¶

mimodd covstats <input bcf file> [-o <file>]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file

The vcf-filter tool:¶

Filter multi-sample vcf files, like the ones generated by varextract, by chromosomal region, variant type (i.e., INDEL or single nucleotide change), and/or by sample-specific genotypes and coverage.

command line usage¶

mimodd vcf-filter <input file> [OPTIONS]

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-s <sample name> [sample name]..., --samples ...: one or more names of samples that the sample-specific filters --gt , --dp and --gq should work on
--gt <genotype pattern> [genotype pattern]...: one or more genotype patterns of the form x/x[, x/x] where x=0 indicates a reference allele and x=1 a variant allele; exactly one genotype pattern (possibly composed of several comma-separated genotypes) needs to be given per sample name specified through -s and variants will be retained only if every listed sample has a genotype contained in its corresponding genotype patterns; use ANY as the genotype pattern to exclude a sample from being used in genotype filtering
--dp <depth> [depth]...: depth of coverage threshold; exactly one value needed for every sample specified through -s; variants will only be retained if every sample declared with -s shows a read coverage of the variant site at least equal to the corresponding threshold; use 0 to skip the requirement for a given sample
--gq <genotype quality> [genotype quality]: genotype quality threshold; exactly one value needed for every sample specified through -s; variants will only be retained if every sample declared with -s has a quality score for its predicted genotype equal to or higher than the corresponding threshold; specifying 0 as the threshold effectively skips gq filtering for a given sample
--af <allelic fraction specifier> [allelic fraction specifier]: exactly one value is needed for every sample specified through -s; each allelic fraction specifier has the format [ALLELE#]:[MIN FRACTION]:[MAX FRACTION] and variants are only retained if the indicated allele number is found in a fraction of reads between MIN and MAX FRACTION; if omitted MIN and MAX FRACTION default to 0 and 1, respectively; if no ALLELE# is specified than the fraction criterion must be met by the most frequent non-reference allele
-r <chromosome>:[start-stop] [chromosome:[start-stop]]..., --region ...: retain variants only if they fall into one of the given chromosomal regions (specified in the format chrom:start-stop or chrom: for a whole chromosome)
-i | -I, --no-indels | --indels-only : [default: retain all types of variants]: mutually exclusive options; use -i or --no-indels to remove indel-type variants from the output, -I or --indels-only to retain only indels
--vfilter <sample name> [sample name]...: filter by sample name(s); instead of filtering variants, this option will retain sample-specific information about each variant only for the specified samples; if several sample names are provided their order also determines the order of sample-specific information in the output; useful for demultiplexing and restructuring (the sample-specific part of) multi-sample vcf files

To filter variants based on sample-specific information, first use the -s option to specify one or more samples by name to which the filters should be applied. Then use the --gt, --dp or --gq options to specify the filter criteria. Each of these options, when present, has to be followed by a number of filters matching the number of samples declared with -s and the filters are associated with the samples by position on the command line.

--gt sets genotype filters. Currently, 0/0 (homozygous ref allele), 0/1 (heterozygous) and 1/1 (homozygous alt allele) are allowed as <genotype patterns>. Several genotypes can be specified by separating them with commas. The keyword ANY is allowed as a placeholder (i.e., no genotype filter should be applied to the given sample), and this should be preferred over the alternative 0/0,0/1,1/1 notation for clarity, briefity and efficiency.

--dp sets coverage filters. To pass the filter a variant site has to be covered by at least as many sample-specific reads as set with –dp. Consequently, 0 can serve as a placeholder here.

--gq sets genotype quality filters. To pass the filter the quality of the genotype call for the relevant sample has to be at least as high as the filter criterion. Again, 0 has a placeholder function.

All filters discussed above are subtractive and line-based, i.e., if, and only if, a variant passes ALL filters, the full entry is retained.

In addition, the tool supports one column-based/vertical filter via the --vfilter option. If used, then of the sample-specific columns of any line (that passes all other filters) only those describing samples specified with the option are retained.

Examples:

Example 1¶

Of a vcf input file variants_in.vcf with a sample called sample1 (and possibly others) retain all variants for which sample1’s genotype is 0/1 (heterozygous) or 1/1 (homozygous for the variant allele):

$ mimodd vcf-filter variants_in.vcf -s sample1 --gt 0/1,1/1

Example 2¶

Of the same file, retain all variants, for which sample1’s genotype is 0/1 or 1/1 AND for which sample2’s genotype is 0/0 (homozygous for the reference allele):

$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 0/1,1/1 0/0

Example 3¶

With the same input again, retain all variants for which the genotype of sample1 is 0/1 or 1/1 AND for which sample1 AND sample2 show at least 3-fold coverage (the genotype of sample2 is not used for filtering):

$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 0/1,1/1 ANY --dp 3 3

Example 4¶

Retain all variants found on chromosomes chrI and chrII for which the genotype of sample1 is 1/1 AND that of sample2 is 0/0. For these variants, report only sample-specific information for sample1:

$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 1/1 0/0 --region chrI: chrII: --vfilter sample1

Hint

Galaxy tool specifics

The MiModD VCF Filter tool offers the full functionality available from the command line, but the Galaxy user interface lets us group the settings more intuitively.

The VCF Filter tool interface with one Sample-specific Filter and one Region Filter added.

The annotate tool:¶

Annotate variants with their effects on genes and transcripts

command line usage¶

mimodd annotate <input file> <annotation_source> [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file

In addition, the following SnpEff-specific options are supported:

--codon-tables <table spec> [<table spec> ...]: specify custom codon tables to be used in assessing variant effects at the protein level. If a codon table should be used for only a specific chromosome, use the format CHROM:TABLE_NAME. Use TABLE_NAME alone to specify a codon table to be used for all chromosomes, for which no chromosome-specific table is provided. Valid TABLE_NAMEs are those defined in the Codon tables section of the SnpEff config file. NOTE: It is also possible to associate chromosomes with a codon table permanently by editing the SnpEff config file.
--stats <file>: generate an additional file <file> with the SnpEff results summary
--no-downstream, --no-upstream, --no-intron, --no-intergenic, --no-utr: do not annotate downstream, upstream, intron, intergenic or UTR effects, respectively
--ud <distance> : [default: 500]: specify the upstream/downstream interval length, i.e., variants more than <distance> nts from the next annotated gene are considered to be intergenic
-c <path>, --config <path> : [default: config settings]: override the location of the SnpEff installation directory configured at the package level with <path>
-m <memory>, --memory <memory> : [default: config settings]: maximal memory to use in GB

This tool uses SnpEff to add annotations taken from the SnpEff genome file <annotation_source> to the variants found in the VCF <input file>.

The snpeff-genomes tool can be used to get a list of all installed SnpEff genome files on your system.

By default, the tool generates a single output file specified with the --ofile option. This new VCF file with the annotations added to the respective variant records can be used like the original file with appropriate downstream tools, like the vcf-filter or the map tool, but is typically used with the varreport tool to generate human-friendly reports of the variants and their effects.

An optional SnpEff summary report about the variant effects can be generated under the name specified through the --stats option.

The --quiet and --verbose switches have their usual MiModD meaning, i.e., they act to suppress the original output from SnpEff and to enable verbose output from the annotate tool itself, respectively.

Hint

Galaxy tool specifics

All the command line functionality above is also avaliable through the MiModD Variant Annotation tool in Galaxy.

Note that all SnpEff-specific options are offered under the Codon Table Configuration and More SnpEff options sections, which are collapsed by default.

For selection of the SnpEff genome, Galaxy users have two options:

if they know the exact name of the SnpEff genome, they can type it directly, otherwise
they can select a dataset produced by an earlier run of the List Installed SnpEff Genomes tool, then choose from a list of SnpEff genomes reported in this dataset.

Note

For most use cases this tool can be replaced with the IUC-maintained snpeff tool from the main Galaxy Tool Shed, which offers a somewhat different approach for installing and managing SnpEff genomes than the MiModD tool.

If you prefer to use this independently developed Galaxy tool, make sure you select its Use ‘EFF’ field compatible with older versions (instead of ‘ANN’) option under Annotation options as shown in the screenshot below, or the output produced by the tool will not be compatible with the MiModD Report Variants tool.

The varreport tool:¶

Generate human-friendly variant reports to simplify data exploration

command line usage¶

mimodd varreport <input file> [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-f <html|text>, --oformat <html|text> : [default: html]: output format
-s <species>, --species <species>: the species to be assumed for generating hyperlinks to organism-specific databases and genome browsers; when no species is provided, the tool will try to autodetect it, but this will only succeed if the <input file> was generated with the annotate tool; if no species can be detected, species-specific hyperlinks will not be generated; if a species is provided but the tool does not know how to generate hyperlinks for it, an error will be raised, unless you provide custom hyperlink formatting instructions for the species through the --link option; since hyperlinks are only inserted into output in html format this option has no effect with -oformat text
-l <link formatter file>, --link <link formatter file>: file to read custom species-specific hyperlink formatting instructions from; like --species, this option is effective only with --oformat html

See also

Prepare and use a custom hyperlink template file for the expected format of the file

This tool generates reports in html or text format about the variants found in its VCF input.

Compared to the input VCF format, the emphasis of these reports is on readability for human beings and on facilitating further data exploration. To this end, the tool can insert - with html as the output format - hyperlinks to species-specific databases and genome browsers into the report making the document a convenient starting point for exploring the variants and their effects further. With text as the selected output format, the tool produces tabular output (tsv-formmat) that can be imported into and further processed with spreadsheet software like Excel.

Hint

Galaxy tool specifics

When you use the MiModD Report Variants tool to produce output in html format, your Galaxy may display the result in crippled form (without style information applied to it). This is a security measure to prevent tools from injecting malicious instructions into their html output.

Follow the displayed instructions or whitelist the Report Variants tool to bypass this issue.

If you trust us that we would never attempt such a thing, you can view the output in all its glory after adding the MiModD Report Variants tool to the whitelist of tools for which Galaxy shows their unfiltered html output. Doing so requires administrator rights on the Galaxy server so, as an alternative, you can also follow the steps suggested on the results page itself to bypass the issue on a per-dataset basis.

The map tool:¶

Analyze and visualize variant patterns to map selected mutations.

Offers three analysis modes. Use:

Simple Variant Density (SVD) mapping mode to visualize the distribution along the reference genome of all variants from an input file or
Variant Allele Frequency (VAF) mapping mode to analyze the allele inheritance pattern in a hybrid population at biallelic sites (i.e. positions at which two different alleles are present in the two parent genomes).
Variant Allele Contrast (VAC) mapping mode to explore the divergence between two samples at all variant sites in the input file. This mode can be used to map causative variants using two bulked samples selected for contrary phenotypes (or a selected and a non-selected sample).

Note

This user guide has a separate section offering a conceptual explanation of Mapping-by-sequencing with MiModD.

In the following we focus on a technical description of the tool interface and assume you know what you are trying to achieve with it.

command line usage¶

mimodd map <input file> SVD|VAF|VAC
           [general options]
           [analysis options]
           [file format options]
           [plotting options]
           [VAF mode options|VAC mode options]

available options:

general options

analysis options

file format options

plotting options

VAF mode options

VAC mode options

general options:

-o <file>, --ofile <file> : [default: stdout]: redirect the output to the specified file
-q, --quiet: suppress warning messages

analysis options:

-b SIZE [SIZE ...], --bin-sizes SIZE [SIZE ...]

file format options:

-s <file>, --seqinfo <file>

overwrite contig information (names and sizes) in the <input file> with that declared in <file>.

The format of <file> must be FASTA or correspond to that defined for a CloudMap Other Species Configuration File (referred to as a CloudMap-style sequence dictionary throughout this manual).

This option is provided as a way to provide contig information when it is missing from the input file. As such, it is never necessary to use it with MiModD-generated input, but only with input generated by third-party tools.

--cloudmap

generate valid input for CloudMap Mapping tools and save it to the output file specified through the -t option, which is required with this option.

The type of resulting vcf-like output is dependent on the analysis mode:

In SVD mode, it is compatible with the CloudMap EMS Variant Density Mapping tool while, in VAF mode, it can serve as input to the CloudMap Hawaiian Variant Mapping tool. The option cannot be used with VAC mode since CloudMap does not have an equivalent mapping mode.

-t <file>, --text-file

generate text-based output for every variant position and save it to the specified file. When used on its own the file can be used as input for later reruns of the tool to speed up plotting. When used in combination with the --cloudmap option, the file will be written in a CloudMap-compatible format. When generated from a tool run in SVD mode, it will be valid input for the EMS Variant Density Mapping tool, while, when produced in VAF mode, it can serve as input to the Hawaiian Variant Mapping tool of CloudMap.

plotting options:

-p <file>, --plot-file <file> : [default: no graphical output]

generate graphical output and save it to <file>

--fit-width

do not preserve relative contig sizes in plots

, but scale each contig to fit the plot width instead.

--no-scatter

do not produce scatter plots of observed segregation rates; just plot histograms

-l <float>, --loess-span <float> : [default: 0.1]

span parameter for the Loess regression line through the linkage data

Smaller values make the line more sensitive to local fluctuations in the data. Too small values may cause individual regression line calculations to fail, in which case a warning will be issued (unless the -q option is in effect) and no regression line will be plotted for the affected contig.

Specify a value of 0 to skip calculation for all contigs.

--ylim-scatter <float> : [default: 1.0]

upper limit for scatter plot y-axis

-c <color>, --points-color <color> : [default: black]

-k <color>, --loess-color <color> : [default: red]

--no-hist: do not produce linkage histogram plots; only generate scatter plots
--no-kde: do not add kernel density estimate lines to histogram plots
--ylim-hist <float> : [default: autoscale]: upper limit for histogram plot y-axis

--hist-colors <color> [<color> ...]

VAF mode options:

-m <sample name>, --mapping-sample <sample name>

specify the name of the (hybrid population) sample (as appearing in the input vcf file) for which the analysis of allelic inheritance patterns should be performed.

-r <sample name>, --related-parent <sample name>

specify the name of a parent sample (as appearing in the input vcf file) that brought in background mutations linked to the causative variant into the mapping sample. Examples of such samples are the originally isolated mutant line, the pre-mutagenesis strain or an ancestor strain, i.e., any strain that can be assumed to have contributed to the variant pool found in the original mutant line. The tool will treat variants found in this strain and in the mapping sample as evidence for linkage between a site and the causative mutation.

-u <sample name>, --unrelated-parent <sample name>

specify the name of a parent sample (as appearing in the input vcf file) that contributed unlinked marker variants to the mapping sample. Examples of such samples are the mapping strain used in the cross from which the mapping sample was obtained or an ancestor strain, i.e., any strain that may have contributed to the variant pool found in the mapping sample, but not to that of the original mutant line. The tool will treat variants found in this sample and in the mapping sample as evidence against linkage between a site and the causative mutation.

-i, --infer-missing

if variant data for either the related or the unrelated parent is not provided, the tool can try to infer the alleles present in that parent from the allele spectrum found in the mapping sample.

Use with caution on carefully filtered variant lists only!

VAF mode requires the specification of at least one parent sample name through the -r, --related-parent and/or -u, --unrelated-parent options. If both a related and an unrelated parent sample are available both may be specified and this will, generally, result in better mapping resolution.

VAC mode options:

-m <sample name>, --mapping-sample <sample name>: specify the name of the sample (as appearing in the input vcf file) that was selected for the causative variant.
-c <sample name>, --contrast-sample <sample name>: specify the name of the sample (as appearing in the input vcf file) that is to be contrasted with the mapping sample.

Note

All sample names must be provided exactly as they appear in the <vcf input file>. If unsure, you can use the info tool with:
```
mimodd info <vcf input file>
```
to query sample names (and other information) encoded in the input file.
Parent samples cannot be specified in SVD mode. Instead, crossing strain variants should be “subtracted” beforehand from the vcf input using the vcf-filter tool.

Hint

Galaxy tool specifics

The MiModD NacreousMap tool offers, essentially, the same functionality as available from the command line though the Galaxy user interface lets us offer the many different settings in a nice and structured way.

As the only real, but minor difference to the command line tool, the NacreousMap tool lets you provide, if you ever have to, contig name and size information - corresponding to the command line -s, --seqinfo option - in the form of a dataset from your current history or of a reference genome stored on the server.

The Administrative Tools¶

The config tool¶

View and modify package configuration.

command line usage¶

python3 -m MiModD.config [OPTIONS]

When run without options, the tool displays the install location and the current configuration settings of the package.

Use any of the available options to change the corresponding package settings:

--tmpfiles <path>, --tmpfiles-path <path>

set the directory that MiModD uses to store temporary data in

If <path> is omitted, tools will generate the data in the directory they are executed from.

--snpeff <path>, --snpeff-path <path>

set the directory that you have installed SnpEff into to start using the SnpEff-dependent functionality of the annotate tool

If <path> is omitted, SnpEff-dependent functionality will be disabled.

-t <multithreading level>, --threads ..., --multithreading-level ...

set the maximal number of threads that any MiModD tool is allowed to use by default (respected approximately)

-m <max memory>, --memory ..., --max-memory ...

set the maximal amount of memory (in GB) that any MiModD tool is allowed to use by default (most tools will use much less, snap may use more)

--use-galaxy-index-files, --no-use-galaxy-index-files

control whether MiModD should try to use index files generated by Galaxy

These are Galaxy-specific options with no effect if you are using MiModD from the command line.

The upgrade tool¶

check for and install upgrades of the package (requires pip).

command line usage¶

python3 -m MiModD.upgrade [install|hg-install] [OPTIONS]

When run without options, the tool checks online for the latest version of MiModD. When used with install, it upgrades the package to the latest available version. hg-install can be used to upgrade to the latest in-development version instead (this requires Mercurial to be installed and is not recommended for normal users as development versions may not be fully functional).

Your MiModD configuration settings will be preserved during the upgrade.

available options:

-v <version>, --version <version>: upgrade to a specific version of the package instead of the latest version. <version> must be a valid MiModD version number when used with install or an existing changeset identifier when used with hg-install.

Examples:

python3 -m MiModD.upgrade install

The standard command to keep MiModD updated!

Tries to upgrade your installation to the latest available version. The upgrade will abort if no newer version than the currently installed version can be found.

python3 -m MiModD.upgrade install --version 0.1.7

Upgrade to version 0.1.7 of MiModD independent of the currently installed version. If your current version is newer, it will be downgraded. If your current version is 0.1.7, it will get re-installed.

The enablegalaxy tool¶

enable the package for use from a local installation of Galaxy.

command line usage¶

python3 -m MiModD.enablegalaxy <Galaxy path> [OPTIONS]

The tool will integrate the package into a local installation of Galaxy specified by <Galaxy path>. <Galaxy path> should point to the root directory of the local Galaxy.

Normally, this should be all that’s needed to run the tool successfully and any of the options below should only be required in exceptional circumstances.

Note for Galaxy Admins

Starting with v0.1.8 of MiModD, the tool will also try to expose the copies of samtools and bcftools bundled with MiModD as Galaxy tool dependencies. The tool will take care not to touch any preexisting dependencies so if your tool dependency folder has a samtools subfolder, the samtools/bcftools integration step will be skipped automatically. If so far, however, you had satisfied non-versioned samtools/bcftools requirements through conda, the tool will create the new Galaxy tool dependency, which could possibly then override the conda installed versions. To avoid this potential issue you can provide the without-samtools option, which will force the samtools/bcftools integration step to be skipped.

If you have no clue what all this means, just do not worry: you’re most likely not affected and can safely ignore this note.

available options:

--without-samtools

see the note above for when you would want to use this option. Generally, if you are enabling MiModD for a freshly set up Galaxy you would never want to specify this option.

-c <config file>, --config-file <config file>

skip autodetection of the Galaxy configuration file and use the indicated file instead. If <config file> is specified as a relative path, that path will be interpreted relative to <Galaxy path>. Providing this option should only be necessary if the tool reports that it cannot autodetect the Galaxy config file.

-d <tool dependency dir>, --dependency-dir <tool dependency dir

do not try to locate the Galaxy tool dependency folder automatically, but use the indicated folder directly. If <tool dependency dir> is specified as a relative path, that path will be interpreted relative to <Galaxy path>.

Like for the --config-file option above, using this option should only be necessary if the tool reports that it cannot autodetect the tool dependency folder used by your Galaxy instance.

-t <line token>, --token <line token> : [default: tool_config_file]

override the name of the key in the configuration file that the path to the MiModD tool wrappers for Galaxy should be added to. Should not normally be required.

In-development Tools¶

The sanitize tool¶

Ensure input files are compatible with MiModD.

Currently, only works with fasta files.

command line usage¶

python3 -m MiModD.sanitize <fasta input file> [OPTIONS]

The tool replaces characters in FASTA description lines that are illegal in MiModD and ensures uniform lengths of sequence data lines. The output generated by the tool is guaranteed to be compatible with all MiModD analysis tools.

available options:

-o <file>, --ofile <file> : [default: stdout]

redirect the output to the specified file

-r <string>, --replace-with <string>

replace all illegal characters in FASTA description lines with the specified character or sequence of characters

If the option is not provided, the tool will replace illegal characters with their hexadecimal capitalized UTF-8 based percent encoding (e.g., %20 will replace SPACE characters and %3D will replace =).

-t <string>, --truncate-at <string>

truncate each FASTA description line at the first occurrence of the specified character or sequence of characters

-b <width>, --block-width <width> : [default: 80]

wrap sequence lines to the specified number of nucleotides per line