MiModD User Guide: Tool Documentation

As an end-user you have two possibilities to work with MiModD:

you can use it as a suite of tools from the command line or, more conveniently, through its built-in Galaxy interface.

The first has the advantage of not requiring any additional software installation and avoids the memory and performance overhead of running the Galaxy server.

The later clearly provides the more intuitive user-interface, particularly for beginners.

In this chapter we will explain both the command line usage and the corresponding Galaxy wrapper for each tool in the MiModD package. For most tools, the relationship between their command line parameters and their configuration in Galaxy should be obvious, but we will point out differences between the two interfaces when they are important.

Overview of the MiModD tools

The following is a list of all command line tools currently built into MiModD together with a brief description of the purpose of each tool:

info          retrieve information about the samples encoded in a file for
              various supported formats
header        generate a SAM format header from an NGS run description
convert       convert between different sequenced reads file formats
reheader      from a BAM file generate a new file with specified header
              sections modified based on the header of a template SAM file
sort          sort a BAM file by coordinates (or names) of the mapped
              reads
snap          align sequence reads using the SNAP aligner
snap-batch    run several snap jobs and pool the resulting alignments into
              a multi-sample SAM/BAM file
snap-index    index a reference genome for use with the SNAP aligner
varcall       predict SNPs and indels in one or more aligned read samples
              and calculate the coverage of every base in the reference
              genome using samtools/bcftools
varextract    extract variant sites from BCF input as generated by varcall
              and report them in VCF
covstats      summary coverage statistics for varcall output
delcall       predict deletions in one or more samples of aligned paired-
              end reads based on coverage of the reference genome and on
              insert sizes
vcf-filter    extract lines from a vcf variant file based on sample- and
              field-specific filters
annotate      annotate a vcf variant file with information about the
              affected genes
snpeff-genomes
              list installed SnpEff genomes
cloudmap      generate CloudMap-compatible output from a vcf file

as you can obtain it, from the command line, by running:

mimodd --help

You can run any individual tool through:

mimodd <tool name> <tool specific arguments>

Additional help on each tool and the arguments it accepts can be obtained by:

mimodd <tool name> --help

, e.g.:

$ mimodd header --help

usage:  header [-h] [--rg-id RG_ID [RG_ID ...]] [--rg-sm RG_SM [RG_SM ...]]
               [--rg-cn RG_CN [RG_CN ...]] [--rg-ds RG_DS [RG_DS ...]]
               [--rg-dt RG_DT [RG_DT ...]] [--rg-lb RG_LB [RG_LB ...]]
               [--rg-pl RG_PL [RG_PL ...]] [--rg-pi RG_PI [RG_PI ...]]
               [--rg-pu RG_PU [RG_PU ...]] [--co [COMMENT [COMMENT ...]]]
               [-o OFILE] [-x]

optional arguments:
  -h, --help            show this help message and exit
  -o OFILE, --ofile OFILE
                        redirect the output to the specified file (default:
                        stdout)
  -x, --relaxed         do not enforce a sample name to be specified for every
                        read group

read group description:
  --rg-id RG_ID [RG_ID ...]
                        one or more unique read group identifiers
  --rg-sm RG_SM [RG_SM ...]
                        one sample name per read group identifier
  --rg-cn RG_CN [RG_CN ...]
                        one sequencing center name per read group
  --rg-ds RG_DS [RG_DS ...]
                        one description line per read group
  --rg-dt RG_DT [RG_DT ...]
                        date runs were produced (YYYY-MM-DD); one date per
                        read group
  --rg-lb RG_LB [RG_LB ...]
                        library identifier for each read group
  --rg-pl RG_PL [RG_PL ...]
                        sequencing platform/technology used to produce each
                        read group
  --rg-pi RG_PI [RG_PI ...]
                        predicted median insert size for the reads of each
                        read group
  --rg-pu RG_PU [RG_PU ...]
                        platform unit, e.g., flowcell barcode or slide
                        identifier, for each read group

other information:
  --co [COMMENT [COMMENT ...]]
                        an arbitrary number of one-line comment strings

If you have enabled MiModD in your local Galaxy installation, the same tools should be available (under slightly different names) in the MiModD section of the Galaxy Tools bar. This is a screenshot of what the section should look like:

_images/MiModD_Galaxy_Toolbar.png

While the MiModD Galaxy tools are really just wrappers around the tools available from the command line, they provide a graphical interface that typically is easier to handle for the new or occasional user. As an example, here is the Galaxy interface for the mimodd header command (called NGS Run Annotation in Galaxy):

_images/NGS_Run_Annotation.png

In addition, to providing user-friendly access to individual tools, Galaxy also lets you combine individual tools into complex workflows without advanced knowledge of shell syntax or scripting languages. The chapter MiModD and Galaxy in this guide explains how MiModD lets you run complete analyses of NGS data - from sequenced reads to lists of variants - as a single Galaxy workflow without user intervention.


The MiModD tools in detail

The MiModD tools can be grouped in three categories:

  • the format converter and query tools (info, header, convert, reheader, sort, covstats, snpeff-genomes, cloudmap) help you annotate and arrange your sequencing data and let you convert between different formats for storage and use by other MiModD and external tools.
  • the core tools (snap, snap-batch, snap-index, varcall, varextract and delcall) do the computation-intensive analysis of the data, aligning sequence reads to reference genomes and detecting sequence variants (currently SNPs, indels and deletions) with respect to the reference.
  • the data exploration tools (annotate and vcf-filter) let you annotate and filter the results of these analyses and, thus, help you to extract biologically relevant information.

The Format Converter and Query Tools

The info tool

Provide information about file contents.

command line usage
mimodd info <input file> [OPTIONS]

The tool expects an input file and reports the metadata it finds encoded in it. It works with and auto-detects almost every file format used anywhere in the MiModD package, i.e., SAM, BAM, vcf, bcf and fasta, and is useful for quick checks of file contents. The tool can produce either plain-text or html output, which, like for many MiModD tools, is printed to the standard output (stdout, i.e., the console/terminal window) by default, but can be redirected with the -o or --ofile option to the specified file.

available options:

-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file

--oformat <html|txt> : [default: txt]

-v, --verbose

Galaxy interface

Galaxy tool name : Retrieve File Information

_images/retrieve_file_info.png

Choose the input file from a drop-down menu of available files from your user history, then click Execute to run the tool. In Galaxy, a tool’s output goes to a file. Galaxy stores these files in an internal database and presents them to you in the history pane. Upon running the tool, its results file will automatically be added to your history. Click on the eye symbol to see the file contents in the main pane.


The header tool

Construct a SAM format file header from information about read groups and comments, which can be used by other tools (reheader, convert and snap) to annotate sequence reads files with metadata.

command line usage
mimodd header [--rg-id <RG_ID> [RG_ID]...] [--rg-sm <RG_SM> [RG_SM]...]
              [--rg-cn <RG_CN> [RG_CN]...] [--rg-ds <RG_DS> [RG_DS]...]
              [--rg-dt <RG_DT> [RG_DT]...] [--rg-lb <RG_LB> [RG_LB]...]
              [--rg_pl <RG_PL> [RG_PL]...] [--rg_pi <RG_PI> [RG_PI]...]
              [--rg_pu <RG_PU> [RG_PU]...]
              [--co <COMMENT> [COMMENT]... ] [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file
-x, --relaxed
do not enforce a sample name to be specified for every read group

A read group in SAM/BAM format refers to a group of sequences that form a logical unit and should be treated together. Typically, a read group comprises sequence reads obtained from one sample in a particular sequencing run. You can add information about one or several read groups to the mimodd header output through the following command line arguments:

--rg-id <ID> [ID]...
Read group IDs are unique strings unambiguously identifying a read group. They may be provided by the sequencing center generating the data, but you can choose any ID(s) as long as you make sure that your naming scheme does not cause ID collisions in downstream analysis steps.
--rg-sm <sample name> [sample name]...
For each read group declared by providing an ID for it, you must specify a sample name using this option (use the -x switch to make sample names optional). The sample name should identify the biological sample that the read group provides sequencing data for.
--rg-cn <sequencing center> [sequencing center]...
Optional name of the sequencing center that generated the data for the read group(s).
--rg-ds <description> [description]...
Optional one-line read group description(s).
--rg-dt <YYYY-MM-DD> [YYYY-MM-DD]...
Optional information in YYYY-MM-DD date format about when the data for the read group(s) was generated.
--rg-lb <lib identifier> [lib identifier]...
Optional library identifier(s). A unique identifier of the sequence library that was used for sequencing should be given here.
--rg-pl <platform name> [platform name]...
Optional name of the sequencing technology that was used to produce the data for the read group(s). Should be one of the following officially supported names: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, PACBIO.
--rg-pi <insert size> [insert size]...
Optional value(s) for the predicted median insert size between sequence reads of the read group(s) on the sequenced DNA fragments. Only applicable for paired-end sequencing data.
--rg-pu <unit ID> [unit ID]...
Optional identifier of the physical sequencing unit that the data for the read group(s) was generated on. Depending on the sequencing platform that was used (see the –rg_pl option) this could be, e.g., a flowcell-barcode or a sequencing slide.

The number of read groups declared in the resulting SAM header is determined by the number of read group IDs and matching number of sample names specified through the --rg-id and --rg-sm options. All other information for any read group is optional. If you do not want to provide a certain parameter for any read group you can simply omit the option. To provide a given parameter for some, but not all of the specified read groups, use the empty string "" as a placeholder. Pairing between parameters is determined by the input order.

Example 1

Specify two read-groups with optional sequencing date information

$ mimodd header --rg-id 001 002 --rg-sm sample1 sample2 --rg-dt 2014-10-23 2014-09-14

@HD VN:1.5
@RG ID:001  SM:sample1      DT:2014-10-23
@RG ID:002  SM:sample2      DT:2014-09-14
Example 2

Specify three read-groups with dates, but a description only available for the second

$ mimodd header --rg-id 001 002 003 --rg-sm sample1 sample2 sample3 --rg-dt 2014-10-23 2014-09-14 2014-11-09 --rg-ds "" "WT control" ""

@HD VN:1.5
@RG ID:001  SM:sample1      DT:2014-10-23
@RG ID:002  SM:sample2      DS:WT control   DT:2014-09-14
@RG ID:003  SM:sample3      DT:2014-11-09

In addition to or instead of read group information the header tool also lets you specify an arbitrary number of comment lines. These are not bound to any specific read group and should, generally, be used for remarks concerning the read groups or their analysis as a whole though you can, of course, refer to a particular read group in the comment text.

Example 3

Generate the three-sample header above, but with two general comment lines appended

$ mimodd header --rg-id 001 002 003 --rg-sm sample1 sample2 sample3 --rg-dt 2014-10-23 2014-09-14 2014-11-09 --rg-ds "" "WT control" "" --co "data from the xx sequencing project" "provided by yy lab"

@HD VN:1.5
@RG ID:001  SM:sample1      DT:2014-10-23
@RG ID:002  SM:sample2      DS:WT control   DT:2014-09-14
@RG ID:003  SM:sample3      DT:2014-11-09
@CO data from the xx sequencing project
@CO provided by yy lab

Note the use of quotes to group elements. Without them a new comment line would be started after every whitespace.

Example 4

Generate a header without read groups and only comment lines

$ mimodd header --co "forgot to say:" this

@HD VN:1.5
@CO forgot to say:
@CO this

While such a header may not be terribly useful on its own, it can be used with the reheader tool to add comments to an existing header.

Note

Within MiModD the order of comment lines will, generally, be preserved, but this is not a file-format guarantee, i.e., other software is free to change it.

The difference between read group ID and sample name

A frequent You have sequenced two DNA preparations (DNA_prep1 and DNA_prep2) and you store the reads from the two samples in one input file.

A minimal header for this file can be generated with:

mimodd header --rg-id first second --rg-sm DNA_prep1 DNA_prep2

resulting in this SAM header:

@HD VN:1.0
@RG SM:DNA_prep1    ID:first
@RG SM:DNA_prep2    ID:second
@SQ LN:0    SN:

Assume you are not satisfied with the reads from DNA_prep1. You resequence that sample and store the reads in the same file. The new header can be generated with:

mimodd header --rg_id first second third --rg_sm DNA_prep1 DNA_prep2 DNA_prep1 --co "resequenced prep 1 hoping for better quality"

The SAM header now looks like this:

@HD VN:1.0
@RG SM:DNA_prep1    ID:first
@RG SM:DNA_prep2    ID:second
@RG SM:DNA_prep1    ID:third
@SQ SN:     LN:0
@CO resequenced prep 1 hoping for better quality
Galaxy interface

Galaxy tool name : NGS Run Annotation

This tool is more limited than its command line counterpart in that it only allows a single read group to be specified and described. There is no possibility to incorporate reference sequence information or comments. The main purpose of the tool is to generate annotation for unaligned single-sample input files that can be merged with this data by the Convert and Snap Read Alignment tools. In addition, it can be used in combination with the Reheader tool to modify or replace individual read groups in multi-sample BAM files.

_images/ngs_run_anno.png

The convert tool

Convert between different sequenced reads file formats.

command line usage
mimodd convert <input file> [input file]... [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file
--iformat <fastq|fastq_pe|gz|gz_pe|sam|bam> : [default: bam]

the format of the input file(s); the input format determines

  • whether the -h option is applicable and
  • whether multiple input files are allowed and how they are used:
    • with input in SAM or BAM format only one input file is accepted
    • with input in single-end fastq format (compressed or not; format strings fastq, gz) multiple input files may be given and the reads from all files will be written to the output in concatenated form, i.e., as if they had come from a single source, providing an easy way to merge split fastq files into a single SAM/BAM file
    • with input in paired-reads fastq format (compressed or not; format strings fastq_pe, gz_pe) multiple input files are interpreted as alternating r1 and r2 files, i.e., every other file is expected to contain the read mates of the preceding file
--oformat <sam|bam> : [default: sam]
the output format
-h <header file>, --header <header file>
use the header of the provided SAM file as the output header; applicable only when --iformat specifies a headerless file format, i.e., one of fastq, fastq_pe, gz or gz_pe

Additional specifications:

Standard input can be used instead of an input file by specifying a minus sign - instead of a file name.

The following table summarizes the conversions that are currently supported:

from to >1 input files optional header
fastq (or gzip compressed fastq.gz) sam yes yes
fastq (or gzip compressed fastq.gz) bam yes yes
sam bam no no
bam sam no no

Note

We anticipate future versions of MiModD to implement subsampling and filtering during conversions. This is the reason why, in addition to the format combinations above, the currently meaningless (unless you are looking for a slow way to copy files) “conversions” sam <-> sam and bam <-> bam are also accepted.

Examples:

Example 1

View the contents of a BAM file by converting it to SAM format

$ mimodd convert myreads.bam

Conversion from BAM to SAM is the default behavior.

Example 2

Convert a SAM file to BAM format

$ mimodd header --rg-id 007 --rg-sm "secret sample" --co "for your eyes only" | \
  mimodd convert - --iformat sam --oformat bam

�BCdsr�e�c``p�p�
                ��2�3�r��t�200�
                               ��*NM.J-Q(N�-�I�rp��L�/R��/-RH�L-V��˩���`��J�BCw

While the output is not too useful, this example demonstrates how the tool can be used in a pipe. Now just to show that the above worked as expected:

$ mimodd header --rg-id 007 --rg-sm "secret sample" --co "for your eyes only" | \
  mimodd convert - --iformat sam --oformat bam | mimodd convert -

@HD VN:1.5
@RG ID:007  SM:secret sample
@CO for your eyes only
Galaxy interface

Galaxy tool name : Convert

This tool has the exact same functionality in Galaxy and from the command line. The Galaxy input form simplifies the coice of parameters, though, by dynamically displaying only applicable options based on the selected input file format.

_images/convert.png

The reheader tool

command line usage
mimodd reheader [template SAM file] <BAM input file>
                [--rg ignore|update|replace [RG_TEMPLATE] [RG_MAPPING]...]
                [--sq ignore|update|replace [SQ_TEMPLATE] [SQ_MAPPING]...]
                [--co ignore|update|replace [CO_TEMPLATE]]
                [--rgm RG_MAPPING [RG_MAPPING]...]
                [--sqm SQ_MAPPING [SQ_MAPPING]...] [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file
-H
output only the resulting header in SAM format

This complex, but powerful tool rewrites an existing <BAM input file> modifying its header on the fly based on the header information of one or several template SAM files according to at least one of these optional parameters:

template SAM file
if specified the header of this file provides the read group information, the sequence dictionary and the comment lines that will be used to modify the input file header.
--rg ignore|update|replace [RG template SAM file] [RG mapping]...

the –rg option followed by one of the keywords ignore, update or replace specifies how the read group section of the new header should be compiled. With ignore any template read group information is ignored and the read group section of the BAM input file is copied to the new file unchanged. With update the input file header is modified by template read group information according to these rules: any read groups found only in the BAM input or in the template information are copied to the new header. For read groups with matching IDs found in the BAM input and in the template, any information about this read group found in the template will be written to the new header (possibly overwriting input file information), but information found only in the BAM input will be copied unchanged. With replace the rules are: all template read groups are written to the new header as they are (possibly overwriting read groups with matching IDs in the BAM input and without information merging). Read groups found only in the BAM input file are discarded.

Both update and replace may optionally be followed by

  • a read group-specific template file, which will be used as the source of template read group information (i.e., if the general template SAM file is also given, its read group section is ignored)

and/or

  • a read group mapping to be used to determine read group ID matching between the BAM input file and the template. This mapping has to be provided in the exact form input_id : template_id (with spaces around the colon). Several mappings must also be separated by a space. When a mapping is given and a template read group with the ID template_id is found, it is treated as if its ID was input_id.
--sq ignore|update|replace [SQ template SAM file] [SQ mapping]...
the –sq option determines the use of template sequence information in just the same way as –rg does for read groups. The only differences are that sequences will be matched by their name (SN header values) and that the length (LN value) and the MD5 checksum (M5 value) of a sequence will never be modified when rewriting sequence information from the BAM input to the new header.
--co ignore|update|replace [template SAM file for comments]
determines how template comments are used. They can be ignored, appended to (with update) any comments found in the BAM input file or be used to replace any comments from the BAM input. As with the –rg and –sq options, an optional specific template SAM file can also be specified.
--rgm RG mapping [RG mapping]...
this option lets you specify one or read group mappings (in the format shown above for the –rg option) that are used to rename read groups. Importantly, renaming is always happening after all other modifications and the input_id(s) given in the mapping(s) have to still exist at that point.
--sqm SQ mapping [SQ mapping]...
equivalent to –rgm for renaming sequence names

Additional specifications:

  • A minus sign - can be used in place of any of the different template file names that the tool supports indicating that the respective header template(s) should be read from standard input. This feature makes it possible to use the tool efficiently in pipes, e.g., with input coming from the header tool.
  • If a general template SAM file is specified the tool’s default for the --rg, --sq and --co will be set to replace, otherwise the defaults will be ignore.
  • The tool autodetects header changes (changes in read group IDs or sequence names, elimination of read groups or sequence records) that may be incompatible with records in the body of the BAM file. In these situations, it will scan the file body for incompatible records and try to fix them or, failing that, will terminate with an error message. In this way, the tool guarantees that its reheadered output will always be a valid BAM file. On the other hand, the scan causes very significant runtime increases leading to the folowing recommendations:
    • reserve the use of the --rgm and --sqm options for vital cases
    • use replace mode for read groups and sequence names (--rg and --sq options) with caution and take care not to remove read group or sequence records from the final header that get referred to in the body of the BAM file.

Examples:

The following examples assume that you have a BAM input file named “example_input” with the following header (printed in SAM format):

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DT:2013-04-30
@RG ID:002  SM:sample 2     DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

This file contains reads from two samples mapped against a common reference genome, i.e. the sequences chrI, chrII, chrIII and the read group IDs 001 and 002 will be referenced in the body of the file.

Note

The reheader tool is quite complex and may take some practice to use efficiently. The -H option instructs the tool to generate only the new header in human readable SAM format and we encourage beginners to use this option to check the outcome of their parameter settings before producing the full reheadered BAM file.

Example 1

Adding descriptions to existing read groups

$ mimodd header -x --rg-id 001 002 --rg-ds "labbook #1, exp #216" "labbook #1, exp #217" | \
  mimodd reheader -H example_input --rg update -

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:002  SM:sample 2     DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

Note that, for this illustration, we used the -H option to obtain only the resulting header in SAM format.

We decided to generate the SAM header template with the descriptions to be added using the header tool. Use of the -x option saved us from having to redundantly specify the sample names again. We piped the output directly to the reheader tool to use it as the --rg template. update mode allows us to add the descriptions to the existing information of each read group. Using replace instead would have produced this unintended output:

@HD VN:1.0  SO:coordinate
@RG ID:001  DS:labbook #1, exp #216
@RG ID:002  DS:labbook #1, exp #217
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

with all the existing read group information replaced by the newly porvided one.

Example 2

Adding comments

$ mimodd header --co "let me just say:" "this header needs comments" | \
  mimodd reheader -H example_input --co update -

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DT:2013-04-30
@RG ID:002  SM:sample 2     DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700
@CO let me just say:
@CO this header needs comments

Again, we generated the new header elements using mimodd header. In this particular case, the same output would have been generated with --co replace mode since no comment lines existed in the original BAM file.

Example 3

Renaming a read group

$ mimodd reheader -H example_input --rgm 001 : 003

@HD VN:1.0  SO:coordinate
@RG ID:003  SM:sample 1     DT:2013-04-30
@RG ID:002  SM:sample 2     DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

Note that if we had not used the -H option, but had produced a full output BAM file with:

$ mimodd reheader example_input --rgm 001 : 003 -o example_output

the tool would have scanned the body of the input BAM file and would have replaced any references to read group ID 001 with references to ID 003.

Example 4

Adding information to read groups and changing their ID in one pass

Assume you have a template SAM file named “template.sam” that has additional information about the read groups, but lists those read groups under different IDs, e.g:

@HD VN:1.0  SO:coordinate
@RG ID:752  SM:sample 1     DS:labbook #1, exp #216
@RG ID:753  SM:sample 2     DS:labbook #1, exp #217

You could use:

$ mimodd reheader -H example_input --rg update template.sam 001 : 752 002 : 753

@HD VN:1.0  SO:coordinate
@RG ID:001  SM:sample 1     DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:002  SM:sample 2     DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

to assign the new information to the correct read groups. If you not only wanted to update the information about the read groups, but also wanted to change their IDs to match those in the template file, you would have to use:

$ mimodd reheader -H example_input --rg update template.sam 001 : 752 002 : 753 --rgm 001 : 752 002 : 753

@HD VN:1.0  SO:coordinate
@RG ID:752  SM:sample 1     DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:753  SM:sample 2     DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII        LN:15279345
@SQ SN:chrIII       LN:13783700

Providing the mapping twice is necessary in this case because the changes of IDs according to the --rgm option are always made after any operations specified by --rg.

Galaxy interface

Galaxy tool name : Reheader BAM file

_images/reheader.png _images/reheader_rg.png

The sort tool:

Sort the reads in a BAM file.

command line usage
mimodd sort <input file> [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file

--oformat <bam|sam> : [default: bam]

-n, --by_name

-l <compression level> : [default: 6 (standard BAM compression level)]

-m <memory>, --memory <memory> : [default: config settings]

-t <threads>, --threads <threads> : [default: config settings]

By default, the tool sorts the reads in a BAM input file by their mapped reference coordinates. The -n switch can be used to sort by read names instead. The -l, -m, and -t options let you tweak the tool for optimal performance on your system. Specifically, they control the compression level to use for intermediate and final output, how much memory in GB to use for the sorting and how many threads to engage in it. Since the default values for -m and -t come from MiModD’s configuration settings (see Configuring MiModD for your system) they should be adequate in most situations.

Galaxy interface

Galaxy tool name : Sort BAM File

The Galaxy tool version exposes the --by_name and --oformat options of the command line tool as check boxes and should be self-explaining.

_images/sort.png

The covstats tool:

Provide a coverage summary report for output from the varcall tool.

command line usage
mimodd covstats <input bcf file> [-o <file>]

available options:

-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file
Galaxy interface

Galaxy tool name : Coverage Statistics

_images/coverage_stats.png

The snpeff-genomes tool:

Query the host machine’s SnpEff installation for registered and installed (i.e., usable) genome annotation files.

command line usage
mimodd snpeff-genomes <input bcf file> [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file
-c <path>, --config <path>
use this path to locate the SnpEff installation directory instead of looking it up in the MiModD configuration file
Galaxy interface

Galaxy tool name : List Installed SnpEff Genomes

_images/list_snpeff_genomes.png

The cloudmap tool:

Generate CloudMap-compatible output from a vcf file.

command line usage
mimodd cloudmap <vcf input file> <analysis mode>
                [reference sample name] <sample name> <[OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file
-s <file>, --seqdict <file>
generate also the sequence dictionary required by the CloudMap Variant Density Mapping and Hawaiian Mapping tools and write it to the specified file

This tool takes a multi-sample vcf input file (as it can be generated by the varextract tool and generates single-sample vcf from it that is compatible with the mutation mapping tools provided by CloudMap.

Together with the input file name you must specify the intended analysis mode as one of “EMS”, “VARIANT”, “HAWAIIAN” indicating which of the three existing CloudMMap mapping tools (EMS Variant Density Mapping, Variant Discovery Mapping and Hawaiian Variant Mapping you are going to use the output for.

This choice determines the meaning of the reference sample and sample name parameters:

  • In EMS mode, a reference sample cannot be specified and sample name must be an existing sample name in the vcf input file, for which you wish to generate mutation density plots in CloudMap.
  • In VARIANT mode, the sample name also indicates the sample for which CloudMap will generate mutation density plots, but the variants that will be retained in the output and, thus, will be mapped by CloudMap are only those not present in the reference sample.
  • HAWAIIAN mode, finally, is the reverse of VARIANT mode in that only those variants from the specified sample will be included in the output which the reference sample is homozygous mutant for.
Galaxy interface

Galaxy tool name : Prepare Variant Data for Mapping

The Galaxy tool has the same functionality as its command line counterpart, but the input form changes to include or exclude the field for entering a reference sample name depending on the selected analysis mode.

_images/galaxy_cloudmap.png

The tool interface in EMS mode ...

_images/galaxy_cloudmap_variant.png

... and in Hawaiian mode.


The Core Tools

The snap tool:

Align sequence reads to a reference genome.

command line usage
mimodd snap <sequencing mode> <reference genome>|<index directory>
            <input file> [<input file2>] -o <file> [OPTIONS]

available options:

--iformat <fastq|gz|sam|bam> : [default: bam]

--oformat <sam|bam> : [default: bam]

--header <header file>

-t <threads>, --threads <threads> : [default: config settings]

-q, --quiet

-v, -verbose

[advanced options]

The tool aligns NGS reads in fastq (gzipped or uncompressed), SAM or BAM format (specified with --iformat) to a reference genome using the SNAP aligner.

Besides an input file (or two in case of paired-end reads in fastq format) of reads, it expects a specification of the sequencing mode, which can either be “single” or “paired”, and a reference genome to align the reads against. The tool will build a temporary index of the reference genome (to make the index permanent, specify a directory to hold it with the advanced option --idx-out). Alternatively, a precalculated index (through a previous call of the tool or through the snap-index tool) may be specified. Whether a reference genome or an index is provided will be auto-detected.

The tool requires your input file(s) to have associated header information. If this is not the case, e.g., with fastq input files, you must provide a SAM file with that information through the --header option. You can also use this option to exchange existing header information in the input just for the alignment.

Note that, currently, the tool does not support printing to the standard output, so specifying an output file with the -o option is mandatory.

Galaxy interface

Galaxy tool name : -

Note

This tool is not directly accessible from Galaxy. The Galaxy interface of MiModD uses the snap-batch tool instead.


The snap-batch tool:

Execute a batch of snap tool calls automatically in an optimized way and merge the results of all calls into a single aligned reads file with several read groups.

command line usage
mimodd snap-batch -s <mimodd snap command line> [mimodd snap command line]...

or:

mimodd snap-batch -f <input file with snap command lines>

The individual commands passed through the -s option or in the file specified with -f should be valid command lines for the mimodd snap tool, but without the leading mimodd.

In case of the -s option, individual commands should be enclosed in quotes to allow command grouping. With the -f option, the file should have one snap command per line.

The name of the output file and its format (SAM or BAM) are taken from the -o and --oformat options specified for the first snap call. Both options are ignored when used as part of later snap calls. The advanced option --sort is treated the same way.

Galaxy interface

Galaxy tool name : Snap Read Alignment

A few restrictions currently apply when using this tool from the Galaxy interface:

As explained above, all parameters except for the output file, its format and sort order can differ between the indiviual snap calls specified when the tool is used from the command line.

In contrast, all but the input file format, and optional custom input file header information are shared between calls when the tool is used from Galaxy. This behaviour helps to keep the Galaxy interface relatively simple and is good enough for aligning reads from several samples and/or read groups against a common reference with common settings.

For more complex tasks, however, you will have to use the snap-batch command line interface.

_images/snap.png

The Snap Read Alignment tool with standard options.


_images/snap_extended.png

The advanced options section, which can be expanded by selecting change settings under further parameter settings.


The snap-index tool:

Generate an index of a reference genome for use by the snap and snap-batch tools.

command line usage
mimodd snap-index <reference genome> <index_directory> [OPTIONS]

-s <seed size>

-h <slack>

-t <threads>, --threads <threads> : [default: config settings]

Use of this tool is rarely necessary since snap and snap_batch call the underlying indexing function internally if needed.

Galaxy interface

Galaxy tool name : -

Note

This tool is not accessiblle from Galaxy.


The varcall tool:

Predict single-nucleotide variants (SNVs) and indels with respect to a reference genome in one or more groups of aligned reads.

command line usage
mimodd varcall <reference genome> <input file> [input file]... [OPTIONS]

available options:

-o <vcf output file>, --ofile <vcf output file> : [default: stdout]
redirect the output to the specified file

-c <cov output file>

-s <vcf white-list file> [output file]

-d <max per-BAM depth>

-i

-q, --quiet

-v, --verbose

Optionally, the tool can produce additional output for positions listed in the vcf file specified with the -s option. This output can be redirected to a file by providing the optional output file as the second argument following -s.

It can also produce a cov file, specified with the -c option, with sample-specific coverage information for every position of the reference genome. varcall offers the choice between calling variants on a per-read group (enabled with the -i switch) or a per-sample basis (the default). This choice also affects the optional additional files if produced.

The -q switch suppresses the original output from the underlying samtools mpileup and bcftools view command, while the -v(erbose) switch enables additional messages from MiModD.

Galaxy interface

Galaxy tool name : Variant Calling

_images/variant_calling.png

The tool lets you choose a reference genome and one or more input files of aligned reads from drop-down menus.

Toggle switches can be used to enable or disable the coverage and white-list sites output and to set the mode of read grouping.

The maximum per-BAM depth can also be set.


The varextract tool:

Extract variant sites from BCF input as generated by varcall and report them in VCF format.

command line usage
mimodd varextract <input bcf file> [OPTIONS]
-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file

-p <vcf file> [<vcf file>]..., --pre-vcf <vcf file> [<vcf file>]...

-a, --keep-alts

-v, --verbose

Galaxy interface

Galaxy tool name : Extract Variant Sites

_images/extract_variant_sites.png

The delcall tool:

Predict deletions in one or more groups of aligned paired-end reads.

command line usage
mimodd delcall <input bam file> [input bam file]... <coverage file> [OPTIONS]
-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file

--max-cov <maximal coverage>

--min-size <minimal size>

-u, include-uncovered
include uncovered regions (even if they do not qualify as deletions)

-i, --group-by-id

-v, --verbose

The tool predicts deletions in one or more samples of aligned paired-end reads from the specified input bam files based on coverage of the reference genome (taken from the input cov file) and on observed insert sizes around the region.

By default, the tool reports uncovered regions, for which their is statistical evidence that they are real deletions. This behaviour can be changed so that any uncovered region will be included with the -u switch.

The output (in gff format) can be redirected from the standard output to a file specified with the -o option.

The --max-cov and --min-size options can be used to set coverage and size thresholds for reporting uncovered regions/deletions.

Like in varcall, there is an -i switch to control the mode of read grouping, but its meaning is slightly different. If enabled reads are grouped strictly by their read groups, but the default is to group reads by their associated sample names only for the coverage analysis, i.e., to detect uncovered regions. The statistical test for real deletions, however, will always be performed on a per-read group basis.

Note

This tool requires paired-end data for the statistical assessment of deletions, but single-end data with the same sample name can be used in and improve the initial coverage analysis.

Galaxy interface

Galaxy tool name : Deletion Prediction for Paired-end Data

_images/deletion_prediction.png

You can select one or more bam input files and a cov file from drop-down menus.

You can provide custom values for the maximal coverage allowed inside a region to consider it as a deletion and the minimal deletion size.

Toggle switches let you control whether to inlude uncovered regions that are not statistically significant deletions in the output and how to group reads from the input bam file(s).


The Data Exploration Tools

The vcf-filter tool:

Filter multi-sample vcf files, like the ones generated by varcall, by chromosomal region, variant type (i.e., INDEL or single nucleotide change), or by sample-specific genotypes and coverage.

command line usage
mimodd vcf-filter <input file> [OPTIONS]
-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file
-s <sample name> [sample name]..., --samples ...
one or more names of samples that the sample-specific filters --gt , --dp and --gq should work on
--gt <genotype pattern> [genotype pattern]...
one or more genotype patterns of the form x/x[, x/x] where x=0 indicates a reference allele and x=1 a variant allele; exactly one genotype pattern (possibly composed of several comma-separated genotypes) needs to be given per sample name specified through -s and variants will be retained only if every listed sample has a genotype contained in its corresponding genotype patterns; use ANY as the genotype pattern to exclude a sample from being used in genotype filtering
--dp <depth> [depth]...
depth of coverage threshold; exactly one value needed for every sample specified through -s; variants will only be retained if every sample declared with -s shows a read coverage of the variant site at least equal to the corresponding threshold; use 0 to skip the requirement for a given sample
--gq <genotype quality> [genotype quality]
genotype quality threshold; exactly one value needed for every sample specified through -s; variants will only be retained if every sample declared with -s has a quality score for its predicted genotype equal to or higher than the corresponding threshold; specifying 0 as the threshold effectively skips gq filtering for a given sample
-r <chromosome>:[start-stop] [chromosome:[start-stop]]..., --region ...
retain variants only if they fall into one of the given chromosomal regions (specified in the format chrom:start-stop or chrom: for a whole chromosome)
-i``|-I``, --no-indels``|–indels-only``
mutually exclusive options to filter out or retain only indels
--vfilter <sample name> [sample name]...
filter by sample name(s); instead of filtering variants, this option will retain sample-specific information about each variant only for the specified samples; if several sample names are provided their order also determines the order of sample-specific information in the output; useful for demultiplexing and restructuring (the sample-specific part of) multi-sample vcf files

To filter variants based on sample-specific information, first use the -s option to specify one or more samples by name to which the filters should be applied. Then use the --gt, --dp or --gq options to specify the filter criteria. Each of these options, when present, has to be followed by a number of filters matching the number of samples declared with -s and the filters are associated with the samples by position on the command line.

–gt sets genotype filters. Currently, 0/0 (homozygous ref allele), 0/1 (heterozygous) and 1/1 (homozygous alt allele) are allowed as <genotype patterns>. Several genotypes can be specified by separating them with commas. The keyword ANY is allowed as a placeholder (i.e., no genotype filter should be applied to the given sample), and this should be preferred over the alternative 0/0,0/1,1/1 notation for clarity, briefity and efficiency.

–dp sets coverage filters. To pass the filter a variant site has to be covered by at least as many sample-specific reads as set with –dp. Consequently, 0 can serve as a placeholder here.

–gq sets genotype quality filters. To pass the filter the quality of the genotype call for the relevant sample has to be at least as high as the filter criterion. Again, 0 has a placeholder function.

All filters discussed above are subtractive and line-based, i.e., if, and only if, a variant passes ALL filters, the full entry is retained.

In addition, the tool supports one column-based/vertical filter via the –v_filter option. If used, then of the sample-specific columns of any line (that passes all other filters) only those describing samples specified with the option are retained.

Examples:

Example 1

Of a vcf input file variants_in.vcf with a sample called sample1 (and possibly others) retain all variants for which sample1’s genotype is 0/1 (heterozygous) or 1/1 (homozygous for the variant allele):

$ mimodd vcf-filter variants_in.vcf -s sample1 --gt 0/1,1/1
Example 2

Of the same file, retain all variants, for which sample1’s genotype is 0/1 or 1/1 AND for which sample2’s genotype is 0/0 (homozygous for the reference allele):

$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 0/1,1/1 0/0
Example 3

With the same input again, retain all variants for which the genotype of sample1 is 0/1 or 1/1 AND for which sample1 AND sample2 show at least 3-fold coverage (the genotype of sample2 is not used for filtering):

$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 0/1,1/1 ANY --dp 3 3
Example 4

Retain all variants found on chromosomes chrI and chrII for which the genotype of sample1 is 1/1 AND that of sample2 is 0/0. For these variants, report only sample-specific information for sample1:

$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 1/1 0/0 --region chrI: chrII: --vfilter sample1
Galaxy interface

Galaxy tool name : VCF Filter

_images/vcf_filter.png _images/vcf_filter_sample_region.png

The annotate tool:

Generate annotated output for called variants to simplify data exploration

command line usage
mimodd annotate <input file> [OPTIONS]

available options:

-o <file>, --ofile <file> : [default: stdout]
redirect the output to the specified file
-f <html|text>, --oformat <html|text> : [default: html]
output format
--grouping <by_sample|by_genes>
group annotated variants by sample or by affected gene instead of simply keeping the input file order
-l <link formatter file>, --link <link formatter file>
file to read custom hyperlink formatting instructions from; effective only with -f html
--species <species>
the name of the species to assume for the annotations

In addition, the following SnpEff-specific options are supported:

-g <genome>, --genome <genome>
use SnpEff with <genome> as its genome file to annotate variants with their effects on genes and transcripts; the exact name of the file has to be provided; use the snpeff-genomes tool to get a list of all available files
-s <file>, --snpeff-out <file>
generate an additional file <file> which holds the original SnpEff output
--stats <file>
generate an additional file <file> with the SnpEff results summary
-m <memory>, --memory <meemory> : [default: config settings]
maximal memory to use in GB

--minC <coverage threshold>

--minQ <qualtiy threshold>

--no-downstream, --no-upstream, --no-intron, --no-intergenic, --no-utr
do not annotate downstream, upstream, intron, intergenic or UTR effects, respectively
--ud <distance> : [default: 500]
specify the upstream/downstream interval length, i.e., variants more than <distance> nts from the next annotated gene are considered to be intergenic

--chr

This tool provides information about the genes and transcripts affected by the variants in a vcf input file.

It uses SnpEff to add annotations to the vcf input, and an installed annotated SnpEff genome has to be specified, from which the annotations will be extracted. The snpeff_genomes tool (see below) can be used to get a list of all installed SnpEff genomes on your system.

By default, the tool generates a single output file specified with the -o option. This is a html file with a table of all variants, which transcripts they affect and, possibly, hyperlinks to the relevant genes and genome views (supported for selected organisms).

The –group_by_sample switch affects the order of this table, changing it from position-based with interleaved samples (the default) to sorted by sample first.

On demand, the original annotated vcf generated by SnpEff can be stored in a separate file specified with the -s option.

In addition, an optional SnpEff summary file of the variant effects can be specified with the –stats option.

The -q and -v switches have their usual MiModD meaning, i.e., to suppress the original output from SnpEff and to enable verbose output from the annotate tool itself, respectively.

All other options affect the behaviour of SnpEff:

Galaxy interface

Galaxy tool name : Variant Annotation

_images/variant_anno.png