MiModD User Guide: Tool Documentation¶
As an end-user you have two possibilities to work with MiModD:
you can use it as a suite of tools from the command line or, more conveniently, through its built-in Galaxy interface.
The first has the advantage of not requiring any additional software installation and avoids the memory and performance overhead of running the Galaxy server.
The later clearly provides the more intuitive user-interface, particularly for beginners.
In this chapter we will explain both the command line usage and the corresponding Galaxy wrapper for each tool in the MiModD package. For most tools, the relationship between their command line parameters and their configuration in Galaxy should be obvious, but we will point out differences between the two interfaces when they are important.
Overview of the MiModD tools¶
The following is a list of all command line tools currently built into MiModD together with a brief description of the purpose of each tool:
info retrieve information about the samples encoded in a file for
various supported formats
header generate a SAM format header from an NGS run description
convert convert between different sequenced reads file formats
reheader from a BAM file generate a new file with specified header
sections modified based on the header of a template SAM file
sort sort a BAM file by coordinates (or names) of the mapped
reads
snap align sequence reads using the SNAP aligner
snap-batch run several snap jobs and pool the resulting alignments into
a multi-sample SAM/BAM file
snap-index index a reference genome for use with the SNAP aligner
varcall predict SNPs and indels in one or more aligned read samples
and calculate the coverage of every base in the reference
genome using samtools/bcftools
varextract extract variant sites from BCF input as generated by varcall
and report them in VCF
covstats summary coverage statistics for varcall output
delcall predict deletions in one or more samples of aligned paired-
end reads based on coverage of the reference genome and on
insert sizes
vcf-filter extract lines from a vcf variant file based on sample- and
field-specific filters
annotate annotate a vcf variant file with information about the
affected genes
snpeff-genomes
list installed SnpEff genomes
cloudmap generate CloudMap-compatible output from a vcf file
as you can obtain it, from the command line, by running:
mimodd --help
You can run any individual tool through:
mimodd <tool name> <tool specific arguments>
Additional help on each tool and the arguments it accepts can be obtained by:
mimodd <tool name> --help
, e.g.:
$ mimodd header --help
usage: header [-h] [--rg-id RG_ID [RG_ID ...]] [--rg-sm RG_SM [RG_SM ...]]
[--rg-cn RG_CN [RG_CN ...]] [--rg-ds RG_DS [RG_DS ...]]
[--rg-dt RG_DT [RG_DT ...]] [--rg-lb RG_LB [RG_LB ...]]
[--rg-pl RG_PL [RG_PL ...]] [--rg-pi RG_PI [RG_PI ...]]
[--rg-pu RG_PU [RG_PU ...]] [--co [COMMENT [COMMENT ...]]]
[-o OFILE] [-x]
optional arguments:
-h, --help show this help message and exit
-o OFILE, --ofile OFILE
redirect the output to the specified file (default:
stdout)
-x, --relaxed do not enforce a sample name to be specified for every
read group
read group description:
--rg-id RG_ID [RG_ID ...]
one or more unique read group identifiers
--rg-sm RG_SM [RG_SM ...]
one sample name per read group identifier
--rg-cn RG_CN [RG_CN ...]
one sequencing center name per read group
--rg-ds RG_DS [RG_DS ...]
one description line per read group
--rg-dt RG_DT [RG_DT ...]
date runs were produced (YYYY-MM-DD); one date per
read group
--rg-lb RG_LB [RG_LB ...]
library identifier for each read group
--rg-pl RG_PL [RG_PL ...]
sequencing platform/technology used to produce each
read group
--rg-pi RG_PI [RG_PI ...]
predicted median insert size for the reads of each
read group
--rg-pu RG_PU [RG_PU ...]
platform unit, e.g., flowcell barcode or slide
identifier, for each read group
other information:
--co [COMMENT [COMMENT ...]]
an arbitrary number of one-line comment strings
If you have enabled MiModD in your local Galaxy installation, the same tools should be available (under slightly different names) in the MiModD section of the Galaxy Tools bar. This is a screenshot of what the section should look like:
While the MiModD Galaxy tools are really just wrappers around the tools available from the command line, they provide a graphical interface that typically is easier to handle for the new or occasional user. As an example, here is the Galaxy interface for the mimodd header command (called NGS Run Annotation in Galaxy):
In addition, to providing user-friendly access to individual tools, Galaxy also lets you combine individual tools into complex workflows without advanced knowledge of shell syntax or scripting languages. The chapter MiModD and Galaxy in this guide explains how MiModD lets you run complete analyses of NGS data - from sequenced reads to lists of variants - as a single Galaxy workflow without user intervention.
The MiModD tools in detail¶
The MiModD tools can be grouped in three categories:
- the format converter and query tools (info, header, convert, reheader, sort, covstats, snpeff-genomes, cloudmap) help you annotate and arrange your sequencing data and let you convert between different formats for storage and use by other MiModD and external tools.
- the core tools (snap, snap-batch, snap-index, varcall, varextract and delcall) do the computation-intensive analysis of the data, aligning sequence reads to reference genomes and detecting sequence variants (currently SNPs, indels and deletions) with respect to the reference.
- the data exploration tools (annotate and vcf-filter) let you annotate and filter the results of these analyses and, thus, help you to extract biologically relevant information.
The Format Converter and Query Tools¶
The info tool¶
Provide information about file contents.
command line usage¶
mimodd info <input file> [OPTIONS]
The tool expects an input file and reports the metadata it finds encoded
in it. It works with and auto-detects almost every file format used
anywhere in the MiModD package, i.e., SAM, BAM, vcf, bcf and fasta, and is
useful for quick checks of file contents.
The tool can produce either plain-text or html output, which, like for many
MiModD tools, is printed to the standard output (stdout, i.e., the
console/terminal window) by default, but can be redirected with the -o or
--ofile option to the specified file.
available options:
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
--oformat <html|txt> : [default: txt]
-v, --verbose
Galaxy interface¶
Galaxy tool name : Retrieve File Information
Choose the input file from a drop-down menu of available files from your user history, then click Execute to run the tool. In Galaxy, a tool’s output goes to a file. Galaxy stores these files in an internal database and presents them to you in the history pane. Upon running the tool, its results file will automatically be added to your history. Click on the eye symbol to see the file contents in the main pane.
The header tool¶
Construct a SAM format file header from information about read groups and comments, which can be used by other tools (reheader, convert and snap) to annotate sequence reads files with metadata.
command line usage¶
mimodd header [--rg-id <RG_ID> [RG_ID]...] [--rg-sm <RG_SM> [RG_SM]...]
[--rg-cn <RG_CN> [RG_CN]...] [--rg-ds <RG_DS> [RG_DS]...]
[--rg-dt <RG_DT> [RG_DT]...] [--rg-lb <RG_LB> [RG_LB]...]
[--rg_pl <RG_PL> [RG_PL]...] [--rg_pi <RG_PI> [RG_PI]...]
[--rg_pu <RG_PU> [RG_PU]...]
[--co <COMMENT> [COMMENT]... ] [OPTIONS]
available options:
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
-x,--relaxed- do not enforce a sample name to be specified for every read group
A read group in SAM/BAM format refers to a group of sequences that form a logical unit and should be treated together. Typically, a read group comprises sequence reads obtained from one sample in a particular sequencing run. You can add information about one or several read groups to the mimodd header output through the following command line arguments:
--rg-id <ID> [ID]...- Read group IDs are unique strings unambiguously identifying a read group. They may be provided by the sequencing center generating the data, but you can choose any ID(s) as long as you make sure that your naming scheme does not cause ID collisions in downstream analysis steps.
--rg-sm <sample name> [sample name]...- For each read group declared by providing an ID for it, you must
specify a sample name using this option (use the
-xswitch to make sample names optional). The sample name should identify the biological sample that the read group provides sequencing data for. --rg-cn <sequencing center> [sequencing center]...- Optional name of the sequencing center that generated the data for the read group(s).
--rg-ds <description> [description]...- Optional one-line read group description(s).
--rg-dt <YYYY-MM-DD> [YYYY-MM-DD]...- Optional information in YYYY-MM-DD date format about when the data for the read group(s) was generated.
--rg-lb <lib identifier> [lib identifier]...- Optional library identifier(s). A unique identifier of the sequence library that was used for sequencing should be given here.
--rg-pl <platform name> [platform name]...- Optional name of the sequencing technology that was used to produce the data for the read group(s). Should be one of the following officially supported names: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, PACBIO.
--rg-pi <insert size> [insert size]...- Optional value(s) for the predicted median insert size between sequence reads of the read group(s) on the sequenced DNA fragments. Only applicable for paired-end sequencing data.
--rg-pu <unit ID> [unit ID]...- Optional identifier of the physical sequencing unit that the data for the read group(s) was generated on. Depending on the sequencing platform that was used (see the –rg_pl option) this could be, e.g., a flowcell-barcode or a sequencing slide.
The number of read groups declared in the resulting SAM header is determined by
the number of read group IDs and matching number of sample names specified
through the --rg-id and --rg-sm options. All other information for any
read group is optional. If you do not want to provide a certain parameter for
any read group you can simply omit the option. To provide a given parameter for
some, but not all of the specified read groups, use the empty string "" as
a placeholder. Pairing between parameters is determined by the input order.
Example 1¶
Specify two read-groups with optional sequencing date information
$ mimodd header --rg-id 001 002 --rg-sm sample1 sample2 --rg-dt 2014-10-23 2014-09-14
@HD VN:1.5
@RG ID:001 SM:sample1 DT:2014-10-23
@RG ID:002 SM:sample2 DT:2014-09-14
Example 2¶
Specify three read-groups with dates, but a description only available for the second
$ mimodd header --rg-id 001 002 003 --rg-sm sample1 sample2 sample3 --rg-dt 2014-10-23 2014-09-14 2014-11-09 --rg-ds "" "WT control" ""
@HD VN:1.5
@RG ID:001 SM:sample1 DT:2014-10-23
@RG ID:002 SM:sample2 DS:WT control DT:2014-09-14
@RG ID:003 SM:sample3 DT:2014-11-09
In addition to or instead of read group information the header tool also lets you specify an arbitrary number of comment lines. These are not bound to any specific read group and should, generally, be used for remarks concerning the read groups or their analysis as a whole though you can, of course, refer to a particular read group in the comment text.
Example 3¶
Generate the three-sample header above, but with two general comment lines appended
$ mimodd header --rg-id 001 002 003 --rg-sm sample1 sample2 sample3 --rg-dt 2014-10-23 2014-09-14 2014-11-09 --rg-ds "" "WT control" "" --co "data from the xx sequencing project" "provided by yy lab"
@HD VN:1.5
@RG ID:001 SM:sample1 DT:2014-10-23
@RG ID:002 SM:sample2 DS:WT control DT:2014-09-14
@RG ID:003 SM:sample3 DT:2014-11-09
@CO data from the xx sequencing project
@CO provided by yy lab
Note the use of quotes to group elements. Without them a new comment line would be started after every whitespace.
Example 4¶
Generate a header without read groups and only comment lines
$ mimodd header --co "forgot to say:" this
@HD VN:1.5
@CO forgot to say:
@CO this
While such a header may not be terribly useful on its own, it can be used with the reheader tool to add comments to an existing header.
Note
Within MiModD the order of comment lines will, generally, be preserved, but this is not a file-format guarantee, i.e., other software is free to change it.
The difference between read group ID and sample name¶
A frequent You have sequenced two DNA preparations (DNA_prep1 and DNA_prep2) and you store the reads from the two samples in one input file.
A minimal header for this file can be generated with:
mimodd header --rg-id first second --rg-sm DNA_prep1 DNA_prep2
resulting in this SAM header:
@HD VN:1.0
@RG SM:DNA_prep1 ID:first
@RG SM:DNA_prep2 ID:second
@SQ LN:0 SN:
Assume you are not satisfied with the reads from DNA_prep1. You resequence that sample and store the reads in the same file. The new header can be generated with:
mimodd header --rg_id first second third --rg_sm DNA_prep1 DNA_prep2 DNA_prep1 --co "resequenced prep 1 hoping for better quality"
The SAM header now looks like this:
@HD VN:1.0
@RG SM:DNA_prep1 ID:first
@RG SM:DNA_prep2 ID:second
@RG SM:DNA_prep1 ID:third
@SQ SN: LN:0
@CO resequenced prep 1 hoping for better quality
Galaxy interface¶
Galaxy tool name : NGS Run Annotation
This tool is more limited than its command line counterpart in that it only allows a single read group to be specified and described. There is no possibility to incorporate reference sequence information or comments. The main purpose of the tool is to generate annotation for unaligned single-sample input files that can be merged with this data by the Convert and Snap Read Alignment tools. In addition, it can be used in combination with the Reheader tool to modify or replace individual read groups in multi-sample BAM files.
The convert tool¶
Convert between different sequenced reads file formats.
command line usage¶
mimodd convert <input file> [input file]... [OPTIONS]
available options:
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
--iformat <fastq|fastq_pe|gz|gz_pe|sam|bam>: [default: bam]the format of the input file(s); the input format determines
- whether the
-hoption is applicable and - whether multiple input files are allowed and how they are used:
- with input in SAM or BAM format only one input file is accepted
- with input in single-end fastq format (compressed or not; format strings
fastq,gz) multiple input files may be given and the reads from all files will be written to the output in concatenated form, i.e., as if they had come from a single source, providing an easy way to merge split fastq files into a single SAM/BAM file - with input in paired-reads fastq format (compressed or not; format
strings
fastq_pe,gz_pe) multiple input files are interpreted as alternating r1 and r2 files, i.e., every other file is expected to contain the read mates of the preceding file
- whether the
--oformat <sam|bam>: [default: sam]- the output format
-h <header file>,--header <header file>- use the header of the provided SAM file as the output header;
applicable only when
--iformatspecifies a headerless file format, i.e., one offastq,fastq_pe,gzorgz_pe
Additional specifications:
Standard input can be used instead of an input file by specifying a minus sign
- instead of a file name.
The following table summarizes the conversions that are currently supported:
| from | to | >1 input files | optional header |
|---|---|---|---|
| fastq (or gzip compressed fastq.gz) | sam | yes | yes |
| fastq (or gzip compressed fastq.gz) | bam | yes | yes |
| sam | bam | no | no |
| bam | sam | no | no |
Note
We anticipate future versions of MiModD to implement subsampling and filtering during conversions. This is the reason why, in addition to the format combinations above, the currently meaningless (unless you are looking for a slow way to copy files) “conversions” sam <-> sam and bam <-> bam are also accepted.
Examples:
Example 1¶
View the contents of a BAM file by converting it to SAM format
$ mimodd convert myreads.bam
Conversion from BAM to SAM is the default behavior.
Example 2¶
Convert a SAM file to BAM format
$ mimodd header --rg-id 007 --rg-sm "secret sample" --co "for your eyes only" | \
mimodd convert - --iformat sam --oformat bam
�BCdsr�e�c``p�p�
��2�3�r��t�200�
��*NM.J-Q(N�-�I�rp��L�/R��/-RH�L-V��˩���`��J�BCw
While the output is not too useful, this example demonstrates how the tool can be used in a pipe. Now just to show that the above worked as expected:
$ mimodd header --rg-id 007 --rg-sm "secret sample" --co "for your eyes only" | \
mimodd convert - --iformat sam --oformat bam | mimodd convert -
@HD VN:1.5
@RG ID:007 SM:secret sample
@CO for your eyes only
Galaxy interface¶
Galaxy tool name : Convert
This tool has the exact same functionality in Galaxy and from the command line. The Galaxy input form simplifies the coice of parameters, though, by dynamically displaying only applicable options based on the selected input file format.
The reheader tool¶
command line usage¶
mimodd reheader [template SAM file] <BAM input file>
[--rg ignore|update|replace [RG_TEMPLATE] [RG_MAPPING]...]
[--sq ignore|update|replace [SQ_TEMPLATE] [SQ_MAPPING]...]
[--co ignore|update|replace [CO_TEMPLATE]]
[--rgm RG_MAPPING [RG_MAPPING]...]
[--sqm SQ_MAPPING [SQ_MAPPING]...] [OPTIONS]
available options:
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
-H- output only the resulting header in SAM format
This complex, but powerful tool rewrites an existing <BAM input file>
modifying its header on the fly based on the header information of one or
several template SAM files according to at least one of these optional
parameters:
template SAM file- if specified the header of this file provides the read group information, the sequence dictionary and the comment lines that will be used to modify the input file header.
--rg ignore|update|replace [RG template SAM file] [RG mapping]...the –rg option followed by one of the keywords ignore, update or replace specifies how the read group section of the new header should be compiled. With
ignoreany template read group information is ignored and the read group section of the BAM input file is copied to the new file unchanged. Withupdatethe input file header is modified by template read group information according to these rules: any read groups found only in the BAM input or in the template information are copied to the new header. For read groups with matching IDs found in the BAM input and in the template, any information about this read group found in the template will be written to the new header (possibly overwriting input file information), but information found only in the BAM input will be copied unchanged. Withreplacethe rules are: all template read groups are written to the new header as they are (possibly overwriting read groups with matching IDs in the BAM input and without information merging). Read groups found only in the BAM input file are discarded.Both
updateandreplacemay optionally be followed by- a read group-specific template file, which will be used as the source of
template read group information (i.e., if the general
template SAM fileis also given, its read group section is ignored)
and/or
- a read group mapping to be used to determine read group ID matching
between the BAM input file and the template. This mapping has to be
provided in the exact form
input_id : template_id(with spaces around the colon). Several mappings must also be separated by a space. When a mapping is given and a template read group with the ID template_id is found, it is treated as if its ID was input_id.
- a read group-specific template file, which will be used as the source of
template read group information (i.e., if the general
--sq ignore|update|replace [SQ template SAM file] [SQ mapping]...- the –sq option determines the use of template sequence information in just the same way as –rg does for read groups. The only differences are that sequences will be matched by their name (SN header values) and that the length (LN value) and the MD5 checksum (M5 value) of a sequence will never be modified when rewriting sequence information from the BAM input to the new header.
--co ignore|update|replace [template SAM file for comments]- determines how template comments are used. They can be ignored, appended to (with update) any comments found in the BAM input file or be used to replace any comments from the BAM input. As with the –rg and –sq options, an optional specific template SAM file can also be specified.
--rgm RG mapping [RG mapping]...- this option lets you specify one or read group mappings (in the format shown above for the –rg option) that are used to rename read groups. Importantly, renaming is always happening after all other modifications and the input_id(s) given in the mapping(s) have to still exist at that point.
--sqm SQ mapping [SQ mapping]...- equivalent to –rgm for renaming sequence names
Additional specifications:
- A minus sign
-can be used in place of any of the different template file names that the tool supports indicating that the respective header template(s) should be read from standard input. This feature makes it possible to use the tool efficiently in pipes, e.g., with input coming from the header tool. - If a general
template SAM fileis specified the tool’s default for the--rg,--sqand--cowill be set toreplace, otherwise the defaults will beignore. - The tool autodetects header changes (changes in read group IDs or sequence
names, elimination of read groups or sequence records) that may be
incompatible with records in the body of the BAM file. In these situations,
it will scan the file body for incompatible records and try to fix them or,
failing that, will terminate with an error message. In this way, the tool
guarantees that its reheadered output will always be a valid BAM file. On
the other hand, the scan causes very significant runtime increases leading to
the folowing recommendations:
- reserve the use of the
--rgmand--sqmoptions for vital cases - use
replacemode for read groups and sequence names (--rgand--sqoptions) with caution and take care not to remove read group or sequence records from the final header that get referred to in the body of the BAM file.
- reserve the use of the
Examples:
The following examples assume that you have a BAM input file named “example_input” with the following header (printed in SAM format):
@HD VN:1.0 SO:coordinate
@RG ID:001 SM:sample 1 DT:2013-04-30
@RG ID:002 SM:sample 2 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII LN:15279345
@SQ SN:chrIII LN:13783700
This file contains reads from two samples mapped against a common reference genome, i.e. the sequences chrI, chrII, chrIII and the read group IDs 001 and 002 will be referenced in the body of the file.
Note
The reheader tool is quite complex and may take some practice to use
efficiently. The -H option instructs the tool to generate only the new
header in human readable SAM format and we encourage beginners to use this
option to check the outcome of their parameter settings before producing the
full reheadered BAM file.
Example 1¶
Adding descriptions to existing read groups
$ mimodd header -x --rg-id 001 002 --rg-ds "labbook #1, exp #216" "labbook #1, exp #217" | \
mimodd reheader -H example_input --rg update -
@HD VN:1.0 SO:coordinate
@RG ID:001 SM:sample 1 DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:002 SM:sample 2 DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII LN:15279345
@SQ SN:chrIII LN:13783700
Note that, for this illustration, we used the -H option to obtain only the
resulting header in SAM format.
We decided to generate the SAM header template with the descriptions to be
added using the header tool. Use of the -x option saved us from having to
redundantly specify the sample names again. We piped the output directly to the
reheader tool to use it as the --rg template. update mode allows us to
add the descriptions to the existing information of each read group.
Using replace instead would have produced this unintended output:
@HD VN:1.0 SO:coordinate
@RG ID:001 DS:labbook #1, exp #216
@RG ID:002 DS:labbook #1, exp #217
@SQ SN:chrI LN:15072423
@SQ SN:chrII LN:15279345
@SQ SN:chrIII LN:13783700
with all the existing read group information replaced by the newly porvided one.
Example 2¶
Adding comments
$ mimodd header --co "let me just say:" "this header needs comments" | \
mimodd reheader -H example_input --co update -
@HD VN:1.0 SO:coordinate
@RG ID:001 SM:sample 1 DT:2013-04-30
@RG ID:002 SM:sample 2 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII LN:15279345
@SQ SN:chrIII LN:13783700
@CO let me just say:
@CO this header needs comments
Again, we generated the new header elements using mimodd header. In this
particular case, the same output would have been generated with
--co replace mode since no comment lines existed in the original BAM file.
Example 3¶
Renaming a read group
$ mimodd reheader -H example_input --rgm 001 : 003
@HD VN:1.0 SO:coordinate
@RG ID:003 SM:sample 1 DT:2013-04-30
@RG ID:002 SM:sample 2 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII LN:15279345
@SQ SN:chrIII LN:13783700
Note that if we had not used the -H option, but had produced a full
output BAM file with:
$ mimodd reheader example_input --rgm 001 : 003 -o example_output
the tool would have scanned the body of the input BAM file and would have replaced any references to read group ID 001 with references to ID 003.
Example 4¶
Adding information to read groups and changing their ID in one pass
Assume you have a template SAM file named “template.sam” that has additional information about the read groups, but lists those read groups under different IDs, e.g:
@HD VN:1.0 SO:coordinate
@RG ID:752 SM:sample 1 DS:labbook #1, exp #216
@RG ID:753 SM:sample 2 DS:labbook #1, exp #217
You could use:
$ mimodd reheader -H example_input --rg update template.sam 001 : 752 002 : 753
@HD VN:1.0 SO:coordinate
@RG ID:001 SM:sample 1 DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:002 SM:sample 2 DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII LN:15279345
@SQ SN:chrIII LN:13783700
to assign the new information to the correct read groups. If you not only wanted to update the information about the read groups, but also wanted to change their IDs to match those in the template file, you would have to use:
$ mimodd reheader -H example_input --rg update template.sam 001 : 752 002 : 753 --rgm 001 : 752 002 : 753
@HD VN:1.0 SO:coordinate
@RG ID:752 SM:sample 1 DS:labbook #1, exp #216 DT:2013-04-30
@RG ID:753 SM:sample 2 DS:labbook #1, exp #217 DT:2013-04-28
@SQ SN:chrI LN:15072423
@SQ SN:chrII LN:15279345
@SQ SN:chrIII LN:13783700
Providing the mapping twice is necessary in this case because the changes of
IDs according to the --rgm option are always made after any operations
specified by --rg.
The sort tool:¶
Sort the reads in a BAM file.
command line usage¶
mimodd sort <input file> [OPTIONS]
available options:
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
--oformat <bam|sam> : [default: bam]
-n, --by_name
-l <compression level> : [default: 6 (standard BAM compression level)]
-m <memory>, --memory <memory> : [default: config settings]
-t <threads>, --threads <threads> : [default: config settings]
By default, the tool sorts the reads in a BAM input file by their mapped
reference coordinates. The -n switch can be used to sort by read names
instead.
The -l, -m, and -t options let you tweak the tool for optimal
performance on your system. Specifically, they control the compression level to
use for intermediate and final output, how much memory in GB to use for the
sorting and how many threads to engage in it. Since the default values for
-m and -t come from MiModD’s configuration settings (see Configuring
MiModD for your system)
they should be adequate in most situations.
Galaxy interface¶
Galaxy tool name : Sort BAM File
The Galaxy tool version exposes the --by_name and --oformat options of
the command line tool as check boxes and should be self-explaining.
The covstats tool:¶
Provide a coverage summary report for output from the varcall tool.
command line usage¶
mimodd covstats <input bcf file> [-o <file>]
available options:
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
The snpeff-genomes tool:¶
Query the host machine’s SnpEff installation for registered and installed (i.e., usable) genome annotation files.
command line usage¶
mimodd snpeff-genomes <input bcf file> [OPTIONS]
available options:
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
-c <path>,--config <path>- use this path to locate the SnpEff installation directory instead of looking it up in the MiModD configuration file
The cloudmap tool:¶
Generate CloudMap-compatible output from a vcf file.
command line usage¶
mimodd cloudmap <vcf input file> <analysis mode>
[reference sample name] <sample name> <[OPTIONS]
available options:
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
-s <file>,--seqdict <file>- generate also the sequence dictionary required by the CloudMap Variant Density Mapping and Hawaiian Mapping tools and write it to the specified file
This tool takes a multi-sample vcf input file (as it can be generated by the varextract tool and generates single-sample vcf from it that is compatible with the mutation mapping tools provided by CloudMap.
Together with the input file name you must specify the intended analysis mode as one of “EMS”, “VARIANT”, “HAWAIIAN” indicating which of the three existing CloudMMap mapping tools (EMS Variant Density Mapping, Variant Discovery Mapping and Hawaiian Variant Mapping you are going to use the output for.
This choice determines the meaning of the reference sample and sample name parameters:
- In EMS mode, a reference sample cannot be specified and sample name must be an existing sample name in the vcf input file, for which you wish to generate mutation density plots in CloudMap.
- In VARIANT mode, the sample name also indicates the sample for which CloudMap will generate mutation density plots, but the variants that will be retained in the output and, thus, will be mapped by CloudMap are only those not present in the reference sample.
- HAWAIIAN mode, finally, is the reverse of VARIANT mode in that only those variants from the specified sample will be included in the output which the reference sample is homozygous mutant for.
Galaxy interface¶
Galaxy tool name : Prepare Variant Data for Mapping
The Galaxy tool has the same functionality as its command line counterpart, but the input form changes to include or exclude the field for entering a reference sample name depending on the selected analysis mode.
The tool interface in EMS mode ...
... and in Hawaiian mode.
The Core Tools¶
The snap tool:¶
Align sequence reads to a reference genome.
command line usage¶
mimodd snap <sequencing mode> <reference genome>|<index directory>
<input file> [<input file2>] -o <file> [OPTIONS]
available options:
--iformat <fastq|gz|sam|bam> : [default: bam]
--oformat <sam|bam> : [default: bam]
--header <header file>
-t <threads>, --threads <threads> : [default: config settings]
-q, --quiet
-v, -verbose
[advanced options]
The tool aligns NGS reads in fastq (gzipped or uncompressed), SAM or BAM format
(specified with --iformat) to a reference genome using the SNAP aligner.
Besides an input file (or two in case of paired-end reads in fastq
format) of reads, it expects a specification of the sequencing mode,
which can either be “single” or “paired”, and a reference genome to
align the reads against. The tool will build a temporary index of the
reference genome (to make the index permanent, specify a directory to
hold it with the advanced option --idx-out). Alternatively, a
precalculated index (through a previous call of the tool or through the
snap-index tool) may be specified. Whether a reference genome or an
index is provided will be auto-detected.
The tool requires your input file(s) to have associated header
information. If this is not the case, e.g., with fastq input files, you
must provide a SAM file with that information through the --header
option. You can also use this option to exchange existing header
information in the input just for the alignment.
Note that, currently, the tool does not support printing to the standard output, so specifying an output file with the -o option is mandatory.
Galaxy interface¶
Galaxy tool name : -
Note
This tool is not directly accessible from Galaxy. The Galaxy interface of MiModD uses the snap-batch tool instead.
The snap-batch tool:¶
Execute a batch of snap tool calls automatically in an optimized way and
merge the results of all calls into a single aligned reads file with several
read groups.
command line usage¶
mimodd snap-batch -s <mimodd snap command line> [mimodd snap command line]...
or:
mimodd snap-batch -f <input file with snap command lines>
The individual commands passed through the -s option or in the file
specified with -f should be valid command lines for the mimodd snap
tool, but without the leading mimodd.
In case of the -s option, individual commands should be enclosed in
quotes to allow command grouping.
With the -f option, the file should have one snap command per line.
The name of the output file and its format (SAM or BAM) are taken from
the -o and --oformat options specified for the first snap call.
Both options are ignored when used as part of later snap calls.
The advanced option --sort is treated the same way.
Galaxy interface¶
Galaxy tool name : Snap Read Alignment
A few restrictions currently apply when using this tool from the Galaxy interface:
As explained above, all parameters except for the output file, its format and sort order can differ between the indiviual snap calls specified when the tool is used from the command line.
In contrast, all but the input file format, and optional custom input file header information are shared between calls when the tool is used from Galaxy. This behaviour helps to keep the Galaxy interface relatively simple and is good enough for aligning reads from several samples and/or read groups against a common reference with common settings.
For more complex tasks, however, you will have to use the snap-batch command line interface.
The Snap Read Alignment tool with standard options.
The advanced options section, which can be expanded by selecting change settings under further parameter settings.
The snap-index tool:¶
Generate an index of a reference genome for use by the snap and
snap-batch tools.
command line usage¶
mimodd snap-index <reference genome> <index_directory> [OPTIONS]
-s <seed size>
-h <slack>
-t <threads>, --threads <threads> : [default: config settings]
Use of this tool is rarely necessary since snap and snap_batch call the underlying indexing function internally if needed.
The varcall tool:¶
Predict single-nucleotide variants (SNVs) and indels with respect to a reference genome in one or more groups of aligned reads.
command line usage¶
mimodd varcall <reference genome> <input file> [input file]... [OPTIONS]
available options:
-o <vcf output file>,--ofile <vcf output file>: [default: stdout]- redirect the output to the specified file
-c <cov output file>
-s <vcf white-list file> [output file]
-d <max per-BAM depth>
-i
-q, --quiet
-v, --verbose
Optionally, the tool can produce additional output for positions listed
in the vcf file specified with the -s option. This output can be
redirected to a file by providing the optional output file as the second
argument following -s.
It can also produce a cov file, specified with the -c option, with
sample-specific coverage information for every position of the reference
genome. varcall offers the choice between calling variants on a per-read
group (enabled with the -i switch) or a per-sample basis (the default).
This choice also affects the optional additional files if produced.
The -q switch suppresses the original output from the underlying
samtools mpileup and bcftools view command, while the -v(erbose) switch
enables additional messages from MiModD.
Galaxy interface¶
Galaxy tool name : Variant Calling
The tool lets you choose a reference genome and one or more input files of aligned reads from drop-down menus.
Toggle switches can be used to enable or disable the coverage and white-list sites output and to set the mode of read grouping.
The maximum per-BAM depth can also be set.
The varextract tool:¶
Extract variant sites from BCF input as generated by varcall and report them in VCF format.
command line usage¶
mimodd varextract <input bcf file> [OPTIONS]
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
-p <vcf file> [<vcf file>]..., --pre-vcf <vcf file> [<vcf file>]...
-a, --keep-alts
-v, --verbose
The delcall tool:¶
Predict deletions in one or more groups of aligned paired-end reads.
command line usage¶
mimodd delcall <input bam file> [input bam file]... <coverage file> [OPTIONS]
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
--max-cov <maximal coverage>
--min-size <minimal size>
-u,include-uncovered- include uncovered regions (even if they do not qualify as deletions)
-i, --group-by-id
-v, --verbose
The tool predicts deletions in one or more samples of aligned paired-end reads from the specified input bam files based on coverage of the reference genome (taken from the input cov file) and on observed insert sizes around the region.
By default, the tool reports uncovered regions, for which their is
statistical evidence that they are real deletions. This behaviour can be
changed so that any uncovered region will be included with the -u
switch.
The output (in gff format) can be redirected from the standard output to
a file specified with the -o option.
The --max-cov and --min-size options can be used to set coverage and
size thresholds for reporting uncovered regions/deletions.
Like in varcall, there is an -i switch to control the mode of read
grouping, but its meaning is slightly different. If enabled reads are
grouped strictly by their read groups, but the default is to group reads
by their associated sample names only for the coverage analysis, i.e.,
to detect uncovered regions. The statistical test for real deletions,
however, will always be performed on a per-read group basis.
Note
This tool requires paired-end data for the statistical assessment of deletions, but single-end data with the same sample name can be used in and improve the initial coverage analysis.
Galaxy interface¶
Galaxy tool name : Deletion Prediction for Paired-end Data
You can select one or more bam input files and a cov file from drop-down menus.
You can provide custom values for the maximal coverage allowed inside a region to consider it as a deletion and the minimal deletion size.
Toggle switches let you control whether to inlude uncovered regions that are not statistically significant deletions in the output and how to group reads from the input bam file(s).
The Data Exploration Tools¶
The vcf-filter tool:¶
Filter multi-sample vcf files, like the ones generated by varcall, by
chromosomal region, variant type (i.e., INDEL or single nucleotide change), or
by sample-specific genotypes and coverage.
command line usage¶
mimodd vcf-filter <input file> [OPTIONS]
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
-s <sample name> [sample name]...,--samples ...- one or more names of samples that the sample-specific filters
--gt,--dpand--gqshould work on --gt <genotype pattern> [genotype pattern]...- one or more genotype patterns of the form x/x[, x/x] where x=0 indicates a
reference allele and x=1 a variant allele; exactly one genotype pattern
(possibly composed of several comma-separated genotypes) needs to be given
per sample name specified through
-sand variants will be retained only if every listed sample has a genotype contained in its corresponding genotype patterns; use ANY as the genotype pattern to exclude a sample from being used in genotype filtering --dp <depth> [depth]...- depth of coverage threshold; exactly one value needed for every sample
specified through
-s; variants will only be retained if every sample declared with-sshows a read coverage of the variant site at least equal to the corresponding threshold; use 0 to skip the requirement for a given sample --gq <genotype quality> [genotype quality]- genotype quality threshold; exactly one value needed for every sample
specified through
-s; variants will only be retained if every sample declared with-shas a quality score for its predicted genotype equal to or higher than the corresponding threshold; specifying 0 as the threshold effectively skips gq filtering for a given sample -r <chromosome>:[start-stop] [chromosome:[start-stop]]...,--region ...- retain variants only if they fall into one of the given chromosomal regions (specified in the format chrom:start-stop or chrom: for a whole chromosome)
-i``|-I``,--no-indels``|–indels-only``- mutually exclusive options to filter out or retain only indels
--vfilter <sample name> [sample name]...- filter by sample name(s); instead of filtering variants, this option will retain sample-specific information about each variant only for the specified samples; if several sample names are provided their order also determines the order of sample-specific information in the output; useful for demultiplexing and restructuring (the sample-specific part of) multi-sample vcf files
To filter variants based on sample-specific information, first use the -s
option to specify one or more samples by name to which the filters should be
applied. Then use the --gt, --dp or --gq options to specify the
filter criteria. Each of these options, when present, has to be followed by a
number of filters matching the number of samples declared with -s and the
filters are associated with the samples by position on the command line.
–gt sets genotype filters. Currently, 0/0 (homozygous ref allele), 0/1 (heterozygous) and 1/1 (homozygous alt allele) are allowed as <genotype patterns>. Several genotypes can be specified by separating them with commas. The keyword ANY is allowed as a placeholder (i.e., no genotype filter should be applied to the given sample), and this should be preferred over the alternative 0/0,0/1,1/1 notation for clarity, briefity and efficiency.
–dp sets coverage filters. To pass the filter a variant site has to be covered by at least as many sample-specific reads as set with –dp. Consequently, 0 can serve as a placeholder here.
–gq sets genotype quality filters. To pass the filter the quality of the genotype call for the relevant sample has to be at least as high as the filter criterion. Again, 0 has a placeholder function.
All filters discussed above are subtractive and line-based, i.e., if, and only if, a variant passes ALL filters, the full entry is retained.
In addition, the tool supports one column-based/vertical filter via the –v_filter option. If used, then of the sample-specific columns of any line (that passes all other filters) only those describing samples specified with the option are retained.
Examples:
Example 1¶
Of a vcf input file variants_in.vcf with a sample called sample1 (and possibly others) retain all variants for which sample1’s genotype is 0/1 (heterozygous) or 1/1 (homozygous for the variant allele):
$ mimodd vcf-filter variants_in.vcf -s sample1 --gt 0/1,1/1
Example 2¶
Of the same file, retain all variants, for which sample1’s genotype is 0/1 or 1/1 AND for which sample2’s genotype is 0/0 (homozygous for the reference allele):
$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 0/1,1/1 0/0
Example 3¶
With the same input again, retain all variants for which the genotype of sample1 is 0/1 or 1/1 AND for which sample1 AND sample2 show at least 3-fold coverage (the genotype of sample2 is not used for filtering):
$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 0/1,1/1 ANY --dp 3 3
Example 4¶
Retain all variants found on chromosomes chrI and chrII for which the genotype of sample1 is 1/1 AND that of sample2 is 0/0. For these variants, report only sample-specific information for sample1:
$ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 1/1 0/0 --region chrI: chrII: --vfilter sample1
The annotate tool:¶
Generate annotated output for called variants to simplify data exploration
command line usage¶
mimodd annotate <input file> [OPTIONS]
available options:
-o <file>,--ofile <file>: [default: stdout]- redirect the output to the specified file
-f <html|text>,--oformat <html|text>: [default: html]- output format
--grouping <by_sample|by_genes>- group annotated variants by sample or by affected gene instead of simply keeping the input file order
-l <link formatter file>,--link <link formatter file>- file to read custom hyperlink formatting instructions from; effective only
with
-f html --species <species>- the name of the species to assume for the annotations
In addition, the following SnpEff-specific options are supported:
-g <genome>,--genome <genome>- use SnpEff with <genome> as its genome file to annotate variants with their
effects on genes and transcripts; the exact name of the file has to be
provided; use the
snpeff-genomestool to get a list of all available files -s <file>,--snpeff-out <file>- generate an additional file <file> which holds the original SnpEff output
--stats <file>- generate an additional file <file> with the SnpEff results summary
-m <memory>,--memory <meemory>: [default: config settings]- maximal memory to use in GB
--minC <coverage threshold>
--minQ <qualtiy threshold>
--no-downstream,--no-upstream,--no-intron,--no-intergenic,--no-utr- do not annotate downstream, upstream, intron, intergenic or UTR effects, respectively
--ud <distance>: [default: 500]- specify the upstream/downstream interval length, i.e., variants more than <distance> nts from the next annotated gene are considered to be intergenic
--chr
This tool provides information about the genes and transcripts affected by the variants in a vcf input file.
It uses SnpEff to add annotations to the vcf input, and an installed annotated SnpEff genome has to be specified, from which the annotations will be extracted. The snpeff_genomes tool (see below) can be used to get a list of all installed SnpEff genomes on your system.
By default, the tool generates a single output file specified with the -o option. This is a html file with a table of all variants, which transcripts they affect and, possibly, hyperlinks to the relevant genes and genome views (supported for selected organisms).
The –group_by_sample switch affects the order of this table, changing it from position-based with interleaved samples (the default) to sorted by sample first.
On demand, the original annotated vcf generated by SnpEff can be stored in a separate file specified with the -s option.
In addition, an optional SnpEff summary file of the variant effects can be specified with the –stats option.
The -q and -v switches have their usual MiModD meaning, i.e., to suppress the original output from SnpEff and to enable verbose output from the annotate tool itself, respectively.
All other options affect the behaviour of SnpEff: