MiModD User Guide: Tool Documentation ************************************* As an end-user you have two possibilities to work with MiModD: you can use it as a suite of tools from the command line or, more conveniently, through its built-in Galaxy interface. The first has the advantage of not requiring any additional software installation and avoids the memory and performance overhead of running the Galaxy server. The later clearly provides the more intuitive user-interface, particularly for beginners. In this chapter we will explain both the command line usage and the corresponding Galaxy wrapper for each tool in the MiModD package. For most tools, the relationship between their command line parameters and their configuration in Galaxy should be obvious, but we will point out differences between the two interfaces when they are important. Overview of the MiModD tools ============================ The following is a list of all command line tools currently built into MiModD together with a brief description of the purpose of each tool:: info retrieve information about the samples encoded in a file for various supported formats header generate a SAM format header from an NGS run description convert convert between different sequenced reads file formats reheader from a BAM file generate a new file with specified header sections modified based on the header of a template SAM file sort sort a BAM file by coordinates (or names) of the mapped reads snap align sequence reads using the SNAP aligner snap-batch run several snap jobs and pool the resulting alignments into a multi-sample SAM/BAM file snap-index index a reference genome for use with the SNAP aligner varcall predict SNPs and indels in one or more aligned read samples and calculate the coverage of every base in the reference genome using samtools/bcftools varextract extract variant sites from BCF input as generated by varcall and report them in VCF covstats summary coverage statistics for varcall output delcall predict deletions in one or more samples of aligned paired- end reads based on coverage of the reference genome and on insert sizes vcf-filter extract lines from a vcf variant file based on sample- and field-specific filters annotate annotate a vcf variant file with information about the affected genes snpeff-genomes list installed SnpEff genomes cloudmap generate CloudMap-compatible output from a vcf file as you can obtain it, from the command line, by running:: mimodd --help You can run any individual tool through:: mimodd Additional help on each tool and the arguments it accepts can be obtained by:: mimodd --help , e.g.:: $ mimodd header --help usage: header [-h] [--rg-id RG_ID [RG_ID ...]] [--rg-sm RG_SM [RG_SM ...]] [--rg-cn RG_CN [RG_CN ...]] [--rg-ds RG_DS [RG_DS ...]] [--rg-dt RG_DT [RG_DT ...]] [--rg-lb RG_LB [RG_LB ...]] [--rg-pl RG_PL [RG_PL ...]] [--rg-pi RG_PI [RG_PI ...]] [--rg-pu RG_PU [RG_PU ...]] [--co [COMMENT [COMMENT ...]]] [-o OFILE] [-x] optional arguments: -h, --help show this help message and exit -o OFILE, --ofile OFILE redirect the output to the specified file (default: stdout) -x, --relaxed do not enforce a sample name to be specified for every read group read group description: --rg-id RG_ID [RG_ID ...] one or more unique read group identifiers --rg-sm RG_SM [RG_SM ...] one sample name per read group identifier --rg-cn RG_CN [RG_CN ...] one sequencing center name per read group --rg-ds RG_DS [RG_DS ...] one description line per read group --rg-dt RG_DT [RG_DT ...] date runs were produced (YYYY-MM-DD); one date per read group --rg-lb RG_LB [RG_LB ...] library identifier for each read group --rg-pl RG_PL [RG_PL ...] sequencing platform/technology used to produce each read group --rg-pi RG_PI [RG_PI ...] predicted median insert size for the reads of each read group --rg-pu RG_PU [RG_PU ...] platform unit, e.g., flowcell barcode or slide identifier, for each read group other information: --co [COMMENT [COMMENT ...]] an arbitrary number of one-line comment strings If you have enabled MiModD in your local Galaxy installation, the same tools should be available (under slightly different names) in the MiModD section of the Galaxy Tools bar. This is a screenshot of what the section should look like: .. image:: /images/MiModD_Galaxy_Toolbar.png While the MiModD Galaxy tools are really just wrappers around the tools available from the command line, they provide a graphical interface that typically is easier to handle for the new or occasional user. As an example, here is the Galaxy interface for the *mimodd header* command (called *NGS Run Annotation* in Galaxy): .. image:: /images/NGS_Run_Annotation.png In addition, to providing user-friendly access to individual tools, Galaxy also lets you combine individual tools into complex workflows without advanced knowledge of shell syntax or scripting languages. The chapter `MiModD and Galaxy `__ in this guide explains how MiModD lets you run complete analyses of NGS data - from sequenced reads to lists of variants - as a single Galaxy workflow without user intervention. -------------- The MiModD tools in detail ========================== The MiModD tools can be grouped in three categories: - **the format converter and query tools (info, header, convert, reheader, sort, covstats, snpeff-genomes, cloudmap)** help you annotate and arrange your sequencing data and let you convert between different formats for storage and use by other MiModD and external tools. - **the core tools (snap, snap-batch, snap-index, varcall, varextract and delcall)** do the computation-intensive analysis of the data, aligning sequence reads to reference genomes and detecting sequence variants (currently SNPs, indels and deletions) with respect to the reference. - **the data exploration tools (annotate and vcf-filter)** let you annotate and filter the results of these analyses and, thus, help you to extract biologically relevant information. -------------- The Format Converter and Query Tools ------------------------------------ .. _info: The info tool ~~~~~~~~~~~~~ Provide information about file contents. command line usage .................. :: mimodd info [OPTIONS] The tool expects an input file and reports the metadata it finds encoded in it. It works with and auto-detects almost every file format used anywhere in the MiModD package, i.e., SAM, BAM, vcf, bcf and fasta, and is useful for quick checks of file contents. The tool can produce either plain-text or html output, which, like for many MiModD tools, is printed to the standard output (stdout, i.e., the console/terminal window) by default, but can be redirected with the ``-o`` or ``--ofile`` option to the specified file. available options: ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``--oformat `` : [default: txt] ``-v``, ``--verbose`` Galaxy interface ................ .. _galaxy-info: Galaxy tool name : Retrieve File Information .. image:: /images/tool_doc/retrieve_file_info.png Choose the input file from a drop-down menu of available files from your user history, then click Execute to run the tool. In Galaxy, a tool's output goes to a file. Galaxy stores these files in an internal database and presents them to you in the history pane. Upon running the tool, its results file will automatically be added to your history. Click on the eye symbol to see the file contents in the main pane. -------------- .. _header: The header tool ~~~~~~~~~~~~~~~ Construct a SAM format file header from information about read groups and comments, which can be used by other tools (reheader, convert and snap) to annotate sequence reads files with metadata. command line usage .................. :: mimodd header [--rg-id [RG_ID]...] [--rg-sm [RG_SM]...] [--rg-cn [RG_CN]...] [--rg-ds [RG_DS]...] [--rg-dt [RG_DT]...] [--rg-lb [RG_LB]...] [--rg_pl [RG_PL]...] [--rg_pi [RG_PI]...] [--rg_pu [RG_PU]...] [--co [COMMENT]... ] [OPTIONS] available options: ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``-x``, ``--relaxed`` do not enforce a sample name to be specified for every read group A read group in SAM/BAM format refers to a group of sequences that form a logical unit and should be treated together. Typically, a read group comprises sequence reads obtained from one sample in a particular sequencing run. You can add information about one or several read groups to the *mimodd header* output through the following command line arguments: ``--rg-id [ID]...`` Read group IDs are unique strings unambiguously identifying a read group. They may be provided by the sequencing center generating the data, but you can choose any ID(s) as long as you make sure that your naming scheme does not cause ID collisions in downstream analysis steps. ``--rg-sm [sample name]...`` For each read group declared by providing an ID for it, you **must** specify a sample name using this option (use the ``-x`` switch to make sample names optional). The sample name should identify the biological sample that the read group provides sequencing data for. ``--rg-cn [sequencing center]...`` Optional name of the sequencing center that generated the data for the read group(s). ``--rg-ds [description]...`` Optional one-line read group description(s). ``--rg-dt [YYYY-MM-DD]...`` Optional information in YYYY-MM-DD date format about when the data for the read group(s) was generated. ``--rg-lb [lib identifier]...`` Optional library identifier(s). A unique identifier of the sequence library that was used for sequencing should be given here. ``--rg-pl [platform name]...`` Optional name of the sequencing technology that was used to produce the data for the read group(s). Should be one of the following officially supported names: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, PACBIO. ``--rg-pi [insert size]...`` Optional value(s) for the predicted median insert size between sequence reads of the read group(s) on the sequenced DNA fragments. Only applicable for paired-end sequencing data. ``--rg-pu [unit ID]...`` Optional identifier of the physical sequencing unit that the data for the read group(s) was generated on. Depending on the sequencing platform that was used (see the --rg_pl option) this could be, e.g., a flowcell-barcode or a sequencing slide. The number of read groups declared in the resulting SAM header is determined by the number of read group IDs and matching number of sample names specified through the ``--rg-id`` and ``--rg-sm`` options. All other information for any read group is optional. If you do not want to provide a certain parameter for any read group you can simply omit the option. To provide a given parameter for some, but not all of the specified read groups, use the empty string ``""`` as a placeholder. Pairing between parameters is determined by the input order. Example 1 ,,,,,,,,, **Specify two read-groups with optional sequencing date information** :: $ mimodd header --rg-id 001 002 --rg-sm sample1 sample2 --rg-dt 2014-10-23 2014-09-14 @HD VN:1.5 @RG ID:001 SM:sample1 DT:2014-10-23 @RG ID:002 SM:sample2 DT:2014-09-14 Example 2 ,,,,,,,,, **Specify three read-groups with dates, but a description only available for the second** :: $ mimodd header --rg-id 001 002 003 --rg-sm sample1 sample2 sample3 --rg-dt 2014-10-23 2014-09-14 2014-11-09 --rg-ds "" "WT control" "" @HD VN:1.5 @RG ID:001 SM:sample1 DT:2014-10-23 @RG ID:002 SM:sample2 DS:WT control DT:2014-09-14 @RG ID:003 SM:sample3 DT:2014-11-09 In addition to or instead of read group information the *header* tool also lets you specify an arbitrary number of comment lines. These are not bound to any specific read group and should, generally, be used for remarks concerning the read groups or their analysis as a whole though you can, of course, *refer* to a particular read group in the comment text. Example 3 ,,,,,,,,, **Generate the three-sample header above, but with two general comment lines appended** :: $ mimodd header --rg-id 001 002 003 --rg-sm sample1 sample2 sample3 --rg-dt 2014-10-23 2014-09-14 2014-11-09 --rg-ds "" "WT control" "" --co "data from the xx sequencing project" "provided by yy lab" @HD VN:1.5 @RG ID:001 SM:sample1 DT:2014-10-23 @RG ID:002 SM:sample2 DS:WT control DT:2014-09-14 @RG ID:003 SM:sample3 DT:2014-11-09 @CO data from the xx sequencing project @CO provided by yy lab Note the use of quotes to group elements. Without them a new comment line would be started after every whitespace. Example 4 ,,,,,,,,, **Generate a header without read groups and only comment lines** :: $ mimodd header --co "forgot to say:" this @HD VN:1.5 @CO forgot to say: @CO this While such a header may not be terribly useful on its own, it can be used with the reheader_ tool to add comments to an existing header. .. Note:: Within MiModD the order of comment lines will, generally, be preserved, but this is not a file-format guarantee, i.e., other software is free to change it. The difference between read group ID and sample name ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, A frequent You have sequenced two DNA preparations (DNA_prep1 and DNA_prep2) and you store the reads from the two samples in one input file. A minimal header for this file can be generated with: .. code:: mimodd header --rg-id first second --rg-sm DNA_prep1 DNA_prep2 resulting in this SAM header: :: @HD VN:1.0 @RG SM:DNA_prep1 ID:first @RG SM:DNA_prep2 ID:second @SQ LN:0 SN: Assume you are not satisfied with the reads from DNA_prep1. You resequence that sample and store the reads in the same file. The new header can be generated with: .. code:: mimodd header --rg_id first second third --rg_sm DNA_prep1 DNA_prep2 DNA_prep1 --co "resequenced prep 1 hoping for better quality" The SAM header now looks like this: :: @HD VN:1.0 @RG SM:DNA_prep1 ID:first @RG SM:DNA_prep2 ID:second @RG SM:DNA_prep1 ID:third @SQ SN: LN:0 @CO resequenced prep 1 hoping for better quality .. Note: A focus on multi-sample files is a distinguishing feature of MiModD that accounts for much of its simplicity! Be sure you understand the relationship between read group IDs and sample names since it is a recurring theme with many MiModD tools. Galaxy interface ................ .. _galaxy-header: Galaxy tool name : NGS Run Annotation This tool is more limited than its command line counterpart in that it only allows a **single read group** to be specified and described. There is no possibility to incorporate reference sequence information or comments. The main purpose of the tool is to generate annotation for unaligned single-sample input files that can be merged with this data by the `Convert `_ and `Snap Read Alignment `_ tools. In addition, it can be used in combination with the `Reheader `_ tool to modify or replace individual read groups in multi-sample BAM files. .. image:: /images/tool_doc/ngs_run_anno.png -------------- .. _convert: The convert tool ~~~~~~~~~~~~~~~~ Convert between different sequenced reads file formats. command line usage .................. :: mimodd convert [input file]... [OPTIONS] available options: ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``--iformat `` : [default: bam] the format of the input file(s); the input format determines - whether the ``-h`` option is applicable and - whether multiple input files are allowed and how they are used: - with input in SAM or BAM format only one input file is accepted - with input in single-end fastq format (compressed or not; format strings ``fastq``, ``gz``) multiple input files may be given and the reads from all files will be written to the output in concatenated form, i.e., as if they had come from a single source, providing an easy way to merge split fastq files into a single SAM/BAM file - with input in paired-reads fastq format (compressed or not; format strings ``fastq_pe``, ``gz_pe``) multiple input files are interpreted as alternating r1 and r2 files, i.e., every other file is expected to contain the read mates of the preceding file ``--oformat `` : [default: sam] the output format ``-h
``, ``--header
`` use the header of the provided SAM file as the output header; applicable only when ``--iformat`` specifies a headerless file format, i.e., one of ``fastq``, ``fastq_pe``, ``gz`` or ``gz_pe`` Additional specifications: Standard input can be used instead of an input file by specifying a minus sign ``-`` instead of a file name. The following table summarizes the conversions that are currently supported: +--------------------------------------+---------+----------------+-----------------+ | from | to | >1 input files | optional header | +======================================+=========+================+=================+ | fastq (or gzip compressed fastq.gz) | sam | yes | yes | +--------------------------------------+---------+----------------+-----------------+ | fastq (or gzip compressed fastq.gz) | bam | yes | yes | +--------------------------------------+---------+----------------+-----------------+ | sam | bam | no | no | +--------------------------------------+---------+----------------+-----------------+ | bam | sam | no | no | +--------------------------------------+---------+----------------+-----------------+ .. Note:: We anticipate future versions of MiModD to implement subsampling and filtering during conversions. This is the reason why, in addition to the format combinations above, the currently meaningless (unless you are looking for a slow way to copy files) *"conversions"* sam <-> sam and bam <-> bam are also accepted. Examples: Example 1 ,,,,,,,,, **View the contents of a BAM file by converting it to SAM format** :: $ mimodd convert myreads.bam Conversion from BAM to SAM is the default behavior. Example 2 ,,,,,,,,, **Convert a SAM file to BAM format** :: $ mimodd header --rg-id 007 --rg-sm "secret sample" --co "for your eyes only" | \ mimodd convert - --iformat sam --oformat bam �BCdsr�e�c``p�p� ��2�3�r��t�200� ��*NM.J-Q(N�-�I�rp��L�/R��/-RH�L-V��˩���`��J�BCw While the output is not too useful, this example demonstrates how the tool can be used in a pipe. Now just to show that the above worked as expected:: $ mimodd header --rg-id 007 --rg-sm "secret sample" --co "for your eyes only" | \ mimodd convert - --iformat sam --oformat bam | mimodd convert - @HD VN:1.5 @RG ID:007 SM:secret sample @CO for your eyes only Galaxy interface ................ .. _galaxy-convert: Galaxy tool name : Convert This tool has the exact same functionality in Galaxy and from the command line. The Galaxy input form simplifies the coice of parameters, though, by dynamically displaying only applicable options based on the selected input file format. .. image:: /images/tool_doc/convert.png -------------- .. _reheader: The reheader tool ~~~~~~~~~~~~~~~~~ command line usage .................. :: mimodd reheader [template SAM file] [--rg ignore|update|replace [RG_TEMPLATE] [RG_MAPPING]...] [--sq ignore|update|replace [SQ_TEMPLATE] [SQ_MAPPING]...] [--co ignore|update|replace [CO_TEMPLATE]] [--rgm RG_MAPPING [RG_MAPPING]...] [--sqm SQ_MAPPING [SQ_MAPPING]...] [OPTIONS] available options: ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``-H`` output only the resulting header in SAM format This complex, but powerful tool rewrites an existing ```` modifying its header on the fly based on the header information of one or several template SAM files according to at least one of these optional parameters: ``template SAM file`` if specified the header of this file provides the read group information, the sequence dictionary and the comment lines that will be used to modify the input file header. ``--rg ignore|update|replace [RG template SAM file] [RG mapping]...`` the --rg option followed by one of the keywords ignore, update or replace specifies how the read group section of the new header should be compiled. With ``ignore`` any template read group information is ignored and the read group section of the BAM input file is copied to the new file unchanged. With ``update`` the input file header is modified by template read group information according to these rules: any read groups found *only* in the BAM input *or* in the template information are copied to the new header. For read groups with matching IDs found in the BAM input *and* in the template, any information about this read group found in the template will be written to the new header (possibly overwriting input file information), but information found only in the BAM input will be copied unchanged. With ``replace`` the rules are: all template read groups are written to the new header as they are (possibly overwriting read groups with matching IDs in the BAM input and *without* information merging). Read groups found only in the BAM input file are discarded. Both ``update`` and ``replace`` may optionally be followed by - a read group-specific template file, which will be used as the source of template read group information (i.e., if the general ``template SAM file`` is also given, its read group section is ignored) and/or - a read group mapping to be used to determine read group ID matching between the BAM input file and the template. This mapping has to be provided in the exact form ``input_id : template_id`` (with spaces around the colon). Several mappings must also be separated by a space. When a mapping is given and a template read group with the ID template_id is found, it is treated as if its ID was input_id. ``--sq ignore|update|replace [SQ template SAM file] [SQ mapping]...`` the --sq option determines the use of template sequence information in just the same way as --rg does for read groups. The only differences are that sequences will be matched by their name (SN header values) and that the length (LN value) and the MD5 checksum (M5 value) of a sequence will *never* be modified when rewriting sequence information from the BAM input to the new header. ``--co ignore|update|replace [template SAM file for comments]`` determines how template comments are used. They can be ignored, appended to (with update) any comments found in the BAM input file or be used to replace any comments from the BAM input. As with the --rg and --sq options, an optional specific template SAM file can also be specified. ``--rgm RG mapping [RG mapping]...`` this option lets you specify one or read group mappings (in the format shown above for the --rg option) that are used to *rename* read groups. Importantly, renaming is always happening *after* all other modifications and the input_id(s) given in the mapping(s) have to still exist at that point. ``--sqm SQ mapping [SQ mapping]...`` equivalent to --rgm for renaming sequence names Additional specifications: - A minus sign ``-`` can be used in place of any of the different template file names that the tool supports indicating that the respective header template(s) should be read from standard input. This feature makes it possible to use the tool efficiently in pipes, e.g., with input coming from the header_ tool. - If a general ``template SAM file`` is specified the tool's default for the ``--rg``, ``--sq`` and ``--co`` will be set to ``replace``, otherwise the defaults will be ``ignore``. - The tool autodetects header changes (changes in read group IDs or sequence names, elimination of read groups or sequence records) that may be incompatible with records in the body of the BAM file. In these situations, it will scan the file body for incompatible records and try to fix them or, failing that, will terminate with an error message. In this way, the tool guarantees that its reheadered output will always be a *valid* BAM file. On the other hand, the scan causes very significant runtime increases leading to the folowing recommendations: - reserve the use of the ``--rgm`` and ``--sqm`` options for vital cases - use ``replace`` mode for read groups and sequence names (``--rg`` and ``--sq`` options) with caution and take care not to remove read group or sequence records from the final header that get referred to in the body of the BAM file. Examples: The following examples assume that you have a BAM input file named "example_input" with the following header (printed in SAM format):: @HD VN:1.0 SO:coordinate @RG ID:001 SM:sample 1 DT:2013-04-30 @RG ID:002 SM:sample 2 DT:2013-04-28 @SQ SN:chrI LN:15072423 @SQ SN:chrII LN:15279345 @SQ SN:chrIII LN:13783700 This file contains reads from two samples mapped against a common reference genome, i.e. the sequences *chrI*, *chrII*, *chrIII* and the read group IDs *001* and *002* will be referenced in the body of the file. .. Note:: The reheader tool is quite complex and may take some practice to use efficiently. The ``-H`` option instructs the tool to generate only the new header in human readable SAM format and we encourage beginners to use this option to check the outcome of their parameter settings before producing the full reheadered BAM file. Example 1 ,,,,,,,,, **Adding descriptions to existing read groups** :: $ mimodd header -x --rg-id 001 002 --rg-ds "labbook #1, exp #216" "labbook #1, exp #217" | \ mimodd reheader -H example_input --rg update - @HD VN:1.0 SO:coordinate @RG ID:001 SM:sample 1 DS:labbook #1, exp #216 DT:2013-04-30 @RG ID:002 SM:sample 2 DS:labbook #1, exp #217 DT:2013-04-28 @SQ SN:chrI LN:15072423 @SQ SN:chrII LN:15279345 @SQ SN:chrIII LN:13783700 Note that, for this illustration, we used the ``-H`` option to obtain only the resulting header in SAM format. We decided to generate the SAM header template with the descriptions to be added using the header_ tool. Use of the ``-x`` option saved us from having to redundantly specify the sample names again. We piped the output directly to the reheader tool to use it as the ``--rg`` template. ``update`` mode allows us to *add* the descriptions to the existing information of each read group. Using ``replace`` instead would have produced this unintended output:: @HD VN:1.0 SO:coordinate @RG ID:001 DS:labbook #1, exp #216 @RG ID:002 DS:labbook #1, exp #217 @SQ SN:chrI LN:15072423 @SQ SN:chrII LN:15279345 @SQ SN:chrIII LN:13783700 with all the existing read group information replaced by the newly porvided one. Example 2 ,,,,,,,,, **Adding comments** :: $ mimodd header --co "let me just say:" "this header needs comments" | \ mimodd reheader -H example_input --co update - @HD VN:1.0 SO:coordinate @RG ID:001 SM:sample 1 DT:2013-04-30 @RG ID:002 SM:sample 2 DT:2013-04-28 @SQ SN:chrI LN:15072423 @SQ SN:chrII LN:15279345 @SQ SN:chrIII LN:13783700 @CO let me just say: @CO this header needs comments Again, we generated the new header elements using mimodd header_. In this particular case, the same output would have been generated with ``--co replace`` mode since no comment lines existed in the original BAM file. Example 3 ,,,,,,,,, **Renaming a read group** :: $ mimodd reheader -H example_input --rgm 001 : 003 @HD VN:1.0 SO:coordinate @RG ID:003 SM:sample 1 DT:2013-04-30 @RG ID:002 SM:sample 2 DT:2013-04-28 @SQ SN:chrI LN:15072423 @SQ SN:chrII LN:15279345 @SQ SN:chrIII LN:13783700 Note that if we had not used the ``-H`` option, but had produced a full output BAM file with:: $ mimodd reheader example_input --rgm 001 : 003 -o example_output the tool would have scanned the body of the input BAM file and would have replaced any references to read group ID 001 with references to ID 003. Example 4 ,,,,,,,,, **Adding information to read groups and changing their ID in one pass** Assume you have a template SAM file named "template.sam" that has additional information about the read groups, but lists those read groups under different IDs, e.g:: @HD VN:1.0 SO:coordinate @RG ID:752 SM:sample 1 DS:labbook #1, exp #216 @RG ID:753 SM:sample 2 DS:labbook #1, exp #217 You could use:: $ mimodd reheader -H example_input --rg update template.sam 001 : 752 002 : 753 @HD VN:1.0 SO:coordinate @RG ID:001 SM:sample 1 DS:labbook #1, exp #216 DT:2013-04-30 @RG ID:002 SM:sample 2 DS:labbook #1, exp #217 DT:2013-04-28 @SQ SN:chrI LN:15072423 @SQ SN:chrII LN:15279345 @SQ SN:chrIII LN:13783700 to assign the new information to the correct read groups. If you not only wanted to update the information about the read groups, but also wanted to change their IDs to match those in the template file, you would have to use:: $ mimodd reheader -H example_input --rg update template.sam 001 : 752 002 : 753 --rgm 001 : 752 002 : 753 @HD VN:1.0 SO:coordinate @RG ID:752 SM:sample 1 DS:labbook #1, exp #216 DT:2013-04-30 @RG ID:753 SM:sample 2 DS:labbook #1, exp #217 DT:2013-04-28 @SQ SN:chrI LN:15072423 @SQ SN:chrII LN:15279345 @SQ SN:chrIII LN:13783700 Providing the mapping twice is necessary in this case because the changes of IDs according to the ``--rgm`` option are always made *after* any operations specified by ``--rg``. Galaxy interface ................ .. _galaxy-reheader: Galaxy tool name : Reheader BAM file .. image:: /images/tool_doc/reheader.png .. image:: /images/tool_doc/reheader_rg.png -------------- .. _sort: The sort tool: ~~~~~~~~~~~~~~ Sort the reads in a BAM file. command line usage .................. :: mimodd sort [OPTIONS] available options: ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``--oformat `` : [default: bam] ``-n``, ``--by_name`` ``-l `` : [default: 6 (standard BAM compression level)] ``-m ``, ``--memory `` : [default: config settings] ``-t ``, ``--threads `` : [default: config settings] By default, the tool sorts the reads in a BAM input file by their mapped reference coordinates. The ``-n`` switch can be used to sort by read names instead. The ``-l``, ``-m``, and ``-t`` options let you tweak the tool for optimal performance on your system. Specifically, they control the compression level to use for intermediate and final output, how much memory in GB to use for the sorting and how many threads to engage in it. Since the default values for ``-m`` and ``-t`` come from MiModD's configuration settings (see `Configuring MiModD for your system `__) they should be adequate in most situations. Galaxy interface ................ .. _galaxy-sort: Galaxy tool name : Sort BAM File The Galaxy tool version exposes the ``--by_name`` and ``--oformat`` options of the command line tool as check boxes and should be self-explaining. .. image:: /images/tool_doc/sort.png -------------- .. _covstats: The covstats tool: ~~~~~~~~~~~~~~~~~~ Provide a coverage summary report for output from the varcall_ tool. command line usage .................. :: mimodd covstats [-o ] available options: ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file Galaxy interface ................ .. _galaxy-covstats: Galaxy tool name : Coverage Statistics .. image:: /images/tool_doc/coverage_stats.png -------------- .. _snpeff_genomes: The snpeff-genomes tool: ~~~~~~~~~~~~~~~~~~~~~~~~ Query the host machine's SnpEff installation for registered and installed (i.e., usable) genome annotation files. command line usage .................. :: mimodd snpeff-genomes [OPTIONS] available options: ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``-c ``, ``--config `` use this path to locate the SnpEff installation directory instead of looking it up in the MiModD configuration file Galaxy interface ................ .. _galaxy-snpeff-genomes: Galaxy tool name : List Installed SnpEff Genomes .. image:: /images/tool_doc/list_snpeff_genomes.png -------------- .. _cloudmap: The cloudmap tool: ~~~~~~~~~~~~~~~~~~ Generate CloudMap-compatible output from a vcf file. command line usage .................. :: mimodd cloudmap [reference sample name] <[OPTIONS] available options: ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``-s ``, ``--seqdict `` generate also the sequence dictionary required by the *CloudMap Variant Density Mapping* and *Hawaiian Mapping* tools and write it to the specified file This tool takes a multi-sample vcf input file (as it can be generated by the varextract_ tool and generates single-sample vcf from it that is compatible with the mutation mapping tools provided by `CloudMap `__. Together with the input file name you must specify the intended analysis mode as one of "EMS", "VARIANT", "HAWAIIAN" indicating which of the three existing CloudMMap mapping tools (*EMS Variant Density Mapping*, *Variant Discovery Mapping* and *Hawaiian Variant Mapping* you are going to use the output for. This choice determines the meaning of the *reference sample* and *sample name* parameters: - In EMS mode, a reference sample cannot be specified and sample name must be an existing sample name in the vcf input file, for which you wish to generate mutation density plots in CloudMap. - In VARIANT mode, the sample name also indicates the sample for which CloudMap will generate mutation density plots, but the variants that will be retained in the output and, thus, will be mapped by CloudMap are **only those not present in the reference** sample. - HAWAIIAN mode, finally, is the reverse of VARIANT mode in that only those variants from the specified sample will be included in the output which the reference sample is homozygous mutant for. Galaxy interface ................ .. _galaxy-cloudmap: Galaxy tool name : Prepare Variant Data for Mapping The Galaxy tool has the same functionality as its command line counterpart, but the input form changes to include or exclude the field for entering a reference sample name depending on the selected analysis mode. .. figure:: /images/tool_doc/galaxy_cloudmap.png The tool interface in *EMS* mode ... .. figure:: /images/tool_doc/galaxy_cloudmap_variant.png ... and in *Hawaiian* mode. -------------- The Core Tools -------------- .. _snap: The snap tool: ~~~~~~~~~~~~~~ Align sequence reads to a reference genome. command line usage .................. :: mimodd snap | [] -o [OPTIONS] available options: ``--iformat `` : [default: bam] ``--oformat `` : [default: bam] ``--header
`` ``-t ``, ``--threads `` : [default: config settings] ``-q``, ``--quiet`` ``-v``, ``-verbose`` [advanced options] The tool aligns NGS reads in fastq (gzipped or uncompressed), SAM or BAM format (specified with ``--iformat``) to a reference genome using the SNAP aligner. Besides an input file (or two in case of paired-end reads in fastq format) of reads, it expects a specification of the sequencing mode, which can either be "single" or "paired", and a reference genome to align the reads against. The tool will build a temporary index of the reference genome (to make the index permanent, specify a directory to hold it with the advanced option ``--idx-out``). Alternatively, a precalculated index (through a previous call of the tool or through the snap-index tool) may be specified. Whether a reference genome or an index is provided will be auto-detected. The tool requires your input file(s) to have associated header information. If this is not the case, e.g., with fastq input files, you must provide a SAM file with that information through the ``--header`` option. You can also use this option to exchange existing header information in the input just for the alignment. Note that, currently, the tool does not support printing to the standard output, so specifying an output file with the -o option is mandatory. Galaxy interface ................ Galaxy tool name : - .. Note:: This tool is not directly accessible from Galaxy. The Galaxy interface of MiModD uses the snap-batch tool instead. -------------- .. _snap_batch: The snap-batch tool: ~~~~~~~~~~~~~~~~~~~~~ Execute a batch of ``snap`` tool calls automatically in an optimized way and merge the results of all calls into a single aligned reads file with several read groups. command line usage .................. :: mimodd snap-batch -s [mimodd snap command line]... or: :: mimodd snap-batch -f The individual commands passed through the ``-s`` option or in the file specified with ``-f`` should be valid command lines for the mimodd snap_ tool, but without the leading ``mimodd``. In case of the ``-s`` option, individual commands should be enclosed in quotes to allow command grouping. With the ``-f`` option, the file should have one ``snap`` command per line. The name of the output file and its format (SAM or BAM) are taken from the ``-o`` and ``--oformat`` options specified for the first ``snap`` call. Both options are ignored when used as part of later ``snap`` calls. The advanced option ``--sort`` is treated the same way. Galaxy interface ................ .. _galaxy-snap: Galaxy tool name : Snap Read Alignment A few restrictions currently apply when using this tool from the Galaxy interface: As explained above, all parameters except for the output file, its format and sort order can differ between the indiviual snap calls specified when the tool is used from the command line. In contrast, all but the input file format, and optional custom input file header information are shared between calls when the tool is used from Galaxy. This behaviour helps to keep the Galaxy interface relatively simple and is good enough for aligning reads from several samples and/or read groups against a common reference with common settings. For more complex tasks, however, you will have to use the snap-batch command line interface. .. figure:: /images/tool_doc/snap.png The Snap Read Alignment tool with standard options. -------------- .. figure:: /images/tool_doc/snap_extended.png The advanced options section, which can be expanded by selecting *change settings* under *further parameter settings*. -------------- .. _snap_index: The snap-index tool: ~~~~~~~~~~~~~~~~~~~~~ Generate an index of a reference genome for use by the ``snap`` and ``snap-batch`` tools. command line usage .................. :: mimodd snap-index [OPTIONS] ``-s `` ``-h `` ``-t ``, ``--threads `` : [default: config settings] Use of this tool is rarely necessary since snap and snap_batch call the underlying indexing function internally if needed. Galaxy interface ................ Galaxy tool name : - .. Note:: This tool is not accessiblle from Galaxy. -------------- .. _varcall: The varcall tool: ~~~~~~~~~~~~~~~~~ Predict single-nucleotide variants (SNVs) and indels with respect to a reference genome in one or more groups of aligned reads. command line usage .................. :: mimodd varcall [input file]... [OPTIONS] available options: ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``-c `` ``-s [output file]`` ``-d `` ``-i`` ``-q``, ``--quiet`` ``-v``, ``--verbose`` Optionally, the tool can produce additional output for positions listed in the vcf file specified with the -s option. This output can be redirected to a file by providing the optional output file as the second argument following ``-s``. It can also produce a cov file, specified with the -c option, with sample-specific coverage information for every position of the reference genome. varcall offers the choice between calling variants on a per-read group (enabled with the ``-i`` switch) or a per-sample basis (the default). This choice also affects the optional additional files if produced. The ``-q`` switch suppresses the original output from the underlying samtools mpileup and bcftools view command, while the -v(erbose) switch enables additional messages from MiModD. Galaxy interface ................ .. _galaxy-varcall: Galaxy tool name : Variant Calling .. image:: /images/tool_doc/variant_calling.png The tool lets you choose a reference genome and one or more input files of aligned reads from drop-down menus. Toggle switches can be used to enable or disable the coverage and white-list sites output and to set the mode of read grouping. The maximum per-BAM depth can also be set. -------------- .. _varextract: The varextract tool: ~~~~~~~~~~~~~~~~~~~~ Extract variant sites from BCF input as generated by varcall and report them in VCF format. command line usage .................. :: mimodd varextract [OPTIONS] ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``-p []...``, ``--pre-vcf []...`` ``-a``, ``--keep-alts`` ``-v``, ``--verbose`` Galaxy interface ................ .. _galaxy-varextract: Galaxy tool name : Extract Variant Sites .. image:: /images/tool_doc/extract_variant_sites.png -------------- .. _delcall: The delcall tool: ~~~~~~~~~~~~~~~~~ Predict deletions in one or more groups of aligned paired-end reads. command line usage .................. :: mimodd delcall [input bam file]... [OPTIONS] ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``--max-cov `` ``--min-size `` ``-u``, ``include-uncovered`` include uncovered regions (even if they do not qualify as deletions) ``-i``, ``--group-by-id`` ``-v``, ``--verbose`` The tool predicts deletions in one or more samples of aligned paired-end reads from the specified input bam files based on coverage of the reference genome (taken from the input cov file) and on observed insert sizes around the region. By default, the tool reports uncovered regions, for which their is statistical evidence that they are real deletions. This behaviour can be changed so that any uncovered region will be included with the ``-u`` switch. The output (in gff format) can be redirected from the standard output to a file specified with the ``-o`` option. The ``--max-cov`` and ``--min-size`` options can be used to set coverage and size thresholds for reporting uncovered regions/deletions. Like in varcall, there is an ``-i`` switch to control the mode of read grouping, but its meaning is slightly different. If enabled reads are grouped strictly by their read groups, but the default is to group reads by their associated sample names only for the coverage analysis, i.e., to detect uncovered regions. The statistical test for real deletions, however, will always be performed on a per-read group basis. .. Note:: This tool requires paired-end data for the statistical assessment of deletions, but single-end data with the same sample name can be used in and improve the initial coverage analysis. Galaxy interface ................ .. _galaxy-delcall: Galaxy tool name : Deletion Prediction for Paired-end Data .. image:: /images/tool_doc/deletion_prediction.png You can select one or more bam input files and a cov file from drop-down menus. You can provide custom values for the maximal coverage allowed inside a region to consider it as a deletion and the minimal deletion size. Toggle switches let you control whether to inlude uncovered regions that are not statistically significant deletions in the output and how to group reads from the input bam file(s). -------------- The Data Exploration Tools -------------------------- .. _vcf_filter: The vcf-filter tool: ~~~~~~~~~~~~~~~~~~~~~ Filter multi-sample vcf files, like the ones generated by ``varcall``, by chromosomal region, variant type (i.e., INDEL or single nucleotide change), or by sample-specific genotypes and coverage. command line usage .................. :: mimodd vcf-filter [OPTIONS] ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``-s [sample name]...``, ``--samples ...`` one or more names of samples that the sample-specific filters ``--gt`` , ``--dp`` and ``--gq`` should work on ``--gt [genotype pattern]...`` one or more genotype patterns of the form x/x[, x/x] where x=0 indicates a reference allele and x=1 a variant allele; exactly one genotype pattern (possibly composed of several comma-separated genotypes) needs to be given per sample name specified through ``-s`` and variants will be retained only if every listed sample has a genotype contained in its corresponding genotype patterns; use ANY as the genotype pattern to exclude a sample from being used in genotype filtering ``--dp [depth]...`` depth of coverage threshold; exactly one value needed for every sample specified through ``-s``; variants will only be retained if every sample declared with ``-s`` shows a read coverage of the variant site at least equal to the corresponding threshold; use 0 to skip the requirement for a given sample ``--gq [genotype quality]`` genotype quality threshold; exactly one value needed for every sample specified through ``-s``; variants will only be retained if every sample declared with ``-s`` has a quality score for its predicted genotype equal to or higher than the corresponding threshold; specifying 0 as the threshold effectively skips gq filtering for a given sample ``-r :[start-stop] [chromosome:[start-stop]]...``, ``--region ...`` retain variants only if they fall into one of the given chromosomal regions (specified in the format chrom:start-stop or chrom: for a whole chromosome) ``-i``|``-I``, ``--no-indels``|``--indels-only`` mutually exclusive options to filter out or retain only indels ``--vfilter [sample name]...`` filter by sample name(s); instead of filtering variants, this option will retain sample-specific information about each variant only for the specified samples; if several sample names are provided their order also determines the order of sample-specific information in the output; useful for demultiplexing and restructuring (the sample-specific part of) multi-sample vcf files To filter variants based on sample-specific information, first use the ``-s`` option to specify one or more samples by name to which the filters should be applied. Then use the ``--gt``, ``--dp`` or ``--gq`` options to specify the filter criteria. Each of these options, when present, has to be followed by a number of filters matching the number of samples declared with ``-s`` and the filters are associated with the samples by position on the command line. --gt sets genotype filters. Currently, 0/0 (homozygous ref allele), 0/1 (heterozygous) and 1/1 (homozygous alt allele) are allowed as . Several genotypes can be specified by separating them with commas. The keyword ANY is allowed as a placeholder (i.e., no genotype filter should be applied to the given sample), and this should be preferred over the alternative 0/0,0/1,1/1 notation for clarity, briefity and efficiency. --dp sets coverage filters. To pass the filter a variant site has to be covered by at least as many sample-specific reads as set with --dp. Consequently, 0 can serve as a placeholder here. --gq sets genotype quality filters. To pass the filter the quality of the genotype call for the relevant sample has to be at least as high as the filter criterion. Again, 0 has a placeholder function. All filters discussed above are subtractive and line-based, i.e., if, and only if, a variant passes ALL filters, the full entry is retained. In addition, the tool supports one column-based/vertical filter via the --v_filter option. If used, then of the sample-specific columns of any line (that passes all other filters) only those describing samples specified with the option are retained. Examples: Example 1 ,,,,,,,,, Of a vcf input file *variants_in.vcf* with a sample called *sample1* (and possibly others) retain all variants for which sample1's genotype is 0/1 (heterozygous) or 1/1 (homozygous for the variant allele):: $ mimodd vcf-filter variants_in.vcf -s sample1 --gt 0/1,1/1 Example 2 ,,,,,,,,, Of the same file, retain all variants, for which sample1's genotype is 0/1 or 1/1 AND for which sample2's genotype is 0/0 (homozygous for the reference allele):: $ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 0/1,1/1 0/0 Example 3 ,,,,,,,,, With the same input again, retain all variants for which the genotype of *sample1* is 0/1 or 1/1 AND for which *sample1* AND *sample2* show at least 3-fold coverage (the genotype of *sample2* is not used for filtering):: $ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 0/1,1/1 ANY --dp 3 3 Example 4 ,,,,,,,,, Retain all variants found on chromosomes *chrI* and *chrII* for which the genotype of *sample1* is 1/1 AND that of *sample2* is 0/0. For these variants, report only sample-specific information for *sample1*:: $ mimodd vcf-filter variants_in.vcf -s sample1 sample2 --gt 1/1 0/0 --region chrI: chrII: --vfilter sample1 Galaxy interface ................ .. _galaxy-vcf-filter: Galaxy tool name : VCF Filter .. image:: /images/tool_doc/vcf_filter.png .. image:: /images/tool_doc/vcf_filter_sample_region.png -------------- .. _annotate: The annotate tool: ~~~~~~~~~~~~~~~~~~ Generate annotated output for called variants to simplify data exploration command line usage .................. :: mimodd annotate [OPTIONS] available options: ``-o ``, ``--ofile `` : [default: stdout] redirect the output to the specified file ``-f ``, ``--oformat `` : [default: html] output format ``--grouping `` group annotated variants by sample or by affected gene instead of simply keeping the input file order ``-l ``, ``--link `` file to read custom hyperlink formatting instructions from; effective only with ``-f html`` ``--species `` the name of the species to assume for the annotations In addition, the following SnpEff-specific options are supported: ``-g ``, ``--genome `` use SnpEff with as its genome file to annotate variants with their effects on genes and transcripts; the exact name of the file has to be provided; use the ``snpeff-genomes`` tool to get a list of all available files ``-s ``, ``--snpeff-out `` generate an additional file which holds the original SnpEff output ``--stats `` generate an additional file with the SnpEff results summary ``-m ``, ``--memory `` : [default: config settings] maximal memory to use in GB ``--minC `` ``--minQ `` ``--no-downstream``, ``--no-upstream``, ``--no-intron``, ``--no-intergenic``, ``--no-utr`` do not annotate *downstream*, *upstream*, *intron*, *intergenic* or *UTR* effects, respectively ``--ud `` : [default: 500] specify the upstream/downstream interval length, i.e., variants more than nts from the next annotated gene are considered to be intergenic ``--chr`` This tool provides information about the genes and transcripts affected by the variants in a vcf input file. It uses SnpEff to add annotations to the vcf input, and an installed annotated SnpEff genome has to be specified, from which the annotations will be extracted. The *snpeff_genomes* tool (see below) can be used to get a list of all installed SnpEff genomes on your system. By default, the tool generates a single output file specified with the -o option. This is a html file with a table of all variants, which transcripts they affect and, possibly, hyperlinks to the relevant genes and genome views (supported for selected organisms). The --group_by_sample switch affects the order of this table, changing it from position-based with interleaved samples (the default) to sorted by sample first. On demand, the original annotated vcf generated by SnpEff can be stored in a separate file specified with the -s option. In addition, an optional SnpEff summary file of the variant effects can be specified with the --stats option. The -q and -v switches have their usual MiModD meaning, i.e., to suppress the original output from SnpEff and to enable verbose output from the annotate tool itself, respectively. All other options affect the behaviour of SnpEff: Galaxy interface ................ .. _galaxy-annotate: Galaxy tool name : Variant Annotation .. image:: /images/tool_doc/variant_anno.png