First steps: single-sample analysis =================================== Background: ----------- Synechocystis sp. PCC 6803 is a representative, well-studied strain of Synechocystis, a genus of cyanobacteria. It serves as a model organism for studies on photosynthesis and carbon and nitrogen metabolism. Several substrains with particular phenotypes have been derived from PCC 6803 through cultivation in different laboratories since its original isolation from a freshwater lake in 1968. One of them is the so-called Moscow substrain (PCC-M) which belongs to a subgroup of motile strains as compared to the non-motile substrain from which the PCC 6803 reference sequence was obtained in 1996. The sequence of the PCC-M strain was reported and compared to that of the PCC 6803 reference in 2012 [1]_ and in this example we will try to recapitulate that analysis using MiModD. The PCC6803.tar.gz archive file that you can obtain from the `MiModD download site `__ contains two files: - the **PCC 6803 reference sequence** in fasta format, including the bacterial chromosome, the four large and two of the three small plasmids of this strain, and - a BAM file of **NGS reads from the PCC-M substrain** (the reads are a random subset of those available through the NCBI small reads archive (SRA) under accession number `SRX043525 `__). The goal of this example is to demonstrate the use of MiModD for aligning WGS reads against a reference genome and for discovering sequence variants in the alignment with respect to that reference. In addition, we will have a first look at the variant annotation and data exploration tools of MiModD. .. _ex1-step0: Step 0: Preparing the data -------------------------- Data preparation for Galaxy ~~~~~~~~~~~~~~~~~~~~~~~~~~~ To follow this example analysis using MiModD's Galaxy interface, you first have to unpack the downloaded dataset archive into a directory of your choice, then upload the reference fasta and the BAM file of WGS reads to Galaxy to make them accessible from the graphical interface. For these relatively small files, the easiest way of doing this is to expand the *Get Data* section in the Galaxy *Tools Bar* and select *Upload File*, then *Browse* **(1)** to locate the extracted files on disk, or to pop up Galaxy's upload manager by clicking the upload icon at the top of the *Tools Bar* **(2)** to select and upload multiple files at once. .. image:: /images/Ex1_upload_data.png Once the uploads are finsihed and the files have been added to your currently active history (the right sidebar in the Galaxy window) you can proceed to Step 1. .. Note:: Uploading the data as described above will store a copy of the files in Galaxy (and use extra disk space on your machine). For a more efficient way of getting data into Galaxy see our `recipe for importing large files `__. For a general introduction to Galaxy and its terminology, you can check out the `Learn Section of the Galaxy Wiki `__. Data preparation for the command line ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Besides extracting the data files from the downloaded archive no special data preparation is required when working from the command line. Throughout this example, we will assume that you are working from the extracted directory as your current working directory. Hence, if you want to follow this example by simply copying/pasting the command line instructions shown in this section, you should, after unpacking the archive, change to the extracted directory with:: cd tutorial_ex1 To confirm that you are in the right directory, try:: ls , which should list the two data files plus a readme txt file. .. _ex1-step1: Step 1: NGS Reads Alignment with the MiModD snap tool ----------------------------------------------------- The goal of this step is to determine, for the millions of short unaligned sequences in the input BAM file, their mapping to the reference genome. To perform an alignment with MiModD in Galaxy, expand the MiModD section of tools in the *Tools Bar*, then select the *SNAP Read Alignment* tool ( **(1)** in the screenshot below). This opens the tool's input form: .. image:: /images/Ex1_Snap_Alignment.png Select the reference genome **(2)** and a WGS dataset (your BAM file) **(3)**. The reads in the example file come from an Illumina paired-end sequencing run, so you should choose *paired-end* as the mode. You can ignore the *custom header file* choice for now since the BAM file already has header information (more on this later). Likewise, all other parameters can be left at their default settings for this simple example. Click *Execute* **(4)** to run the job (depending on your computer it should finish within 5 to 15 minutes). The job will now be added to your history bar as shown below (note the status changes - from scheduled to running to done - that it will undergo and that are indicated by color changes from grey to yellow to green): .. image:: /images/Ex1_Snap_Alignment_running.png If you are using MiModD from the command line, you can execute the following command to get the exact same result as for the Galaxy job:: mimodd snap paired PCC_6803_ref.fa SRR101486.bam -o SRR101486.aln.bam -C ++ -M --sort --verbose --quiet , which will store the alignment of the reads found in the BAM input file against the reference genome in the output file SRR101486.aln.bam. As explained under `Data Preparation `_, you can simply copy/paste this and the following command line quotes, if you set your current working directory to the directory extracted from the downloaded archive. Otherwise, you will have to add the appropriate paths to all filenames. .. tip:: You can get usage information for any MiModD tool from the command line with: ``mimodd --help`` so for an overview over all command line options of the snap tool, you can type: ``mimodd snap --help`` , which will tell you that the general form of the command is: ``mimodd snap -o [OPTIONS]`` Here, as throughout the documentation angle brackets ``< >`` enclose parameters to be filled in by the user and square brackets ``[ ]`` indicate optional parameters. See also the `snap tool description `__ for more details. .. _ex1-step2: Step 2: Identification of Sequence Variants ------------------------------------------- In the *Variant Calling* step, each nucleotide in the reference genome gets compared to every overlapping read in the aligned reads file from *Step 1* to see whether the reference and the sequenced bases agree or whether there is sufficient evidence for a sequence variation at the position. Below is a snapshot of the Galaxy interface of the *Variant Calling* tool of MiModD: .. image:: /images/Ex1_SNP_calling.png You need to specify the reference genome file **(1)** and (at least) one aligned reads file **(2)**. In our example, of course, you should choose the file produced in the previous step. The remaining settings can be left at their default values shown in the screenshot. As before, click *Execute* **(3)** to start the job. A new dataset will be added to your history: .. image:: /images/Ex1_SNP_calling_running.png It holds information, for every base in the genome, about how many reads in the aligned reads file overlap that base position, how many of them agree and disagree with the reference base, how well their ensemble supports the reference base or an alternate base instead, and more. This per-nucleotide information gets encoded in the `binary variant call format BCF `__ to save disk space. While this means that you cannot view the contents of the file directly in Galaxy (clicking on the dataset's eye symbol in the history bar will pop up your browser's download dialog instead), MiModD uses bcf files produced by the *Variant Calling* tool as a flexible starting point for downstream analyses. The MiModD command line tool for *Variant Calling* is *varcall* and the command line equivalent to the above Galaxy tool run is:: mimodd varcall PCC_6803_ref.fa SRR101486.aln.bam -o example1_calls.bcf -d 250 -i --quiet --verbose .. Tip:: MiModD uses general command line options that are understood by several tools in the package. The ``-o`` option (and its long form ``--ofile`` let the user specify the name of an output file. Without this option, most tools send their output to the standard output stream, which normally means the screen. The ``-v`` or ``--verbose`` option instructs tools to print extra status messages while they are running. The ``-q`` or ``--quiet`` option is recognized by commands that wrap third-party tools and makes them suppress the original output from these tools. The ``--quiet`` and ``--verbose`` options can be combined. The *mimodd varcall* command, for example, internally uses samtools/bcftools for calling variants and ``--quiet`` suppresses any output from these tools, while MiModD itself may produce ``--verbose`` output. .. _ex1-step3: Step 3: Extraction of variants of interest from the binary variant call file ---------------------------------------------------------------------------- The principle of variant extraction in MiModD is that you can interrogate a variant call file like the one produced in the previous step and, using various filters, you can retrieve only variants relevant for a specific scientific question. The question we would like to answer for our example data is: **What genomic variations exist in the sequenced genome of the PCC-M substrain compared to the reference genome ?** and we will use the *varextract* (Galaxy name: *Extract Variant Sites*) and *vcf-filter* (*VCF Filter* in Galaxy) tools to answer it. You will, normally, combine these two tools like this to extract variants of interest: - the *varextract* tool scans a binary variant call file for positions, at which there is evidence (from the aligned reads file) for a variation (single-nucleotide variant, SNV, or a small insertion/deletion event, indel) and **extracts the information about only these variant positions** in `vcf format `__, a human-readable text format. For typical analyses this extracted variant file still contains many thousands of potential variant sites. - the *vcf-filter* tool then lets the user specify various filters to apply to an extracted variant file in order to **reduce the variants to those meeting user-defined criteria**, which are, again, reported in vcf format. In our example, the *Extract Variant Sites* tool is extremely easy to use: .. image:: /images/Ex1_variant_extract.png Just select the BCF variant calls file produced in the `previous step `_ **(1)** and start the analysis **(2)**. Likewise, from the command line simply run:: mimodd varextract example1_calls.bcf -o example1_extracted_variants.vcf This generates a new dataset, containing per-nucleotide information for only those reference bases that represent predicted variant sites. In contrast to the original binary file, the new file is a text file (in VCF format) that can be inspected directly (by clicking the dataset's eye symbol in Galaxy): .. image:: /images/Ex1_variant_view.png While this gives us a first view on the predicted variants present in our data, the format is clearly not the most user-friendly and with a larger number of variants it quickly becomes impractical to work with so, for data exploration, you would like to: 1) reduce the number of variants to consider and 2) have better ways of viewing the data. The *vcf-filter* tool that we would like to introduce next addresses the first need, while the *annotate* tool discussed in the next section addresses the second. In general, you will almost always want to filter extracted variants first with *vcf-filter* before running the *annotate* tool on the resulting smaller list. Filtering an extracted variants file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As we have seen above, MiModD reports variant calls in `vcf format `__. To streamline the analysis of multiple samples, it also generally generates multi-sample files. Technically, the *vcf* format uses its tab-separated columns 9 and higher to store sample-specific information (as opposed to the first 8 columns that store general information about the variants) and most vcf filters in MiModD work on these columns, which are called the *genotype columns* - a term that we will stick to in the following discussion. For our example analysis in Galaxy, we want to set up the *VCF Filter* tool as shown below: .. image:: /images/Ex1_vcf_filter.png First, select the vcf input file with extracted variants **(1)**. Then, to define a sample-specific filter, click *Add New Sample-specific Filter* **(2)**, and, in the expanding section, specify the name of the sample that the filter should work on **(3)**. The sample name has to match one of the genotype column names in the input vcf file. In our example the name of our only sample was taken from the original BAM file header (see below) and is *PCC-M*. .. Tip:: If you are unsure about the sample names or about other general information encoded in any MiModD-supported file type, then without knowing anything about the file format, you can use the `info tool `__ (*Retrieve File Information* in Galaxy) to get a standardized summary report. Next, you need to specify, one or several sample-specific citeria that a given line in the vcf input file will have to fulfill to pass the filter. Currently, the tool lets you define filters based on genotype patterns, depth of coverage and genotype quality. Specifying a genotype pattern criterion ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Under *genotype pattern(s) to be included* specify the diploid genotype(s) x/x (with x = 0 for a wt allele, x = 1 for a mutant allele) that should pass the filter. A variant is retained in the output if the given sample matches one of the specified genotypes for that site. .. Note:: Variant calling in MiModD assumes a diploid genome. Thus, using MiModD (or, in fact, *samtools*) for variant calling in haploid organisms may be argued to suffer a conceptual problem but, in practice, results are affected little by this - you just need to decide whether you want to consider the biologically impossible *heterozygous* genotypes as false calls (as done in this example) or as suboptimally called homozygotes. For this example, try 1/1 as the genotype pattern for sample PCC-M to retain only variants for which the sample appears to be homozygous mutant **(4)**. Specifying a depth of coverage criterion ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Depth of coverage means the number of reads in your input file(s) of sequenced reads that provided information about the variant site (either favouring the variant or not). Variants called based on only very few reads are often questionable (since they could easily be caused by sequencing errors), so you can set a lower limit for the read depth here. For our example, try setting it to three, an often used minimal requirement **(5)**. .. Note:: *depth of coverage* here refers to the DP value of the genotype fields, i.e., it is the sample-specific depth of coverage not to be confused with the overall depth of coverage indicated in the general INFO field of the vcf format. Specifying a genotype quality criterion ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Genotype quality is a score for the reliability of the genotype assignment made by the variant caller. As such, this field may be used to create stricter genotype filters than through the use of genotype patterns alone. For this example, this criterion can be left at its default value of zero (disabled). .. admonition:: Other filters The tool also lets you set up two non-sample-specific filters that we will not use in this example. To give a brief description: - *Region* filters can be used to restrict the output to variants in certain regions of the genome. - through the *Variant Type* filter you can specify whether you want to retain just single-nucleotide variants (SNVs) or only indels. The *sample* filter, finally, does not work on rows (= variants), but on the sample-specific genotype columns. It can be used to reduce the number of samples in the output and is supported for compatibility with external tools that may not support multi-sample vcf format. Obviously, this filter cannot be used reasonably on input files with only one sample like our example dataset. Once you have populated the tool form with all parameters you can execute the job, which should finish almost immediately. In general, filtering (and annotating) variants are very fast processes compared to alignment and variant calling making it easy to explore variants once they are called with many different settings. To reproduce the Galaxy output from the command line, run:: mimodd vcf-filter example1_extracted_variants.vcf -o example1_filtered_variants.vcf -s PCC-M --gt 1/1 --dp 3 --gq 0 , where the ``-s`` or ``--samples`` option is, generally, used to specify one or multiple sample names. For each sample name, a corresponding ``--gt``, ``--dp`` and ``--gq`` value needs to be provided, which get then combined in corresponding order into sample-specific filters (see the `vcf-filter tool description `__ for a much more detailed description and examples of the command line usage of this tool). When inspecting the filtered output for our example, you will note that the filter has not reduced the number of variants dramatically (90 out of 117 variants are retained), which shows that for this simple dataset with quite high average coverage most variants have been accurately called (see the note on variant calling in haploid genomes above). verify that the overall file format has not been changed, but that there are now only 68 variants reported instead of an original 76. .. _ex1-step4: Step 4: Generating variant reports ---------------------------------- Among the files produced in the previous steps the *variant output file* and the *deletion output file* are of special interest to answer this question. The variant output file contains the information about all detected SNVs and small indels, the deletion output file the information about the detected deletions > 100 nucleotides. Both files are tab-separated plain text formats, so they can be inspected with any text editor or be imported into spreadsheet software. From Galaxy the content of the files can be displayed by clicking the eye symbol next to the file names in the history bar. - the *annotate* tool (Variant Annotation in Galaxy), on the other hand, will turn a list of variants into a more readable table format and will try to annotate the variants with the genomic features they affect. For a full annotation, the tool requires an existing SnpEff installation and a SnpEff genome annotation file , but limited functionality is offered - and is enough for this first example - independent of SnpEff. .. Note:: The current version of MiModD does not offer any tools to post-process the deletion output file. The *annotate* tool can, actually, perform two tasks at once: - it can simplify the complex information represented in a vcf file to the parts relevant for the average human reader, i.e., it strips off all the statistical information - it can annotate variants with the genomic features they affect and generate output containing hyperlinks to relevant databases for the organism under study. In its simplest form, the tool only carries out the first task and this requires only minimal configuration. In Galaxy, simply select *Variant Annotation* and then prepare the tool interface to look like this: .. image:: /images/Ex1_annotate_simple.png The equivalent command line call is:: mimodd annotate example1_filtered_variants.vcf -o example1_variant_report.txt --oformat text This produces another tab-separated plain text file, which is, however, more pleasant to the eye. In our example you should get something like this: .. image:: /images/Ex1_annotate_simple_result.png .. Note:: In general, most MiModD tools will produce one results file per run, but some, like the *Variant Calling* tool, can generate additional output files on demand. The Galaxy interface will automatically recognize these situations and will add the additional files to the history bar as needed. If you use these same tools from the command line, you must specify names for all output files. .. _ex1-step5: Step 5: Identification of Deletions ----------------------------------- The *Variant Calling* strategy from *Step 2* allows for the identification of *single nucleotide variations* (SNVs) and of small insertion/deletion events (indels), but cannot detect larger alterations to the genome, like larger deletions and insertions, duplications and translocations. In its current version, MiModD provides a (still somewhat experimental) tool to detect deletions in paired-end reads data. Additional tools for the detection of other variations are in development and may be added in future releases. The *Deletion Prediction for paired-end data* tool for Galaxy is accessed through the following input mask, .. image:: /images/Ex1_del_calling.png which requires the user to specify (at least) one aligned reads file **(1)** just as for the *Variant Calling* tool. As second input it expects a coverage file as we produced it in *Step 2* **(2)**. Two additional parameters can be adjusted to change the behaviour of the underlying algorithm: The tool defines a low-coverage region as a contiguous stretch of nucleotides in the reference genome all of which are covered by less than a threshold number of aligned reads from the input file(s). The *maximal coverage allowed inside a low-coverage region* setting specifies this threshold. In general, increasing this number makes the tool consider more regions as candidate deletions and will slow its performance, but result in more hits. There is no generally optimal setting for this parameter, but acceptable settings have to be determined empirically for every dataset. However, for a small genome as in this example that shows very good overall coverage it is certainly a good idea to increase it over its default value of 0 (we determined a value of 4 to give reasonable results) **(3)**. The *minimal deletion size* restricts the reported deletions (and low-coverage regions, see below) to those of at least this number of nucleotides in length. In general, the predictive power of the underlying algorithm decreases with decreasing sizes of the uncovered regions, so setting this parameter to less than its default value of 100 will, in most situations, just increase the run- time of the tool and produce more uncovered regions, but will not change the number of statistically significant deletions. On the other hand, of course, small deletions may be missed with too restrictive settings. For our example the value can be left at its default of 100 nucleotides. Finally, the output of the tool can be changed by selecting *include low-coverage regions* . With this option enabled, low-coverage regions that did not pass the statistical test for real deletions will be included in the output. The resulting list will usually be much longer than the default output and will contain regions that are poorly covered for technical reasons (e.g., regions refractory to sequencing). In some cases, the extended list, however, may contain true deletions missed by the algorithm, but which may be interesting enough to be followed up by other means. For the example, you can leave the option unchecked, but if you are curious, you may repeat the run with low-coverage regions included and compare the results. The *group reads based on read group id only* check-box can be left as it is for this example. As always, click *Execute* **(4)** to start the job. *Deletion Calling* from the command line is done by using the *delcall* tool. The exact command to reproduce the Galaxy job described above is:: mimodd delcall SRR101486.aln.bam example1_calls.bcf -o example1_deletions.txt --max-cov 4 --min-size 100 -i .. Tip:: At this point, you may wish to compare the result of the analysis with MiModD to the `published list of variants `__ [1]_ for this sample. Do not forget to look at the results of the deletion calling! .. [1] `Trautmann et al. (2012): Microevolution in Cyanobacteria: Re-sequencing a Motile Substrain of Synechocystis sp. PCC 6803. DNA Res, 19:435-48. `__