Multi-sample analysis: three-sample analysis of yeast mutants and their parent strain
=====================================================================================

Background:
-----------

Wild-type laboratory strains of *Saccharomyces cerevisiae* , like the
reference strain S288C or the widely used strain CEN.PK113-7D, can grow
on a variety of carbon sources including lactate.

In a recent publication [2]_, a knock-out mutant strain based on CEN.PK113-7D 
was generated, in which the gene for the only known lactate transporter in 
yeast, JEN1, is disrupted by an inserted reporter cassette. As expected this 
strain cannot use lactate as a carbon source. 
Laboratory evolution was used then on this strain to obtain two substrains that 
had regained the ability to grow on lactate.
Both substrains, IMW004 and IMW005, were subjected to WGS to gain
insight into the genomic changes that these strains had undergone
compared to CEN.PK113-7D over ~100 generations of laboratory evolution
and that might explain their ability to grow on lactate despite the
disruption of the JEN1 gene.

Strikingly, the substrains harbored independent point mutations in the
acetate transporter gene ADY2 and subsequent experiments showed that
these mutations are likely to cause changes in the substrate specificity
of this transporter.

This example uses MiModD to recapitulate the primary finding from the
laboratory evolution experiment. It will demonstrate how MiModD enables
the simultaneous analysis of several samples in a few simple steps, and
how the data exploration and annotation tools can be used to narrow down
large lists of variants and affected genes efficiently.

The yeastADY2.tar.gz archive available from the `MiModD download site 
<http://sourceforge.net/projects/mimodd/files/Example%20Datasets/>`__ contains 
the following files: 

-  the **S288C reference sequence** in fasta format,
-  a BAM file of **NGS reads from the CEN.PK113-7D parent strain** (the
   reads are a random subset of those available through the NCBI small
   reads archive (SRA) under accession number
   `SRX129922 <http://www.ncbi.nlm.nih.gov/sra/?term=SRX129922>`__),
-  a BAM file of **NGS reads from the lab-evolved substrain IMW004** (the
   same reads as available under the NCBI SRA accession number
   `SRX129995 <http://www.ncbi.nlm.nih.gov/sra/SRX129995>`__), and
-  a BAM file of **NGS reads from the lab-evolved substrain IMW005** (the
   same reads as available under the NCBI SRA accession number
   `SRX129996 <http://www.ncbi.nlm.nih.gov/sra/SRX129996>`__).

.. [2] `Kok et al. (2012): Laboratory evolution of new lactate transporter 
   genes in a jen1D mutant of Saccharomyces cerevisiae and their identification 
   as ADY2 alleles by whole-genome resequencing and transcriptome analysis. 
   FEMS Yeast Res, 12:359-74. <http://www.ncbi.nlm.nih.gov/pubmed/22257278>`__


.. _ex2-step0:

Step 0: Preparing the data
--------------------------

Data preparation for Galaxy
~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you are using MiModD through its Galaxy interface, you need to upload the 
reference fasta and the BAM files of WGS reads to Galaxy just as in the 
`first example <tutorial_example1.html#ex1-step0>`__. 
Once the uploads are finished you can proceed to Step 1.

Data preparation for the command line
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Besides extracting the data files from the downloaded archive no special data 
preparation is required when working from the command line. As in the `first 
example <tutorial_example1.html#ex1-step0>`__, we will assume though that you 
have set your current working directory to the extracted directory 
tutorial_ex2. Although this is not an absolute requirement, it will allow you 
to execute the command line instructions in this section by simply 
copying/pasting them from this page into a terminal window.


.. _ex2-step1:

Step 1: Multi-sample NGS Reads Alignment with the MiModD snap tool
------------------------------------------------------------------

Alignments for multiple samples are almost as simple to perform with MiModD as 
alignments for a single sample.

In Galaxy, start by selecting the reference genome and the first sequenced 
reads input file. Select *paired-end* as the mode for this file since all WGS 
data in this example was obtained using Illumina paired-end sequencing. Select 
BAM as the input file format. As before, you do not have to select a custom 
header file since all BAM files in this example have appropriate header 
information already. Once you have configured these settings for the first 
input file, simply add the additional input datasets. You can add a dataset by 
clicking on the *Add new datasets* button.

.. image:: /images/Ex2_Snap_Multi.png

Make sure you select the appropriate mode (paired-end) and file format
(BAM) for each dataset.

Leave the output file format and the further parameter settings
unchanged and click on Execute to start the analysis job.

This will add a single results file with the merged alignments for all
input files to your history.

If you are using MiModD from the command line, you will have to use the
*snap-batch* tool to align reads from multiple input files. To this tool
you have to pass a full valid *snap* tool call per input file, either as a
series of literal commands or as a text file, in which the individual commands 
are stored on separate lines.

If this sounds complicated, consider this example:

To perform the alignment for just one input file you would issue the
following command as we have seen in the `First Steps example 
<tutorial_example1.html#ex1-step1>`__, e.g.::

  mimodd snap paired S288C_ref.fa SRR445715.bam -o SRR445715.aln.bam -C ++ -M --sort --verbose --quiet

The command line to execute snap-batch for an alignment of all three BAM input 
files with the same settings as above then takes the following form::

  mimodd snap-batch -s "snap paired S288C_ref.fa SRR445715.bam -o example2.aln.bam -C ++ -M --sort --verbose --quiet" "snap paired S288C_ref.fa SRR445716.bam -o example2.aln.bam -C ++ -M --sort --verbose --quiet" "snap paired S288C_ref.fa SRR445717.bam -o example2.aln.bam -C ++ -M --sort --verbose --quiet"

Here, the ``-s`` switch indicates that the individual *snap* calls are 
specified directly on the command line. Note that the quotes around the 
individual *snap* calls are necessary to group the calls (or the shell would 
split arguments on every whitespace).

Alternatively, you can generate a batch text file (using any editor that
lets you store or export plain text) with the following content::

  snap paired S288C_ref.fa SRR445715.bam -o example2.aln.bam -C ++ -M --sort --verbose --quiet
  snap paired S288C_ref.fa SRR445716.bam -o example2.aln.bam -C ++ -M --sort --verbose --quiet
  snap paired S288C_ref.fa SRR445717.bam -o example2.aln.bam -C ++ -M --sort --verbose --quiet
  
and then call *snap-batch* like this (the ``-f`` switch tells the tool to read 
the *snap* commands from the specified file)::

  mimodd snap-batch -f <batch file>

.. Note::

   In both versions of the *snap-batch* command you need to provide the ``-o`` 
   switch with every single *snap* command just to form a valid call. Since 
   *snap-batch* will always produce just a single merged output, it takes the 
   output file name from the first *snap* command that is provided and ignores 
   additional output file specifications in later commands.

See also the `snap-batch tool <tool_doc.html#snap-batch>`__ description in the 
`Tool Documentation <tool_doc.html>`__.


.. _ex2-step2:

Step 2: Identification of Sequence Variants from a multi-sample alignment
-------------------------------------------------------------------------

*Variant calling* from a multi-sample alignment as input is really no
different from the single-sample procedure illustrated in the `first
example <tutorial_example1.html#ex1-step2>`__.

MiModD will correctly identify read groups and samples present in the
input file of aligned reads and handle them appropriately.

We leave it up to you to run this very simple job from Galaxy, but here is the 
command line equivalent to copy/paste::

  mimodd varcall S288C_ref.fa example2.aln.bam -o example2_calls.bcf -d 250 -i --quiet --verbose


.. _ex2-step3:

Step 3: Identification of Deletions
-----------------------------------

Like for *Variant Calling*, there is no difference between the *Deletion
Calling* step for `single-sample <tutorial_example1.html#ex1-step5>`__ and 
multi-sample input.

Again, MiModD will auto-detect all read groups and samples in the input
file and handle them appropriately. The result is a single output file,
in which deletions (and possibly uncovered regions) are reported along
with the sample in which they were detected.

From the command line you can use *delcall* with::

  mimodd delcall example2.aln.bam example2_calls.bcf -o example2_deletions.txt --max-cov 4 --min-size 100 -i


.. _ex2-step4:

Step 4: Data Exploration and Annotation
---------------------------------------

.. admonition:: Preliminary Note

   The real difference between a single-sample WGS analysis and the analysis of 
   multiple samples is that multi-sample data allows more complex questions to 
   be addressed than single-sample data.

Possible questions in this example are:

#. What differences exist between the parent strain of the study,
   CEN.PK113-7D, and the reference strain S288C ?
#. What differences exist between the evolved strain IMW004 and its
   parent CEN.PK113-7D ?
#. What differences exist between the second evolved strain IMW005 and
   its parent CEN.PK113-7D ?
#. Are there common effects of these differences in IMW004 and IMW005 ?

Note that question 1 is only of secondary interest here compared to
questions 2 and 3, but most importantly, that the naive question

*What differences exist between each of the evolved strains and the
reference ?*

is not very helpful since the set of these differences will be the union
of the differences between CEN.PK113-7D and S288C and the differences
between each evolved strain and CEN.PK113-7D. In fact, this is the
reason why it is essential to include sequencing data for CEN.PK113-7D
in the analysis.

Also note that in question 4) we are not asking for *shared differences*
in IMW004 and IMW005. Since these two strains have evolved completely
separately the chance of them sharing an exactly identical variation
that is not present in their parent strain is extremely low. However, it
is certainly possible that *different* variants in the two strains will
affect the same genomic feature (e.g., the same gene), which could be
taken as evidence that this gene may be of importance for the observed
biological effect (i.e., growth on lactate).

To begin with, we need to extract the total set of variants detected in the 
variant calling step. From the command line this is done through::

  mimodd varextract example2_calls.bcf -o example2_extracted_variants.vcf
  
Again, we leave it up to you to do the same from Galaxy.

.. image:: /images/Ex2_variants.png
   :align: left

A quick look at the results reveals that, in total, almost 30000 (!) variants 
(SNVs and indels) have been detected (see figure on the left).

These include differences between any of the three sequenced samples and
the reference genome along with false-positive calls. The goal of this
analysis step is to filter and annotate this impressive list in
different ways to remove false-positive variants and to retain only
those that help us answer the four questions defined above.

This section demonstrates how the *vcf-filter* and the *annotate*
tool can be used together for this purpose.

.. Note::

   In principle, deletions predicted in the *Deletion Calling* step should be 
   treated the same way, preferably together with the other variants.
   The current version of MiModD, however, does not offer any tools to 
   post-process the deletion output file. Luckily, the number of predicted 
   deletions is typically much smaller than that of other variants, so 
   inspecting the full list of deletions is possible (though, admittedly, 
   inconvenient).

Q1: Differences between the parent strain of the study, CEN.PK113-7D, and the reference strain S288C
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The key to answering most of our questions is to use the *vcf-filter*
tool with appropriate sample-specific filters.

A simple filter corresponding to question 1 can be formulated on the
command line as::

  mimodd vcf-filter example2_extracted_variants.vcf -o example2_q1_filter1_variants.vcf -s CEN.PK113-7D --gt 1/1,0/1 --dp 0 --gq 0
    
, which reports all variants for which the CEN.PK113-7D sample appears to be 
heterozygous or homozygous mutant. With this approach we get rid of ~ 3000 variants, but are still left with around 27000.

It is quite easy, however, to make the filter more stringent, e.g., we can try::

  mimodd vcf-filter example2_extracted_variants.vcf -o example2_q1_filter2_variants.vcf -s CEN.PK113-7D --gt 1/1,0/1 --dp 3 --gq 0
    
, which is very similar to the first version but, in addition, is asking for at 
least three fold coverage of the variant for the CEN.PK113-7D sample. 
With this, we lose another ~ 1200 variants, leaving us with just below 26000.

CEN.PK113-7D is a haploid yeast strain and heterozygous calls are likely
indicating sites at which variant calling is unreliable, so there is 
justification (as `discussed for the first example 
<http://mimodd.readthedocs.org/en/latest/tutorial_example1.html#specifying-a-genotype-pattern-criterion>`__) 
for using::

  mimodd vcf-filter example2_extracted_variants.vcf -o example2_q1_filter3_variants.vcf -s CEN.PK113-7D --gt 1/1 --dp 3 --gq 0
    
to retain only homozygous variants, which turn out to be ~ 23700 in our
case.

Finally, we could reason that true sequence deviations between
CEN.PK113-7D and the reference strain S288C are very likely to occur
also in the derived strains IMW004 and IMW005. Thus, we can add
additional filters for these samples resulting in::

  mimodd vcf-filter example2_extracted_variants.vcf -o example2_q1_filter4_variants.vcf -s CEN.PK113-7D IMW004 IMW005 --gt 1/1 1/1 1/1 --dp 3 3 3 --gq 0 0 0
    
, which only retains variants for which all three sequenced samples appear to 
be homozygous mutant and for which there is an at least 3-fold coverage in all 
three samples. Applying this filter gives us a list of ~ 22500 variants 
representing our best candidates for true variant sites between S288C and 
CEN.PK113-7D.

To compose this last job from Galaxy, open the *VCF Filter* tool, select
the *VCF input file*, then define three sample-specific filters (adding
them by clicking on Add new sample-specific filter) as shown below:

.. image:: /images/Ex2_Q1_filters.png

Q2: Differences between the evolved strain IMW004 and its parent CEN.PK113-7D
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here we are looking for variants in IMW004 that are *not* present in the parent 
strain CEN.PK113-7D. To get them we can use::

  mimodd vcf-filter example2_extracted_variants.vcf -o example2_q2_variants.vcf -s CEN.PK113-7D IMW004 IMW005 --gt 0/0 1/1 0/0 --dp 3 3 3 --gq 0 0 0

The rationale for including IMW005 in this filter and only accepting variants 
that are absent from this strain is that the laboratory evolution experiment 
was not started with CEN.PK113-7D, but with a derived strain genetically 
engineered to have the JEN1 gene disrupted. The sequence of this strain is not 
available, and the idea is that variants found in IMW004 *and* IMW005, but not CEN.PK113-7D, must have first occurred in the unsequenced JEN1delta strain and, 
thus, were present at the start of the experiment, which makes them less 
interesting.

Applying this filter leaves us with only 40 variants (out of the initial
29000)! The list (produced in Galaxy) is shown below.

.. image:: /images/Ex2_Q2_results.png

Note also the very different overall variant quality scores among the 40 
candidates, which could form a good basis for narrowing down the list further.

Q3: Differences between the evolved strain IMW005 and its parent CEN.PK113-7D
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This question can, of course, be answered by just slightly modifying our
last filter settings to::

  mimodd vcf-filter example2_extracted_variants.vcf -o example2_q3_variants.vcf -s CEN.PK113-7D IMW005 IMW004 --gt 0/0 1/1 0/0 --dp 3 3 3 --gq 0 0 0

With the roles of IWM005 and IMW004 reverted, we get the following list
of 38 variants:

.. image:: /images/Ex2_Q3_results.png

Q4: Differences with common effects in IMW004 and IMW005
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This question, asking for *distinct variants* in IMW004 and IMW005 that
have a *common effect* on a genomic feature, is the most challenging one.

In this example, with the small number of variants obtained as answers
to questions 2 and 3, it is not too difficult to find the two variants -
one in IMW004, the other in IMW005 - that are in suspiciously close
genomic vicinity, i.e., may well affect the same gene.

Nevertheless, we will demonstrate here how to identify these variants
through the use of the *annotate* tool.

As part of the first example, we have looked at the most `basic functionality 
of the annotate tool 
<tutorial_example1.html#step-4-generating-variant-reports>`__, which is to turn 
a vcf file into a stripped down tabular report file. 

Yet, *annotate* can do far more than this.

If you choose *HTML* as the output format instead of *Tab-separated plain 
text*, the tool will try to enrich the output with two types of hyperlinks, one 
for genomic positions and one for affected transcripts. These hyperlinks enable 
the user to access the relevant genomic region through the native genome 
browser for the organism of interest and to navigate to the corresponding gene 
page of the organism's standard database, respectively.

The current version of MiModD comes with built-in support for generating such 
hyperlinks for the following six model organisms:

  - the cyanobacterium *Synechocystis* PCC6803
  - baker's yeast (*S. cerevisiae*)
  - thale cress (*A. thaliana*)
  - the roundworm *C. elegans*
  - the fruitfly *D. melanogaster*
  - zebrafish (*D. rerio*)

For these species, it is enough to provide *annotate* with the species name 
(several aliases are supported) to have the hyperlinks generated. 

If you are working with a different species you can provide your own `template 
file with hyperlink formatting instructions <recipes.html#prepare-and-use-a-custom-hyperlink-template-file>`__.

.. Note::

   Weblinks are changing frequently and it is entirely possible that hyperlinks 
   for your species of interest suddenly stop working. Custom template files 
   offer a temporary solution for this kind of problem until we provide a 
   software update.
   
   If you discover broken hyperlinks or generate a template file that you want 
   to share with a larger community, you should contact us and we will be more 
   than happy to fix the problem or to add your organism to the list of 
   natively supported species.

Enhancing variant annotations through genome browser hyperlinks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Generating hyperlinks to genome browser views of genomic regions has no 
requirements other than a valid hyperlink template or built-in species support. 
For our example, we can generate an annotated file from Galaxy by configuring 
the *Variant Annotation* tool as shown below (the input file here is the list 
of 14 variants generated as answer to Q3):

.. image:: /images/Ex2_Q4_setup1.png

All you have to do is to set the output format to *HTML* and to provide the 
species name.

The command line equivalent of this Galaxy job is::

  mimodd annotate example2_q3_variants.vcf --species S.cerevisiae -o example2_q3_anno.html -f html

The result of this step is a new dataset the contents of which can be
viewed in any web browser (or by clicking on its eye symbol in Galaxy).

It should look like this:

.. image:: /images/Ex2_Q4_results1.png

Following any of the position links will take you to the `yeastgenome.org 
<http://www.yeastgenome.org>`__ genome browser and will zoom in on a 1kb region 
around the variant site.

For the first variant, for example, this would look like this:

.. image:: /images/Ex2_Q4_gbrowse.png

quickly showing that the variant lies in the gene PHO11.

.. Warning::

   If the version of the reference genome you are using is different from that 
   used by the genome browser, the simple 1:1 mapping between variant sites 
   reported by MiModD and positions shown in the genome browser may be 
   inaccurate ! 
   
   Make sure you are using a compatible genome version for your organism and do 
   not rely on the genome browser view alone to determine the exact location of 
   variations !

Including effects on genomic features in annotated reports
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To annotate variants with their effects on genomic features, such as 
transcripts and genes, MiModD relies on `SnpEff 
<http://snpeff.sourceforge.net/>`__ - a toolbox for genetic variant annotation 
and effect prediction. 
 
The MiModD installation instructions have a separate section on the 
`Installation and inital setup of SnpEff <install_snpeff.html>`__.

If you have configured MiModD to use SnpEff and you have a SnpEff genome file 
for yeast installed (genome file EF3.64 is assumed here), you can get fully 
annotated html output via the following command line::

  mimodd annotate example2_q3_variants.vcf --genome EF3.64 --ud 500 -o example2_q3_anno_full.html -f html

, where the SnpEff genome file gets passed to the tool with the ``--genome`` 
option (which also instructs the tool to use SnpEff for the annotations). 

In addition, the example uses the ``--ud`` switch to restrict the length of 
what is considered by SnpEff to be the upstream/downstream regions of genes. 
This is one way to reduce the output of the tool.

.. Hint::

   The *genome name* following ``--genome`` option is the SnpEff genome file 
   name (which you used for downloading the file) and could be different for 
   your installation (e.g., if you chose to download a different version).
    
   If you do not know or do not remember the name of the genome file you can 
   use the simple *snpeff-genomes* tool like this::
   
     mimodd snpeff-genomes

   to get a list of all currently installed SnpEff genome files on your system 
   together with organism names.

In Galaxy, SnpEff-dependent annotation jobs require one additional step:
you have to generate a text file with the information about the installed 
SnpEff genome files on your system.

This can be done using the *List Installed SnpEff Genomes* tool to generate the 
list of genome files and store them in a dataset in your history. 
The tool does not require any parameters, but only has to be executed to get 
the dataset that you will need in the next step.

Going back to the *Variant Annotation* tool you can now configure its
interface like below:

.. image:: /images/Ex2_Q4_setup2.png

Essentially, you select **(1)** SnpEff as the tool to annotate the input
file, which brings up the SnpEff-specific part of the interface. Next,
you have to select a dataset from your history that holds the SnpEff
genome list **(2)** . This is the dataset you just generated in the last
step. Once selected, the content of the dataset is used to populate a
drop-down menu that lets you select a SnpEff genome file **(3)**. To
generate equivalent output to the command line above, make sure you
restrict the upstream/downstream interval length to 500 bases **(4)** .

.. Note::

   When MiModD uses SnpEff to annotate variants, you will usually not have to 
   set the species explicitly since MiModD can obtain it from the SnpEff genome 
   file. 
      
   If auto-detection of the species fails (resulting in no hyperlinks) or if 
   you want to override its result you can, however, provide a species name 
   explicitly through the *Species* text box or using the ``--species`` option 
   on the command line.