Installing and Configuring SnpEff

MiModD can use SnpEff (v3.3 through 4.x) during variant annotation.

Installation and initial setup

Installation of SnpEff is optional (though we recommend making use of this option) and can be done at any time, i.e. you do not have to have SnpEff to install MiModD.

additional requirement

SnpEff is written in and, thus, requires Java, which you may need to install separately.

You can download SnpEff from http://snpeff.sourceforge.net/.

To install it, simply unpack the downloaded archive to a newly created snpEff folder in your home directory or into any other directory that you see fit.

See also

Official installation instructions for the most up to date information.

To activate SnpEff-dependent functionality of MiModD, you will have to configure the SNPEFF_PATH parameter in MiModD (see Configuring MiModD for your system).

Installation of SnpEff genomes

Discovering and obtaining genome files

MiModD does (intentionally) NOT support SnpEff’s database on-demand download feature. This means that you will have to install SnpEff genome files manually before you can use them.

The procedure to do so is explained here in the SnpEff Documentation.

One way to learn about which SnpEff genome files are available for download suggested by the manual is through the command line via (assuming you are running this from the directory you installed SnpEff into):

java -jar snpEff.jar databases

but since this will list all available downloads (> 20000 according to the manual), we recommend using:

java -jar snpEff.jar databases | grep "your search term for your organism"

For example, at the time of writing and for the installed version of SnpEff, the output of:

java -jar snpEff.jar databases | grep "elegans"

, which may be used to search for available genome files for C. elegans, was (with line-wrapping removed):

GCA_000162475.1.18    Granulicatella_elegans_atcc_700633          http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_GCA_000162475.1.18.zip
GCA_000162475.1.21    Granulicatella_elegans_atcc_700633          http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_GCA_000162475.1.21.zip
WBcel215.68           Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WBcel215.68.zip
WBcel215.69           Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WBcel215.69.zip
WBcel215.70           Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WBcel215.70.zip
WBcel235.18           Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WBcel235.18.zip
WBcel235.21           Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WBcel235.21.zip
WBcel235.71           Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WBcel235.71.zip
WBcel235.72           Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WBcel235.72.zip
WBcel235.73           Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WBcel235.73.zip
WBcel235.74           Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WBcel235.74.zip
WBcel235.75           Caenorhabditis_elegans                OK    http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WBcel235.75.zip
WS220.64              Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WS220.64.zip
WS220.65              Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WS220.65.zip
WS220.66              Caenorhabditis_elegans                      http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WS220.66.zip
WS241                 C. elegans                                  http://downloads.sourceforge.net/project/snpeff/databases/v3_6/snpEff_v3_6_WS241.zip

Then, following the documentation, to install the latest genome version file you could either run:

java -jar snpEff.jar download WS241

or visit the download link for this version and extract the downloaded archive in the data subfolder of your SnpEff installation.

Getting genome file versions right

It is important to understand that nucleotide numbering may change between different genome versions of the same species. The nucleotide numbers reported in any MiModD output file are determined by the fasta reference sequence you are using in the analysis. SnpEff, on the other hand, uses its own genome files that specify the positions of features in the genome.

If there is a mismatch between these two coordinate systems, then the functional effects on genes and proteins reported by the SnpEff-dependent parts of MiModD for any discovered variants cannot be expected to be correct!

Of course, the easiest way to ensure that the nucleotide numbering in a fasta reference genome and a SnpEff genome is the same is to use files with matching reference version numbers (in the example above that would mean using a WS241 fasta reference genome with the WS241 SnpEff genome.

However, not all reference genome versions ever released for any organism have corresponding SnpEff genomes available, so we have collected some workarounds you could consider if a matching SnpEff genome is not available for your reference genome. Not all of these are applicable in every use case, but we hope at least one of them can help in yours:

  • Can you use a different fasta reference version for your analysis?

    If SnpEff does not offer a matching genome for your chosen fasta reference, but has one for a slightly older or newer version, then maybe you can get and use that fasta reference version?

    Of course, it’s best to make this decision at the beginning of an analysis or you will have to start over again.

  • Is the genome file version you need available for a different version of SnpEff (check the available database downloads for SnpEff to see if that is the case)?

    SnpEff genome files are bound to the specific SnpEff version they get produced for so if you find that your version of SnpEff only supports older or more recent genome versions than your fasta reference, you could check whether switching to a newer or older SnpEff version, respectively, can solve your problem.

    It is very easy to switch the SnpEff version used by MiModD - all you need to do is to reconfigure the SNPEFF_PATH parameter in MiModD (see Configuring MiModD for your system).

  • Does SnpEff offer an alternative, coordinate-compatible genome file version?

    Unfortunately, there is no universal way to learn which versions of a reference genome have no coordinate differences between them. You will probably have to read through the changelogs of the reference sequence releases for your organism of interest to get that information.

  • Is there a UCSC chain file that describes the coordinate-mapping between your fasta genome version and any genome version available for SnpEff?

    If so, then you can use this file with the MiModD rebase tool (Galaxy name: Rebase Sites) to convert your VCF variant coordinates to the SnpEff genome file coordinates just before you do the annotation.

    Official UCSC chain files for many organisms are available from the UCSC Sequence and Annotation Downloads page (they are also called LiftOver files there), but there may be additional sources for your organism of interest. UCSC genomes and correpsonding chain files use their own versioning scheme, in which version numbers only increase with major changes to the reference genome, i.e., with nucleotide, but not mere annotation changes. A good starting point to learn which UCSC genome version identifier corresponds to which organism-specific version number (that you may be more familiar with and that typically also gets used by SnpEff) is the full list of UCSC genome releases.

    The MiModD rebase tool only needs a VCF variants file, the coordinates of which you wish to rebase and the chain file (gzipped or uncompressed) describing the mapping between the two genome versions, and will produce a new VCF with the adjusted variant coordinates. As the only caveat you will have to take care to carry out the conversion in the right direction: chain files normally follow the naming convention <from_version>To<to_version>.over.chain and allow you to migrate variants from <from_version> to <to_version>. If you want to go in the opposite direction, you need to provide the -r or --reverse option when running the rebase tool.

    See also

    the detailed documentation of the rebase tool