Mapping-by-Sequencing using MiModD and CloudMap *********************************************** What is mapping-by-sequencing ============================= The classical approach to identifying the *causative* mutation of any particular mutant phenotype consists of two separate steps: first, genetic mapping is used to narrow down a genomic region for which genetic markers introduced by crossing indicate linked inheritance with the phenotype, then candidate DNA stretches in that region are sequenced to identify the mutation. Through whole-genome sequencing it is now possible to merge these two steps into one *mapping-by-sequencing* step and to speed up mutation identification enormously. After any mapping cross, the inheritance pattern for any set of genetic markers can now be determined *along* with candidate mutations from the same sequencing data. Moreover, with mapping-by-sequencing essentially all non-causative mutations (including even previously unknown ones) present in any of the strains used for crossing can be used as marker mutations. This makes mapping-by-sequencing not only a fast, but also an extremely sensitive and versatile method. To be useful for mapping-by-sequencing experiments, analysis tools need to be able to identify mutations, but also to report or visualize the inheritance pattern of marker mutations so that researchers can use that information to identify the most likely candidate for the *causative* mutation from the potentially long list of all identified variants. What is CloudMap ? ================== *CloudMap*, accessible through the main Galaxy server at `http://usegalaxy.org `__, is a collection of tools for analysis and visualization of mapping-by-sequencing experiments performed with virtually any model organism. At its core are the three mapping tools: - CloudMap: EMS Variant Density Mapping, - CloudMap: Variant Discovery Mapping with WGS data, and - CloudMap: Hawaiian Variant Mapping with WGS data , which support the visual interpretation of the marker inheritance patterns obtained from following any of three popular mutation mapping approaches. More information on CloudMap can be found in the `CloudMap online documentation `__. MiModD complements CloudMap =========================== While the CloudMap tools really facilitate the interpretation of inheritance patterns of sets of mutations, the suite itself does not offer tools to identify variants in the first place, but instead, relies on assembling additional tools available on the main Galaxy server into relatively complex workflows. MiModD makes it possible to perform the complete upstream sequence analysis generating the input data for the three core CloudMap tools efficiently on a local computer. Hence, it eliminates the need to upload primary sequence data (with huge file sizes) to a remote server. As a rather specialized package, MiModD also provides a much simpler interface to the necessary alignment, variant calling and variant filtering steps than what can be realized by combining standard Galaxy tools. An interface between MiModD and CloudMap ======================================== The standard vcf format variant lists generated by MiModD are not directly compatible with the CloudMap tools, but MiModD offers the *cloudmap* subcommand that can transform the data for use with any of the three CloudMap mapping tools. The resulting CloudMap-compatible vcf files are small enough to be transferred to remote machines conveniently, where the results can then be visualized using CloudMap. Analyzing whole-genome sequencing data from mapping experiments =============================================================== In this section, we review the three strategies for mapping-by-sequencing that CloudMap is compatible with and explain how the corresponding data can be analyzed in MiModD. We assume that you are already familiar with the relevant MiModD tools. Simple Variant Density (SVD) Mapping ------------------------------ referred to in CloudMap as *EMS Variant Density Mapping* In this simple form of mapping, a phenotypically defined mutant strain obtained, for example, from a mutagenesis screen gets backcrossed (selecting for the phenotype) to its unmutagenized parent strain or outcrossed to a different strain, then sequenced. Because of linked inheritance the phenotypic selection will not only work on the causative mutation itself, but also on nearby *non-causative* mutations introduced during mutagenesis, i.e., the causative mutation is expected to be found in the center of a mutation-rich region. This approach works best if the sequence of the crossing strain (parent strain or unrelated) is analyzed along with that of the outcrossed mutant or if several mutants derived from the same parent are analyzed together since this makes it possible to eliminate variants that represent (misinformative) sequence deviations between the reference genome and the crossing strain. The joint analysis of several sequencing datasets is one of the hallmark features of MiModD and results are conveniently stored in a single multisample variant call file per analysis. Starting from there, the *vcf-filter* tool enables the straightforward exclusion of variants with any desired pattern of genotypes across the samples. The `Multi-sample analysis `__ section of the Tutorial provides an illustration of how simple this makes it to eliminate common background mutations. After filtering to retain only informative marker mutations, you can simply pass the resulting vcf variants file to the *cloudmap* tool. Variant Allele Frequency (VAF) Mapping ------------------------------------- referred to in CloudMap as *Variant Discovery Mapping* This approach is an extension of the Simple Variant Density Mapping above. Instead of generating a single outcrossed strain over several rounds of crossing, the mutant strain, here, gets crossed only once to the parent strain or an unrelated strain. Then, the non-uniform (segregating) F2 generation is screened for phenotypically mutant individuals, which are sequenced as a pool, an approach often referred to as *bulk segregant analysis*. Compared to Simple Variant Density Mapping, Variant Allele Frequency Mapping provides finer-grained linkage information at less experimental effort since every variant present in the starting strain is not only probed simply for presence or absence after the outcross, but the fraction of variant over reference alleles in the sequenced pool provides a direct estimate of the **probability of separating the variant from the phenotype**. As before, it is essential that the crossing strain sequence is analyzed along with the outcrossed pool so that misinformative variants present already in the crossing strain can be subtracted before the interpretation of the mapping results. MiModD, by default, always calculates the required ratio between variant- and reference-supporting reads for every detected variant site. Hence, data preparation for Variant AF Mapping with CloudMap can proceed almost exactly as with Simple Variant Density Mapping. Sequencing data of the crossing strain and of the outcrossed pool are analyzed together resulting in a multisample variants file. However, the *vcf-filter* tool cannot be used efficiently here to eliminate crossing strain variants because the tool defines filters based on genotypes, which, conceptually, does not make sense for the pooled sample. Instead, you should pass the unfiltered multisample variants file to the MiModD *cloudmap* tool setting the mode to *Variant* and indicate the samples representing the pool and the crossing strain, respectively, and the tool will retain only those variant sites for which there is no evidence of variant reads in the crossing strain sample. The resulting dataset is ready for analysis with CloudMap. Generating data for use with the CloudMap: Hawaiian Variant Mapping tool ------------------------------------------------------------------------ referred to in CloudMap as *Hawaiian Variant Mapping* The name of this CloudMap tool is misleading in that it is a general, non-organism-specific Crossing Strain Variant Mapping tool rather than being restricted to the most widely used mapping strain in C. elegans research. The mapping strategy here can be thought of as reversed Variant Discovery Mapping. Just as described in the previous section, bulk F2 segregant analysis is used to obtain linkage information, with the only exception that the variants that are analyzed are those **inherited from the crossing strain** (as opposed to those from the original mutant strain as in Variant Discovery Mapping). Due to this difference the interpretation of the linkage pattern is reversed in comparison to Variant Discovery Mapping: crossing strain variants tend to be excluded from the phenotypic F2 pool in the vicinity of the phenotype-causing mutation. If WGS data for the crossing strain is available, MiModD can be used exactly like for Variant Discovery Mapping to prepare data for Hawaiian Variant Mapping analysis with CloudMap. The only difference is that the ``vcf_filter`` tool here should be used to retain only the variants that the crossing strain is homozygous for (instead of excluding them). Most geneticists using this mapping strategy, however, will not resequence the crossing strain, but will use some established crossing strain with already characterized variants for their organism, like the Hawaiian strain for C. elegans. Accordingly, MiModD provides the option to use the information about known variant sites instead of aligned reads for the crossing strain at the stage of variant calling. This alternative way of combining MiModD and the CloudMap Hawaiian Variant Mapping tool is described in great detail in the MiModD Tutorial section `Incorporating classical mapping strategies `__.