Mapping-by-Sequencing with MiModD

What is mapping-by-sequencing?

The classical approach to identifying the causative mutation underlying any particular mutant phenotype consists of two separate steps:

  • First, genetic mapping is used to narrow down the genomic region that the mutation resides in.

    This is done by introducing defined genetic markers through crossing to a mapping strain and looking for linked inheritance with the mutant phenotype.

  • Candidate DNA stretches in that region are then sequenced to identify the mutation.

With whole-genome sequencing it has become possible to merge these two steps into one mapping-by-sequencing step and to speed up mutation identification enormously. After any mapping cross, the inheritance pattern for any set of genetic markers can now be determined along with candidate mutations from the same sequencing data.

Moreover, with mapping-by-sequencing essentially all non-causative mutations (including even previously unknown ones) present in any of the strains used for crossing can be used as marker mutations. This makes mapping-by-sequencing not only a fast, but also an extremely sensitive and versatile method.

To be useful for mapping-by-sequencing experiments, analysis tools need to be able to identify mutations, but also to report or visualize the inheritance pattern of marker mutations so that researchers can use that information to identify the most likely candidates for the causative mutation from a potentially long list of all identified variants.

Mapping-by-sequencing support in MiModD

Overview

MiModD offers full support for mapping-by-sequencing analyses - from raw WGS sequencing data all the way to annotated lists of candidate mutations.

_images/mapping-by-sequencing_workflow.png

Mapping-by-sequencing analysis workflows in MiModD.

Typically, such an analysis will start with the sequenced reads files of a mapping sample and of one or two parental samples that contributed marker variants to the mapping and the first steps will be to:

  1. Align the sequenced reads of all samples to the common reference genome in one run of the snap-batch tool from the command line or the Snap Read Alignment tool from Galaxy.

    Remember that joint analysis of several sequencing datasets is one of the strengths of MiModD. If you have sequenced several related samples, you can gain analysis power and simplify your workflow by analyzing them together.

  2. Use the resulting aligned reads file and the reference genome as input to the varcall command line or the Variant Calling Galaxy tool to obtain a bcf file with genome-wide per-nucleotide variant call statistics for all samples.

  3. Extract bona fide variant sites (in any of the samples) from the bcf file and store them in vcf format using the varextract command line tool or the Extract Variant Sites tool in Galaxy.

    3a. At this stage it is possible to merge in independently obtained variant lists for additional samples. See Mapping with known variants for when this can be useful.

At this point, the computationally intensive part of the analysis is completed. While the above steps are long-running tasks that you may want to execute, at least from Galaxy, as an unattended workflow, the remaining steps are fast, but less predictable.

  1. Variants that cannot be causative based on their cross-sample inheritance pattern (e.g., those present in non-phenotypic samples) or that would misguide the linkage analysis can be eliminated using the vcf-filter tool (Galaxy tool name: VCF Filter).
  2. Linkage evidence provided by the remaining variants is analyzed and visualized with the map tool (Galaxy tool name: NacreousMap) to pinpoint causative variants to genomic intervals.
  3. Using this mapping information, the original list of extracted variants from step 3 or 4 can be re-inspected (possibly using the vcf-filter tool again) to identify candidates for the causative variant.
  4. The resulting list of candidate variants can be annotated to learn about their effects on genomic features like transcripts and genes.

Hint

Other chapters of this user guide, in particular the Tutorial, explain in detail how to perform steps 1-3 above to identify variants with MiModD and, if you are very new to MiModD, we recommend you to work through them before returning to this section.

The vcf-filter tool required at steps 4 and 6 is introduced in the Multi-sample analysis section of the Tutorial.

For users of the MiModD Galaxy interface, the accompanying example workflow 2 available from the MiModD download site can serve as a template for automating steps 1-4 above through a Galaxy workflow.

The third section of the Tutorial contains an example of linkage analysis as in step 5, but does not explain things in as much detail as we will try in this section.

Two types of linkage analyses in MiModD

The core tool for linkage analysis in MiModD (step 5 above) is the map tool. It accepts input in VCF format describing variant sites in possibly multiple samples and generated, e.g., by the varextract or vcf-filter tools at steps 3 or 4 above.

Note

The map tool can be used to generate variant linkage reports or plots. For plotting, the tool requires working installations of R and rpy2.

If you want graphical output, but do not want to install additional software, you are encouraged to upload your analysis results to our public NacreousMap server and plot it there.

This tool can be run in either of two modes:

Simple Variant Density (SVD) mapping

In SVD mode the tool generates reports of and/or plots the genome-wide distribution of ALL variants listed in its input without further processing.

The graphical output is in the form of histograms (one per reference chromosome) with user-configurable bin widths of variant counts along the genome. The text report contains the histogram data in table form for import into spreadsheet applicationsn or other analysis software.

In this mode, it is the user’s responsibility to pre-filter the variant input list in an appropriate way so that only meaningful variants are passed to the map tool. With the various options provided by the vcf-filter tool this is a powerful and flexible way of visualizing variant linkage patterns following various kinds of mapping protocol.

Variant Allele Frequency (VAF) mapping

VAF mode is more specialized than SVD mode and intended for bulked segregant analysis, in which data is obtained from a non-uniform (segregating) population of mapping cross progeny sequenced as a pool. In this mode, the map tool reports/plots linkage as the relative contribution to the bulked segregant population of each of two parental alleles at identified biallelic sites.

Methods Background

When a mutant strain displaying a recessive phenotype of interest is crossed to a divergent mapping strain, then, in a bulk of phenotypic F2 progeny, the relative contribution of each parent allele at biallelic variant sites will provide linkage information. While the bulked segregant population should, on average, inherit alleles from the original mutant background and from the mapping strain with approximately equal likelihood, there should be a marked shift towards inheritance of the original mutant background and exclusion of mapping strain alleles in the vicinity of the selected causative mutation.

In Variant Allele Frequency Mapping, the variants present in the starting strains are not just probed for presence or absence in the mapping cross progeny, but the fraction of variant over reference alleles in the sequenced pool provides a direct estimate of the probability of separating the variant from the phenotype, which is why VAF-type analyses often provide very fine-grained linkage information with relatively little experimental effort.

The disadvantage is that the sequence of at least one of the parental backgrounds has to be known - either from previous work or from sequencing along with the bulked segregant population.

Different from SVD mode, VAF mode requires the VCF input to specify variants from multiple samples:

  • one of these samples, obviously, has to be the bulked segregants sample because it holds the inheritance patterns at all biallelic sites, i.e., the crucial mapping information. In MiModD, this sample is referred to as the mapping sample and gets passed to the map tool through its -m or --mapping-sample option.

The two parental backgrounds, variant information of at least one of which must be provided in the VCF input, play opposite roles in a VAF mapping analysis:

  • alleles of the parental background that contributes the mutation to be mapped, behave similar to how they would in a simple backcrossing scheme. They will have a higher likelihood to be inherited by any phenotypic F2 individual the closer they are to the mutation of interest. In MiModD, the sample representing this background is called the related sample (and gets passed to the map tool via the -r or --related-sample option).
  • the other parental background is represented by what is called the unrelated sample in MiModD. It is passed to the map tool through the -u or --unrelated-sample option and inheritance of its alleles is interpreted exactly contrary: low representation in the bulked F2 sample is an indication that the corresponding sites are close to the mutation to be mapped.

The graphical output in VAF mode can, like in SVD mode, take the form of per-chromosome histograms with user-configurable bin widths. These histograms quantify the evidence for linkage to the causative mutation provided by the variants in each bin, i.e., the causative mutation is likely to be found near a peak in the histograms. Like for SVD mode, the text report contains the histogram data in tabular form.

In addition, per-chromosome scatter plots showing allele ratios in the mapping sample at each biallelic variant site can be produced for direct visual identification of regions of linked inheritance. In these plots the y-axis shows observed segregation rate between any variant site and the mutation to be mapped, i.e., the causative mutation is likely to be found in the center of a cluster of data points with y values close to zero. Because with one data point for each biallelic variant site, the scatter plot output becomes crowded easily, a moving average through the y values (more precisely, it is a Loess regression line) can also be plotted and is expected to show a local minimum near the causative variant site.

How linkage evidence is calculated - the technical details

The algorithm used to identify biallelic variant sites and to calculate allele ratios for the mapping sample depends on the type of input you provide for the tool:

  • If the input contains variants found in all three relevant samples - the mapping, the related and the unrelated sample - and you specify the names of all of them through the corresponding command line options, then the map tool can identify biallelic sites directly. All it does is to look for sites at which the related and unrelated samples are homozygous for different alleles. At such sites it then analyzes the contribution of the two alleles to the mapping sample population and forms the ratio.

  • If only one of the two parental samples is available, the tool has to guess the allelic constitution of the missing sample.

    By default, this will be done by considering only sites, at which the available sample is homozygous for a non-reference allele, and assuming that the missing sample is homozygous for the reference. This will typically be a useful assumption if the two parental backgrounds are not closely related themselves.

    Alternatively (with the -i or --infer-missing option specified), the tool will look at all sites, at which the provided sample appears to be homozygous for any allele (including the reference allele) and will use the ratio of this allele over any other allele found in the mapping sample data. Obviously that assumption is VERY speculative and only makes sense to use when combined with appropriate pre-filtering of the variant sites using the vcf-filter tool (see the Backcrossed bulked segregants analysis example later in this chapter for a more detailed explanation of the problem and an illustration how to address it).

The ratio of unrelated over related sample allele contribution calculated through any of the above algorithms is what is shown then in the scatterplot output of the tool.

The histogram plots, on the other hand, show a linkage score for variant sites per bin. This score is calulated as the per-bin sum of individual variant linkage scores, which, in turn, are calulated as:

(related parent allele count - unrelated parent allele count)
-------------------------------------------------------------  ^ 3
(related parent allele count + unrelated parent allele count)

By default, the resulting numbers are normalized to the total number of biallelic sites in the bin and the average linkage evidence provided by the whole corresponding chromosome. The underlying formula used for this normalization is the same as reported in [1], Figure 7. The normalization can be turned off and absolute numbers be plotted with the --no-normalize option.

Which mapping mode should I use?

VAF mode

Simply put, whenever you are analyzing sequencing data from a non-isogenic bulked segregants population, VAF mode should be your first choice. Even though it is often possible to analyze the same data in SVD mode after careful pre-filtering, VAF mode is usually more sensitive at picking up linkage evidence and provides finer mapping.

VAF mode does require variant information from at least two samples - from the bulked segregants population and from at least one of the two parental backgrounds that contributed to its generation. This requirement, however, is less limiting than it may sound at first. For one thing, parental background does not strictly mean that you have to know all the variants present in the parent strains of the mapping cross. Some missing or falsely assumed variants are usually quite tolerable, which means that variant information from a sufficiently related strain will also do. What is more, you can obtain the parental background variants from sequencing the parent or a related strain and analyzing the data along with that of the bulked segregants sample - but you may also use a published variants list (see the info box on Mapping with pre-existing parental variant lists below).

SVD mode

SVD mode should be used to analyze any mapping approach that is based on the sequencing of individuals or (nearly) isogenic populations. In these situations, you know that you can only have one defined genotype at each variant site and analyzing allele ratios will not provide additional information, but just introduce unnecessary noise.

In addition, SVD mode can be used with only one sample and, generally, in a wider range of applications, for which the map tool does not (yet) provide a specialized mode.

Other views on the two modes

Schneeberger [2] has come up with a classification for mapping-by-sequencing strategies. Sticking to his terminology, you would use VAF mode for the analysis of mutant allele frequency (MAF) mapping experiments. SVD mode, on the other hand, could be used for, but is not limited to, homozygosity mapping.

Users of CloudMap may find it convenient to think of SVD and VAF mode as the analogs of the CloudMap EMS Variant Density and Hawaiian Variant Density Mapping tools, respectively, although the mapping modes in MiModD offer significant improvements over their CloudMap counterparts as explained in the section MiModD for users of CloudMap at the end of this chapter.

Hint

Mapping with pre-existing parental variant lists

The simplest way to obtain parent strain variants to work with in VAF mode is to actually sequence and analyze at least one of the parental backgrounds alongside the bulked segregant population. In this case, you can simply combine the sequenced reads obtained from your different samples during the alignment or the variant calling step and tell the map tool about the role (mapping sample, related sample or unrelated sample) of each sample in the linkage analysis. The tool will then automatically detect informative biallelic sites in the multisample variants file produced by variant extraction.

In practice, however, if the mapping strain is (or is closely related to) an established strain in your field with its genomic variants already characterized by others, you may prefer to rely on these previous data instead of resequencing the strain. Often, though, you will not have access to the original sequencing data, but only to published lists of identified variants.

In this common case, you can perform the alignment and variant calling steps just on the data from the bulked segregant sample and include the pre-existing information about variant sites in the parental background at the stage of variant extraction.

To this end, the varextract tool accepts an optional pre-calculated VCF file of variant sites, which it will merge with the variant list extracted from its main called variants input.

This alternative way of including crossing strain information is illustrated in the MiModD Tutorial section on Incorporating mapping-by-sequencing strategies and in the Outcrossed bulked segregants analysis section of the Usage examples below. In addition, you may consult the varextract tool documentation for more information on its -p or --pre-vcf option and the expected format of the pre-calculated VCF file.

Usage examples

Mapping with backcrossed mutant strains

As a typical application that would be analyzed in SVD mode consider a phenotypically defined mutant strain obtained from a mutagenesis screen followed by multiple rounds of backcrossing to the non-mutagenized parent strain.

Although there are typically more powerful strategies, it may be possible to map and identify the causative mutation directly through sequencing the backcrossed strain.

Because of linked inheritance, the phenotypic selection during the backcross will not only have worked on the causative mutation itself, but also on nearby non-causative background mutations introduced during mutagenesis. As a result, the causative mutation, after several rounds of crossing, is expected to be found in the center of a mutation-rich region. This is possible because typical mutagenesis protocols introduce dozens to hundreds of mutations into a genome simultaneously.

To exploit the linkage information, it is essential, of course, to exclude variants from the analysis that exist already in the parent strain and will, thus, be kept during the backcross independently of their distance to the causative mutation. To be able to eliminate these variants, you will have to sequence the non-mutagenized parent strain or, if you have isolated and backcrossed several independent mutant strains from the same mutagenesis screen, sequence all or several of these strains and discard all shared variants as most likely derived from the non-mutagenized parent strain.

Now, let us look at how you might analyze the corresponding sequencing data with MiModD.


Here is your assumed starting material for the variant with one backcrossed strain sequenced together with its parent strain:

ref.fa : fasta format
  the reference genome of your organism

mutant.fw.fq : fastq format
  forward reads from the backcrossed mutant

mutant.rv.fq : fastq format
  reverse reads from the backcrossed mutant

parent.fw.fq : fastq format
  forward reads from the non-mutagenized parent strain

parent.rv.fq : fastq format
  reverse reads from the non-mutagenized parent strain

Step 0 - generate sample metadata for use in the analysis:

# Generate a SAM header for each sample.
# The files are used at the next step to add sample metadata to the
# sequenced reads during the alignment.
# The sample names (specified through the --rg-sm) option is what can
# be used to address a specific sample at later steps.

mimodd header --rg-id 001 --rg-sm bc_mutant --rg-ds "mutant x backcrossed n times to parent strain" -o mutant_hdr.sam
mimodd header --rg-id 002 --rg-sm parent --rg-ds "parent strain of bc_mutant" -o parent_hdr.sam

Step 1 - alignment:

# joint alignment of the bc_mutant and parent sample using paired-end mode for each

mimodd snap-batch -s \
"snap paired ref.fa mutant.fw.fq mutant.rv.fq --iformat fastq --header mutant_hdr.sam -o aligned.bam" \
"snap paired ref.fa parent.fw.fq parent.rv.fq --iformat fastq --header parent_hdr.sam -o aligned.bam"

Step 2 - variant calling:

# combined variant calling for both samples

mimodd varcall ref.fa aligned.bam -o variant_calls.bcf

Step 3 - variant extraction:

# extracting bona fide variants from the binary variant calls file

mimodd varextract variant_calls.bcf -o variants_extracted.vcf

Step 4 - variant filtering:

# Retain only variant sites at which the bc_mutant sample is homozygous mutant
# (genotype 1/1), but at which the parent sample appears to be homozygous wt (0/0),
# i.e., eliminate variants that may have been inherited from the parent strain.

mimodd vcf-filter variants_extracted.vcf --sample bc_mutant parent --gt 1/1 0/0 -o variants_informative.vcf

Step 5 - SVD mapping:

# Generate a plot and a tabular report of the distribution of all variant sites
# that passed the previous step.

mimodd map SVD variants_informative.vcf -o variants_linkage.txt -p variants_linkage.pdf

Step 6 - refiltering variants:

# usually not necessary with just a small number of mutagenesis-induced variants

# We assume that the previous step mapped the causative mutation to
# the interval 4,500,000 - 5,000,000 on chrII.

mimodd vcf-filter variants_informative.vcf --region chrII:4500000-5000000 -o candidates_causative.vcf

And here is a very similar workflow you could use if you had sequenced several independent mutant strains, but not their common parent strain ...


Assumed starting material:

ref.fa : fasta format
  the reference genome of your organism

mut1.fw.fq : fastq format
  forward reads from the backcrossed mutant1 sample

mut1.rv.fq : fastq format
  reverse reads from the backcrossed mutant1 sample

mut2.fw.fq : fastq format
  forward reads from the backcrossed mutant2 sample

mut2.rv.fq : fastq format
  reverse reads from the backcrossed mutant2 sample

mut3.fw.fq : fastq format
  forward reads from the backcrossed mutant3 sample

mut3.rv.fq : fastq format
  reverse reads from the backcrossed mutant3 sample

Step 0 - generate sample metadata for use in the analysis:

# Generate a SAM header for each sample.
# The files are used at the next step to add sample metadata to the
# sequenced reads during the alignment.
# The sample names (specified through the --rg-sm) option is what can
# be used to address a specific sample at later steps.

mimodd header --rg-id 001 --rg-sm bc_mutant1 --rg-ds "mutant1 backcrossed n times to parent strain x" -o mut1_hdr.sam
mimodd header --rg-id 002 --rg-sm bc_mutant2 --rg-ds "mutant2 backcrossed n times to parent strain x" -o mut2_hdr.sam
mimodd header --rg-id 003 --rg-sm bc_mutant3 --rg-ds "mutant3 backcrossed n times to parent strain x" -o mut3_hdr.sam

Step 1 - alignment:

# joint alignment of the bc_mutant samples using paired-end mode for each

mimodd snap-batch -s \
"snap paired ref.fa mut1.fw.fq mut1.rv.fq --iformat fastq --header mut1_hdr.sam -o aligned.bam" \
"snap paired ref.fa mut2.fw.fq mut2.rv.fq --iformat fastq --header mut2_hdr.sam -o aligned.bam" \
"snap paired ref.fa mut3.fw.fq mut3.rv.fq --iformat fastq --header mut3_hdr.sam -o aligned.bam"

Step 2 - variant calling:

# combined variant calling for all samples

mimodd varcall ref.fa aligned.bam -o variant_calls.bcf

Step 3 - variant extraction:

# extracting bona fide variants from the binary variant calls file

mimodd varextract variant_calls.bcf -o variants_extracted.vcf

Step 4 - variant filtering:

# Retain only variant sites at which the bc_mutant1 sample is homozygous mutant
# (genotype 1/1), but at which the other samples appear to be homozygous wt (0/0),
# i.e., eliminate variants that may have been co-inherited from the common parent strain.

mimodd vcf-filter variants_extracted.vcf --sample bc_mutant1 bc_mutant2 bc_mutant3 --gt 1/1 0/0 0/0 -o mut1_variants_informative.vcf

Step 5 - SVD mapping:

# Generate a plot and a tabular report of the distribution of all variant sites
# that passed the previous step.

mimodd map SVD mut1_variants_informative.vcf -o mut1_variants_linkage.txt -p mut1_variants_linkage.pdf

Step 6 - refiltering variants:

# usually not necessary with just a small number of mutagenesis-induced variants

# We assume that the previous step mapped the causative mutation to
# the interval 4,500,000 - 5,000,000 on chrII.

mimodd vcf-filter mut1_variants_informative.vcf --region chrII:4500000-5000000 -o mut1_candidates_causative.vcf

The steps above should identify the causative variant in bc_mutant1. Steps 4-6 could then be repeated with the roles of the mutant samples exhchanged to identify the causative mutations in bc_mutant2 and bc_mutant3.

Outcrossed bulked segregants analysis

In the previous example above we had to rely on background mutations introduced along with the causative variant during mutagenesis. Instead of limiting our set of linkage markers to these co-induced variants we could decide to cross the mutant strains (before or after backcrossing) to a more distantly related mapping strain. This would introduce lots of additional markers into the F1 offspring, which, after a round of selfing, would then be segregating in F2 individuals. After sequencing a bulk of F2 progeny selected for the original mutant phenotype you would expect linkage analysis to reveal a region deprived of mapping strain alleles around the causative mutation.

Since the sequenced sample does not represent an isogenic population with defined genotypes at variant sites, but a bulk of segregants you should prefer to use the map tool in VAF mode for the analysis.

The exact analysis workflow depends on whether you have obtained sequencing data also for the mapping strain or whether you want to rely on a published list of variants so two separate examples are provided below, one for each case.


Assuming you have sequenced the mapping strain along with the F2 pool, this is what could be your starting files:

ref.fa : fasta format
  the reference genome of your organism

mutant_F2.fw.fq : fastq format
  forward reads from the F2 segregant pool

mutant_F2.rv.fq : fastq format
  reverse reads from the F2 segregant pool

polymorph.fw.fq : fastq format
  forward reads from the polymorphic mapping strain

polymorph.rv.fq : fastq format
  reverse reads from the polymorphic mapping strain

Step 0 - generate sample metadata for use in the analysis:

# Generate a SAM header for each sample.
# The sample names (specified through the --rg-sm) option is what can
# be used to address a specific sample at later steps.

mimodd header --rg-id 001 --rg-sm F2_pool --rg-ds "mutant F2 pool obtained from outcrossing mutant x to mapping strain y" -o pool_hdr.sam
mimodd header --rg-id 002 --rg-sm mapping_strain --rg-ds "mapping strain used to generate F2_pool" -o polym_hdr.sam

# just to point out that you can combine sample metadata and sequenced reads before the alignment:
# this will produce a BAM file of the unaligned sequenced reads for each sample with the header included.
# When archiving data this should be your preferred format since it protects against loss of sample annotations.

mimodd convert mutant_F2.fw.fq mutant_F2.rv.fq --iformat fastq_pe --oformat bam -h pool_hdr.sam -o mutant_F2.bam
mimodd convert polymorph.fw.fq polymorph.rv.fq --iformat fastq_pe --oformat bam -h polym_hdr.sam -o polymorph.bam

Step 1 - alignment:

# aligning from BAM files with combined and annotated paired-end data

mimodd snap-batch -s \
"snap paired ref.fa mutant_F2.bam --iformat bam -o aligned.bam" \
"snap paired ref.fa polymorph.bam --iformat bam -o aligned.bam"

Step 2 - variant calling:

# combined variant calling for all samples

mimodd varcall ref.fa aligned.bam -o variant_calls.bcf

Step 3 - variant extraction:

# extracting bona fide variants from the binary variant calls file

mimodd varextract variant_calls.bcf -o variants_extracted.vcf

Step 4 - VAF mapping:

# VAF mapping here can be done directly without prior filtering.
# The map tool will only consider variant sites that appear to be homozygous
# mutant in the mapping strain.

mimodd map VAF variants_extracted.vcf -m F2_pool -u mapping_strain -o variants_linkage.txt -p variants_linkage.pdf

Step 5 - filtering variants:

# Post-filtering is essential in this example because the F2_pool data will
# contain a huge number of variants inherited from the mapping strain.
# We eliminate these by asking for variant sites at which the selected F2_pool is homozygous mutant,
# but at which the mapping strain has a wt genotype.

# We assume that the previous step mapped the causative mutation to
# the interval 4,500,000 - 5,000,000 on chrII.

mimodd vcf-filter variants_extracted.vcf --region chrII:4500000-5000000 --sample F2_pool mapping_strain --gt 1/1 0/0 -o candidates_causative.vcf

If you have a variant list of the mapping strain instead of actual sequenced reads ...


your starting files may look like these:

ref.fa : fasta format
  the reference genome of your organism

mutant_F2.fw.fq : fastq format
  forward reads from the F2 segregant pool

mutant_F2.rv.fq : fastq format
  reverse reads from the F2 segregant pool

variants_polymorph.vcf : vcf format
  preidentified variants for the polymorphic mapping strain

Step 0 - generate sample metadata for use in the analysis:

# Generate a SAM header for the sequenced sample.

mimodd header --rg-id 001 --rg-sm F2_pool --rg-ds "mutant F2 pool obtained from outcrossing mutant x to mapping strain y" -o pool_hdr.sam

# just to point out that you can combine sample metadata and sequenced reads before the alignment
# recommended for archiving as explained in the previous example

mimodd convert mutant_F2.fw.fq mutant_F2.rv.fq --iformat fastq_pe --oformat bam -h pool_hdr.sam -o mutant_F2.bam

Step 1 - alignment:

# aligning from a single BAM file with combined and annotated paired-end data

mimodd snap paired ref.fa mutant_F2.bam --iformat bam -o aligned.bam

Step 2 - variant calling:

# variant calling
# identical to previous example though there is only one sample this time

mimodd varcall ref.fa aligned.bam -o variant_calls.bcf

Step 3 - variant extraction:

# extract bona fide variants from the binary variant calls file and
# MERGE them WITH THE KNOWN VARIANTS from the mapping strain through use of the --p option.
# If variants_polymorph.vcf provides a sample name, this name can be used to address
# the mapping strain sample at later steps, otherwise the name "external_source_1" will be used.

mimodd varextract variant_calls.bcf -p variants_polymorph.vcf -o variants_extracted.vcf

Step 4 - VAF mapping:

# VAF mapping here can be done directly without prior filtering.
# The map tool will only consider variant sites that appear to be homozygous
# mutant in the mapping strain.

mimodd map VAF variants_extracted.vcf -m F2_pool -u external_source_1 -o variants_linkage.txt -p variants_linkage.pdf

Step 5 - filtering variants:

# Post-filtering is essential in this example because the F2_pool data will
# contain a huge number of variants inherited from the mapping strain.
# We eliminate these by asking for variant sites at which the selected F2_pool is homozygous mutant,
# but at which external_source_1 has a wt genotype.

# We assume that the previous step mapped the causative mutation to
# the interval 4,500,000 - 5,000,000 on chrII.

mimodd vcf-filter variants_extracted.vcf --region chrII:4500000-5000000 --sample F2_pool external_source_1 --gt 1/1 0/0 -o candidates_causative.vcf

Backcrossed bulked segregants analysis

As a combination of the first two examples, you may consider a strategy, in which a mutant obtained from a mutagenesis screen is crossed to the non-mutagenized parent strain just once followed by a round of inbreeding and pooling of phenotypic F2 progeny.

In this case, the mapping relies on background mutations introduced during mutagenesis as in the first example, but uses a bulked segregants approach as in the second.

The fact that you have sequenced a non-isogenic bulked population serves as an indication that using the map tool in VAF mode may give better results than an analysis using SVD mode.

What is tricky to get right in this example is the role of the crossing strain. Since we use the non-mutagenized parent strain it is tempting to think of it as the related sample, but it is essential to realize that this term in the context of the map tool refers to the parental sample variants from which would indicate linkage to the causative mutation when inherited by F2 progeny. In the case here, however, only inheritance of mutagenesis-induced variants would be regarded as evidence for linkage, while inheritance of the corresponding non-mutagenized parent allele is evidence against linkage. So, in fact, the crossing strain needs to be assigned the role of the unrelated sample for the linkage analysis to work as expected.

What is more, in the absence of sequencing data for the original mutant strain, we have to deduce the mutagenesis-induced variants from the allele spectrum found in the segregants pool. This is achieved by providing the -i or --infer-missing option to the map tool, but there is a danger associated with doing so: variant sites at which the F2 segregants pool appears to be almost pure for the reference allele, may be sites at which a true mutagenesis-induced variant became, by chance, almost entirely eliminated from the F2 pool, which would serve as evidence against linkage between the site and the causative variant. Quite likely though, the traces of a non-reference allele found in the F2 pool are just a sequencing or analysis artefact at what is really a pure reference allele in all strains involved. Because of this ambiguity it is typically required to filter strictly when the -i option is used for mapping and to remove the questionable sites before the linkage analysis. This is typically easiest to achieve by using an allele-frequency filter (see the following suggested workflow and the documentation of the vcf-filter tool.

The following example analysis considers all of the above aspects.


Here are your presumed starting files:

ref.fa : fasta format
  the reference genome of your organism

mutant_F2.fw.fq.gz : gzipped fastq format
  forward reads from the F2 segregant pool

mutant_F2.rv.fq.gz : gzipped fastq format
  reverse reads from the F2 segregant pool

variants_polymorph.vcf : vcf format
  preidentified variants for the polymorphic mapping strain

and this is how you may analyze them ...

Step 0 - generate sample metadata for use in the analysis:

mimodd header --rg-id 001 --rg-sm F2_pool --rg-ds "mutant F2 pool obtained from outcrossing mutant x to its non-mutagenized parent strain" -o bc_pool_hdr.sam
mimodd header --rg-id 002 --rg-sm parent --rg-ds "non-mutagenized parent strain of mutant x" -o parent_hdr.sam

Step 1 - alignment:

# similar to previous examples
# for a change, we assume the sequenced reads to come from gzip-compressed files

mimodd snap-batch -s \
"snap paired ref.fa mutant_F2.reads1.fq.gz mutant_F2.reads2.fq.gz --iformat gz --header bc_pool_hdr.sam -o mimodd_aln.bam" \
"snap paired ref.fa parent.reads1.fq.gz parent.reads2.fq.gz --iformat gz --header parent_hdr.sam -o mimodd_aln.bam"

Step 2 - joint variant calling:

mimodd varcall ref.fa mimodd_aln.bam -o mimodd_calls.bcf

Step 3 - variant extraction:

mimodd varextract mimodd_calls.bcf -o mimodd_extracted_variants.vcf

Step 4 - variant filtering:

# Filtering is crucial here!
# Variants present already in the parental background HAVE to be removed so
# we use the --gt filter to retain only variant sites at which the parent is
# homozygous for the reference allele, but we allow ANY genotype for the F2
# pool.
# Because we want to map with the -i option later, we also need to remove
# any variant sites at which the F2 pool is almost pure for the reference
# allele. For this we use the --af filter and keep only sites at which the
# predominant non-reference allele contributes at least 30% to the F2 pool.
# To make both the original genotype assignment and the allele frequencies
# more reliable we also keep only sites at which they are backed by at least
# 5 informative sequence reads in each sample.

mimodd vcf-filter mimodd_extracted_variants.vcf -s F2_pool parent --gt ANY 0/0 --af :0.3: :: --dp 5 5 -o mimodd_BCF2_variants.vcf

Step 5 - VAF mapping:

# The parent sample needs to be assigned the role of the unrelated sample using the -u option.

# To have the mutagenesis-induced variants in the unavailable original mutant deduced from the
# allele composition of the F2 pool, we use the -i/--infer-missing option.

# -l 0 turns off the calculation of the Loess regression line for the scatter plots,
# which is not very helpful with limited numbers of variants.

mimodd map VAF mimodd_extracted_variants.vcf -m F2_pool -u parent -i -o mimodd_bcf2_linkage.txt -p mimodd_bcf2_linkage.pdf -l 0

Step 6: refiltering of variants:

# The F2 pool should be homozygous mutant for the causative mutation,
# which should not be present in the parental background.

# We assume that the previous step mapped the causative mutation to
# the interval 0 - 6,000,000 on chrIII.

mimodd vcf-filter mimodd_extracted_variants.vcf -s F2_pool parent --gt 1/1 0/0 -r chrIII:0-6000000 -o candidate_variants.vcf

MiModD for users of CloudMap

For several years, CloudMap [1] - accessible online via the main Galaxy server at https://usegalaxy.org - has been a popular choice for the analysis of mapping-by-sequencing experiments conducted with model organisms. MiModD borrows several key ideas and, consequently, shows some similarities, with that earlier software, but aims at providing a much improved user experience.

Where CloudMap makes use of Galaxy workflows to combine existing Galaxy tools into a pipeline for WGS reads alignment and variant calling, and only provides custom code for linkage analysis, MiModD is a standalone solution for complete mapping-by-sequencing analyses. As a consequence, MiModD can get designed for efficiency and user-friendliness instead of around the input/output requirements of individual tools.

As for linkage analysis, the map tool was inspired by the CloudMap EMS Variant Density Mapping, Variant Discovery Mapping and Hawaiian Variant Mapping tools. The switch from CloudMap to MiModD for linkage analysis is easy because the map tool generates plots that look very similar to those produced by the CloudMap tools. At the same time though, the map tool represents a complete rewrite of and fixes many bugs and inefficiencies found in the original CloudMap tools, among them:

  • a more powerful algorithm for allele frequency analyses

    The tool enables mutation mapping using markers from two parents simultaneously (effectively, combining the CloudMap Variant Discovery Mapping and Hawaiian Variant Mapping approaches into one).

Hint

MiModD distinguishes between a related and an unrelated parent sample in variant allele frequency mapping with respect to the original background of the causative mutation. In CloudMap, you would analyze the variants contributed by the related parent to the allele pool of a bulked segregants population with the Variant Density Mapping tool, while those inherited from the unrelated parent would be analyzed separately with the Hawaiian Variant Mapping tool. The MiModD map tool, in contrast, lets you analyze both sets of variants together for higher mapping accuracy.

  • same control over variant density plots as over allele frequency plots
  • generation of small, responsive pdfs (up to 50x smaller than files generated with CloudMap)
  • more plotting options (scatterplots in any combination of colors with or without a Loess regression line, any number of bin-widths in histograms)
  • true species-independence of all plots
  • less unhandled errors and clearer error messages

If you have your data analyzed through a CloudMap-based workflow, you can still take advantage of these MiModD-specific enhancements to the linkage analysis step because the MiModD map tool can analyze variants found in CloudMap-generated VCF files. As the only restriction, you will need to provide a CloudMap-style sequence dictionary file through the -s or --seqdict-file option (even if CloudMap itself would not require such a file for the specific species you are working with).

Hint

If all you want to use MiModD for is for generating linkage plots, you may be interested in doing so on our public NacreousMap plotting server, instead of installing MiModD on your local system.

If, conversely, you want to use MiModD for the upstream analysis of your WGS data, but to use CloudMap for plotting, the map tool lets you produce CloudMap-ready VCF-like files by combining its --cloudmap and -t or --text-file options.


[1](1, 2) Minevich et al. (2012): CloudMap: A Cloud-Based Pipeline for Analysis of Mutant Genome Sequences. Genetics, 192:1249-69.
[2]Schneeberger et al. (2014): Using next-generation sequencing to isolate mutant genes from forward genetic screens. Nature Reviews Genetics, 15:662-676.