File Formats used by MiModD

Work in progress

We apologize, but this section of the User Guide is very incomplete still.

FASTA

FASTA is a text format that can store multiple sequences in a single file.

Each sequence begins with a single-line description, followed by lines of sequence data. Description lines are distinguished from sequence data lines by a greater-than > symbol at the beginning of the line.

Lines can be terminated with either CR+LF (Windows-style) or LF (Unix/Linux-style). Blank lines are not allowed.

The sequence data lines should be formatted as blocks of equal line length.

In MiModD, FASTA format is used exclusively for reference genome input files and MiModD-specific restrictions apply to the description lines found in the files. Specifically, description lines must not contain:

  • non-printable or non-ASCII characters
  • whitespace characters
  • any of the characters: <>[]*;=,

This restriction is enforced by all tools that require a fasta reference genome. The MiModD.sanitize tool can be used to substitute illegal characters in description lines and also ensures that sequence data lines are block-formatted.

Note

The character restriction exists because MiModD will use the full content of the description line as the sequence name and we must ensure that this name is a valid sequence name in all downstream data formats generated during any analysis.

See also

MiModD tools that use fasta input files

snap, snap-batch, snap-index, varcall

MiModD tools to manipulate fasta files

MiModD.sanitize


sam

See also

MiModD tools that accept sam input files

snap, snap-batch

MiModD tools that produce sam output files

snap, snap-batch, header,

MiModD tools to manipulate sam files

convert, reheader


bam

See also

MiModD tools that accept bam input files

snap, snap-batch, varcall, delcall, MiModD.index

MiModD tools that produce bam output files

snap, snap-batch

MiModD tools to manipulate bam files

convert, reheader, sort


vcf

See also

MiModD tools that use vcf input files

MiModD tools that produce vcf output files

varextract

MiModD tools to manipulate vcf files

vcf-filter


bcf

See also

MiModD tools that use bcf input files

MiModD tools that produce bcf output files

varcall

MiModD tools to manipulate bcf files


CloudMap-style sequence dictionary

This format is defined by the CloudMap analysis pipeline, in which it is referred to as an Other Species Configuration File.

It is used to specify the names of the contigs (or chromosomes) along with their sizes that make up a given reference genome.

Where needed MiModD embeds this information into its output files so, as long as you are working with MiModD-generated files, a sequence dictionary file is never needed and, currently, only the MiModD map tool provides an option for specifying such a file as input to enable its use with external input files.

A sequence dictionary file has a simple two-column tab-delimited format, in which each line consists of a contig name (exactly as it appears in the corresponding reference file) in the first column and the length of that contig in megabases (rounded up) in the second.

As an example, this is what a sequence dictionary for the six chromosomes of the roundworm C. elegans could look like:

I     16
II    16
III   14
IV    18
V     21
X     18

This and a few other sequence dictionary files are available as shared data through our online version of NacreousMap, but remember that you may have to modify the sequence names to match those defined in your reference file before using any pre-made sequence dictionary.

See also

MiModD tools that use CloudMap-style sequence dictionaries

map