File Formats and Compatibility

This section provides detailed information about specifications and restrictions of the file formats read and produced by MiModD.

Wherever possible, MiModD tries to adhere to the official specification of each file format used. Certain formats, however, have ambiguous specifications or have specifications that are incompatible with those of other formats in MiModD analysis workflows. In these cases, we are forced to interpret or modify the specs for the sake of clarity and user experience.

We tried hard to make error messages, and in particular, those related to file format violations, as clear and user-friendly as possible so, in most situations, when you pass data to a MiModD tool that cannot process it, the tool should complain loud and clearly. If this error message, in some cases, is not enough to understand the exact cause of the problem, it may be helpful to study the definition of the problematic file format in the list below.

In addition, the format specifications can help you avoid incompatibilities should you want to pass MiModD-generated files to other software tools.

Character Encoding

The definitions of many text-based file formats used in bioinformatics do not specify their expected character encoding explicitly, but implicitly almost all expect English alphabet characters to be ASCII-encoded. Characters outside the ASCII-range, if not explicitly disallowed in a format, are a common cause of problems and incompatibilities. In recent years, UTF-8, a superset of ASCII, has become the emerging standard for universal character encoding, and some modern bioinformatics file format specifications (see the vcf format below) already define UTF-8 as the file encoding of the format.

Since version 0.1.7.2, all output produced by MiModD is UTF-8 encoded. Any input is expected to be UTF-8 encoded as well, but most tools will handle decoding errors flexibly. For users this means:

When passing data to MiModD:

  • if your data (including sequence/chromosome names, annotations, sample names, etc.) contains only ASCII-range characters, you will, in most circumstances, not have to worry about its encoding. If the data has been stored in an ASCII-compatible encoding (UTF-8, latin-1, pure ASCII, etc.), things will just work.
  • if your data contains characters outside the ASCII-range (like non-English alphabet characters) and the input format allows them, you should make sure the input data has been saved in UTF-8 encoding. If not, you should be prepared to see your special characters being misdecoded and not be readable anymore.

When using MiModD-generated data with other tools.

  • if your data contains only ASCII-range characters, you can pass output from MiModD to any other tool that expects ASCII-encoded input.
  • if your data contains characters outside the ASCII-range, you are at the mercy of the downstream tool, which may or may not expect UTF-8 encoding.

Hint

To increase the chance of successful communication with other tools, make sure UTF-8 is set as default encoding in your system locale settings. MiModD reads and writes UTF-8, independent of your system encoding settings, but other programs may use the system encoding.

MiModD File Format Specifications

FASTA

FASTA is a text format that can store multiple sequences in a single file.

Each sequence begins with a single-line description, followed by lines of sequence data. Description lines are distinguished from sequence data lines by a greater-than > symbol at the beginning of the line.

Lines can be terminated with either CR+LF (Windows-style) or LF (Unix/Linux-style). Blank lines are not allowed.

The sequence data lines should be formatted as blocks of equal line length.

In MiModD, FASTA format is used exclusively for reference genome input files and MiModD-specific restrictions apply to the description lines found in the files. Specifically, description lines must not contain:

  • non-printable or non-ASCII characters
  • whitespace characters
  • any of the characters: <>[]*;=,

This restriction is enforced by all tools that require a fasta reference genome. The MiModD.sanitize tool can be used to substitute illegal characters in description lines and also ensures that sequence data lines are block-formatted.

Note

The character restriction exists because MiModD will use the full content of the description line as the sequence name and we must ensure that this name is a valid sequence name in all downstream data formats generated during any analysis.

See also

MiModD tools that use fasta input files

snap, snap-batch, index, varcall

MiModD tools to manipulate fasta files

MiModD.sanitize


SAM

The Sequence Alignment/Map format is a TAB-delimited text format defined as part of the hts-specs project.

MiModD sticks quite closely to the official specification, in particular:

  • SAM headers must follow the official format, where every line (except comment lines) must conform to the following regular expression:

    /^@[A-Z][A-Z](\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/
    

    , which also means that only printable characters from the ASCII range are allowed in non-comment header lines.

    Comment lines must conform to the regular expression:

    /^@CO\t.*/
    

    meaning they have to start with @CO and a TAB, but the actual comment text can consist of any characters (which MiModD will encode using UTF-8).

  • Body lines consist of eleven TAB-delimited columns with the exact character restrictions given in the official specs.

See also

MiModD tools that accept SAM input files

snap, snap-batch

MiModD tools that produce sam output files

snap, snap-batch, header,

MiModD tools to manipulate sam files

convert, reheader


BAM

The Binary Alignment/Map format is the binary companion of the SAM format and also defined as part of the hts-specs project.

MiModD sticks to these specifications as it does for the SAM format.

See also

MiModD tools that accept bam input files

snap, snap-batch, varcall, delcall, index

MiModD tools that produce bam output files

snap, snap-batch

MiModD tools to manipulate bam files

convert, reheader, sort


VCF

The Variant Call Format is defined as part of the hts-specs project.

See also

MiModD tools that use vcf input files

map, varreport

MiModD tools that produce vcf output files

varextract

MiModD tools to manipulate vcf files

vcf-filter, rebase, annotate


BCF

The Binary Call Format is the binary counterpart of VCF and defined as part of the hts-specs project as well.

See also

MiModD tools that use bcf input files

varextract, delcall, covstats

MiModD tools that produce bcf output files

varcall

MiModD tools to manipulate bcf files


CloudMap-style sequence dictionary

This format is defined by the CloudMap analysis pipeline, in which it is referred to as an Other Species Configuration File.

It is used to specify the names of the contigs (or chromosomes) along with their sizes that make up a given reference genome.

Where needed MiModD embeds this information into its output files so, as long as you are working with MiModD-generated files, a sequence dictionary file is never needed and, currently, only the MiModD map tool provides an option for specifying such a file as input to enable its use with external input files.

A sequence dictionary file has a simple two-column tab-delimited format, in which each line consists of a contig name (exactly as it appears in the corresponding reference file) in the first column and the length of that contig in megabases (rounded up) in the second.

As an example, this is what a sequence dictionary for the six chromosomes of the roundworm C. elegans could look like:

I     16
II    16
III   14
IV    18
V     21
X     18

This and a few other sequence dictionary files are available as shared data through our online version of NacreousMap, but remember that you may have to modify the sequence names to match those defined in your reference file before using any pre-made sequence dictionary.

See also

MiModD tools that use CloudMap-style sequence dictionaries

map