mcast [options] <motif file> <sequence file>
MCAST computes statistical confidence estimates by generating at least 100 random sequences that have GC-contents spanning the same range as the discovered matches (plus 500bp on either side of each cluster). After all matches have been found in your input sequences, MCAST generates the random sequences and scores them using the same algorithm as for the input sequences. MCAST bins the scores of the matches found in the random sequences according the GC-content of the random sequence. For each match in the input sequences, MCAST determines its GC-content, and looks up the mean random score in the appropriate GC-bin. MCAST then uses this mean score to estimate the p-value of the match, which is then used to compute its E- and q-value. Note: This binning approach is different from the approach described in the orginal MCAST paper mentioned below because MCAST no longer assumes that there is a linear relationship between match GC-content and match score.
When computing statistical confidence estimates, MCAST must retain the matches in memory until the final distribution of scores can be estimated. This means that the scanning of genome sized datasets has the potential to exhaust all available memory. To avoid this problem, MCAST uses reservoir sampling of the match scores, and limits the number of matches that are kept in memory. The default number of matches kept in memory is 100,000, but this value can be adjusted via the --max-stored-scores option. If the maximum number of stored matches is reached, then MCAST will drop the least significant half of the matches. This behavior may result in matches missing from the MCAST output, even though they would have satisfied the user-specified p-value or q-value threshold.
MCAST can make use of position-specific priors (PSPs) to improve its identification of true motif occurrences. To take advantage of PSPs in MCAST you use must provide two command line options. The --psp option is used to set the name of a file containing the PSP, and the --prior-dist option is used to set the name of a file containing the binned distribution of the PSP.
The PSP can be provided in MEME PSP file format, or in wiggle format. The MEME PSP file format requires that a PSP be included for every position in the sequence to be scanned. This format is usually only practical for relatively small sequence databases. The wiggle format accommodates sequence segments with missing PSP. When no PSP is available for a given position, MCAST will use the median PSP from the PSP distribution file. The wiggle format will work with large sequence databases, including full genomes. MCAST also uses the median PSP value in scoring all positions in the random sequences it generates for computing statistical confidence estimates.
The PSP and PSP distribution files can be generated from raw scores using the
create-priors
utility available
when you download and install the MEME Suite on your own computer.
A full description of the algorithm may be found in:
The name of a file containing DNA motifs in MEME format.
Outputs from MEME, STREME and DREME are supported, as well as Minimal MEME
Format. You can also input DNA motifs in TRANSFAC format if you
specify the --transfac
option. You can convert many other motif formats to MEME format
using conversion scripts
available with the MEME Suite.
Note: All motifs must have width at least 2.
The name of a file of DNA sequences in FASTA format.
MCAST will create a directory named
mcast_out
(the name of this directory can be overridden via the
--o or --oc options)
The directory will contain:
mcast.html
-
an HTML file that provides the results in a human-readable formatmcast.tsv
-
a TSV (tab-separated values) file that provides
the results in a format suitable for parsing by scripts and viewing with Excelmcast.gff
-
a GFF3 format file that provides the results
in a format suitable for display in the UCSC genome browser
cisml.xml
-
that provides the results in the CisML
schemamcast.xml
-
that describes the inputs to MCAST in XML format and references
the CISML file cisml.xml
Note: See this detailed description of the MCAST output formats for more information.
Option | Parameter | Description | Default Behavior |
---|---|---|---|
Output | |||
Motifs | |||
--transfac | MCAST will assume that the motif file is in TRANSFAC matrix format. | MCAST assumes the motif file is in MEME format. | |
--max-total-width | max | Limit the combined width of all the input motifs to no more than max columns. This can be set to prevent jobs from exceeding the available memory. The memory requirements of MCAST are quadratic in the combined widths of the motifs, and can reach 5Gb when the combined width is greater than 8000 columns. | MCAST does not limit the combined width of all motifs. |