mcast [options] <motifs> <sequence database>
MCAST searches a sequence database for statistically significant clusters of non-overlapping occurrences of a given set of motifs.
A motif "hit" is a sequence position that is sufficiently similar to a motif in the query, where the score for a motif at a particular sequence position is computed without gaps. To compute the p-value of a motif score, MCAST assumes that the sequences in the database were generated by a 0-order Markov process (see option --bgfile, below). To be considered a hit, the p-value of the motif alignment score must be less than the significance threshold, pthresh (see option --motif-pthresh, below). Note that MCAST searches for hits on both strands of the sequences.
A cluster of non-overlapping hits is called a "match". The user specifies the maximum allowed distance between the hits in a match using the --max-gap option. Two hits separated by more than the maximum allowed gap will be reported in separate matches.
The p-value of a hit is converted to a "p-score" in order to compute the total score of the match it participates in. The p-score for a hit with p-value p is
The total score of a match is the sum of the p-scores of the hits making up the match.
MCAST searches for all possible matches between the query motifs and the sequences in the database, and reports the matches with the largest scores in decreasing order. Three types of statistical confidence estimates (p-value, E-value, and q-value) are estimated for each score, and the reported matches can be filtered by applying p-value or q-value thresholds (see the options --output-pthresh and --output-qthresh below).
In order for MCAST to compute statistical confidence estimates, at least 100 matches must be found. If the database contains too few sequences, or if certain other options are made too stringent, then too few matches may exist for significance statistics to be computed. In this case, the p-value, q-value, and E-value columns are set to "NaN", and all matches are printed. This limitation can be overcome by specifying the --synth and ---bgfile options. When those options are set, synthetic sequences will be generated from the provided background model and used to estimate significance statistics.
When computing statistical confidence estimates, MCAST must retain the matches in memory until the final distribution of scores can be estimated. This means that the scanning of genome sized datasets has the potential to exhaust all available memory. To avoid this problem, MCAST uses reservoir sampling of the match scores, and limits the number of matches that are kept in memory. The default number of matches kept in memory is 100,000, but this value can be adjusted via the --max-stored-scores option. If the maximum number of stored matches is reached, then MCAST will drop the least significant half of the matches. This behavior may result in matches missing from the MCAST output, even though they would have satisified the user-specified p-value or q-value threshold.
A full description of the algorithm may be found in:
A file containing MEME formatted motifs. Outputs from MEME and DREME are supported along with minimal MEME format for which there are conversion scripts available to support other formats. Input motifs that are likely to appear in the sequences.
A collection of DNA sequences in FASTA format.
MCAST will create a directory named
mcast_out
(the name of this directory can be overridden via the
--o or --oc options)
The directory will contain:
mcast.xml
describing the inputs to MCAST
in XML formatcisml.xml
reporting the matches in XML
format using the CisML
schemamcast.html
reporting the matches in HTML formatmcast.txt
reporting the matches in tab-delimited formatmcast.gff
reporting the matches in
GFF formatmcast.wig
reporting the matches in
wiggle track formatOption | Parameter | Description | Default Behaviour |
---|---|---|---|
General Options | |||
-bgfile | background file | A file with background frequencies of letters. This expects a markov background model. This also accepts special values like --nrdb-- which has the same effect as not supplying a background, --uniform-- which uses a uniform background and --motif-- which uses the background specified in the motif file. | Uses frequencies embedded in the application from the non-redundant database. |
--bgweight | weight | Add weight times the background frequency to the corresponding letter counts in each motif when converting them to postion specific scoring matrices. | A weight of 4.0 is used. |
--max-gap | max gap | The value of max gap specifies the longest distance allowed between two hits in a match. Hits separated by more than max gap will be placed in different matches. Note: Large values of max gap combined with large values of pthresh may prevent MCAST from computing E-values. | The maximum gap is set to 50. |
--max-stored-scores | max | Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate. | The maximum number of stored matches is 100,000. |
--motif-pthresh | pthresh | sets the scale for calculating pscores for motif hits. The
p-score for a hit with p-value p is
S = -log2(p/pthresh),
|
The motif scaling pvalue is set to 0.0005. |
--output-ethresh | out E-value | The E-value threshold for displaying search results. If the E-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter. | The E-value threshold is 10.0. |
--output-pthresh | out p-value | The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter. | The E-value is used as the threshold. See --output-ethresh option. |
--output-qthresh | out q-value | The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter. | The E-value is used as the threshold. See --output-ethresh option. |
--synth | Use synthetic scores for distribution. |
The HTML output contains
The plain text output contains a line for each match. Each line contains the following fields:
The lines are sorted by score in descending order.
The wiggle track output contains the following entries:
The wiggle track output is sorted by sequence name and position.
If you use MCAST in your research please cite the following paper:
Timothy Bailey and William Stafford Noble,
"Searching for statistically significant regulatory modules",
Bioinformatics (Proceedings of the European Conference on Computational Biology),
19(Suppl. 2):ii16-ii25, 2003.
[full text]