Usage:

mcast [options] <motifs> <sequence database>

Description

MCAST searches a sequence database for statistically significant clusters of non-overlapping occurrences of a given set of motifs.

A motif "hit" is a sequence position that is sufficiently similar to a motif in the query, where the score for a motif at a particular sequence position is computed without gaps. To compute the p-value of a motif score, MCAST assumes that the sequences in the database were generated by a 0-order Markov process (see option --bgfile, below). To be considered a hit, the p-value of the motif alignment score must be less than the significance threshold, pthresh (see option --motif-pthresh, below). Note that MCAST searches for hits on both strands of the sequences.

A cluster of non-overlapping hits is called a "match". The user specifies the maximum allowed distance between the hits in a match using the --max-gap option. Two hits separated by more than the maximum allowed gap will be reported in separate matches.

The p-value of a hit is converted to a "p-score" in order to compute the total score of the match it participates in. The p-score for a hit with p-value p is

S = -log2(p/pthresh),

The total score of a match is the sum of the p-scores of the hits making up the match.

MCAST searches for all possible matches between the query motifs and the sequences in the database, and reports the matches with the largest scores in decreasing order. Three types of statistical confidence estimates (p-value, E-value, and q-value) are estimated for each score, and the reported matches can be filtered by applying p-value or q-value thresholds (see the options --output-pthresh and --output-qthresh below).

In order for MCAST to compute statistical confidence estimates, at least 100 matches must be found. If the database contains too few sequences, or if certain other options are made too stringent, then too few matches may exist for significance statistics to be computed. In this case, the p-value, q-value, and E-value columns are set to "NaN", and all matches are printed. This limitation can be overcome by specifying the --synth and ---bgfile options. When those options are set, synthetic sequences will be generated from the provided background model and used to estimate significance statistics.

When computing statistical confidence estimates, MCAST must retain the matches in memory until the final distribution of scores can be estimated. This means that the scanning of genome sized datasets has the potential to exhaust all available memory. To avoid this problem, MCAST uses reservoir sampling of the match scores, and limits the number of matches that are kept in memory. The default number of matches kept in memory is 100,000, but this value can be adjusted via the --max-stored-scores option. If the maximum number of stored matches is reached, then MCAST will drop the least significant half of the matches. This behavior may result in matches missing from the MCAST output, even though they would have satisified the user-specified p-value or q-value threshold.

A full description of the algorithm may be found in:

Bailey and Noble. "Searching for statistically significant regulatory modules." Bioinformatics (Proceedings of the European Conference on Computational Biology). 19(Suppl. 2):ii16-ii25, 2003.

Input

Motifs

A file containing MEME formatted motifs. Outputs from MEME and DREME are supported along with minimal MEME format for which there are conversion scripts available to support other formats. Input motifs that are likely to appear in the sequences.

Sequence Database

A collection of DNA sequences in FASTA format.

Output

MCAST will create a directory named mcast_out (the name of this directory can be overridden via the --o or --oc options) The directory will contain:

Options

Option Parameter Description Default Behaviour
General Options
-bgfilebackground file A file with background frequencies of letters. This expects a markov background model. This also accepts special values like --nrdb-- which has the same effect as not supplying a background, --uniform-- which uses a uniform background and --motif-- which uses the background specified in the motif file. Uses frequencies embedded in the application from the non-redundant database.
--bgweightweight Add weight times the background frequency to the corresponding letter counts in each motif when converting them to postion specific scoring matrices. A weight of 4.0 is used.
--max-gapmax gap The value of max gap specifies the longest distance allowed between two hits in a match. Hits separated by more than max gap will be placed in different matches. Note: Large values of max gap combined with large values of pthresh may prevent MCAST from computing E-values. The maximum gap is set to 50.
--max-stored-scoresmax Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate. The maximum number of stored matches is 100,000.
--motif-pthreshpthresh sets the scale for calculating pscores for motif hits. The p-score for a hit with p-value p is
S = -log2(p/pthresh),
The motif scaling pvalue is set to 0.0005.
--output-ethreshout E-value The E-value threshold for displaying search results. If the E-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter. The E-value threshold is 10.0.
--output-pthreshout p-value The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter. The E-value is used as the threshold. See --output-ethresh option.
--output-qthreshout q-value The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter. The E-value is used as the threshold. See --output-ethresh option.
--synth Use synthetic scores for distribution.

HTML output

The HTML output contains

Text output

The plain text output contains a line for each match. Each line contains the following fields:

The lines are sorted by score in descending order.

Wiggle output

The wiggle track output contains the following entries:

The wiggle track output is sorted by sequence name and position.

Citing

If you use MCAST in your research please cite the following paper:
Timothy Bailey and William Stafford Noble, "Searching for statistically significant regulatory modules", Bioinformatics (Proceedings of the European Conference on Computational Biology), 19(Suppl. 2):ii16-ii25, 2003. [full text]