MCAST: Motif Cluster Alignment Search Tool
Usage:
mcast [options] <query> <database>
Description:
MCAST
searches a sequence database for statistically significant clusters of non-overlapping "hits" to the motifs in a query.A "hit" is a sequence position that is sufficiently similar to a motif in the query. To be a hit, the p-value of the motif alignment score must be less than the significance threshold, pthresh (see option -p, below). The alignment of the motif and the sequence position is done without gaps. To compute the p-value of a motif alignment score,
MCAST
assumes that the sequences in the database were generated by a 0-order Markov process; see option -bg, below. With DNA sequences,MCAST
searches for hits on both the sequences given in the database, and their reverse complements.A cluster of non-overlapping hits is called a "match". The use specifies the maximum allowed distance between the hits in a match. (Two hits separated by more than the maximum allowed gap will be reported in separate matches.)
MCAST
searches for all of the matches between the query and the sequences in the database. Each match is assigned an E-value, and matches that score below an E-value threshold are printed in order of increasing E-value (see option -e, below).The p-value of a hit is converted to a "p-score" in order to compute the total score of the match it participates in. The p-score for a hit with p-value p is
S = -log2(p/pthresh),where the significance threshold pthresh may be specified by the user. The total score of a match is the sum of the p-scores of the hits making up the match.MCAST
finds the matches with the maximum match scores.In order for E-values to be computed by
MCAST
, at least 100 matches must be found. If there are too few sequences in the database, or if certain other options are made to stringent (see Options, below), too few matches may exist for E-values to be computed. In this case, the results are sorted by match score, the E-value column is set to "NaN" and all matches are printed.Input:
- <query> - A set of motifs in TRANSFAC format must be provided.
- <database> -
MCAST
requires a sequence database in FASTA format. If the filename is given as '-', thenmcast
will attempt to read the database from standard input.Output:
MCAST
usesMHMMSCAN
to generate its output. Clickinghere
or on the links below will take you to theMHMMSCAN
output documentation. Use the "Back" button on your browser to return to theMCAST
documentation.Options:
- -p-thresh <pthresh> - The significance threshold that defines a motif "hit" is given by pthresh. Any sequence position in the database with an alignment score with p-value less than pthresh is considered a hit. The value of pthresh must be greater than zero and less than or equal to 1. The default value of pthresh is 0.0005.
Note:
- The proper value for the pthresh can only be determined by experimentation since it depends on the number of motifs, the information content of the motifs and the value of maxgap.
- If pthresh is too small, there may be few (or no) "hits", and, consequently, few (or no) matches. This may cause
MCAST
to be unable to compute match E-values, or to report no matches. Small values of the pthresh may also cause the reported E-values to be inaccurate. In this case, the E-values will always be too large (conservative).- If pthresh is too large, the expected length of a match may be longer than most of the sequences in the database you are searching. This will prevent
MCAST
from being able to compute E-values. Large values of pthresh may yeild no hits due to the minimum score for a match becoming very large, especially when the query contains few motifs. Very large values of pthresh, when searching genomic DNA, tend to give high scores to low-complexity sequence and repeated elements.- -max-gap <maxgap> - The value of maxgap specifies the longest distance allowed between two hits in a match. Hits separated by more than maxgap will be placed in different matches.
Note:
- Large values of maxgap combined with large values of pthresh may prevent
MCAST
from computing E-values due to the problem described above in the second note for the -p switch.- -e-thresh <Ethresh> -
MCAST
prints the matches with E-values below Ethresh. The default threshold is 10. If E-values cannot be computed (see notes for -p and -g, above), all matches are printed.- -bg-weight <b> - Add b times each background frequency (see -bg, below) to the corresponding letter counts in the query motifs when converting them to PSSMs.
- -bg-file <file> -
MCAST
needs to know the frequencies of the letters of the sequence alphabet in the database being searched (the "background" letter frequencies). By default, the background letter frequencies of the appropriate (DNA or protein) NCBI non-redundant database are used. The E-values computed byMCAST
will be more accurate and the search will be more selective if you provide a file containing the background letter frequencies in background model format. (This format allows "order-n" Markov models to be specified, butMCAST
uses only the single-letter frequencies from file, ignoring any higher-order frequencies present there.)MCAST
uses the background letter frequencies for two purposes:
- to convert the query motifs from TRANSFAC (counts) to log-odds (scores), and
- to estimate the p-values of individual motif alignment scores.
- -lowcomp <threshold> - Eliminate low-complexity motifs from the model. Motif complexity is the average K-L distance between the "motif background distribution" and each column of the motif. The motif background is just the average distribution of all the columns. The K-L distance, which measures the difference between two distributions, p and f, is the same as the information content:
p1 log(p1/f1) + p2 log(p2/f2) + ... + pn log(pn/fn). This value increases with increasing complexity.- -synth - Create synthetic sequences for estimating E-values. This is useful with small input databases where not enough match scores are found to estimate E-values.
- -text - Print output in plain text instead of HTML format.
- -scratch <dir> - Directory for temporary files (default is current)
- -meme - the query file is in MEME format (default is TRANSFAC format)
Bugs: None known.
Author: Timothy Bailey and William Stafford Noble .