mcast [options] <motifs> <sequence database>
In order for MCAST to compute statistical confidence estimates, at least 200 matches must be found. If the database contains too few sequences, or if certain other options are made too stringent, then too few matches may exist for significance statistics to be computed. In this case, the p-value, q-value, and E-value columns are set to "NaN", and all matches are printed. This limitation can be overcome by specifying the --synth option. When this option is set, synthetic sequences will be generated using a background model generated by choosing a random GC frequency within the range of observed GC minimum and maximum. The synthetic sequences will be used to estimate significance statistics.
When computing statistical confidence estimates, MCAST must retain the matches in memory until the final distribution of scores can be estimated. This means that the scanning of genome sized datasets has the potential to exhaust all available memory. To avoid this problem, MCAST uses reservoir sampling of the match scores, and limits the number of matches that are kept in memory. The default number of matches kept in memory is 100,000, but this value can be adjusted via the --max-stored-scores option. If the maximum number of stored matches is reached, then MCAST will drop the least significant half of the matches. This behavior may result in matches missing from the MCAST output, even though they would have satisfied the user-specified p-value or q-value threshold.
MCAST can make use of position specific priors (PSPs) to improve its identification of true motif occurrences. To take advantage of PSPs in MCAST you use must provide two command line options. The --psp option is used to set the name of a file containing the PSP, and the --prior-dist option is used to set the name of a file containing the binned distribution of the PSP.
The PSP can be provided in MEME PSP file format, or in wiggle format. The MEME PSP file format requires that a PSP be included for every position in the sequence to be scanned. This format is usually only practical for relatively small sequence databases. The wiggle format accommodates sequence segments with missing PSP. When no PSP is available for a given position, MCAST will use the median PSP from the PSP distribution file. The wiggle format will work with large sequence databases, including full genomes.
The PSP and PSP distribution files can be generated from raw scores using the
create-priors utility available
when you download and install the MEME Suite on your own computer.
A full description of the algorithm may be found in:
A file containing MEME formatted motifs. Outputs from MEME and DREME are supported, as well as Minimal MEME Format. You can convert many other motif formats to MEME format using conversion scripts available with the MEME Suite. Input motifs that are likely to appear in the sequences.
A collection of DNA sequences in FASTA format.
MCAST will create a directory named
mcast_out (the name of this directory can be overridden via the
--o or --oc options)
The directory will contain:
mcast.xmldescribing the inputs to MCAST in XML format
cisml.xmlreporting the matches in XML format using the CisML schema
mcast.htmlreporting the matches in HTML format
mcast.txtreporting the matches in tab-delimited format
mcast.gffreporting the matches in GFF format
The score reported in the GFF output is
|--alpha||alpha||The fraction of all TF binding sites that are binding sites for the TF of interest.||1.0|
|--bgfile||background file||A file with background frequencies of letters. This expects a markov background model. This also accepts special values like --nrdb-- which has the same effect as not supplying a background, --uniform-- which uses a uniform background and --motif-- which uses the background specified in the motif file.||Uses frequencies embedded in the application from the non-redundant database.|
|--hardmask||Nucleotides in lower case will be converted to the wildcard 'N'. This prevents these positions from being considred in motif matches. This is useful when the input sequence file has been soft-masked for tandem repeats. Without hard masking, MCAST may assign sequence segments containing tandem repeats a highly significant score.||Nucleotides in lower case are converted to upper case.|
|--max-gap||max gap||The value of max gap specifies the longest distance allowed between two hits in a match. Hits separated by more than max gap will be placed in different matches. Note: Large values of max gap combined with large values of pthresh may prevent MCAST from computing E-values.||The maximum gap is set to 50.|
|--max-stored-scores||max||Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate.||The maximum number of stored matches is 100,000.|
|--motif-pthresh||pthresh||sets the scale for calculating pscores for motif hits. The
p-score for a hit with p-value p is
S = -log2(p/pthresh),
|The motif scaling pvalue defaults to 0.0005.|
|--output-ethresh||out E-value||The E-value threshold for displaying search results. If the E-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.||The E-value threshold is 10.0.|
|--output-pthresh||out p-value||The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.||The E-value is used as the threshold. See --output-ethresh option.|
|--output-qthresh||out q-value||The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.||The E-value is used as the threshold. See --output-ethresh option.|
|--parse-genomic-coord||When this options is specified each sequence header will be
checked for UCSC style genomic coordinates. These are of the form:
>sequence name:starting position-ending positionWhere
|The first position in the sequence will be assumed to be 1.|
|--psp||file||File containing position specific priors (PSP) in MEME PSP format or wiggle format. This file can be generated using the create-priors utility.|
|--prior-dist||file||File containing binned distribution of priors. This file can be generated using the create-priors utility.|
|--synth||Use synthetic scores for distribution. A 0th-order Markov model of nucleotide frequencies will be created by choosing a GC content at random between the observed minimum and maximum values. This model will be used to generate synthetic sequences, and the synthetic sequences will be used to estimate the distribution of p-values.||No synthetic sequences will be generated.|
|--text||Limits output to plain text sent to standard out.|
|--transfac||MCAST will assume that the motif file is in TRANSFAC matrix format.||MCAST assumes the motif file is in MEME format..|
|--version||Display the version and exit.||Run as normal.|
The HTML output contains
The plain text output contains a line for each match. Each line contains the following fields:
The lines are sorted by score in descending order.
If you use MCAST in your research please cite the following paper:
Timothy Bailey and William Stafford Noble, "Searching for statistically significant regulatory modules", Bioinformatics (Proceedings of the European Conference on Computational Biology), 19(Suppl. 2):ii16-ii25, 2003. [full text]