mcast [options] <motif file> <sequence file>
In order for MCAST to compute statistical confidence estimates, at least 200 matches must be found. If the database contains too few sequences, or if certain other options are made too stringent, then too few matches may exist for significance statistics to be computed. In this case, the p-value, q-value, and E-value columns are set to "NaN", and all matches are printed. This limitation can be overcome by specifying the --synth option. When this option is set, synthetic sequences will be generated using a background model generated by choosing a random GC frequency within the range of observed GC minimum and maximum. The synthetic sequences will be used to estimate significance statistics.
When computing statistical confidence estimates, MCAST must retain the matches in memory until the final distribution of scores can be estimated. This means that the scanning of genome sized datasets has the potential to exhaust all available memory. To avoid this problem, MCAST uses reservoir sampling of the match scores, and limits the number of matches that are kept in memory. The default number of matches kept in memory is 100,000, but this value can be adjusted via the --max-stored-scores option. If the maximum number of stored matches is reached, then MCAST will drop the least significant half of the matches. This behavior may result in matches missing from the MCAST output, even though they would have satisfied the user-specified p-value or q-value threshold.
MCAST can make use of position-specific priors (PSPs) to improve its identification of true motif occurrences. To take advantage of PSPs in MCAST you use must provide two command line options. The --psp option is used to set the name of a file containing the PSP, and the --prior-dist option is used to set the name of a file containing the binned distribution of the PSP.
The PSP can be provided in MEME PSP file format, or in wiggle format. The MEME PSP file format requires that a PSP be included for every position in the sequence to be scanned. This format is usually only practical for relatively small sequence databases. The wiggle format accommodates sequence segments with missing PSP. When no PSP is available for a given position, MCAST will use the median PSP from the PSP distribution file. The wiggle format will work with large sequence databases, including full genomes.
The PSP and PSP distribution files can be generated from raw scores using the
create-priors utility available
when you download and install the MEME Suite on your own computer.
A full description of the algorithm may be found in:
The name of a file containing DNA motifs in MEME format.
Outputs from MEME and DREME are supported, as well as Minimal MEME
Format. You can also input DNA motifs in TRANSFAC format if you
--transfac. You can convert many other motif formats to MEME format
using conversion scripts
available with the MEME Suite. Input motifs that are likely to appear in the
sequences. Note: All motifs must have width at least 2.
The name of a file of DNA sequences in FASTA format.
MCAST will create a directory named
mcast_out (the name of this directory can be overridden via the
--o or --oc options)
The directory will contain:
mcast.html- an HTML file that provides the results in a human-readable format
mcast.tsv- a TSV (tab-separated values) file that provides the results in a format suitable for parsing by scripts and viewing with Excel
mcast.gff- a GFF3 format file that provides the results in a format suitable for display in the UCSC genome browser
cisml.xml- that provides the results in the CisML schema
mcast.xml- that describes the inputs to MCAST in XML format and references the CISML file
Note: See this detailed description of the MCAST output formats for more information.
|--alpha||alpha||The fraction of all TF binding sites that are binding sites for the TF of interest.||1.0|
|--hardmask||Nucleotides in lower case will be converted to the wildcard 'N'. This prevents these positions from being considered in motif matches. This is useful when the input sequence file has been soft-masked for tandem repeats. Without hard masking, MCAST may assign sequence segments containing tandem repeats a highly significant score.||Nucleotides in lower case are converted to upper case.|
|--max-gap||max gap||The value of max gap specifies the longest distance allowed between two hits in a match. Hits separated by more than max gap will be placed in different matches. Note: Large values of max gap combined with large values of pthresh may prevent MCAST from computing E-values.||The maximum gap is set to 50.|
|--max-stored-scores||max||Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate.||The maximum number of stored matches is 100,000.|
|--max-total-width||max||Limit the combined width of all the input motifs to no more than max columns. This can be set to prevent jobs from exceeding the available memory. The memory requirements of MCAST are quadratic in the combined widths of the motifs, and can reach 5Gb when the combined width is greater than 8000 columns.||MCAST does not limit the combined width of all motifs.|
|--motif-pthresh||pthresh||Set the scale for calculating pscores for motif hits. The
pscore for a hit with p-value p is
S = -log2(p/pthresh),
|The motif scaling pvalue defaults to 0.0005.|
|--output-ethresh||out E-value||The E-value threshold for displaying search results. If the E-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.||The E-value threshold is 10.0.|
|--output-pthresh||out p-value||The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.||The E-value is used as the threshold. See --output-ethresh option.|
|--output-qthresh||out q-value||The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.||The E-value is used as the threshold. See --output-ethresh option.|
|--synth||Use synthetic scores for distribution. A 0-order Markov model of nucleotide frequencies will be created by choosing a GC content at random between the observed minimum and maximum values. This model will be used to generate synthetic sequences, and the synthetic sequences will be used to estimate the distribution of p-values.||No synthetic sequences will be generated.|
|--transfac||MCAST will assume that the motif file is in TRANSFAC matrix format.||MCAST assumes the motif file is in MEME format.|
If you use MCAST in your research please cite the following paper:
Timothy Bailey and William Stafford Noble, "Searching for statistically significant regulatory modules", Bioinformatics (Proceedings of the European Conference on Computational Biology), 19(Suppl. 2):ii16-ii25, 2003. [full text]