MCAST

MCAST computes statistical confidence estimates by generating at least 100 random sequences that have GC-contents spanning the same range as the discovered matches (plus 500bp on either side of each cluster). After all matches have been found in your input sequences, MCAST generates the random sequences and scores them using the same algorithm as for the input sequences. MCAST bins the scores of the matches found in the random sequences according the GC-content of the random sequence. For each match in the input sequences, MCAST determines its GC-content, and looks up the mean random score in the appropriate GC-bin. MCAST then uses this mean score to estimate the p-value of the match, which is then used to compute its E- and q-value. Note: This binning approach is different from the approach described in the orginal MCAST paper mentioned below because MCAST no longer assumes that there is a linear relationship between match GC-content and match score.

When computing statistical confidence estimates, MCAST must retain the matches in memory until the final distribution of scores can be estimated. This means that the scanning of genome sized datasets has the potential to exhaust all available memory. To avoid this problem, MCAST uses reservoir sampling of the match scores, and limits the number of matches that are kept in memory. The default number of matches kept in memory is 100,000, but this value can be adjusted via the --max-stored-scores option. If the maximum number of stored matches is reached, then MCAST will drop the least significant half of the matches. This behavior may result in matches missing from the MCAST output, even though they would have satisfied the user-specified p-value or q-value threshold.

MCAST can make use of position-specific priors (PSPs) to improve its identification of true motif occurrences. To take advantage of PSPs in MCAST you use must provide two command line options. The --psp option is used to set the name of a file containing the PSP, and the --prior-dist option is used to set the name of a file containing the binned distribution of the PSP.

The PSP can be provided in MEME PSP file format, or in wiggle format. The MEME PSP file format requires that a PSP be included for every position in the sequence to be scanned. This format is usually only practical for relatively small sequence databases. The wiggle format accommodates sequence segments with missing PSP. When no PSP is available for a given position, MCAST will use the median PSP from the PSP distribution file. The wiggle format will work with large sequence databases, including full genomes. MCAST also uses the median PSP value in scoring all positions in the random sequences it generates for computing statistical confidence estimates.

The PSP and PSP distribution files can be generated from raw scores using the create-priors utility available when you download and install the MEME Suite on your own computer.

A full description of the algorithm may be found in:

<motif file>

The name of a file containing DNA motifs in MEME format. Outputs from MEME, STREME and DREME are supported, as well as Minimal MEME Format. You can also input DNA motifs in TRANSFAC format if you specify the --transfac option. You can convert many other motif formats to MEME format using conversion scripts available with the MEME Suite. Note: All motifs must have width at least 2.

<sequence file>

The name of a file of DNA sequences in FASTA format.

MCAST will create a directory named mcast_out (the name of this directory can be overridden via the --o or --oc options) The directory will contain:

mcast.html - an HTML file that provides the results in a human-readable format
mcast.tsv - a TSV (tab-separated values) file that provides the results in a format suitable for parsing by scripts and viewing with Excel
mcast.gff - a GFF3 format file that provides the results in a format suitable for display in the UCSC genome browser
cisml.xml - that provides the results in the CisML schema
mcast.xml - that describes the inputs to MCAST in XML format and references the CISML file cisml.xml

Note: See this detailed description of the MCAST output formats for more information.

Option	Parameter	Description	Default Behavior
Output
Motifs
--transfac		MCAST will assume that the motif file is in TRANSFAC matrix format.	MCAST assumes the motif file is in MEME format.
--max-total-width	max	Limit the combined width of all the input motifs to no more than max columns. This can be set to prevent jobs from exceeding the available memory. The memory requirements of MCAST are quadratic in the combined widths of the motifs, and can reach 5Gb when the combined width is greater than 8000 columns.	MCAST does not limit the combined width of all motifs.
Sequences
--hardmask		Nucleotides in lower case will be converted to the wildcard 'N'. This prevents these positions from being considered in motif matches. This is useful when the input sequence file has been soft-masked for tandem repeats. Without hard masking, MCAST may assign sequence segments containing tandem repeats a highly significant score.	Nucleotides in lower case are converted to upper case.
Background Model and Priors
Hidden Markov Model
--motif-pthresh	pthresh	The maximum p-value for a motif site to be considered a hit and to be included in a match. Note: Small values of pthresh may prevent MCAST from computing E-values. Note: This value also sets the scale for calculating pscores for motif hits. The pscore for a hit with p-value p is S = -log₂(p/pthresh),	The maximum p-value for a hit is 0.0005.
--max-gap	max gap	The value of max gap specifies the longest distance allowed between two hits in a match. Hits separated by more than max gap will be placed in different matches. Note: Large values of max gap may prevent MCAST from computing E-values.	The maximum gap is set to 50.
Scoring
--output-ethresh	out E-value	The E-value threshold for displaying search results. If the E-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.	The E-value threshold is 10.
--output-pthresh	out p-value	The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.	The E-value is used as the threshold. See --output-ethresh option.
--output-qthresh	out q-value	The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.	The E-value is used as the threshold. See --output-ethresh option.
--max-stored-scores	max	Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate.	The maximum number of stored matches is 100,000.
--seed		MCAST uses this seed in the generation of the random sequences that it uses to estimate the p-, E- and q-values of the discovered matches.	0
Misc

The MEME Suite

Motif-based sequence analysis tools

Motif Cluster Alignment and Search Tool

Usage:

Description

Input

<motif file>

<sequence file>

Output

Options

Citing