mast <motif file> <sequence file> [options]
MAST is a tool for searching biological sequence databases for sequences that contain one or more of a group of known motifs.
A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. Motifs are represented as position-dependent scoring matrices that describe the score of each possible letter at each position in the pattern. Individual motifs may not contain gaps. Patterns with variable-length gaps must be split into two or more separate motifs before being submitted as input to MAST.
MAST takes as input a file containing the descriptions of one or more motifs and searches a sequence database that you select for sequences that match the motifs. The motif file can be the output of the MEME motif discovery tool or any file in the appropriate format.
A file containing MEME formatted motifs. Outputs from MEME and DREME are supported along with minimal MEME format for which there are conversion scripts available to support other formats. MAST used to require a log-odds matrix in the motifs but while the log-odds section is used preferentially it is not required.
A file containing FASTA formatted sequences which are suspected to contain motif sites. See the -dblist option if you need to specify multiple sequence databases.
MAST outputs an XML file which can then be converted into HTML or text format. The XML file is designed for machine processing and the HTML file is designed for human viewing. The text format is available for backwards compatibility though due to design decisions made to optimise the XML for HTML generation the output for separate scoring mode is not identical and some options were removed.
MAST outputs three things:
MAST works by calculating match scores for each sequence in the database compared with each of the motifs in the group of motifs you provide. For each sequence, the match scores are converted into various types of p-values and these are used to determine the overall match of the sequence to the group of motifs and the probable order and spacing of occurrences of the motifs in the sequence.
MAST generates a human readable file from the XML output containing:
The HTML version is the recommended version for human reading and has all sections documented however the text version has no documentation for the first section. That section lists each motif along with the sequence that would achieve the best possible match score. In order to avoid biased scores when multiple motif scores are combined, MAST also computes the pairwise correlations between each pair of motifs. The correlation between two motifs is the maximum sum of Pearson's correlation coefficients for aligned columns divided by the width of the shorter motif. The maximum is found by trying all alignments of the two motifs. Motifs with correlations below 0.60 have little effect on the accuracy of the combined scores. Pairs of motifs with higher correlations should be removed from the query.
Option | Parameter | Description | Default Behaviour |
---|---|---|---|
Input Options | |||
-bfile | file | A file with background frequencies of letters. This expects a markov background model. | Uses predefined background frequencies from NRD. |
-dblist | Interpret the sequence file as a list of FASTA-formatted databases | ||
Output Options | |||
-hit_list | Write a machine-readable list of all hits to standard output. No other output is created. | ||
Which Motifs To Use | |||
-remcorr | Remove highly correlated motifs from query. | ||
-m | n | Use only motifs appearing at the nth position in the file. This option may be repeated. | Use all the motifs. |
-c | count | Only use the first count motifs. | Use all the motifs. |
-mev | evalue | Use only motifs with E-values ≤ evalue. | Use all the motifs. |
-diag | diagram | The nominal order an spacing of motifs is specified by diagram, which is a block diagram. | |
DNA Only Options | |||
-norc | Do not score the reverse complement DNA strand. This option is not compatible with the -sep or -dna options. | ||
-sep | Score the reverse complement DNA strand as a separate sequence. This option is not compatible with the -norc or -dna options. | ||
-dna | Translate the DNA sequences to protein so protein motifs may be scanned. The motifs must be protein and the sequences must be DNA. This option is not compatible with -norc or -sep | ||
-comp | Adjust the p-values and E-values for sequence composition. | ||
Which Results To Print | |||
-ev | evalue | Output results for sequences with E-values < evalue. | Output results for sequences with E-values < 10. |
Appearance of Block Diagrams | |||
-mt | mt | Show motif matches with p-value < mt. | Show motif matches with a p-value < 0.0001. |
-w | show weak matches (mt < p-value < mt * 10) in angle brackets in the hit list or when the XML is converted to text. | ||
-best | Include only the best motif hits in -hit_list diagrams | ||
-seqp | Use SEQUENCE p-values for motif thresholds. | use POSITION p-values for motif thresholds. | |
Miscellaneous | |||
-mf | mf | In results use mf as motif file name. | |
-df | df | In results use df as database name. This option is ignored when -dblist is specified. | |
-dl | dl | In results use dl as link to search sequence names. The token SEQUENCEID is replaced with the FASTA sequence ID. This is ignored when -dblist is specified. | |
-minseqs | ms | The lower bound on the number of sequences in the database. | |
-nostatus | Do not print progress updates to standard error. | ||
-notext | Do not create text output. | ||
-nohtml | Do not create HTML output. |
The match score of a motif to a position in a sequence is the sum of the score from each column of the position-dependent scoring matrix corresponding to the letter at that position in the sequence. For example, if the sequence is
TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC ========
and the motif is represented by the position-dependent scoring matrix (where each row of the matrix corresponds to a position in the motif)
Position | A | C | G | T |
---|---|---|---|---|
1 | 1.447 | 0.188 | -4.025 | -4.095 |
2 | 0.739 | 1.339 | -3.945 | -2.325 |
3 | 1.764 | -3.562 | -4.197 | -3.895 |
4 | 1.574 | -3.784 | -1.594 | -1.994 |
5 | 1.602 | -3.935 | -4.054 | -1.370 |
6 | 0.797 | -3.647 | -0.814 | 0.215 |
7 | -1.280 | 1.873 | -0.607 | -1.993 |
8 | -3.076 | 1.035 | 1.414 | -3.913 |
then the match score of the fourth position in the sequence (underlined) would be found by summing the score for T in position 1, G in position 2 and so on until G in position 8. So the match score would be
score = -4.095 + -3.945 + -3.895 + -1.994 + -4.054 + -0.814 + -1.933 + 1.414 = -19.316
The match scores for other positions in the sequence are calculated in the same way. Match scores are only calculated if the match completely fits within the sequence. Match scores are not calculated if the motif would overhang either end of the sequence.
MAST reports all matches of a sequence to a motif or group of motifs in terms of the p-value of the match. MAST considers the p-values of four types of events:
All p-values are based on a random sequence model that assumes each position in a random sequence is generated according to the average letter frequencies of all sequences in the appropriate (peptide or nucleotide) non-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/) on September 22, 1996. This can be overridden by specifying the -bfile or -comp options (see below). For DNA sequences, unless -norc is given, the positive and reverse complement strand frequencies are averaged together.
-comp
The random model uses the letter frequencies in the current target
sequence instead of the non-redundant database frequencies. This
causes p-values and E-values to be compensated individually
for the actual composition of each sequence in the database. This option
can increase search time substantially due to the need to compute
a different score distribution for each high-scoring sequence.
With this option and DNA sequences, the positive and reverse complement
strand frequencies are not averaged together.
The p-value of a match of a given position within a sequence to a motif is defined as the probability of a randomly selected position in a randomly generated sequence having a match score at least as large as that of the given position. Note: If MAST is combining reverse complement DNA strands, the position p-value is not corrected for multiple tests.
The p-value of a match of a sequence to a motif is defined as the probability of a randomly generated sequence of the same length having a match score at least as large as the largest match score of any position in the sequence.
The p-value of a match of a sequence to a group of motifs is defined as the probability of a randomly generated sequence of the same length having sequence p-values whose product is at least as small as the product of the sequence p-values of the matches of the motifs to the given sequence.
The E-value of the match of a sequence in a database to a group of motifs is defined as the expected number of sequences in a random database of the same size that would match the motifs as well as the sequence does and is equal to the combined p-value of the sequence times the number of sequences in the database.
MAST lists the names and part of the descriptive text of all sequences whose E-value is less than E. Sequences shorter than one or more of the motifs are skipped. The sequences are sorted by increasing E-value. The value of E is set to 10 for the WEB server but is user-selectable in the downloadable version of MAST.
Motif diagrams show the order and spacing of non-overlapping matches to the motifs in each high-scoring sequence. Motif occurrences are determined based on the position p-value of matches to the motif. Strong matches (p-value < mt) are shown in square brackets (`[ ]'), weak matches (mt < p-value < mt * 10) are shown in angle brackets (`< >') and the length of non-motif sequence ("spacer") is shown between underscores (`_'). For example,
27_[3]_44_<4>_99_[1]_7
shows an initial spacer of length 27, followed by a strong match to motif 3, a spacer of length 44, a weak match to motif 4, a spacer of length 99, a strong match to motif 1 and a final non-motif sequence of length 7. The value of M is 0.0001 for the WEB server but is user-selectable in the downloadable version of the MEME Suite.
MAST annotates each high-scoring sequence by printing the sequence along with the position and strength of all the non-overlapping motif occurrences. The four lines above each motif occurrence contain, respectively,
The best possible match to a motif is the sequence of letters which would achieve the highest match score.
If you specify the -hit_list switch to MAST, MAST outputs ONLY a list of "hits" in easily machine-readable format. Each line corresponds to one motif occurrence in one sequence. The format of the hit lines is
where
sequence_name | is the name of the sequence containing the hit |
strand | is the strand (+ or - for DNA, blank for protein), |
motif | is the motif number, |
start | is the starting position of the hit, |
end | is the ending position of the hit, and |
score | is the score the hit, |
p-value | is the position p-value of the hit. |
Two comment lines (starting with "#") are written above the list of hits,
and the MAST command line is printed as a comment line after the list.
An example of the output using the -hit_list
switch to MAST is:
Multiple sequence databases can be loaded by MAST by putting the file names into a file and specifying that file instead of the sequence database with the option -dblist.
The file list has one file name on each line with the optional name and link as follows:
<file> [<name> <link>] ... ...
If it is specified then the name will be used instead of the file name in the output. If the link is specified then all sequences for that database in the html output will have a hyperlink to the URL specified with the text SEQUENCEID replaced with the FASTA sequence id.
The following examples assume that file "meme.results" is the output of a MEME run containing at least 3 motifs which was created on the trainingset "training.fasta" and file SwissProt is a copy of the Swiss-Prot database on your local disk. DNA_DB is a copy of a DNA database on your local disk.
mast meme.results training.fasta
mast meme.results SwissProt
mast meme.results SwissProt -ev 200
mast meme.results SwissProt -diag "9-[2]-61-[1]-62-[3]-91"
mast meme.results SwissProt -m 1 -m 3
mast meme.results SwissProt -c 2
mast meme.results DNA_DB -dna -comp
If you use MAST in your research, please cite the following paper:
Timothy L. Bailey and Michael Gribskov,
"Combining evidence using p-values: application to sequence homology searches",
Bioinformatics, 14(1):48-54, 1998.
[pdf]