MAST -- Motif Alignment and Search Tool

Motif search tool


MAST sends you two e-mail messages:

The e-mail messages and how MAST computes match scores and their statistical significance (p-values) are explained in the following sections. A sample search results file is also provided.

Return to MAST introduction.


Confirmation message

The first e-mail message you receive should be a confirmation message to let you know that your search request has been received. You should receive an e-mail message that looks something like this:
            Subject: MAST confirmation: alcohol dehydrogenase motifs
             
            Your MAST search request 14019 is being processed:
            Motif file: adh
            Database to search: SwissProt
            
If you fail to receive the confirmation message, check your e-mail address and try resubmitting your MAST request.

Search Results

The second e-mail message you should receive contains the results of the MAST search. It contains:
  • the version of MAST and the date it was built,
  • the reference to cite if you use MAST in your research,
  • a description of the database and motifs used in the search,
  • an explanation of the results
  • high-scoring sequences--sequences matching the group of motifs above a stated level of statistical significance,
  • motif diagrams showing the order and spacing of occurrences of the motifs in the high-scoring sequences and
  • annotated sequences showing the positions and p-values of all motif occurrences in each of the high-scoring sequences.

Each section of the results file contains an explanation of how to interpret them.

Match Scores

The match score of a motif to a position in a sequence is the sum of the score from each row of the position-dependent scoring matrix corresponding to the letter at that position in the sequence. For example, if the sequence is
            TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC
               ========
            
and the motif is represented by the position-dependent scoring matrix (where each row of the matrix corresponds to a position in the motif)
            =========|=================================
            POSITION |   A        C        G        T
            =========|=================================
              1      | 1.447    0.188   -4.025   -4.095 
              2      | 0.739    1.339   -3.945   -2.325 
              3      | 1.764   -3.562   -4.197   -3.895 
              4      | 1.574   -3.784   -1.594   -1.994 
              5      | 1.602   -3.935   -4.054   -1.370 
              6      | 0.797   -3.647   -0.814    0.215 
              7      |-1.280    1.873   -0.607   -1.933 
              8      |-3.076    1.035    1.414   -3.913 
            =========|=================================
            
then the match score of the fourth position in the sequence (underlined) would be found by summing the score for T in position 1, G in position 2 and so on until G in position 8. So the match score would be
              score = -4.095 + -3.945 + -3.895 + -1.994
                      + -4.054 + -0.814 + -1.933 + 1.414 
                    = -19.316
            
The match scores for other positions in the sequence are calculated in the same way. Match scores are only calculated if the match completely fits within the sequence. Match scores are not calculated if the motif would overhang either end of the sequence.

P-values

MAST reports all matches of a sequence to a motif or group of motifs in terms of the p-value of the match. MAST considers the p-values of four types of events:
  • position p-value: the match of a single position within a sequence to a given motif,
  • sequence p-value: the best match of any position within a sequence to a given motif,
  • combined p-value: the combined best matches of a sequence to a group of motifs, and
  • e-value: observing a combined p-value at least as small in a random database of the same size.
All p-values are based on a random sequence model that assumes each position in a random sequence is generated according to the average letter frequencies of all sequences in the database being searched.

Position p-value

The p-value of a match of a given position within a sequence to a motif is defined as the probability of a randomly selected position in a randomly generated sequence having a match score at least as large as that of the given position. Note:If MAST is combining reverse complement DNA strands, the position p-value is not corrected for multiple tests.

Sequence p-value

The p-value of a match of a sequence to a motif is defined as the probability of a randomly generated sequence of the same length having a match score at least as large as the largest match score of any position in the sequence.

Combined p-value

The p-value of a match of a sequence to a group of motifs is defined as the probability of a randomly generated sequence of the same length having sequence p-values whose product is at least as small as the product of the sequence p-values of the matches of the motifs to the given sequence.

E-value

The e-value of the match of a sequence in a database to a a group of motifs is defined as the expected number of sequences in a random database of the same size that would match the motifs as well as the sequence does and is equal to the combined p-value of the sequence times the number of sequences in the database.

Database and Motifs

This section shows information on the database that was searched and the motifs in the search query. The database section gives the date the database was last updated as well as the number of sequences and total sequence characters in it. The motifs are listed by motif number. The width and subsequence which would be given the best possible score for each motif is shown. If there is more than one motif in the query, all pairwise correlations between the motifs are shown. The correlations can range from -1 to +1, with +1 meaning that the shorter motif is exactly identical to part or all of the longer motif. High correlations can cause some combined p-values and e-values to be inaccurate (too low). It may be advisable to remove enough motifs from the query to insure that no pairs of motifs have high correlations. Any high correlations are indicated along with the suggestion that one of the motifs be removed from the query.

High-scoring Sequences

MAST lists the names and part of the descriptive text of all sequences whose e-value is less than E. Sequences shorter than one or more of the motifs are skipped. The sequences are sorted by increasing e-value. The value of E is set to 10 for the WEB server but is user-selectable in the down-loadable version of MAST.

When nucleotide sequences are searched, the strand (+ or -) is indicated. When nucleotide sequences are searched with peptide motifs, the reading frame (a, b or c) of the best matches is is also indicated. Matches are not all required to be in the same reading frame but must all be on the same strand.

Motif Diagrams

Motif diagrams show the order and spacing of non-overlapping matches to the motifs in each high-scoring sequence. Motif occurrences are determined based on the position p-value of matches to the motif. In the MOTIF DIAGRAMS section of the output, diagrams are shown like this:


6
4
3
5
7

In the ANNOTATED SEQUENCES section of the output, diagrams are shown like this:

            27-[3]-44-<4>-99-[1]-7
            
In this notation, strong matches (p-value < M) are shown in square brackets (`[ ]'), weak matches (M < p-value < M × 10) are shown in angle brackets (`< >') and the length of non-motif sequence ("spacer") is shown between dashes (`-'). The example above shows an initial spacer of length 27, followed by a strong match to motif 3, a spacer of length 44, a weak match to motif 4, a spacer of length 99, a strong match to motif 1 and a final non-motif sequence of length 7. The value of M is 0.0001 for the WEB server but is user-selectable in the down-loadable version of MAST.

When nucleotide databases are searched, all matches must be on the same strand and the strand (+ or -) is indicated in the output. When peptide motifs are used to search nucleotide sequences, the reading frame (a, b or c) of each match is indicated next to the motif numbers in the motif diagrams found in the ANNOTATED SEQUENCES section of the output. For example,

            97-[6b]-17-[4a]-36-[3a]-45-[5a]-96-[7a]-59
            
shows that motif 6 matched in reading frame b while the other motif matches occurred in reading frame a.

Note: If you specify the -hit_list switch to MAST, the motif "diagram" takes the form of a comma separated list of motif occurrences ("hits"). Each "hit" has the format:

                    <strand><motif> <start> <end> <p-value>
            where
                    <strand>  is the strand (+ or - for DNA, blank for protein),
                    <motif>           is the motif number,
                    <start>           is the starting position of the hit,
                    <end>             is the ending position of the hit, and
                    <p-value> is the position p-value of the hit.
            

Annotated Sequences

MAST annotates each high-scoring sequence by printing the sequence along with the position and strength of all the non-overlapping motif occurrences. The four lines above each motif occurrence contain, respectively,
  • the motif number of the occurrence,
  • the position p-value of the occurrence,
  • the best possible match to the motif, and
  • a plus sign (`+') above each letter in the occurrence that has a positive match score to the motif.
The best possible match to a motif is the sequence of letters which would achieve the highest match score.

When peptide motifs are used to search nucleotide sequences, the reading frame (a, b or c) of each match is indicated with the motif number and the peptide translation of the matching sequence is shown just above the motif occurrence.

Sample MAST Search Results

Here is an actual MAST search results file of a search of a nucleotide database with peptide motifs. It has been edited slightly to reduce its size by removing most of the 832 sequences which matched the motifs.