MAST -- Motif Alignment and Search Tool

Motif search tool


MOTIF FORMAT

MAST can search using (multiple) motifs contained in
  • a MEME output file,
  • a GCG profile file,
  • two or more GCG profile files concatenated together, or
  • a file with the following format.
Motif file format

              ALPHABET= alphabet
              log-odds matrix: alength= alength w= w
              row_1
              row_2
              ...
              row_w  
            
  • A motif is represented by a position-dependent scoring matrix.
  • A scoring matrix is preceded by a line starting with the words log-odds matrix: and specifying alength, the length of the alphabet (number of columns in the scoring matrix), and the w, the width of the motif (number of rows in the scoring matrix).
  • The following w lines (no blank lines allowed) contain the rows of the scoring matrix. Row i, column j of the matrix gives the score for the j-th letter in alphabet appearing at position i in an occurrence of the motif.
  • The spaces after the equals signs and the colon are required.
  • The number of letters in alphabet must equal alength.
  • Any number of additional motifs may follow the first one.
  • The motif file must contain a line starting with
                       ALPHABET= 
                
    followed by alphabet, a list containing the letters used in the motifs. The order of the letters in alphabet must be the same as the order of the columns of scores in the motifs. The order need not be alphabetical and case does not matter, but there should be no spaces in alphabet. The letters in alphabet must be a subset of either the IUB/IUPAC DNA (ABCDGHKMNRSTUVWY*-) or protein (ABCDEFGHIKLMNPQRSTUVWXYZ*-) alphabets. DNA alphabets must contain at least the letters ACGT. Protein alphabets must contain at least the letters ACDEFGHIKLMNPQRSTVWY. All other letters in the alphabets are optional. If any of the optional letters are missing from alphabet, MAST automatically generates scores for them by taking the weighted average of the scores for the letters which the missing letter could match. (The weights are the frequencies of the replaced letters in the appropriate non-redundant database.) Replacements for the optional letters are given in the following table.
    Letters matched by optional letters
    optional
    letter
    matches
    DNA protein
    B CGT DN
    D AGT
    H ACT
    K GT
    M AC
    N ACGT
    R AG
    S CG
    U T ACDEFGHIKLMNPQRSTVWY
    V CAG
    W AT
    X ACDEFGHIKLMNPQRSTVWY
    Y CT
    Z EQ
    * ACGT ACDEFGHIKLMNPQRSTVWY
    - ACGT ACDEFGHIKLMNPQRSTVWY

EXAMPLE

Here is an example of a DNA motif file that contains two motifs.

Sample motif file


              ALPHABET= ACGT
              log-odds matrix: alength= 4 w= 9
               -4.275  -0.182  -4.195   1.408
               -4.296  -1.487   1.880  -0.816
               -2.160  -1.492  -4.171   1.474
               -0.810  -4.076   1.872  -2.164
                1.537  -1.487  -4.195  -4.205
                0.113   0.340  -0.237  -0.209
               -0.454   0.923   0.390  -0.834
               -1.336  -0.082   0.905   0.100
                0.674  -4.183   0.130  -0.201
              log-odds matrix: alength= 4 w= 6
               -2.032   0.324   1.371  -0.781
               -0.409   0.560  -0.250   0.119
               -4.274  -0.519  -0.260   1.167
               -2.188   2.300  -4.191  -2.465
                1.265  -4.111  -0.267  -2.180
               -1.977   2.158  -1.661  -2.071 
            

In the example above, because the order of the letters in alphabet is ACGT, the first column of each motif gives the scores for the letter A at each position in the motif, the second column gives the scores for C and so forth.


MAST input

MAST introduction