The MEME Suite

Motif-based sequence analysis tools

AME Tutorial

What AME does

Typical AME applications

Determining which motif(s) are enriched in a set of promoters of co-regulated genes.
Determining which motif(s) are relatively enriched in one set of promoters compared to another set of promoters.
Determining which motif(s) are enriched in a set of transcription factor ChIP-seq peaks (CentriMo may be more appropriate for this task).
Determining which type of kinase substrate motif is enriched in a set of proteins.

How AME works

AME determines the (relative) enrichment of motif in its input separately and reports the significant ones. The first step is to scan each primary and control sequence with the motif, computing a odds score (not log-odds) for each position in each sequence. Next, for each sequence, AME combines the scores according to the chosen scoring method. By default, the website version uses the "Average odds score", which is just what it sounds like—the average odds score over all positions in the sequence where the motif fits. AME then sorts all the sequences (primary and control) according to their scores, and applies a statistical test to determine if the primary sequences have significantly larger scores. By default, the website version of AME uses the "Rank sum test".

Scan each sequence with a motif, computing one score per sequence.
Apply statistical test on the tendency of the primary sequences to have larger scores.
Report motifs with significant (adjusted) p-values.

Sequence and Motif Alphabets

AME supports DNA, RNA, protein and custom sequence alphabets. The alphabet is specified in the motif file, and the sequences being searched must be compatible with that alphabet. (DNA motifs can be used to search RNA sequences and vice-versa.) If the motif alphabet is "complementable" (e.g., DNA-like alphabets), AME scores both strands of the sequences. If you have DNA motifs and wish to search only the given strand of each sequence, you can edit the alphabet section of the motif file to contain

ALPHABET = ACGU

rather than

ALPHABET = ACGT

You can also override the the alphabet specified in the motif file with an alphabet that contains all the core symbols specified in the motif alphabet but which may contain additional core symbols. The motifs will be expanded to match this new alphabet with 0's filling in the probabilities for the new symbols (prior to applying pseudocounts).

Primary Sequence set

These sequences should all be in the same sequence alphabet. Their lengths may vary. They can, for example, be a set of promoters thought to be co-regulated, a set of ChIP-seq regions or a set of proteins thought to be phosphorylated be one or more kinases.

Control sequence set

AME detects motifs enriched in the primary sequences relative to these sequences. If you don't provide a control set, the website version of AME will create one copying the primary sequences and shuffling the letters within each sequence. The shuffling preserves 2-mer frequencies in each sequence individually. It is advisable that the primary and control sequences have similar length distributions or AME's reported p-values may not be accurate.

FAQ

How big should the control sequence set be?

For AME it doesn't hurt at all if the control set is larger than the primary. Intuitively, if the primary set is small, having a much larger control might allow you to find less significant motifs.

Basically, AME works by sorting the sequences by some motif score. Imagine if you had just one positive sequence, the extreme case. If it sorts to the top of N sequences, the p-value of that event is 1/(N+1). That gets smaller as you increase N, which you can do by increasing the number of control sequences. (But they should all be the same length as the positive sequence in that case!)

For the above reason, the fasta-shuffle-letters script included with the MEME Suite when you download it has the -copies option which lets you create N shuffled copies of each input sequence.

As a rule of thumb, if your primary dataset has fewer than 500 sequences you might want to use enough control sequences to bring the total to 1000.

I'm not finding any enriched motifs; should I use a smaller set of motifs?

The more motifs you have in your database the more of a multiple testing issue you face. If you can eliminate motifs that are irrelevant to your experiment, that reduces the multiple testing problem. The enrichment of each motif is considered separately, so their is no particular advantage to scanning with motifs that you know are not relevant. Of course if you are not getting any significant enrichment (as judged by the p-value before adjustment), the multiple testing issue is moot. However, reducing the number of motifs in the library also shortens the time needed for AME to run which may still make it advantageous!