AME determines the (relative) enrichment of motif in its input separately and reports the significant ones. The first step is to scan each primary and control sequence with the motif, computing a odds score (not log-odds) for each position in each sequence. Next, for each sequence, AME combines the scores according to the chosen scoring method. By default, the website version uses the "Average odds score", which is just what it sounds like—the average odds score over all positions in the sequence where the motif fits. AME then sorts all the sequences (primary and control) according to their scores, and applies a statistical test to determine if the primary sequences have significantly larger scores. By default, the website version of AME uses the "Rank sum test".
AME supports DNA, RNA, protein and custom sequence alphabets. The alphabet is specified in the motif file, and the sequences being searched must be compatible with that alphabet. (DNA motifs can be used to search RNA sequences and vice-versa.) If the motif alphabet is "complementable" (e.g., DNA-like alphabets), AME scores both strands of the sequences. If you have DNA motifs and wish to search only the given strand of each sequence, you can edit the alphabet section of the motif file to contain
ALPHABET = ACGU
ALPHABET = ACGT
These sequences should all be in the same sequence alphabet. Their lengths may vary. They can, for example, be a set of promoters thought to be co-regulated, a set of ChIP-seq regions or a set of proteins thought to be phosphorylated be one or more kinases.
AME detects motifs enriched in the primary sequences relative to these sequences. If you don't provide a control set, the website version of AME will create one copying the primary sequences and shuffling the letters within each sequence. The shuffling preserves 2-mer frequencies in each sequence individually. It is advisable that the primary and control sequences have similar length distributions or AME's reported p-values may not be accurate.
For AME it doesn't hurt at all if the control set is larger than the primary. Intuitively, if the primary set is small, having a much larger control might allow you to find less significant motifs.
Basically, AME works by sorting the sequences by some motif score. Imagine if you had just one positive sequence, the extreme case. If it sorts to the top of N sequences, the p-value of that event is 1/(N+1). That gets smaller as you increase N, which you can do by increasing the number of control sequences. (But they should all be the same length as the positive sequence in that case!)
For the above reason, the fasta-shuffle-letters
script
included with the MEME Suite when you download it has the -copies
option
which lets you create N shuffled copies of each input sequence.
As a rule of thumb, if your primary dataset has fewer than 500 sequences you might want to use enough control sequences to bring the total to 1000.
The more motifs you have in your database the more of a multiple testing issue you face. If you can eliminate motifs that are irrelevant to your experiment, that reduces the multiple testing problem. The enrichment of each motif is considered separately, so their is no particular advantage to scanning with motifs that you know are not relevant. Of course if you are not getting any significant enrichment (as judged by the p-value before adjustment), the multiple testing issue is moot. However, reducing the number of motifs in the library also shortens the time needed for AME to run which may still make it advantageous!