The MEME Suite

Motif-based sequence analysis tools

FIMO Tutorial

What FIMO does

Typical FIMO applications

Determining all the positions where a transcription factor (TF) motif matches in one or more promoter sequences.
Predicting exactly where a TF binds in each of its ChIP-seq peaks using a known motif (from one of the databases provided) or a de novo motif discovered by a motif discovery algorithm such as MEME, STREME or DREME.
Determining all the positions where a protein motif matches in one or more protein sequences.

How FIMO works

FIMO converts each input motif into a log-odds PSSM and uses each PSSM to independently scan each input sequence. It reports all positions in each sequence that match a motif with a statistically significant log-odds score. You can control the match p-value that is considered significant, and whether or not FIMO reports matches on both strands when the sequence alphabet is complementable (e.g., DNA or RNA).

Sequence and Motif Alphabets

FIMO supports DNA, RNA, protein and custom sequence alphabets. The alphabet is specified in the motif file, and the sequences being searched must be compatible with that alphabet. (DNA motifs can be used to search RNA sequences and vice-versa.)

Sequence Database

These sequences should all be in the same sequence alphabet. Their lengths may vary. They can, for example, be a set of promoters thought to be co-regulated, a set of ChIP-seq regions or the proteome of an organism.

FAQ

When should I use FIMO rather than MAST?

MAST is intended for searching "short" sequences such as proteins, whereas FIMO is more appropriate for sequences of any length. Also, MAST reports a single score for each scanned sequence that combines the best match to each of the motifs in its input. So MAST is most useful when you need a way to rank your sequences based on how well each sequence matches all (or many) of your motifs. FIMO does not combine match scores in any way, and reports all matches separately, so it is useful when you want to know exactly where the signals your motifs describe are located in each input sequence.

Does it make sense to scan an entire genome with transcription factor motifs using FIMO?
This is probably a bad idea for several reasons, and it may be preferable to focus on shorter regions such as promoters, enhancers or ChIP-seq peak regions. Also, if you have them, FIMO can make use additional sources of information, like DNAse hypersensitivity, to help distinguish chance matches from the biologically significant ones. More information about that is available in the FIMO documentation about using position-specific priors.

The first problem with scanning a genome is that genomes are very large and transcription factor affinities are not very specific. Motifs typically only have 8 bases or less of specificity, and all 8-mers occur many millions of times in a eukaryotic genome.

Another way of viewing this problem is that when you use FIMO to scan a eukaryotic genome you have a significant multiple testing problem. FIMO's default match p-value threshold is 0.0001, and scanning a eukaryotic genome would apply that test billions of times. That means you are going to get hundreds of thousands of matches entirely by chance. These most likely will completely overwhelm the biologically significant matches. When scanning something like the mouse genome, for example, you would have to apply a very stringent match, threshold, 1e-10 to reduce the false positive rate to something reasonable.

You can reduce the multiple testing problem somewhat by using the command line version of FIMO and using a q-value threshold rather than a p-value threshold. The q-value is similar to a p-value but corrected for multiple testing. Like a p-value, a q-value of 0.01 or less is a reasonable threshold for significance. However, to compute q-values, FIMO has to hold all the matches in memory. When scanning a full genome this can easily require tens of GB of memory. To avoid causing memory problems, FIMO limits the number of results it will hold in memory to 100,000. When using the command line version this can be adjusted using the --max-stored-scores option. Each time FIMO reaches the maximum allowed number of stored scores, it will decrease the p-value threshold until it has eliminated at least half the scores.

To fix the memory problem, you can use the --text option. This turns off the computation of q-values entirely, and directs FIMO to print out each match as it occurs. You could use this to get a full list at your desired p-value threshold. However, this does not fix the multiple testing problem. Unless you a set a stringent p-value threshold the biologically significant matches are going to be overwhelmed by the matches that are statistically significant due to chance. Note that short motifs pose a particular problem, because even a perfect match may not be strongly statistically significant, so that a threshold that is stringent enough to eliminate the chance matches will end up eliminating all matches.
Does it make sense to scan all the promoters in a genome with transcription factor motifs using FIMO?
It might, but only if your definition of "promoter" is shorter than 1000bp.

By default, FIMO uses a p-value threshold of 0.0001, and scans both DNA strands, so that means you will expect about one "match" every five promoters simply by chance, which might be a tolerable false-positive rate (20%), depending on what you intend to do next. The (approximate) formula is: #false-positives/promoter = (2 * promoter length) * p-value threshold.

If your promoters are longer than 1000bp, you would need to decrease the p-value threshold to compensate. For 5000bp promoters, you would need to divide the p-value threshold by five, giving a threshold of 0.00002. The problem is, for some transcription factor motifs, the best possible match to the motif is not significant at this level. Hence the advice to limit FIMO searches with TF motifs to promoter regions no longer than 1000bp.

A more extensive discussion of these multiple testing issues can be found in "How does multiple testing correction work" by W. Noble (Nature Methods 27:1135-9, 2009). The CTCF example in Figure 1 in that paper was generated using FIMO.
Should I use a background model with FIMO?
Yes, definitely.

If you are using FIMO via the web and select one of the sequence databases provided, FIMO will use a background model based on the base (or residue) frequencies in the sequences you select.

If you run FIMO on the command line, you need to create and provide FIMO with an appropriate background model background. You can do this with the script fasta-get-markov. If your input sequences are diverse and contain several bases/residues or more, you can simply run that script on your input sequences to create a background model. Alternatively, you can create a background model in the same way from a large set of randomly selected sequence regions. Those sequences should be biologically similar to the sequences you intend to scan with FIMO.