FIMO converts each input motif into a log-odds PSSM and uses each PSSM to independently scan each input sequence. It reports all positions in each sequence that match a motif with a statistically significant log-odds score. You can control the match p-value that is considered significant, and whether or not FIMO reports matches on both strands when the sequence alphabet is complementable (e.g., DNA or RNA).
FIMO supports DNA, RNA, protein and custom sequence alphabets. The alphabet is specified in the motif file, and the sequences being searched must be compatible with that alphabet. (DNA motifs can be used to search RNA sequences and vice-versa.)
These sequences should all be in the same sequence alphabet. Their lengths may vary. They can, for example, be a set of promoters thought to be co-regulated, a set of ChIP-seq regions or the proteome of an organism.
MAST is intended for searching "short" sequences such as proteins, whereas FIMO is more appropriate for sequences of any length. Also, MAST reports a single score for each scanned sequence that combines the best match to each of the motifs in its input. So MAST is most useful when you need a way to rank your sequences based on how well each sequence matches all (or many) of your motifs. FIMO does not combine match scores in any way, and reports all matches separately, so it is useful when you want to know exactly where the signals your motifs describe are located in each input sequence.
This is probably a bad idea for several reasons, and it may be preferable to focus on shorter regions such as promoters, enhancers or ChIP-seq peak regions. Also, if you have them, FIMO can make use additional sources of information, like DNAse hypersensitivity, to help distinguish chance matches from the biologically significant ones. More information about that is available in the FIMO documentation about using position-specific priors.
The first problem with scanning a genome is that genomes are very large and transcription factor affinities are not very specific. Motifs typically only have 8 bases or less of specificity, and all 8-mers occur many millions of times in a eukaryotic genome.
Another way of viewing this problem is that when you use FIMO to scan a eukaryotic genome you have a significant multiple testing problem. FIMO's default match p-value threshold is 0.0001, and scanning a eukaryotic genome would apply that test billions of times. That means you are going to get hundreds of thousands of matches entirely by chance. These most likely will completely overwhelm the biologically significant matches. When scanning something like the mouse genome, for example, you would have to apply a very stringent match, threshold, 1e-10 to reduce the false positive rate to something reasonable.
You can reduce the multiple testing problem somewhat by using the command line version of
FIMO and using a q-value threshold rather than a p-value threshold.
The q-value is similar to a p-value but corrected for multiple testing.
Like a p-value, a q-value of 0.01 or less is a reasonable
threshold for significance. However, to compute q-values,
FIMO has to hold all the matches in memory. When scanning a full genome this can
easily require tens of GB of memory. To avoid causing memory problems,
FIMO limits the number of results it will hold in memory to 100,000.
When using the command line version this can be adjusted using the
--max-stored-scores
option. Each time FIMO reaches the
maximum allowed number of stored scores, it will decrease the p-value threshold
until it has eliminated at least half the scores.
To fix the memory problem, you can use the --text
option. This turns off the
computation of q-values entirely, and directs FIMO to print out each match as it occurs.
You could use this to get a full list at your desired p-value threshold.
However, this does not fix the multiple testing problem. Unless you a set a stringent
p-value threshold the biologically significant matches are going to be overwhelmed
by the matches that are statistically significant due to chance. Note that short
motifs pose a particular problem, because even a perfect match may not be
strongly statistically significant, so that a threshold that is stringent enough
to eliminate the chance matches will end up eliminating all matches.
It might, but only if your definition of "promoter" is shorter than 1000bp.
By default, FIMO uses a p-value threshold of 0.0001, and scans both DNA strands, so that means you will expect about one "match" every five promoters simply by chance, which might be a tolerable false-positive rate (20%), depending on what you intend to do next. The (approximate) formula is: #false-positives/promoter = (2 * promoter length) * p-value threshold.
If your promoters are longer than 1000bp, you would need to decrease the p-value threshold to compensate. For 5000bp promoters, you would need to divide the p-value threshold by five, giving a threshold of 0.00002. The problem is, for some transcription factor motifs, the best possible match to the motif is not significant at this level. Hence the advice to limit FIMO searches with TF motifs to promoter regions no longer than 1000bp.
A more extensive discussion of these multiple testing issues can be found in "How does multiple testing correction work" by W. Noble (Nature Methods 27:1135-9, 2009). The CTCF example in Figure 1 in that paper was generated using FIMO.
Yes, definitely.
If you are using FIMO via the web and select one of the sequence databases provided, FIMO will use a background model based on the base (or residue) frequencies in the sequences you select.
If you run FIMO on the command line, you need to create and provide FIMO with an
appropriate background model background. You can do this with the script
fasta-get-markov
. If your
input sequences are diverse and contain several bases/residues or more,
you can simply run that script on your input sequences to create a background
model. Alternatively, you can create a background model in the same way
from a large set of randomly selected sequence regions. Those sequences should
be biologically similar to the sequences you intend to scan with FIMO.