tom-tom
Usage:
tom-tom [options] -query <query motifs> -target <target motifs>
Description:
The
tom-tom
program searches one or more query motifs against a database of target motifs, and reports for each query a list of target motifs, ranked by E-value.For a given pair of motifs, the program considers all offsets, while requiring a minimum number of overlapping positions. For a given offset, each overlapping position is scored using one of seven column similarity functions defined below. In order to compute the scores,
tom-tom
needs to know the frequencies of the letters of the sequence alphabet in the database being searched (the "background" letter frequencies). By default, the background letter frequencies included in the MEME input files are used. The scores of columns that overlap for a given offset are summed. This summed score is then converted to an E-value. The reported E-value is the minimal E-value over all possible offsets.Input:
- <query motifs> - A file containing one or more motifs in MEME format. Each of these motifs will be searched against the target database.
- <target motifs> - A file containing one or more motifs in MEME format.
Output:
tom-tom
prints to standard output a tab-delimited text file. The file begins with a header, indicated by leading "#" characters. This is followed by a single title line, and then the actual values. The columns are:
- Query motif name
- Target motif name
- Optimal offset: the offset between the query and the target motif
- E-value
- Overlap: the number of positions of overlap between the two motifs.
- Query consensus sequence.
- Target consensus sequence.
- Orientation: Orientation of target motif with respect to query motif.
The E-value is the expected number of times that a similarity this strong would be observed by chance in a target database of random motifs. The output contains results for each query, in the order that the queries appear in the input file. With respect to each query, targets are ranked by E-value.
Options:
-ethresh <value>
- Only report E-values below the specified threshold (Default = 1).-min-overlap <value>
- Only report motif matches that overlap by this many positions or more. In case a query motif is smaller than the value ofmin-overlap
, then the corresponding motif-width is used as the requiredmin-overlap
for that query. The default value is 5.-internal
- This parameter forces the shorter motif to be completely contained in the longer motif.-dist <pearson|chi|allr|fish|kullback|rms|sandelin>
These values correspond to Pearson correlation coefficient (PCC
), Pearson chi square test (PCST
), Average log-likelihood ratio (ALLR
), Fisher-Irwin exact test (FIET
), Kullback- Leiber divergence (KLD
), Eucleidian distance (ED
) and Sandelin-Wasserman function (SW
). Detailed descriptions of these functions can be found in the implementation specifications fortom-tom
. FIET is not feasible for large tables, therefore it is not allowed for protein motifs.-pseudocount <float>
This option adds a pseudocount to each entry in the given counts matrix for the motif. By default, if the distance metric ischi|allr|kullback
, then the pseudocount value is 1.0. Otherwise, the default value is 0. If the given motifs already contain pseudocounts, then use-pseudocount 0
.Bugs: none known.
Future enhancements:
- -bg-file <file> - This parameter would allow the user to provide a file containing the background letter frequencies in background model format. Note that this format allows "order-n" Markov models to be specified, but
tom-tom
uses only the single-letter frequencies from the file, ignoring any higher-order frequencies present there.- -bg-weight <b> - Add b times each background frequency to the corresponding letter counts in order to compute the distribution vector fb. This reduces small sample biases.
- Allow additional input formats (JASPAR, TRANSFAC, FASTA alignments, etc.).
- Produce XML/HTML output.
Authors: Shobhit Gupta (shobhitg@u.washington.edu), Charles E. Grant (cegrant@gs.washington.edu), William Noble (noble@gs.washington.edu)