motiph [options] <alignment file> <tree file>
<motif file>
Scan a DNA multiple sequence alignment for occurrences of given
motifs, taking into account the phylogenetic tree relating the
sequences. Any sequences containing nothing but gaps are removed from
the alignment.
motiph
supports several evolutionary models that it uses to compute
the score for aligning the motif to each possible region of the
multiple alignment.
Read about the algorithm for more details.
The name of a file containing a DNA multiple alignment in ClustalW format. Alternatively, if the --list option is used, this file may contain a list of alignment files. In this case each of the alignment files will be scanned for each of the motifs.
The name of a file containing a phylogenetic tree in Phylip Newick format. This tree may contain additional species not represented in the alignment.
The name of a file containing MEME formatted motifs.
Output is created as a table in HTML (.html) and text (.txt) formats. Each line in the table represents a match of the motif to a "window" in the alignment. Matches to the motif and its reverse complement are listed separately. The output columns contain the motif name, the alignment name, the start position of the match in the alignment, the end position of the match, the match score, the match p-value, the match q-value, and the matched portion of the first sequence in the alignment, respectively.
Output is also created in GFF format (.gff) and in CISML (.xml) format.
Option | Parameter | Description | Default Behavior | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
General Options | ||||||||||||||||||||||||||||||
--bg | mutation rate | The mutation rate for sites in the background model. | A background mutation rate of 1 is used. | |||||||||||||||||||||||||||
--bfile | background file | Read background frequencies from
background file. The file should be in
MEME background file format.
If the argument is the keyword motif-file , then the frequencies will be taken from
the motif file. If both strands are being scored, the background model
will be modified by averaging the frequencies of letters and their reverse complements. |
Use frequencies embedded in the application from an old version of the NCBI non-redundant DNA database. | |||||||||||||||||||||||||||
--fg | mutation rate | The mutation rate for sites in the foreground model(s). | A foreground mutation rate of 1 is used. | |||||||||||||||||||||||||||
--gap | skip|fixed|wildcard|minimum|model | Specifies the gap handling strategy.
|
Behaves as if --gap skip had been specified. | |||||||||||||||||||||||||||
--gap-cost | cost | Specifies the costs for gaps when using the fixed gap handling strategy. | A gap cost of zero is used. | |||||||||||||||||||||||||||
--hb | Use the Halpern-Bruno modification to the evolutionary model. | |||||||||||||||||||||||||||||
--list | Treat the second required input as a list of alignments, rather than a single alignment. | The second required argument is treated as a single alignment. | ||||||||||||||||||||||||||||
--max-stored-scores | storage | Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate. | The maximum number of stored matches is set to 100,000. | |||||||||||||||||||||||||||
--model | single|average|jc|k2|f81|f84|hky|tn | The evolutionary model to use.
|
Behaves as if --model f81 was specified. | |||||||||||||||||||||||||||
--motif | ID | Use only the motif identified by ID. This option may be repeated. | All motifs are used. | |||||||||||||||||||||||||||
--no-qvalue | Do not compute a q-value for each p-value. The q-value calculation is that of Benjamini and Hochberg (1995). | The program computes q-values. | ||||||||||||||||||||||||||||
--norc | Do not score the reverse complement DNA strand. | Both strands are scored. | ||||||||||||||||||||||||||||
--output-pthresh | p-value threshold | The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If both the --output-pthresh and --output-qthresh options appear on the command line, whichever appears later on the command line will be applied. | The p-value threshold is set to 1e-4. | |||||||||||||||||||||||||||
--output-qthresh | q-value threshold | The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. | No q-value threshold is applied. | |||||||||||||||||||||||||||
--pseudocounts | pseudocount | A pseudocount to be added to each count in the motif matrix, weighted by the background frequencies for the nucleotides (Dirichlet prior), before converting the motif to probabilities. | A pseudocount of 0.1 is added. | |||||||||||||||||||||||||||
--pur-pyr | ratio | The ratio of the purine transition rate to pyrimidine transition rate. This parameter is used by the Tamura-nei model. | The ratio is set to 1.0. | |||||||||||||||||||||||||||
--seed | n | Seed for random number generator. | The seed is set to the current time. | |||||||||||||||||||||||||||
--text | Limits output to plain text sent to standard out. This mode allows the program to search an arbitrarily large database, because results are not stored in memory. | |||||||||||||||||||||||||||||
--transition-transversion | ratio | The ratio of the transition rate to the transversion rate. This parameter is used by the Kimura 2-parameter, F84, HKY, and Tamura-nei models. | The ratio is set to 0.5. |
The algorithm is as follows:
A = a multiple alignment T = a phylogenetic tree M = a motif frequency matrix B = the background frequencies of the four bases in A U = a background evolutionary model with equilibrium frequencies B // Build evolutionary models. for j in positions of M { E[j] = an evolutionary model with equilibrium frequencies from the jth position of M } // Build matrix of all possible scores for i in all possible alignment columns { for j in positions of M { foreground = site_likelihood(E[j], A[:][i], T) background = site_likelihood(U, A[:][i] T) score_matrix[i][j] = log_2(foreground / background) } } // Calculate p-values of all possible scores for motif sized windows // windows in the alignment. pvalues = get_pv_lookup(score_matrix, B) // Calculate score for each motif sized window in the alignment. for i in columns of A { score = 0 for j in positions of M { index = calculate the index of A[:][i + j] in the array of all possible alignment columns score = score + score_matrix[index][j] } print pvalues[score] }
The core of the algorithm is a routine (site_likelihood
)
for scoring a particular column of the multiple alignment using a
given evolutionary model and a given phylogenetic tree. The alignment
site provides the observed nucleotides at the base of the tree, and we
sum over the unobserved nucleotides in the rest of the tree,
conditioning on the equilibrium distribution from the evolutionary
model at the root of the tree (Felsenstein Pruning Algorithm). The
tree must be a maximum likelihood tree, of the kind generated by DNAml
from Phylip or by FastDNAml. Branch lengths in the tree are converted
to conditional probabilities using the specified evolutionary
model.