motiph

Usage:

motiph [options] <alignment file> <tree file> <motif file>

Description

Scan a DNA multiple sequence alignment for occurrences of given motifs, taking into account the phylogenetic tree relating the sequences. Any sequences containing nothing but gaps are removed from the alignment. motiph supports several evolutionary models that it uses to compute the score for aligning the motif to each possible region of the multiple alignment. Read about the algorithm for more details.

Input

<alignment file>

The name of a file containing a DNA multiple alignment in ClustalW format. Alternatively, if the --list option is used, this file may contain a list of alignment files. In this case each of the alignment files will be scanned for each of the motifs.

<tree file>

The name of a file containing a phylogenetic tree in Phylip Newick format. This tree may contain additional species not represented in the alignment.

<motif file>

The name of a file containing MEME formatted motifs.

Output

Output is created as a table in HTML (.html) and text (.txt) formats. Each line in the table represents a match of the motif to a "window" in the alignment. Matches to the motif and its reverse complement are listed separately. The output columns contain the motif name, the alignment name, the start position of the match in the alignment, the end position of the match, the match score, the match p-value, the match q-value, and the matched portion of the first sequence in the alignment, respectively.

Output is also created in GFF format (.gff) and in CISML (.xml) format.

Options

Option Parameter Description Default Behavior
General Options
--bgmutation rate The mutation rate for sites in the background model. A background mutation rate of 1 is used.
--bfilebackground file Read background frequencies from background file. The file should be in MEME background file format. If the argument is the keyword motif-file, then the frequencies will be taken from the motif file. If both strands are being scored, the background model will be modified by averaging the frequencies of letters and their reverse complements. Use frequencies embedded in the application from an old version of the NCBI non-redundant DNA database.
--fgmutation rate The mutation rate for sites in the foreground model(s). A foreground mutation rate of 1 is used.
--gapskip|​fixed|​wildcard|​minimum|​model Specifies the gap handling strategy.
ValueName
skip Skip those sites where any position in the alignment window contains a gap.
fixed Sites containing gaps are assigned a fixed score, specified by --gap-cost.
wildcard The gap character matches any base, and the score is the product of the corresponding probabilities.
minimum The gap character is assigned the score corresponding to the least likely letter at the given position.
model Use model-specific gap handling. Currently, the only model that supports this is f81_gap.
Behaves as if --gap skip had been specified.
--gap-costcost Specifies the costs for gaps when using the fixed gap handling strategy. A gap cost of zero is used.
--hb Use the Halpern-Bruno modification to the evolutionary model.
--list Treat the second required input as a list of alignments, rather than a single alignment. The second required argument is treated as a single alignment.
--max-stored-scoresstorage Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate. The maximum number of stored matches is set to 100,000.
--modelsingle|​average|​jc|​k2|​f81|​f84|​hky|​tn The evolutionary model to use.
ValueNameDescription
singleSingle Score score first sequence: compute standard log-odds score of first sequence in the alignment; ignores tree but does NOT remove gaps.
averageAverage Score compute average of standard log-odds score of aligned sites.
jcJukes-Cantor equilibrium base frequencies are all 1/4; the only free parameter is the mutation rate.
k2Kimura 2-parameter equilibrium base frequencies are all 1/4; the free parameters are the mutation rate and the transition/transversion rate ratio.
f81Felsenstein 1981 equilibrium base frequencies are taken from the alignment; the only free parameter is the mutation rate.
f84Felsenstein 1984 equilibrium base frequencies are taken from the alignment; the free parameters are the mutation rate and the transition/transversion rate ratio. The ratio of purine-purine to pyrimidine->pyrimidine transitions is assumed to be 1.
hkyHasegawa-Kishino-Yano equilibrium base frequencies are taken from the alignment; the free parameters are the mutation rate and the transition/transversion rate ratio. The ratio of purine-purine to pyrimidine-pyrimidine transitions is assumed to be equal to the ratio of purines to pyrimidines.
tnTamura-Nei equilibrium base frequencies are taken from the alignment; the free parameters are the mutation rate, the transition/transversion rate ratio, and the ratio of purine-purine transitions to pyrimidine-pyrimidine transitions.
A description of the f81 model is available in chapter 13 of Statistical Methods in Bioinformatics by Ewens and Grant. The other models are described in chapters 9 and 13 of Inferring Phylogenies by Felsenstein.
Behaves as if --model f81 was specified.
--motifID Use only the motif identified by ID. This option may be repeated. All motifs are used.
--no-qvalue Do not compute a q-value for each p-value. The q-value calculation is that of Benjamini and Hochberg (1995). The program computes q-values.
--norc Do not score the reverse complement DNA strand. Both strands are scored.
--output-pthreshp-value threshold The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If both the --output-pthresh and --output-qthresh options appear on the command line, whichever appears later on the command line will be applied. The p-value threshold is set to 1e-4.
--output-qthreshq-value threshold The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. No q-value threshold is applied.
--pseudocountspseudocount A pseudocount to be added to each count in the motif matrix, weighted by the background frequencies for the nucleotides (Dirichlet prior), before converting the motif to probabilities. A pseudocount of 0.1 is added.
--pur-pyrratio The ratio of the purine transition rate to pyrimidine transition rate. This parameter is used by the Tamura-nei model. The ratio is set to 1.0.
--seedn Seed for random number generator. The seed is set to the current time.
--text Limits output to plain text sent to standard out. This mode allows the program to search an arbitrarily large database, because results are not stored in memory.
--transition-transversionratio The ratio of the transition rate to the transversion rate. This parameter is used by the Kimura 2-parameter, F84, HKY, and Tamura-nei models. The ratio is set to 0.5.

Algorithm

The algorithm is as follows:

A = a multiple alignment
T = a phylogenetic tree
M = a motif frequency matrix
B = the background frequencies of the four bases in A
U = a background evolutionary model with equilibrium frequencies B

// Build evolutionary models.
for j in positions of M {
  E[j] = an evolutionary model with equilibrium frequencies 
    from the jth position of M
}

// Build matrix of all possible scores
for i in all possible alignment columns {
  for j in positions of M {
    foreground = site_likelihood(E[j], A[:][i], T)
    background = site_likelihood(U, A[:][i] T)
    score_matrix[i][j] = log_2(foreground / background)
  }    
}
// Calculate p-values of all possible scores for motif sized windows
// windows in the alignment.
pvalues = get_pv_lookup(score_matrix, B)

// Calculate score for each motif sized window in the alignment.
for i in columns of A {
  score = 0
  for j in positions of M {
    index = calculate the index of A[:][i + j] in the array
      of all possible alignment columns
    score = score + score_matrix[index][j]
  }    
  print pvalues[score]
}
      

The core of the algorithm is a routine (site_likelihood) for scoring a particular column of the multiple alignment using a given evolutionary model and a given phylogenetic tree. The alignment site provides the observed nucleotides at the base of the tree, and we sum over the unobserved nucleotides in the rest of the tree, conditioning on the equilibrium distribution from the evolutionary model at the root of the tree (Felsenstein Pruning Algorithm). The tree must be a maximum likelihood tree, of the kind generated by DNAml from Phylip or by FastDNAml. Branch lengths in the tree are converted to conditional probabilities using the specified evolutionary model.