ama [options] <motif file> <sequence file>
[<background file>]
The name AMA stands for "Average Motif Affinity". The program scores a set of sequences given a binding motif, treating each position in the sequence as a possible binding event. The score is calculated by averaging the likelihood ratio scores for all feasible binding events to the given sequence (and to its reverse strand for complementable alphabets). The binding strength at each potential site is defined as the likelihood ratio of the site under the motif versus under a zero-order background model provided by the user.
By default, AMA reports the average motif affinity score. It can also report p-values, which are estimated analytically using the given zero-order background model or using the GC-content of each sequence. The GC-content options are restricted to alphabets with 4 symbols in 2 complementary pairs, like DNA.
AMA can also compute the sequence-dependent likelihood ratio score used by Clover. The denominator of this score depends on the sequence being scored, and is the likelihood of the site under a Markov model derived from the sequence itself. Unlike Clover, AMA also allows higher-order sequence-derived Markov models (see --sdbg option below).
If the input file contains more than one motif, the motifs will be processed consecutively.
Note: AMA does not score sequence positions that contain ambiguous characters.
Full details are given in the supplement to the GOMO paper:
The name of a file containing motifs in MEME Motif format. Each motif may be no wider than 300 positions.
The name of a file containing a sequences in FASTA format.
(Optional sometimes) The name of a file containing a 0-order Markov Model in
background model format such as produced
by fasta-get-markov
.
If both strands are being scored, the background model will be modified by averaging the
frequencies of letters and their reverse complements.
Note: This is a required option unless --sdbg is
specified.
AMA writes in CisML
format to standard out, unless you specify one of
--o or --oc.
In that case, the --o-format option (if given) is ignored
and two output files are written to the directory you specify.
The files are ama.xml
in CISML format, and ama.txt
in (almost) GFF2 format.
AMA's version of the GFF2 format uses the "sequence strand" field (field 7) to hold the p-value of the sequence:
<sequence_name> ama sequence 1 <sequence_length> <sequence_score> <sequence_p-value> . <motif_id>
Option | Parameter | Description | Default Behaviour |
---|---|---|---|
General Options | |||
--sdbg | n | Use a sequence-dependent Markov model of order n when computing likelihood ratios. A different sequence-dependent Markov model is computed for each sequence in the input and used to compute the likelihood ratio of all sites in that sequence. This option overrides --pvalues, --gcbins, and --rma. | The background file is required and is used to compute the likelihood ratio for all sites in all sequences. |
--motif | id | Use only the motif identified by id. This option may be repeated. | All motifs are used. |
--motif-pseudo | float | A pseudocount to be added to each count in the motif matrix, after first multiplying by the corresponding background frequency. | A pseudocount of 0.01 is applied. |
--norc | Do not score the reverse complement strand (when using a complementable alphabet). | All strands are scored. | |
--scoring | avg-odds|max-odds | Indicates whether the average or the maximum likelihood ratio (odds) score should be calculated. If max-odds is chosen, no p-value will be printed. | Average score will be calculated. |
--rma | Scale the motif affinity score by the maximum achievable score for each motif. This is termed the Relative Motif Affinity score. This allows for direct comparison between different motifs. | Affinity scores are not scaled. | |
--pvalues | Print the p-value of the average odds score in the output file. The p-score for a score is normally computed (but see --gcbins) assuming the sequences were each generated by the 0-order Markov model specified by the background file frequencies. This option is ignored if max-odds scoring is used. | No p-value will be printed. | |
--gcbins | bins | Compensate p-values for the complementary pair content (aka GC content) of each sequence independently. This is done by computing the score distributions for a range of complementary pair frequency values. Using 41 bins (recommended) computes distributions at intervals of 2.5% GC content. The computation assumes that the ratios of the two complementary pairs (ie A & T or G & C for the DNA alphabet) are both equal to 1. This assumption will fail if a sequence contains far more of a letter than its complement. This option sets the --pvalues option. This option is ignored if max-odds scoring is used. | Uncompensated p-values are printed. |
--cs | Enables combining of sequences with the same identifier by taking the average score and the Sidak corrected p-value: 1−(1−α)^1/n. Different sequences with the same identifier are used in GOMO databases if one gene in the reference species has more than one homologous gene in the related species (one-to-many relationship). | Sequences are processed independently of each other. | |
--o-format | gff|cisml | Set the output file format. | CISML output format is used. |
--max-seq-length | max | Set the maximum length allowed for input sequences to max. | The maximum allowed input sequence length is 250000000. |
--last | n | Use only scores of (up to) last n sequence positions to compute AMA. If the sequence is shorter than this value the entire sequence is scored. If the motif is longer than this value it will not be scored. | The full sequence is scored. |