ama [options] <motif file> <sequence file> [<background file>]
The name AMA stands for "Average Motif Affinity". The program scores a set of sequences given a binding motif, treating each position in the sequence as a possible binding event. The score is calculated by averaging the likelihood ratio scores for all feasible binding events to the given sequence (and to its reverse strand for complentable alphabets). The binding strength at each potential site is defined as the likelihood ratio of the site under the motif versus under a zero-order background model provided by the user.
By default, AMA reports the average motif affinity score. It can also report p-values, which are estimated analytically using the given zero-order background model or using the GC-content of each sequence. The GC-content options are restricted to alphabets with 4 symbols in 2 complementary pairs, like DNA.
AMA can also compute the sequence-dependent likelihood ratio score used by Clover. The denominator of this score depends on the sequence being scored, and is the likelihood of the site under a Markov model derived from the sequence itself. Unlike Clover, AMA also allows higher-order sequence-derived Markov models (see --sdbg option below).
If the input file contains more than one motif, the motifs will be processed consecutively.
Full details are given in the supplement to the GOMO paper:Fabian A. Buske, Mikael Bodén, Denis C. Bauer and Timothy L. Bailey, "Assigning roles to DNA regulatory motifs using comparative genomics", Bioinformatics, 26(7):860-866, 2010.
A file containing a list of motifs, in MEME format.
A file containing a collection of sequences in FASTA format.
This is a required option unless --sdbg is specified.
AMA writes to standard out, unless you specify one of --o or --oc in which case the o-format option (if given) is ignored and separate files containing each output format are written to the named directory. The available output formats are gff and CisML.
gff output has the format:
<sequence_name> ama sequence 1 <sequence_length> <sequence_score> <sequence_p-value> . . .
|--sdbg||n||Use a sequence-dependent Markov model of order n when computing likelihood ratios. A different sequence-dependent Markov model is computed for each sequence in the input and used to compute the likelihood ratio of all sites in that sequence. This option overrides --pvalues, --gcbins, and --rma.||The background file is required and is used to compute the likelihood ratio for all sites in all sequences.|
|--motif||id||Use only the motif identified by id. This option may be repeated.||All motifs are used.|
|--motif-pseudo||float||A pseudocount to be added to each count in the motif matrix, after first multiplying by the corresponding background frequency.||A pseudocount of 0.1 is applied.|
|--norc||Do not score the reverse complement strand (when using a complementable alphabet).||All strands are scored.|
|--scoring||avg-odds|max-odds||Indicates whether the average or the maximum likelihood ratio (odds) score should be calculated. If max-odds is chosen, no p-value will be printed.||Average score will be calculated.|
|--rma||Scale the motif affinity score by the maximum achievable score for each motif. This is termed the Relative Motif Affinity score. This allows for direct comparison between different motifs.||Affinity scores are not scaled.|
|--pvalues||Print the p-value of the average odds score in the output file. The p-score for a score is normally computed (but see --gcbins) assuming the sequences were each generated by the 0-order Markov model specified by the background file frequencies. This option is ignored if max-odds scoring is used.||No p-value will be printed.|
|--gcbins||bins||Compensate p-values for the complementary pair content (aka GC content) of each sequence independently. This is done by computing the score distributions for a range of complementary pair frequency values. Using 41 bins (recommended) computes distributions at intervals of 2.5% GC content. The computation assumes that the ratios of the two complementary pairs (ie A & T or G & C for the DNA alphabet) are both equal to 1. This assumption will fail if a sequence contains far more of a letter than its complement. This option sets the --pvalues option. This option is ignored if max-odds scoring is used.||Uncompensated p-values are printed.|
|--cs||Enables combining of sequences with the same identifier by taking the average score and the Sidak corrected p-value: 1−(1−α)^1/n. Different sequences with the same identifier are used in GOMO databases if one gene in the reference species has more than one homologous gene in the related species (one-to-many relationship).||Sequences are processed independently of each other.|
|--o-format||gff|cisml||Set the output file format.||CISML output format is used.|
|--o||dir||Create a folder called dir and write output files in it. This option is not compatible with -oc as only one output folder is allowed.||The program writes to standard out.|
|--oc||dir||Create a folder called dir but if it already exists allow overwriting the contents. This option is not compatible with -o as only one output folder is allowed.||The program writes to standard out.|
|--max-seq-length||max||Set the maximum length allowed for input sequences to max.||The maximum allowed input sequence length is 250000000.|
|--last||n||Use only scores of (up to) last n sequence positions to compute AMA. If the sequence is shorter than this value the entire sequence is scored. If the motif is longer than this value it will not be scored.||The full sequence is scored.|