Usage:

ame [options] <sequence file> <motif file>+

Description

Input

<sequence file>

The name of a file containing a set of (primary) sequences in FASTA format. The FASTA header line of each sequence may contain a number (called a 'FASTA score') immediately following the sequence name that is used by some of AME's statistical enrichment methods.

The (optional) FASTA scores can represent any biological signal related to the sequences such as expression level, peak height or fluorescence score. If the sequences do not contain FASTA scores, some of AME's statistical enrichment methods utilize the order of the sequences in the sequence file.

<motif file>+

The names of one or more files containing MEME formatted motifs. Outputs from MEME, STREME and DREME are supported, as well as Minimal MEME Format. You can convert many other motif formats to MEME format using conversion scripts available with the MEME Suite.

Output

AME writes its output to files in a directory named ame_out, which it creates if necessary. You can change the output directory using the --o or --oc options. The directory will contain the following files:

In all output files, only results for significantly enriched motifs are reported.

Note: See this detailed description of the AME output formats for more information.

Algorithm

Scores- AME uses two scores for each sequence in computing motif enrichment. The 'PWM score' is computed by scoring the sequence with the motif. The 'FASTA score' is either provided in the sequence header line (see above); otherwise it is the rank of the sequence within the sequence file.

Partition maximization- AME sorts the sequences in increasing order of FASTA score, and then 'partitions' the sequences, labeling the first N sequences 'positive', and the rest 'negative'. AME computes the significance of motif enrichment using this labeling and the PWM scores, and then repeats the process using values of N from 1 to the total number of sequences. AME reports the partition with the highest significance.

Variations- The above behavior can be modified using the options described below. For example, with some enrichment methods you can switch the roles of the FASTA and PWM scores (see options --poslist and --linreg-switchxy, below). With two enrichment methods (fisher and ranksum), you can provide control sequences (see --control, below), which causes both FASTA scores and sequence order to be ignored. Two other enrichment methods (pearson and spearman), which are based on the correlation coefficient, ignore the 'negative' sequences entirely during partition maximization. You can also define which sequences are 'positive' by specifying '--fix-partition N', which causes the first N sequences (sorted by FASTA score) to be labeled 'positive'.

Options

Option Parameter Description Default Behavior
General Options
--text Output TSV format only to standard output. AME behaves as if --oc ame_out had been specified.
--controlfile A set of control sequences in FASTA format or the keyword --shuffle--. AME will determine if each motif is enriched in the primary sequences compared to the control sequences by labeling the primary sequences 'positive' and the control sequences 'negative', and then applying the enrichment method to that labeling. The keyword --shuffle-- causes AME to create (a minimum of 1000) control sequences by shuffling the letters in each primary sequence while preserving the frequencies of k-mers (see option --kmer, below). Note: The control sequences should have (approximately) the same distribution of lengths as the primary sequences or AME may fail to correctly detect enriched motifs and will report inaccurate p-values. AME sorts the sequences by FASTA score and performs partition maximization, labeling the first N sequences as positive, for N=1,..,number of sequences.
--kmerk Preserve the frequencies of k-mers when creating a control dataset by shuffling the letters of each primary sequence. A value of 2 is used.
--seeds Use s as the initial random number seed when shuffling sequence letters. A value of 1 is used.
--methodfisher|ranksum|pearson |spearman|3dmhg|4dmhg The method for testing motif enrichment.
fisher -
the one-tailed Fisher's Exact test. By default, AME performs partition maximization, labeling sequences sorted by FASTA score, and classifies them using the hit threshold (see --hit-lo-fraction, below). If you specify which sequences are 'positive' using either '--control' or '--fix-partition', AME instead maximizes over all possible PWM thresholds that are at least as large as the sequence threshold defined for the scoring method in use (see --scoring, below).
ranksum -
the one-tailed Wilcoxon rank-sum test, also known as the Mann-Whitney U test.
pearson -
the significance of the Pearson correlation coefficient between the PWM score and the FASTA score. Requires FASTA scores in the all sequence headers. If there are fewer than 30 sequences, AME computes the mean-squared error of the linear regression between the PWM score and the FASTA score instead. Not valid with --control.
spearman -
the significance of Spearman's rank coefficient (ρ) between the PWM score ranks and the FASTA score ranks. Not valid with --control.
3dmhg and 4dmhg -
the 3-dimensional (3dmhg) and 4-dimensional (4dmhg) multi-hypergeometric tests are two-tailed tests described in McLeay and Bailey, "Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data", BMC Bioinformatics 11:165, 2010. These tests require --scoring totalhits; the 3dmhg function discriminates among sequences with 0, 1 or ≥ 2 hits, and the 4dmhg function discriminates among sequences with 0, 1, 2 or ≥ 3 hits. Note: Motifs enriched in either the primary or control sequences (or at the top or bottom of the sequences if you only give one sequence file) are considered significant by these tests. Not valid with --control.
The one-tailed Fisher's exact test (fisher) method is used for testing motif enrichment.
--scoringavg|max|sum|totalhits The method for scoring a single sequence for matches to a motif's PWM. The PWM score assigned to a sequence is either:
avg -
the average motif odds score of all positions in the sequence; the sequence threshold assumes that the sequence has one "hit" (see --hit-lo-fraction, below) and the rest of the sites in the sequence have an average odds of 1.
max -
the maximum motif odds score over all positions in the sequence; the sequence threshold is equal to hit threshold (see --hit-lo-fraction, below).
sum -
the sum of the motif odds scores of all positions in the sequence; the sequence threshold assumes that the sequence has one "hit" (see --hit-lo-fraction, below) and the rest of the sites in the sequence have an average odds of 1.
totalhits -
the total number of positions in the sequence whose odds score is at least hit score (see --hit-lo-fraction, below); the sequence threshold is 1.
The avg scoring method is used.
--hit-lo-fractionfraction The hit threshold for a motif is defined as fraction times the maximum possible log-odds score for the motif. A position is considered a "hit" if the log-odds score is greater than or equal to the hit threshold. A value of 0.25 is used.
--evalue-report-thresholdevalue E-value threshold for reporting a motif as significantly enriched. A threshold of 10 is used for reporting a motif.
--fasta-thresholdscore For the Fisher's exact test only when you use --poslist pwm, and you do not use --control --fix-partition. AME will classify sequences with FASTA scores below score as 'positives'. A maximum FASTA score of 0.001 is used by AME to classify a sequence as 'positive'.
--fix-partitionN Causes AME to evaluate only the single partition consisting of the first N sequences. May not be use with --control or --poslist pwm. Partition maximization is performed.
--poslistpwm|fasta For partition maximization, test thresholds on either X (PWM score) or Y (FASTA score). May not be used with --control or --fix-partition.
pwm -
Use PWM score (X).
fasta -
Use FASTA score (Y).
Hint: Be careful switching the poslist. It switches between using X and Y for determining true positives in the contingency matrix, in addition to switching which of X and Y AME uses for partition maximization.
Use the FASTA score.
--log-fscores  Convert FASTA scores into log-space. Only relevant for the pearson method. Use the FASTA score directly.
--log-pwmscores  Convert PWM scores into log-space. Only relevant for the pearson method. Use the PWM score directly.
--linreg-switchxy  Make the x-points FASTA scores and the y-points PWM scores. Only relevant for the pearson and spearman methods. Keep the original axis.
--noseq   (--method fisher only) Do not output the TSV (tab-separated values) file sequences.tsv. Note: This option is recommended when there are many many motifs and many input sequences as the TSV file can become extremely large. AME outputs file sequences.tsv, which lists the true- and false-positive sequences identified by AME using Fisher's Exact test.
--verbose1|2|3|4|5 A number that regulates the verbosity level of the output information messages. If set to 1 (quiet) then AME will only output error messages whereas the other extreme 5 (dump) outputs lots of mostly useless information. This option is best placed first. At verbosity level 3, AME will report the significance of each set of each partition of the sequences that it considers. The verbosity level is set to 2 (normal).

Citing

If you use AME in your research, please cite the following paper:
Robert McLeay and Timothy L. Bailey, "Motif Enrichment Analysis: A unified framework and method evaluation", BMC Bioinformatics, 11:165, 2010, doi:10.1186/1471-2105-11-165. [full text]