Usage:

enr [options] --p <primary sequences> [--m <motifs>]+

Description

Input

<primary sequences>

The name of a file containing the primary (positive) sequences in FASTA format. The file must contain at least 2 valid sequences or ENR will reject it. Note that the command-line version of ENR does not attempt to detect the alphabet from the primary sequences, so you should specify it with the --dna, --rna, --protein or --alph options.

<motifs>

The name of a file containing motifs in MEME format that ENR will test for enrichment in the primary sequence. This argument may be present more than once, allowing you to simultaneously analyze motifs in several motif files.

Output

ENR writes its output to standard output. The output is in tab-separated values format (TSV). The first line of the output contains the (tab-separated) names of the fields. The names and meanings of each of the fields are described in the table below.

fieldnamecontents
1IDThe ID of the motif.
2ALT_IDThe alternate ID of the motif (or blank).
3POS_MATCHESThe number of primary sequences matching the motif with scores greater than or equal to the optimal score threshold.
4NEG_MATCHESThe number of negative sequences matching the motif with scores greater than or equal to the optimal score threshold.
5SCORE_THRThe match score threshold giving the optimal p-value. This is the score threshold used by ENR to determine the values of POS_MATCHES and NEG_MATCHES.
6RATIOThe relative enrichment ratio of the motif in the primary vs. control sequences, defined as (POS_MATCHES/NPOS) / (NEG_MATCHES/NNEG), where NPOS is the numbr of primary sequences in the input, and NNEG is the number of negative sequences in the input.
7PVALUEThe statistical signficance (p-value) of the motif's enrichment, according to the chosen objective function.
8LOG10_PVALUEThe base-10 logarithm of the p-value.

Options

Option Parameter Description Default Behavior
Objective Function
--objfun de|​ cd This option is used to select the objective function that ENR optimizes in searching for motifs.
ValueNameDescription
deDifferential Enrichment This objective function scores motifs based on the enrichment of their sites in the primary sequences compared with the control sequences. ENR estimates motif enrichment using Fisher's exact test if the primary and control sequences have the same average length (within 0.01%), otherwise it uses the Binomial test.
cdCentral Distance This objective function scores motifs based on their tendency to occur near the center of the primary sequences, which must all be of the same length. No set of control sequences is allowed, and the primary sequences should include adequate flanking region around the expected motif sites—e.g., use sequences of 500bp for ChIP-seq. ENR estimates the tendency of a motif to occur near the centers of primary sequences using the cumulative Bates distribution applied to the mean distance of the best site from the sequence center.
ENR uses the Differential Enrichment (de) function.
Control Sequences and Hold-out Set
--n control sequences The name of a file containing control (negative) sequences in FASTA format. The control sequences must be in the same sequence alphabet as the primary sequences. If the average length of the control sequences is longer than that of the primary sequences, ENR trims the control sequences so that both sets have the same average length. If you do not provide control sequences, ENR creates them by shuffling a copy of each primary sequence, preserving the frequencies of words of length k (see next option). Shuffling also preserves the positions of non-core (e.g., ambiguous) characters in each sequence to avoid artifacts.
--kmer k Preserve the frequencies of words (k-mers) of this size when shuffling primary sequences to create control sequences. k must be in the range [1,..,6]. ENR also estimates a background model of order k-1 from the primary (positive) sequences for use in log-likelihood scoring of motif sites. ENR preserves the frequencies of words of length 3 (DNA, RNA and Custom alphabets), and 1 (Protein), and constructs background models of order 2 (DNA, RNA, Custom), and order 0 (Protein).
--hofract hofract The fraction of the primary sequences that ENR will randomly select and hold out to simulate exactly how STREME works. ENR will therefor report the same statistical significance for motifs found by STREME as reported by STREME. Note: Set this option to zero if you want to measure the statistical significance of your motifs in the complete set of input sequences. ENR sets hofract to 0.1 (10%) of the primary sequences.
-seed seed Random seed for shuffling and sampling the hold-out set sequences (see above). ENR uses a random seed of 0.
Alphabet
Motif Scoring and Selection
Misc
--verbosity1|2|3|4|5 A number that regulates the verbosity level of the output information messages. If set to 1 (quiet) then ENR will only output error messages, whereas the other extreme 5 (dump) outputs lots of information intended for debugging. The verbosity level is set to 2 (normal).

ENR algorithm overview

ENR evaluates each motif in the motif file(s) for enrichment in the primary sequences.

  1. Suffix Tree Creation.

    ENR builds a single suffix tree that includes both the primary and control sequences (but not the hold-out set sequences).

  2. Motif Conversion.

    ENR converts each motif from a frequency matrix to a log-odds score matrix. By default, STREME creates a background model from the control sequences, but you can provide a different background model if you wish.

  3. Motif Significance Computation.

    ENR computes the unbiased statistical significance of the of the motif by using the motif and the optimal discriminative score threshold (based on the primary and control sequences) to classify the hold-out set sequences, and then applying the statistical test (Fisher's exact test, Binomial test, or the cumulative Bates distribution) to the classification. Classification is based on the best match to the motif in each sequence (on either strand when the alphabet is complementable).

Citing

If you use ENR in your research, please cite the following paper:
FIXME.