Usage:

sea [options] --p <primary sequences> [--m <motifs>]+

Description

Input

<primary sequences>

The name of a file containing the primary (positive) sequences in FASTA format. The file must contain at least 2 valid sequences or SEA will reject it.

<motifs>

The name of a file containing motifs in MEME format that SEA will test for enrichment in the primary sequence. This argument may be present more than once, allowing you to simultaneously analyze motifs in several motif files.

Output

SEA writes its output to files in a directory named sea_out, which it creates if necessary. You can change the output directory using the --o or --oc options. The directory will contain:

Note: All options may be preceded by a single dash (-) instead of a double dash (--) if desired.


Options

Option Parameter Description Default Behavior
Output
--text Output TSV format only to standard output. SEA behaves as if --oc sea_out had been specified.
Control Sequences and Background Model
--n control sequences The name of a file containing control (negative) sequences in FASTA format. The control sequences must be in the same sequence alphabet as the primary sequences. If the average length of the control sequences is longer than that of the primary sequences, SEA trims the control sequences so that both sets have the same average length. If you do not provide control sequences, SEA creates them by shuffling a copy of each primary sequence, using an m-order shuffle (see next option). Shuffling also preserves the positions of non-core (e.g., ambiguous) characters in each sequence to avoid artifacts.
--order m If you do not provide control sequences, SEA creates them by shuffling a copy of each primary sequence, using an m-order shuffle of each primary sequence. This preserves the frequencies of words of length m+1 in each shuffled sequence. Unless you specify a background model file (see --bfile, below), SEA will also estimate an m-order background model from the control sequences (or the primary sequences if you do not provide control sequences). m must be in the range [0,..,5]. SEA uses m=2 (DNA and RNA), and m=0 (Protein and Custom alphabets).
--bfile file Specify the source of a background model in Markov Background Model Format, or one of the keywords --motif--, motif-file or --uniform--. The first two keywords cause the 0-order letter frequencies contained in the first motif file to be used, and --uniform-- causes uniform letter frequencies to be used. SEA uses the m-order portion of the background model for log-likelihood scoring of motif sites. Note: SEA will set the value of m to 0 if you specify one of the three keywords instead of the name of a file. SEA estimates a 0-order background model from the control sequences.
--hofract hofract The fraction of the primary and control sequences that SEA will randomly select for computing the best score threshold for each motif. SEA uses this threshold when computing the p-value of the motif in the remaining (non-holdout) sequences.
Note: If the hold-out set would contain fewer than 5 sequences, SEA does not create it, and the motif p-values and E-values will be less accurate.
SEA sets hofract to 0.1 (10%) of the primary and control sequences.
--seed seed Random seed for shuffling and sampling the hold-out set sequences. SEA uses a random seed of 0.
Alphabet
Motif Scoring and Selection
Output filtering
--thresh thresh Limit the results to motifs whose significance is no greater than thresh. By default, SEA filters motifs on their enrichment E-value, which is computed by multiplying the p-value by the number of motifs in the input to SEA. You can use the --qvalue or --pvalue option (below) if you want to filter motifs on their enrichment q-value or p-value instead. SEA will report motifs with enrichment E-values up to 10 (or with q-values up to 0.05 if --qvalue or --pvalue given).
--qvalue Filter motifs on their enrichment q-value. The q-value is the minimum False Discovery Rate (FDR) required to consider the motif significant. Filter motifs on the enrichment E-value.
--pvalue Filter motifs on their enrichment p-value. Filter motifs on the enrichment E-value.
Misc
--align left | center | right For the site positional distribution diagrams, align the sequences on their left ends (left), on their centers (center), or on their right ends (right). For visualizing motif distributions, center alignment is ideal for ChIP-seq and similar data; right alignment for sequences upstream of transcription start sites; left alignment for many proteins or 3' UTR sequences. Align the sequences on their centers.
--noseqs Do not output a TSV file (sequences.tsv) containing the matching sequences for each significant motif. This file can be quite large, so suppressing its output can be useful if it is not needed. Output a TSV file containing the matching sequences.
--verbosity1|2|3|4|5 A number that regulates the verbosity level of the output information messages. If set to 1 (quiet) then SEA will only output warning and error messages, whereas the other extreme 5 (dump) outputs lots of information intended for debugging. The verbosity level is set to 2 (normal).

SEA algorithm overview

SEA evaluates each motif in the motif file(s) for enrichment in the primary sequences.

  1. Suffix Tree Creation.

    SEA builds a single suffix tree that includes both the primary and control sequences (but not the hold-out set sequences).

  2. Motif Conversion.

    SEA converts each motif from a frequency matrix to a log-odds score matrix. By default, SEA creates a background model from the control sequences, but you can provide a different background model if you wish.

  3. Motif Significance Computation.

    SEA estimates motif enrichment using Fisher's exact test if the primary and control sequences have the same average length (within 0.01%), otherwise it uses the Binomial test. SEA first uses the motif to classify the sequences in the hold-out set. Classification is based on the best match to the motif in each sequence (on either strand when the alphabet is complementable). SEA chooses the score threshold that gives the most significant p-value on the hold-out set. Using that score threshold, SEA then classifies the remaining (non-hold-out) sequences and computes the statistical significance of the classification.

    If there are not enough input sequences to construct a hold-out set with at least 5 primary and 5 control sequences, SEA optimizes the score threshold over all the input sequences. It adjusts the optimal p-value for N multiple tests using the formula
        p' = 1 - (1 - p)N,
    where N is the number of score thresholds tested during the optimization of p.

Citing