sea [options] --p <primary sequences> [--m <motifs>]+
The name of a file containing the primary (positive) sequences in FASTA format. The file must contain at least 2 valid sequences or SEA will reject it.
The name of a file containing motifs in MEME format that SEA will test for enrichment in the primary sequence. This argument may be present more than once, allowing you to simultaneously analyze motifs in several motif files.
SEA writes its output to files in a directory named
sea_out
, which it creates if necessary. You can change the
output directory using the --o or --oc options.
The directory will contain:
sea.html
-
an HTML file that provides the results in an interactive, human-readable formatsea.tsv
-
a TSV (tab-separated values)
file that provides the results in a format suitable for parsing by
scripts and viewing with Excel sequences.tsv
-
(optional) a TSV (tab-separated values)
file that lists the true- and false-positive sequences identified by SEAsites.tsv
-
(optional) a TSV (tab-separated values)
file that lists the positions of the motif sites in the positive (primary) sequences
for each motif found to be significant by SEANote: All options may be preceded by a single dash (-) instead of a double dash (--) if desired.
Option | Parameter | Description | Default Behavior |
---|---|---|---|
Output | |||
--text | Output TSV format only to standard output. | SEA behaves as if --oc sea_out had
been specified. |
|
Primary Sequences | |||
Control Sequences and Background Model | |||
--n | control sequences | The name of a file containing control (negative) sequences in FASTA format. The control sequences must be in the same sequence alphabet as the primary sequences. If the average length of the control sequences is longer than that of the primary sequences, SEA trims the control sequences so that both sets have the same average length (unless you also specify option --notrim). | If you do not provide control sequences, SEA creates them by shuffling a copy of each primary sequence, using an m-order shuffle (see next option). Shuffling also preserves the positions of non-core (e.g., ambiguous) characters in each sequence to avoid artifacts. |
--notrim | Do not trim the control sequences even if their average length exceeds that of the primary sequences. | The control sequences will be trimmed if their average length exceeds that of the primary sequences. This may enable SEA to use the (more accurate) Fisher test rather than the Binomial test. | |
--order | m | If you do not provide control sequences, SEA creates them by shuffling a copy of each primary sequence, using an m-order shuffle of each primary sequence. This preserves the frequencies of words of length m+1 in each shuffled sequence. Unless you specify a background model file (see --bfile, below), SEA will also estimate an m-order background model from the control sequences (or the primary sequences if you do not provide control sequences). m must be in the range [0,..,5]. | SEA uses m=2 (DNA and RNA), and m=0 (Protein and Custom alphabets). |
--bfile | file | Specify the source of a background model
in Markov Background Model Format,
or one of the keywords --motif-- , motif-file or --uniform-- .
The first two keywords cause the 0-order letter frequencies contained in the
first motif file to be used, and --uniform-- causes
uniform letter frequencies to be used.
SEA uses the m-order portion of the background model for
log-likelihood scoring of motif sites. Note: SEA will set the
value of m to 0 if you specify one of the three keywords
instead of the name of a file.
|
SEA estimates a 0-order background model from the control sequences. |
--hofract | hofract | The fraction of the primary and control sequences that SEA will randomly select
for computing the best score threshold for each motif.
SEA uses this threshold when computing the p-value of the
motif in the remaining (non-hold-out) sequences.
Note: If the hold-out set would contain fewer than 5 sequences, SEA does not create it, and the motif p-values and E-values will be less accurate. |
SEA sets hofract to 0.1 (10%) of the primary and control sequences. |
--seed | seed | Random seed for shuffling and sampling the hold-out set sequences. | SEA uses a random seed of 0. |
Alphabet | |||
Motif Scoring and Selection | |||
Output filtering | |||
--thresh | thresh | Limit the results to motifs whose significance is no greater than thresh. By default, SEA filters motifs on their enrichment E-value, which is computed by multiplying the p-value by the number of motifs in the input to SEA. You can use the --qvalue or --pvalue option (below) if you want to filter motifs on their enrichment q-value or p-value instead. | SEA will report motifs with enrichment E-values up to 10 (or with q-values up to 0.05 if --qvalue or --pvalue given). |
--qvalue | Filter motifs on their enrichment q-value. The q-value is the minimum False Discovery Rate (FDR) required to consider the motif significant. | Filter motifs on the enrichment E-value. | |
--pvalue | Filter motifs on their enrichment p-value. | Filter motifs on the enrichment E-value. | |
Misc | |||
--align | left | center | right | For the site positional distribution diagrams, align the sequences on their left ends (left), on their centers (center), or on their right ends (right). For visualizing motif distributions, center alignment is ideal for ChIP-seq and similar data; right alignment for sequences upstream of transcription start sites; left alignment for many proteins or 3' UTR sequences. | Align the sequences on their centers. |
--noseqs | Do not output the TSV file sequences.tsv containing the matching sequences
for each significant motif. This file can be quite large,
so suppressing its output can be useful if it is not needed.
This also suppresses output of the TSV file sites.tsv containing
the positions of the sites in the positive (primary) sequences of each motif
found to be significant by SEA.
|
Output two TSV files containing, respectively, the matching sequences, and the positions of the predicted sites in those sequences. | |
--verbosity | 1|2|3|4|5 | A number that regulates the verbosity level of the output information messages. If set to 1 (quiet) then SEA will only output warning and error messages, whereas the other extreme 5 (dump) outputs lots of information intended for debugging. | The verbosity level is set to 2 (normal). |
SEA evaluates each motif in the motif file(s) for enrichment in the primary sequences.
SEA builds a single suffix tree that includes both the primary and control sequences (but not the hold-out set sequences).
SEA converts each motif from a frequency matrix to a log-odds score matrix. By default, SEA creates a background model from the control sequences, but you can provide a different background model if you wish.
SEA estimates motif enrichment using Fisher's exact test if the primary and control sequences have the same average length (within 0.01%), otherwise it uses the Binomial test. SEA first uses the motif to classify the sequences in the hold-out set. Classification is based on the best match to the motif in each sequence (on either strand when the alphabet is complementable). SEA chooses the score threshold that gives the most significant p-value on the hold-out set. Using that score threshold, SEA then classifies the remaining (non-hold-out) sequences and computes the statistical significance of the classification.
If there are not enough input sequences to construct a hold-out set with
at least 5 primary and 5 control sequences, SEA optimizes the score threshold
over all the input sequences. It adjusts the optimal p-value for
N multiple tests using the formula
p' = 1 - (1 - p)N,
where N is the number of score thresholds tested during the
optimization of p.