SEA

The name of a file containing the primary (positive) sequences in FASTA format. The file must contain at least 2 valid sequences or SEA will reject it.

The name of a file containing motifs in MEME format that SEA will test for enrichment in the primary sequence. This argument may be present more than once, allowing you to simultaneously analyze motifs in several motif files.

SEA writes its output to files in a directory named sea_out, which it creates if necessary. You can change the output directory using the --o or --oc options. The directory will contain:

sea.html - an HTML file that provides the results in an interactive, human-readable format
sea.tsv - a TSV (tab-separated values) file that provides the results in a format suitable for parsing by scripts and viewing with Excel
sequences.tsv - (optional) a TSV (tab-separated values) file that lists the true- and false-positive sequences identified by SEA
sites.tsv - (optional) a TSV (tab-separated values) file that lists the positions of the motif sites in the positive (primary) sequences for each motif found to be significant by SEA

Note: All options may be preceded by a single dash (-) instead of a double dash (--) if desired.

Option	Parameter	Description	Default Behavior
Output
--text		Output TSV format only to standard output.	SEA behaves as if `--oc sea_out` had been specified.
Primary Sequences
Control Sequences and Background Model
--n	control sequences	The name of a file containing control (negative) sequences in FASTA format. The control sequences must be in the same sequence alphabet as the primary sequences. If the average length of the control sequences is longer than that of the primary sequences, SEA trims the control sequences so that both sets have the same average length (unless you also specify option --notrim).	If you do not provide control sequences, SEA creates them by shuffling a copy of each primary sequence, using an m-order shuffle (see next option). Shuffling also preserves the positions of non-core (e.g., ambiguous) characters in each sequence to avoid artifacts.
--notrim		Do not trim the control sequences even if their average length exceeds that of the primary sequences.	The control sequences will be trimmed if their average length exceeds that of the primary sequences. This may enable SEA to use the (more accurate) Fisher test rather than the Binomial test.
--order	m	If you do not provide control sequences, SEA creates them by shuffling a copy of each primary sequence, using an m-order shuffle of each primary sequence. This preserves the frequencies of words of length m+1 in each shuffled sequence. Unless you specify a background model file (see --bfile, below), SEA will also estimate an m-order background model from the control sequences (or the primary sequences if you do not provide control sequences). m must be in the range [0,..,5].	SEA uses m=2 (DNA and RNA), and m=0 (Protein and Custom alphabets).
--bfile	file	Specify the source of a background model in Markov Background Model Format, or one of the keywords `--motif--`, `motif-file` or `--uniform--`. The first two keywords cause the 0-order letter frequencies contained in the first motif file to be used, and `--uniform--` causes uniform letter frequencies to be used. SEA uses the m-order portion of the background model for log-likelihood scoring of motif sites. Note: SEA will set the value of m to 0 if you specify one of the three keywords instead of the name of a file.	SEA estimates a 0-order background model from the control sequences.
--hofract	hofract	The fraction of the primary and control sequences that SEA will randomly select for computing the best score threshold for each motif. SEA uses this threshold when computing the p-value of the motif in the remaining (non-hold-out) sequences. Note: If the hold-out set would contain fewer than 5 sequences, SEA does not create it, and the motif p-values and E-values will be less accurate.	SEA sets hofract to 0.1 (10%) of the primary and control sequences.
--seed	seed	Random seed for shuffling and sampling the hold-out set sequences.	SEA uses a random seed of 0.
Alphabet
Motif Scoring and Selection
Output filtering
--thresh	thresh	Limit the results to motifs whose significance is no greater than thresh. By default, SEA filters motifs on their enrichment E-value, which is computed by multiplying the p-value by the number of motifs in the input to SEA. You can use the --qvalue or --pvalue option (below) if you want to filter motifs on their enrichment q-value or p-value instead.	SEA will report motifs with enrichment E-values up to 10 (or with q-values up to 0.05 if --qvalue or --pvalue given).
--qvalue		Filter motifs on their enrichment q-value. The q-value is the minimum False Discovery Rate (FDR) required to consider the motif significant.	Filter motifs on the enrichment E-value.
--pvalue		Filter motifs on their enrichment p-value.	Filter motifs on the enrichment E-value.
Misc
--align	left \| center \| right	For the site positional distribution diagrams, align the sequences on their left ends (left), on their centers (center), or on their right ends (right). For visualizing motif distributions, center alignment is ideal for ChIP-seq and similar data; right alignment for sequences upstream of transcription start sites; left alignment for many proteins or 3' UTR sequences.	Align the sequences on their centers.
--noseqs		Do not output the TSV file `sequences.tsv` containing the matching sequences for each significant motif. This file can be quite large, so suppressing its output can be useful if it is not needed. This also suppresses output of the TSV file `sites.tsv` containing the positions of the sites in the positive (primary) sequences of each motif found to be significant by SEA.	Output two TSV files containing, respectively, the matching sequences, and the positions of the predicted sites in those sequences.
--verbosity	1\|2\|3\|4\|5	A number that regulates the verbosity level of the output information messages. If set to 1 (quiet) then SEA will only output warning and error messages, whereas the other extreme 5 (dump) outputs lots of information intended for debugging.	The verbosity level is set to 2 (normal).

SEA evaluates each motif in the motif file(s) for enrichment in the primary sequences.

Suffix Tree Creation.
SEA builds a single suffix tree that includes both the primary and control sequences (but not the hold-out set sequences).
Motif Conversion.
SEA converts each motif from a frequency matrix to a log-odds score matrix. By default, SEA creates a background model from the control sequences, but you can provide a different background model if you wish.
Motif Significance Computation.
SEA estimates motif enrichment using Fisher's exact test if the primary and control sequences have the same average length (within 0.01%), otherwise it uses the Binomial test. SEA first uses the motif to classify the sequences in the hold-out set. Classification is based on the best match to the motif in each sequence (on either strand when the alphabet is complementable). SEA chooses the score threshold that gives the most significant p-value on the hold-out set. Using that score threshold, SEA then classifies the remaining (non-hold-out) sequences and computes the statistical significance of the classification.

If there are not enough input sequences to construct a hold-out set with at least 5 primary and 5 control sequences, SEA optimizes the score threshold over all the input sequences. It adjusts the optimal p-value for N multiple tests using the formula
p' = 1 - (1 - p)^N,
where N is the number of score thresholds tested during the optimization of p.

The MEME Suite

Motif-based sequence analysis tools

Simple Enrichment Analysis

Usage:

Description

Input

Output

Options

SEA algorithm overview

Citing