MoMo

Input

<algorithm>

The name of the algorithm to use to search for motifs. Available algorithms are motifx, modl, and simple.

motifx: The motif-x algorithm utilizes a greedy iterative search to discover motifs by recursively picking the most statistically significant position/residue pair according to binomial probability, reducing the dataset (foreground and background) to only peptides containing that pair, and continuing until no more position/residue pairs are significant according to a user-defined threshold. If this motif has at least one statistically significant position/residue pair, all instances of the pattern are removed, and the algorithm continues to generate motifs until this condition fails.
The motif-x algorithm is described in the paper Schwartz, D. and Gygi, S. P. (2005). "An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets". Nature Biotechnology, 23(11), 1391-1398.
modl: The MoDL algorithm is based on the principle of minimum description length (MDL). It searches for a set of motifs that minimizes the number of bits to encode the set of modified peptides and motifs, using a greedy iterative approach. The algorithm uses a list of candidate single-residue motifs (excluding the modified site) that exist in the modification dataset. Then, starting with an empty initial set of motifs, the algorithm generates five potential motif sets by either (1) removing a candidate motif, (2) adding a candidate motif, (3) adding a candidate motif then removing a motif, (4) merging a motif with a candidate motif, or (5) merging a motif with a candidate motif and then removing a motif. The algorithm then chooses the potential motif set with the minimum description length, and makes it the new 'initial' set of motifs. The algorithm iterates the above steps a specified number of times t, or until the description length of the motif set does not change for L iterations (t=50 and L=10 by default).
The MoDL algorithm is described in the paper Ritz, A., Shakhnarovich, G., Salomon, A., and Raphael, B. (2009). "Discovery of phosphorylation motif mixtures in phosphoproteomics data". Bioinformatics, 25(1), 14-21.
simple: Creates a maximum-likelihood position weight matrix (PWM) motif for each distinct central residue present in the modified peptides in the input PTM file(s). The weights in the PWM are the observed frequencies of the amino acids in the equal-length modified peptides, aligned on their central residue (the modified amino acid). If the modified peptides in the input PTM file(s) have differing lengths, their lengths are adjusted to be equal, as described below under the --width option.

<PTM file>+

The names of one or more files with peptide sequences containing a post-translationally modified amino acid. (These are referred to as the "foreground" peptides.) Each file must be either in PSM format, FASTA format, or "Raw" (one sequence per line) format. All files must use the same format, and MoMo will attempt to determine the format of the file(s) using the following rules, in order:

FASTA: The first non-empty line begins with the '>' character.
PSM: The first non-empty line contains a tab character.
Raw: Otherwise.

With FASTA or Raw format files, all sequences must have the same length, the length must be 7 or be specified using the option --width, below, the sequences must be in the Protein IUPAC alphabet, and the modified amino acid is assumed to be the central residue.

With PSM format files, one column must contain the modified peptides with the modified amino acid indicated as described in the PSM format documentation. You can specify the name of the modified peptide column using the --sequence-column, option, below. MoMo will attempt to expand all modified peptides to the width of the longest modified peptide or the requested motif width, whichever is wider. (See option --width, below, for a description of how the expansion is done.)

Output

MoMo writes its output to files in a directory named momo_out, which it creates if necessary. You can change the output directory using the --o or --oc options. The directory will contain the following files:

momo.html - an HTML file that provides the results in a human-readable format
momo.tsv - a TSV (tab-separated values) file that provides the results in a format suitable for parsing by scripts and viewing with Excel
momo.txt - a plain text file containing the motifs discovered by MoMo in MEME Motif Format
<name>.png - PNG image files containing sequence logos for each of the motifs found by MoMo (where <name> is the motif name)

Note: See this detailed description of the MoMo output formats for more information.

Options

motif-x MoDL simple

Option Parameter Description Default Behavior

General Options

--psm-type comet| ms-gf+| tide| percolator If the PTM file(s) are in tab-separated Peptide-Spectrum Match (PSM) format, you can specify the name of the program that created them. This will cause MoMo to set the name of the column containing the modified peptides appropriately. MoMo will attempt to determine the type of the PTM file(s), and if they are in a PSM format, you must use the --sequence-column option, below.

--sequence-column name The name of the column containing the modified peptides in the PTM file. The column names must be specified, separated by tabs, in the first line of the PTM file.
Note: Column names must NOT contain tabs.
Note: This option is required if the PTM file(s) are in PSM format unless you use the --psm-type option, above. None.

--width

width

The width of motifs to discover. Because motifs will be symmetric around the central, modified residue, width must be odd. The behavior of MoMo depends on the format of the PTM input file(s).

PTM file format

MoMo Behavior

FASTA or Raw format

No effect. An error is reported if the length of any sequence in the input files differs from width.

PSM format

If a modified peptide is shorter than width, MoMo will first attempt to expand it by looking up its context in the protein database file, if given (see option --protein-database, below.) If the modified peptide is still shorter than width, MoMo will pad it on either side as required using the Protein IUPAC 'X' character. If the longest modified peptide is still shorter than width, MoMo will set the motif width to the length of the longest (expanded and padded) modified peptide.

FASTA or Raw format: Motifs of width 7 are generated.
PSM format: Motifs of width the length of the longest modified peptide are generated.

--seed seed The seed for initializing the random number generator used for shuffling foreground peptides (preserving the central residue) to use as the background peptides unless you specify option --db_background, below. A value of 0 is used.

--db-background The background peptides for the motif-x and MoDL algorithms will be extracted from the protein database if you specify option --protein-database, below. Shuffled versions (preserving the central residue) of each of the foreground peptides will be used as the background peptides for the motif-x and MoDL algorithms.

--protein-database

protein database file

A protein database that will be used to allow expansion of modified peptides from PSM formatted PTM input file(s) (see option --width, above), for estimating the amino acid background frequencies, and potentially for creating a set of background peptides (see option --db_background, above). (This is typically the protein database that was used to generate the PTM input file(s).) This file may be in either FASTA or Raw format. If it is in FASTA format, the sequences may be of any length. If it is in Raw format, each sequence must be of the length specified by the --width option, above, and there must be exactly one sequence per line, with no sequence ID lines. The background frequencies are used as follows:

Algorithm	MoMo Behavior
motifx	The background frequencies are used to estimate the binomial probabilities.
modl	The background frequencies are used to estimate the description length.
simple	The background frequencies are included in the MEME Motif Format motif output file `momo.txt`.

Modified peptides from PTM file(s) in PSM format are padded using the Protein IUPAC 'X' character as required, and the amino acid background frequencies are derived from the frequencies in the foreground peptides (after expansion and padding).

--filter field,lt|le|eq|ge|gt,threshold Specifies a (single) filter that causes only modified peptides that pass the given test to be included in the analysis. The test consists of three components separated by commas with no spaces in between. (If the field name contains spaces, enclose the entire test string in quotes.) The field component of the parameter specifies the name of the column in the PSM format PTM file from which the score is drawn. The next component specifies whether only modified peptides with scores less than (lt), less than or equal (le), equal (eq), greater than or equal (ge), or greater than (gt) the threshold are retained. No filter.

--remove-unknowns T|F If TRUE (T), all foreground and background peptides that contain an 'X' (after expansion and padding, see option --width, above) will be removed from the analysis. Do not remove peptides just because they contain an 'X'. Remove peptides if they contain an 'X'.

--eliminate-repeats num Any groups of peptides in the foreground or background sets whose num central residues (after expansion and padding, see option --width, above) are identical will be replaced with a single copy. Because the window is symmetric around the central, modified residue, num must be odd. To turn this option off, specify a width of 0. Note: Since shorter peptides will be padded with the 'X' character, which matches any other character, shorter peptides will match longer ones that contain them, and will be subject to elimination. Behave as if the value width (see option --width, above) was given for num.

--min-occurrences num Only attempt to construct a motif for a particular modification of an amino acid if there are at least num foreground and background peptides (after eliminating repeats, see option --eliminate-repeats, above) that contain it. Behave as if 5 was given for num.

--min-occurrences num The minimum number of peptides in the post-translationally modified data set needed to match the residue/position pair for each recursive iteration of motif-x. Also, MoMo only attempts to construct a motif for a particular modification of an amino acid if there are at least num foreground and background peptides (after eliminating repeats, see option --eliminate-repeats, above) that contain it. Behave as if 20 was given for num.

--single-motif-per-mass Generate a single motif that combines all central residues that have the same modification mass. Only valid with PSM formatted PTM files. (The modification mass is given as a number following the modified amino acid in the modified peptide as described in the PSM format documentation.) For example, phosphorylation is typically specified as a mass of 79.97 added to the residues S, T or Y. If this option is not given, three separate motifs are generated, each with a perfectly conserved central residue. If this option is given, then all the phosphorylation events are combined into a single motif, with a mixture of S, T and Y in the central position. Generate a motif for each combination of residue and modification mass.

--hash-fasta k If a protein database is provided in FASTA format, the process of finding the location of a peptide within the protein can be sped up using an O(1) lookup table hashing from each unique k-mer to an arraylist of locations. If k is 0, the program will proceed using linear search instead of creating a lookup table. Note: With a full mammalian proteome as the protein database, MoMo will typically run faster using hashing (e.g., set k to 6) if your PTM file(s) contain more than 50,000 peptides in total. Behave as if 0 was given for k.

--max-motifs num MoDL will stop after it finds num motifs. Behave as if 100 was given for num.

--max-iterations num MoDL will stop after num iterations. Behave as if 50 was given for num.

--max-no-decrease num MoDL will stop if there is no decrease in MDL for num iterations. Behave as if 10 was given for num.

--score-threshold num The largest binomial probability for a residue/position pair to be counted as significant during each recursive iteration of motif-x. Behave as if 1e-6 was given for num.

--harvard Mimic the behavior of (the Harvard version of) motif-x more closely by only calculating binomial p-values no smaller than 10^-16 for residue/position pairs. Smaller p-values are set to 10^-16, and ties are broken by sorting residue/position pairs by decreasing number of peptides that match them. Calculate residue/position pair binomial p-values in log-space, allowing p-values as small as e^{(-10³⁰⁰)}.

The MEME Suite

Motif-based sequence analysis tools

Modification Motifs

Usage:

Description

Input

<algorithm>

<PTM file>+

Output

Options

Citing