FASTA Sequence Format

Description

Various MEME Suite programs require as input a file containing protein or DNA sequences. These input files must be in FASTA format.

Format Specification

Every entry consists of a sequence identifier (ID), an optional comment (COMMENT), and a sequence (SEQUENCE). The format looks like this:

>ID COMMENT
SEQUENCE
      

The special character ">" marks the beginning of a new sequence. The ">" character is followed immediately by the sequence identifier. The rest of that line is occupied by the optional comment. Subsequent lines contain the sequence itself.

Some rules about representing sequences:

Here is an example of three protein sequences in FASTA format:

>ICYA_MANSE 
GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY
DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN
LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD
NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH

>LACB_BOVIN 
MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA
QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI
DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALE
KFDKALKALPMHIRLSFNPTQLEEQCHI

>BBP_PIEBR 
NVYHDGACPEVKPVDNFDWSNYHGKWWEVAKYPNSVEKYGKCGWAEYTPE
GKSVKVSNYHVIHGKEYFIEGTAYPVGDSKIGKIYHKLTYGGVTKENVFN
VLSTDNKNYIIGYYCKYDEDKKGHQDFVWVLSRSKVLTGEAKTAVENYLI
GSPVVDSQKLVYSDFSEAACKVN
      

Weights (MEME-only extension)

When running MEME sequence weights may be specified in the dataset file by special header lines where the unique name is "WEIGHTS" (all caps) and the descriptive text is a list of sequence weights.

Sequence weights are numbers in the range 0 < w ≤ 1. All weights are assigned in order to the sequences in the file. If there are more sequences than weights, the remainder are given weight one. Weights may be specified by more than one "WEIGHTS" entry which may appear anywhere in the file. When weights are used, sequences will contribute to motifs in proportion to their weights.

Here is an example for a file of three sequences where the first two sequences are very similar and it is desired to down-weight them:

>WEIGHTS 0.5 .5 1.0 
>seq1
GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAK
>seq2
GDMFCPGYCPDVKPVGDFDLSAFAGAWHELAK
>seq3
QKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKW