update-sequence-db

Usage:

update-sequence-db [options] <sequence database directory>

Description

This program has three uses:

Input

<sequence database directory>

The directory to contain your MEME Suite sequence database. The MEME Suite expects to find sequence databases in a folder called fasta_databases either inside in the folder MEME Install Folder/db or in the folder specified to the configure script --with-db DB Install Folder. Depending on how you configured the MEME Suite you should either specify MEME Install Folder/db/fasta_databases or DB Install Folder/fasta_databases.

Output

You must specify one or more categories of database to update. See the section "Select Category of Databases to Update or List", below.

By default, the program will list all the FASTA sequence files missing from your chosen database category that are available for download. You can use this information to create a regular expression describing the files you want to download. For example, the command
update-sequence-db --ensemblvertebrates --db_regexp "Human.*DNA" fasta_databases
will download the human genome FASTA file from Ensembl to a directory named fasta_databases on your computer. On the other hand, the command
update-sequence-db --ensemblvertebrates --db_regexp "Human" fasta_databases
will download both the human genome and the human proteome FASTA files. Finally, the command
update-sequence-db --ensemblvertebrates --db_all fasta_databases
will download all missing genome and proteome FASTA files for all vertebrates in the current release of Ensembl.

The first time it is called, the program creates a folder called downloads and a folder called logs. It also creates a SQLite database called fasta_db.sqlite. Every FASTA sequence file that is downloaded is initially put in the folder downloads until it has been completely downloaded. When the FASTA sequence file has been downloaded it will be decompressed or merged from multiple sources as required and put into a sequence file with either a .faa or .fna extension for protein or DNA sequences, respectively. Once the sequence has been expanded it will be processed by fasta-get-markov to calculate a first-order background model in a file with the extension .bfile. For each genomes file, an index file ending in .fai for use by bed2fasta will be automatically created using the fasta-file-indexer utility. Additionally, fasta-get-markov will be called to calculate the number of sequences, the shortest, longest and average size, and all this information will be stored in the SQLite database.

Configuration

Configuration files that tweak the behaviors of the sequence database downloaders will be automatically generated in the conf/ subdirectory within the specified sequence database directory.

Additionally the miscellaneous source downloader (--misc) will check the conf/ subdirectory for any files ending with the extension .csv, which it reads to determine sequence sources. The MEME Suite includes two files db_general.csv and db_other_genomes.csv in the distribution's etc folder which you can move into the conf folder, though this is not done automatically during install.

Options

Option Parameter Description Default Behavior
General Options
--db_all Download all available FASTA files from the chosen category. List missing FASTA files from the chosen category.
--db_regexp Download only FASTA files from the chosen category matching the given JAVA regular expression. List missing FASTA files from the chosen category.
--db_rel Download FASTA files or list missing FASTA files from this specific release of the chosen category. Use most recent release of the chosen category.
--list_rel List available database release numbers of the chosen category and exit. Download FASTA files or list missing FASTA files from the chosen category.
Select Category of Databases to Update or List
--ensemblbacteria Update bacteria genomes and proteins from Ensembl. Do not update this type of database.
--ensemblfungi Update fungi genomes and proteins from Ensembl. Do not update this type of database.
--ensemblmetazoa Update metazoa genomes and proteins from Ensembl. Do not update this type of database.
--ensemblplants Update plant genomes and proteins from Ensembl. Do not update this type of database.
--ensemblprotists Update protist genomes and proteins from Ensembl. Do not update this type of database.
--ensemblvertebrates Update vertebrate genomes and proteins from Ensembl. Do not update this type of database.
--ensemblvertebratesabinitio Update vertebrate ab initio proteins from Ensembl. Do not update this type of database.
--genbankbacteria Update bacteria genomes and proteins from GenBank. Do not update this type of database.
--genbankfungi Update fungi genomes and proteins from GenBank. Do not update this type of database.
--misc Update the miscellaneous sequence databases specified in .csv files in the database subdirectory conf/. There are two example .csv files in the MEME Suite etc/ directory. Do not update this type of database.
--rsatfungi Update the Fungi upstream sequence databases from RSAT. Do not update this type of database.
--rsatmetazoan Update the Metazoan upstream sequence databases from RSAT. Do not update this type of database.
--rsatplants Update the Plant upstream sequence databases from RSAT. Do not update this type of database.
--rsatprokaryotes Update the Prokaryote upstream sequence databases from RSAT. Do not update this type of database.
--rsatprotists Update the Protist upstream sequence databases from RSAT. Do not update this type of database.
--ucscdeuterostome Update the deuterostome genomes from UCSC. Do not update this type of database.
--ucscinsect Update the insect genomes from UCSC. Do not update this type of database.
--ucscmammal Update the mammal genomes from UCSC. Do not update this type of database.
--ucscnematode Update the nematode genomes from UCSC. Do not update this type of database.
--ucscother Update the "other" genomes from UCSC. Do not update this type of database.
--ucscvertebrate Update the vertebrate genomes from UCSC. Do not update this type of database.
--ucscvirus Update the virus genomes from UCSC. Do not update this type of database.
--updaterclassname Experimental Specify the classname of a custom updater.
File Cleanup Options
--obsoletefile pattern Mark any sequence databases that match the given glob syntax file pattern as obsolete causing them to be hidden from the interface. This option may be repeated to specify multiple patterns. After the files are obsoleted the updater exits. Run as normal.
--delete_old Sequence databases marked as obsolete (on a previous update) will be deleted. Sequence databases marked as obsolete will be left untouched.
--retain_missing Database entries for missing files are retained. Database entries for missing files are removed.
--purge_only Delete sequence databases for which the files have been lost and exit. Update the selected database(s).
Backwards Compatibility Options
--csv:directory Create a CSV file and index file that lists all the databases to enable backwards compatibility with older releases. The directory to create the CSV and index file can be specified if desired but if it is not specified then the CSV and index file will be placed in the sequence database directory. Don't create a CSV or index file.
Miscellaneous Options
--bindirectory Specify where to find the fasta-get-markov tool. The program will search the configured bin directory and if fasta-get-markov is not present it will search the path.
--loglog file Specify the file to write logs. A log will be written the logs directory below the sequence database directory.
--priorstsv file Specify a tab separated values file listing all the priors that should be listed in the database. The updater will exit after changing the priors. Note that pre-existing priors will be removed! Run as normal.
--vlog level Specify the logging level [1-8]. A default logging level of 3 is used which outputs errors, warnings and summary information.

Adding Priors

MCAST and FIMO support priors for sequence databases but adding them is still a manual process. The process will probably be automated in future however until then this is how you add priors.

  1. Create priors with prior sources like DNase1 hypersensitivity sequence tag counts using create-priors. This will create two files: priors.wig and priors.dist which you should rename in a way that makes sense.
  2. Run gzip on each of the files. This should leave them with the extension ".gz" - it is important that you leave this extension so they can be ungzip-ed by the webservice script later.
  3. Move the gzip-ed ".wig" and ".dist" files into the sequence database directory. They may be at the top level or nested within folders however they must be accessible by a relative path without any ".." elements.
  4. Create a file listing all the priors one per line with the fields separated by tabs.
    FieldDescription
    Sequence FileThe path to the sequence file relative to the sequence database directory.
    Wig FileThe path to the gzip-ed ".wig" file relative to the sequence database directory.
    Dist FileThe path to the gzip-ed ".dist" file relative to the sequence database directory.
    BiosampleA short descriptive name for the sample used in the experiment that the priors were derived from.
    AssayA short descriptive name for the experiment that the priors were derived from.
    SourceA short descriptive name of the lab or group that performed the experiment that the priors were derived from.
    URLA URL linking to further information on the experiment.
    DescriptionA description of the experiment which may contain HTML.
  5. Finally run
    update-sequence-db --priors priors tsv sequence database directory
    which will replace the existing priors in the database with those listed in the TSV file. Check the log file generated by update-sequence-db to ensure that all the priors were added without error.

Database Schema

As well as downloading the sequence files from many sources, the updater tracks the files using a SQLite database. The schema of the database is given below.

tblCategory

ColumnTypeConstraintDescription
idINTEGERPRIMARY KEYA auto-generated unique identifier for the category. Other tables reference this field.
nameTEXTUNIQUE NOT NULLThe unique name of the category as shown to users.

tblListing

ColumnTypeConstraintDescription
idINTEGERPRIMARY KEYA auto-generated unique identifier for the listing. Other tables reference this field.
categoryIdINTEGERNOT NULL REFERENCES tblCategory (id)The identifier of the category that contains this listing.
nameTEXTNOT NULLThe name of the listing shown to users.
descriptionTEXTNOT NULLThe description of the listing shown to users.

The combination of the fields categoryId and name is unique.

tblSequenceFile

ColumnTypeConstraintDescription
idINTEGERPRIMARY KEYA auto-generated unique identifier for the sequence file.
retrieverINTEGERNOT NULL An identifier for the code module that downloaded this sequence. It allows the individual code modules to ensure they don't change the records of files downloaded by other modules.
listingIdINTEGERNOT NULL REFERENCES tblListing (id)The identifier of the listing that contains this sequence file.
alphabetINTEGERNOT NULL CHECK (alphabet IN (1, 2, 4)) Represents the alphabet as powers of 2 so they can be combined into a bitset.
  • RNA = 1,
  • DNA = 2,
  • Protein = 4.
editionINTEGERNOT NULLA machine readable version. This field is used for sorting. Larger numbers are considered newer.
versionTEXTNOT NULLA human readable version which is displayed to the user.
descriptionTEXTNOT NULLThe description of the sequence file, often containing information about the source.
fileSeqTEXTUNIQUE NOT NULLThe relative path to the sequence file.
fileBgTEXTUNIQUE NOT NULLThe relative path to the background file.
sequenceCountINTEGERNOT NULLThe number of sequences.
totalLenINTEGERNOT NULLThe total end-to-end combined length of the sequences.
minLenINTEGERNOT NULLThe length of the shortest sequence.
maxLenINTEGERNOT NULLThe length of the longest sequence.
avgLenREALNOT NULLThe average length of the sequences.
stdDLenREALNOT NULLCurrently unused! Intended to store the standard deviation of the average length.
obsoleteINTEGERDEFAULT 0Used to flag sequences as obsolete. Sequences flagged as obsolete are hidden from the interface.

The combination of the fields listingId, alphabet and edition is unique.

tblPriorFile

ColumnTypeConstraintDescription
idINTEGERPRIMARY KEYA auto-generated unique identifier for the prior.
sequenceIdINTEGERNOT NULL REFERENCES tblSequenceFile (id)The identifier of the sequence that is associated with this prior.
filePriorTEXTUNIQUE NOT NULLThe relative path to the wig file (which may be gzipped).
fileDistTEXTUNIQUE NOT NULLThe relative path to the dist file (which may be gzipped).
biosampleTEXTNOT NULLA short descriptive name for the sample used in the experiment that the priors were derived from.
assayTEXTNOT NULLA short descriptive name for the experiment that the priors were derived from.
sourceTEXTNOT NULLA short descriptive name of the lab or group that performed the experiment that the priors were derived from.
urlTEXTNOT NULLA URL linking to further information on the experiment.
descriptionTEXTNOT NULLA description of the experiment which may contain HTML.