update-sequence-db [options] <sequence database directory>
This program has three uses:
Note: You must select one of the supported categories using an option from the second group of options, below.
This command creates a SQLite database called fasta_db.sqlite
and
downloads FASTA sequence files from multiple sources while storing information
about the sequences in the SQLite database.
The program will start in status display mode where it will give
regular updates on what it is doing. You can switch it
to command mode by pressing Enter. In command mode you can type the two
basic commands "help" which will show the available commands and "status"
which will switch it back to status mode. While sequences are downloading
you may use the command "exit" to stop any further downloading.
The directory to contain your MEME Suite sequence database. The MEME Suite expects
to find sequence databases in a folder called fasta_databases
either inside in the folder MEME Install Folder/db
or in
the folder specified to the configure script
--with-db DB Install Folder
. Depending on how
you configured the MEME Suite you should either specify
MEME Install Folder/db/fasta_databases
or
DB Install Folder/fasta_databases
.
You must specify one or more categories of database to update. See the section "Select Category of Databases to Update or List", below.
By default, the program will list all the FASTA sequence files missing from your chosen database category that are available for download. You can use this information to create a regular expression describing the files you want to download. For example, the commandfasta_databases
on your computer. On the other hand, the command
The first time it is called, the program creates a folder called downloads
and a folder
called logs
. It also creates a SQLite database called
fasta_db.sqlite
. Every FASTA sequence file that is downloaded
is initially put in the folder downloads
until it has been
completely downloaded. When the FASTA sequence file has been downloaded it will be
decompressed or merged from multiple sources as required and put into
a sequence file with either a .faa
or .fna
extension for protein or DNA sequences, respectively. Once the sequence has been
expanded it will be processed by fasta-get-markov
to
calculate a first-order background model in a file with the extension
.bfile
. For each genomes file, an index file ending in .fai
for use by bed2fasta
will be automatically created using the fasta-file-indexer utility.
Additionally, fasta-get-markov will be called to
calculate the number of sequences, the shortest, longest and average size,
and all this information will be stored in the SQLite database.
Configuration files that tweak the behaviors of the sequence database
downloaders will be automatically generated in the conf/
subdirectory within the specified sequence database directory.
Additionally the miscellaneous source downloader (--misc
)
will check the conf/
subdirectory for any files ending with the extension
.csv
, which it reads to determine sequence sources. The MEME
Suite includes two files db_general.csv
and
db_other_genomes.csv
in the distribution's etc
folder which you can move into the conf
folder, though
this is not done automatically during install.
Option | Parameter | Description | Default Behavior |
---|---|---|---|
General Options | |||
--db_all | Download all available FASTA files from the chosen category. | List missing FASTA files from the chosen category. | |
--db_regexp | Download only FASTA files from the chosen category matching the given JAVA regular expression. | List missing FASTA files from the chosen category. | |
--db_rel | Download FASTA files or list missing FASTA files from this specific release of the chosen category. | Use most recent release of the chosen category. | |
--list_rel | List available database release numbers of the chosen category and exit. | Download FASTA files or list missing FASTA files from the chosen category. | |
Select Category of Databases to Update or List | |||
--ensemblbacteria | Update bacteria genomes and proteins from Ensembl. | Do not update this type of database. | |
--ensemblfungi | Update fungi genomes and proteins from Ensembl. | Do not update this type of database. | |
--ensemblmetazoa | Update metazoa genomes and proteins from Ensembl. | Do not update this type of database. | |
--ensemblplants | Update plant genomes and proteins from Ensembl. | Do not update this type of database. | |
--ensemblprotists | Update protist genomes and proteins from Ensembl. | Do not update this type of database. | |
--ensemblvertebrates | Update vertebrate genomes and proteins from Ensembl. | Do not update this type of database. | |
--ensemblvertebratesabinitio | Update vertebrate ab initio proteins from Ensembl. | Do not update this type of database. | |
--genbankbacteria | Update bacteria genomes and proteins from GenBank. | Do not update this type of database. | |
--genbankfungi | Update fungi genomes and proteins from GenBank. | Do not update this type of database. | |
--misc | Update the miscellaneous sequence databases specified
in .csv files in the database subdirectory conf/ .
There are two example .csv files in the MEME Suite etc/
directory.
|
Do not update this type of database. | |
--rsatfungi | Update the Fungi upstream sequence databases from RSAT. | Do not update this type of database. | |
--rsatmetazoan | Update the Metazoan upstream sequence databases from RSAT. | Do not update this type of database. | |
--rsatplants | Update the Plant upstream sequence databases from RSAT. | Do not update this type of database. | |
--rsatprokaryotes | Update the Prokaryote upstream sequence databases from RSAT. | Do not update this type of database. | |
--rsatprotists | Update the Protist upstream sequence databases from RSAT. | Do not update this type of database. | |
--ucscdeuterostome | Update the deuterostome genomes from UCSC. | Do not update this type of database. | |
--ucscinsect | Update the insect genomes from UCSC. | Do not update this type of database. | |
--ucscmammal | Update the mammal genomes from UCSC. | Do not update this type of database. | |
--ucscnematode | Update the nematode genomes from UCSC. | Do not update this type of database. | |
--ucscother | Update the "other" genomes from UCSC. | Do not update this type of database. | |
--ucscvertebrate | Update the vertebrate genomes from UCSC. | Do not update this type of database. | |
--ucscvirus | Update the virus genomes from UCSC. | Do not update this type of database. | |
--updater | classname | Experimental Specify the classname of a custom updater. | |
File Cleanup Options | |||
--obsolete | file pattern | Mark any sequence databases that match the given glob syntax file pattern as obsolete causing them to be hidden from the interface. This option may be repeated to specify multiple patterns. After the files are obsoleted the updater exits. | Run as normal. |
--delete_old | Sequence databases marked as obsolete (on a previous update) will be deleted. | Sequence databases marked as obsolete will be left untouched. | |
--retain_missing | Database entries for missing files are retained. | Database entries for missing files are removed. | |
--purge_only | Delete sequence databases for which the files have been lost and exit. | Update the selected database(s). | |
Backwards Compatibility Options | |||
--csv:directory | Create a CSV file and index file that lists all the databases to enable backwards compatibility with older releases. The directory to create the CSV and index file can be specified if desired but if it is not specified then the CSV and index file will be placed in the sequence database directory. | Don't create a CSV or index file. | |
Miscellaneous Options | |||
--bin | directory | Specify where to find the fasta-get-markov tool. | The program will search the configured bin directory and if fasta-get-markov is not present it will search the path. |
--log | log file | Specify the file to write logs. | A log will be written the logs directory below the
sequence database directory. |
--priors | tsv file | Specify a tab separated values file listing all the priors that should be listed in the database. The updater will exit after changing the priors. Note that pre-existing priors will be removed! | Run as normal. |
--v | log level | Specify the logging level [1-8]. | A default logging level of 3 is used which outputs errors, warnings and summary information. |
MCAST and FIMO support priors for sequence databases but adding them is still a manual process. The process will probably be automated in future however until then this is how you add priors.
priors.wig
and priors.dist
which you should rename in a way that makes sense.gzip
on each of the files. This should leave them
with the extension ".gz" - it is important that you leave this
extension so they can be ungzip-ed by the webservice script later.Field | Description |
---|---|
Sequence File | The path to the sequence file relative to the sequence database directory. |
Wig File | The path to the gzip-ed ".wig" file relative to the sequence database directory. |
Dist File | The path to the gzip-ed ".dist" file relative to the sequence database directory. |
Biosample | A short descriptive name for the sample used in the experiment that the priors were derived from. |
Assay | A short descriptive name for the experiment that the priors were derived from. |
Source | A short descriptive name of the lab or group that performed the experiment that the priors were derived from. |
URL | A URL linking to further information on the experiment. |
Description | A description of the experiment which may contain HTML. |
As well as downloading the sequence files from many sources, the updater tracks the files using a SQLite database. The schema of the database is given below.
Column | Type | Constraint | Description |
---|---|---|---|
id | INTEGER | PRIMARY KEY | A auto-generated unique identifier for the category. Other tables reference this field. |
name | TEXT | UNIQUE NOT NULL | The unique name of the category as shown to users. |
Column | Type | Constraint | Description |
---|---|---|---|
id | INTEGER | PRIMARY KEY | A auto-generated unique identifier for the listing. Other tables reference this field. |
categoryId | INTEGER | NOT NULL REFERENCES tblCategory (id) | The identifier of the category that contains this listing. |
name | TEXT | NOT NULL | The name of the listing shown to users. |
description | TEXT | NOT NULL | The description of the listing shown to users. |
The combination of the fields categoryId
and name
is unique.
Column | Type | Constraint | Description |
---|---|---|---|
id | INTEGER | PRIMARY KEY | A auto-generated unique identifier for the sequence file. |
retriever | INTEGER | NOT NULL | An identifier for the code module that downloaded this sequence. It allows the individual code modules to ensure they don't change the records of files downloaded by other modules. |
listingId | INTEGER | NOT NULL REFERENCES tblListing (id) | The identifier of the listing that contains this sequence file. |
alphabet | INTEGER | NOT NULL CHECK (alphabet IN (1, 2, 4)) | Represents the alphabet as powers of 2 so they can be combined into a bitset.
|
edition | INTEGER | NOT NULL | A machine readable version. This field is used for sorting. Larger numbers are considered newer. |
version | TEXT | NOT NULL | A human readable version which is displayed to the user. |
description | TEXT | NOT NULL | The description of the sequence file, often containing information about the source. |
fileSeq | TEXT | UNIQUE NOT NULL | The relative path to the sequence file. |
fileBg | TEXT | UNIQUE NOT NULL | The relative path to the background file. |
sequenceCount | INTEGER | NOT NULL | The number of sequences. |
totalLen | INTEGER | NOT NULL | The total end-to-end combined length of the sequences. |
minLen | INTEGER | NOT NULL | The length of the shortest sequence. |
maxLen | INTEGER | NOT NULL | The length of the longest sequence. |
avgLen | REAL | NOT NULL | The average length of the sequences. |
stdDLen | REAL | NOT NULL | Currently unused! Intended to store the standard deviation of the average length. |
obsolete | INTEGER | DEFAULT 0 | Used to flag sequences as obsolete. Sequences flagged as obsolete are hidden from the interface. |
The combination of the fields listingId
, alphabet
and edition
is unique.
Column | Type | Constraint | Description |
---|---|---|---|
id | INTEGER | PRIMARY KEY | A auto-generated unique identifier for the prior. |
sequenceId | INTEGER | NOT NULL REFERENCES tblSequenceFile (id) | The identifier of the sequence that is associated with this prior. |
filePrior | TEXT | UNIQUE NOT NULL | The relative path to the wig file (which may be gzipped). |
fileDist | TEXT | UNIQUE NOT NULL | The relative path to the dist file (which may be gzipped). |
biosample | TEXT | NOT NULL | A short descriptive name for the sample used in the experiment that the priors were derived from. |
assay | TEXT | NOT NULL | A short descriptive name for the experiment that the priors were derived from. |
source | TEXT | NOT NULL | A short descriptive name of the lab or group that performed the experiment that the priors were derived from. |
url | TEXT | NOT NULL | A URL linking to further information on the experiment. |
description | TEXT | NOT NULL | A description of the experiment which may contain HTML. |