index-sequence-db

Create index files for the genomes in the sequence database directory.

This program requires the update-sequence-db program to have been run first.

If necessary, the program adds the fileSeqIndex column to the tblSequenceFile. Then loops through all the databases specified in the SQLite database fasta_db.sqlite that is created by update-sequence-db and uses fasta-file-indexer to create the FASTA index file for each genome database. The program then updates the tblSequenceFile record with the name of the index file in the fileSeqIndex column.

<sequence database directory>

The folder to used to store sequence database files that should be indexed. The program expects the folder to contain the sequence FASTA files, and the SQLite database fasta_db.sqlite created by update-sequence-db.

For each sequence file indexed the program creates a corresponding index file with the same basename, but the suffix .fai. The fasta_db.sqlite database is updated with the name of the index file. The program also creates a folder called called logs containing date-stamped logs of the program's activity.

As well as downloading the sequence files from many sources, the updater tracks the files using a SQLite database. The schema of the database is given below.

tblCategory

Column	Type	Constraint	Description
id	INTEGER	PRIMARY KEY	A auto-generated unique identifier for the category. Other tables reference this field.
name	TEXT	UNIQUE NOT NULL	The unique name of the category as shown to users.

tblListing

Column	Type	Constraint	Description
id	INTEGER	PRIMARY KEY	A auto-generated unique identifier for the listing. Other tables reference this field.
categoryId	INTEGER	NOT NULL REFERENCES tblCategory (id)	The identifier of the category that contains this listing.
name	TEXT	NOT NULL	The name of the listing shown to users.
description	TEXT	NOT NULL	The description of the listing shown to users.

The combination of the fields categoryId and name is unique.

tblSequenceFile

Column	Type	Constraint	Description
id	INTEGER	PRIMARY KEY	A auto-generated unique identifier for the sequence file.
retriever	INTEGER	NOT NULL	An identifier for the code module that downloaded this sequence. It allows the individual code modules to ensure they don't change the records of files downloaded by other modules.
listingId	INTEGER	NOT NULL REFERENCES tblListing (id)	The identifier of the listing that contains this sequence file.
alphabet	INTEGER	NOT NULL CHECK (alphabet IN (1, 2, 4))	Represents the alphabet as powers of 2 so they can be combined into a bitset. RNA = 1, DNA = 2, Protein = 4.
edition	INTEGER	NOT NULL	A machine readable version. This field is used for sorting. Larger numbers are considered newer.
version	TEXT	NOT NULL	A human readable version which is displayed to the user.
description	TEXT	NOT NULL	The description of the sequence file, often containing information about the source.
fileSeq	TEXT	UNIQUE NOT NULL	The relative path to the sequence file.
fileBg	TEXT	UNIQUE NOT NULL	The relative path to the background file.
fileSeqIndex	TEXT	UNIQUE NOT NULL	The relative path to the index file for the sequence.
sequenceCount	INTEGER	NOT NULL	The number of sequences.
totalLen	INTEGER	NOT NULL	The total end-to-end combined length of the sequences.
minLen	INTEGER	NOT NULL	The length of the shortest sequence.
maxLen	INTEGER	NOT NULL	The length of the longest sequence.
avgLen	REAL	NOT NULL	The average length of the sequences.
stdDLen	REAL	NOT NULL	Currently unused! Intended to store the standard deviation of the average length.
obsolete	INTEGER	DEFAULT 0	Used to flag sequences as obsolete. Sequences flagged as obsolete are hidden from the interface.

The combination of the fields listingId, alphabet and edition is unique.

tblPriorFile

Column	Type	Constraint	Description
id	INTEGER	PRIMARY KEY	A auto-generated unique identifier for the prior.
sequenceId	INTEGER	NOT NULL REFERENCES tblSequenceFile (id)	The identifier of the sequence that is associated with this prior.
filePrior	TEXT	UNIQUE NOT NULL	The relative path to the wig file (which may be gzipped).
fileDist	TEXT	UNIQUE NOT NULL	The relative path to the dist file (which may be gzipped).
biosample	TEXT	NOT NULL	A short descriptive name for the sample used in the experiment that the priors were derived from.
assay	TEXT	NOT NULL	A short descriptive name for the experiment that the priors were derived from.
source	TEXT	NOT NULL	A short descriptive name of the lab or group that performed the experiment that the priors were derived from.
url	TEXT	NOT NULL	A URL linking to further information on the experiment.
description	TEXT	NOT NULL	A description of the experiment which may contain HTML.

The MEME Suite

Motif-based sequence analysis tools

Usage:

Description

Input

<sequence database directory>

Output

Database Schema

tblCategory

tblListing

tblSequenceFile

tblPriorFile