index-sequence-db

Usage:

index-sequence-db <sequence database directory>

Description

Create index files for the genomes in the sequence database directory.

This program requires the update-sequence-db program to have been run first.

If necessary, the program adds the fileSeqIndex column to the tblSequenceFile. Then loops through all the databases specified in the SQLite database fasta_db.sqlite that is created by update-sequence-db and uses fasta-file-indexer to create the FASTA index file for each genome database. The program then updates the tblSequenceFile record with the name of the index file in the fileSeqIndex column.

Input

<sequence database directory>

The folder to used to store sequence database files that should be indexed. The program expects the folder to contain the sequence FASTA files, and the SQLite database fasta_db.sqlite created by update-sequence-db.

Output

For each sequence file indexed the program creates a corresponding index file with the same basename, but the suffix .fai. The fasta_db.sqlite database is updated with the name of the index file. The program also creates a folder called called logs containing date-stamped logs of the program's activity.

Database Schema

As well as downloading the sequence files from many sources, the updater tracks the files using a SQLite database. The schema of the database is given below.

tblCategory

ColumnTypeConstraintDescription
idINTEGERPRIMARY KEYA auto-generated unique identifier for the category. Other tables reference this field.
nameTEXTUNIQUE NOT NULLThe unique name of the category as shown to users.

tblListing

ColumnTypeConstraintDescription
idINTEGERPRIMARY KEYA auto-generated unique identifier for the listing. Other tables reference this field.
categoryIdINTEGERNOT NULL REFERENCES tblCategory (id)The identifier of the category that contains this listing.
nameTEXTNOT NULLThe name of the listing shown to users.
descriptionTEXTNOT NULLThe description of the listing shown to users.

The combination of the fields categoryId and name is unique.

tblSequenceFile

ColumnTypeConstraintDescription
idINTEGERPRIMARY KEYA auto-generated unique identifier for the sequence file.
retrieverINTEGERNOT NULL An identifier for the code module that downloaded this sequence. It allows the individual code modules to ensure they don't change the records of files downloaded by other modules.
listingIdINTEGERNOT NULL REFERENCES tblListing (id)The identifier of the listing that contains this sequence file.
alphabetINTEGERNOT NULL CHECK (alphabet IN (1, 2, 4)) Represents the alphabet as powers of 2 so they can be combined into a bitset.
  • RNA = 1,
  • DNA = 2,
  • Protein = 4.
editionINTEGERNOT NULLA machine readable version. This field is used for sorting. Larger numbers are considered newer.
versionTEXTNOT NULLA human readable version which is displayed to the user.
descriptionTEXTNOT NULLThe description of the sequence file, often containing information about the source.
fileSeqTEXTUNIQUE NOT NULLThe relative path to the sequence file.
fileBgTEXTUNIQUE NOT NULLThe relative path to the background file.
fileSeqIndexTEXTUNIQUE NOT NULLThe relative path to the index file for the sequence.
sequenceCountINTEGERNOT NULLThe number of sequences.
totalLenINTEGERNOT NULLThe total end-to-end combined length of the sequences.
minLenINTEGERNOT NULLThe length of the shortest sequence.
maxLenINTEGERNOT NULLThe length of the longest sequence.
avgLenREALNOT NULLThe average length of the sequences.
stdDLenREALNOT NULLCurrently unused! Intended to store the standard deviation of the average length.
obsoleteINTEGERDEFAULT 0Used to flag sequences as obsolete. Sequences flagged as obsolete are hidden from the interface.

The combination of the fields listingId, alphabet and edition is unique.

tblPriorFile

ColumnTypeConstraintDescription
idINTEGERPRIMARY KEYA auto-generated unique identifier for the prior.
sequenceIdINTEGERNOT NULL REFERENCES tblSequenceFile (id)The identifier of the sequence that is associated with this prior.
filePriorTEXTUNIQUE NOT NULLThe relative path to the wig file (which may be gzipped).
fileDistTEXTUNIQUE NOT NULLThe relative path to the dist file (which may be gzipped).
biosampleTEXTNOT NULLA short descriptive name for the sample used in the experiment that the priors were derived from.
assayTEXTNOT NULLA short descriptive name for the experiment that the priors were derived from.
sourceTEXTNOT NULLA short descriptive name of the lab or group that performed the experiment that the priors were derived from.
urlTEXTNOT NULLA URL linking to further information on the experiment.
descriptionTEXTNOT NULLA description of the experiment which may contain HTML.