fasta-holdout-set

Usage:

fasta-holdout-set [options] --p <primary sequences>

Description

Split primary and control sequences into training and testing sets. Control sequences are generated by shuffling if not specified. Primary sequences will be centrally trimmed to a specified length if requested.

Input

<primary sequences>

The name of a file containing the primary (positive) sequences in FASTA format. The file must contain at least two valid sequences or fasta-holdout-set will reject it.

Output

fasta-holdout-set writes its output to files in a directory named fasta-holdout-set_out, which it creates if necessary. You can change the output directory using the --o or --oc options. The directory will contain:

Unless option --hofract (see below) is given and is equal to zero the output directory will also contain:

Note: All options may be preceded by a single dash (-) instead of a double dash (--) if desired.

Options

Option Parameter Description Default Behavior
Output
--n control sequences The name of a file containing control (negative) sequences in FASTA format. The control sequences must be in the same sequence alphabet as the primary sequences. If the average length of the control sequences is longer than that of the primary sequences, fasta-holdout-set trims the control sequences so that both sets have the same average length. If you do not provide control sequences, fasta-holdout-set creates them by shuffling a copy of each primary sequence, using a m-order shuffle (see next option). Shuffling also preserves the positions of non-core (e.g., ambiguous) characters in each sequence to avoid artifacts.
--order m If you do not provide control sequences, fasta-holdout-set will do an an m-order shuffle of each primary sequence to to create control sequences. This preserves the frequencies of words of length m+1 in each shuffled sequence. Unless you specify a background model file (see --bfile, below), fasta-holdout-set will also estimate an m-order Markov background model from the control sequences (or the primary sequences if you do not provide control sequences). fasta-holdout-set uses the fasta-get-markov program with a total pseudocount of 1 to create the Markov model. m must be in the range [0,..,5]. fasta-holdout-set uses m=2 (DNA and RNA), and m=0 (Protein and Custom alphabets).
--hofract hofract The fraction of the primary and control sequences that fasta-holdout-set will randomly select and place in the hold-out set output files.
Note: If a value of 0 is specified, no hold-out set output files are created and the training set files will contain all the original sequences.
Note: If a hold-out set would contain fewer than 5 sequences, fasta-holdout-set creates an empty output file.
fasta-holdout-set places 0.1 (10%) of the primary and control sequences in the hold-out set output files.
--seed seed Random seed for shuffling and sampling the hold-out set sequences (see above). fasta-holdout-set uses a random seed of 0.
--ccutsize Trim the primary sequences to their central region of size characters before creating the control sequences and before splitting the sequences into training and testing sets. A value of 0 indicates that the primary sequences should not be trimmed. Note: If you provide control sequences they will never be trimmed. A value of 0 is used.
--verbosity1|2|3|4|5 A number that regulates the verbosity level of the output information messages. If set to 1 (quiet) then fasta-holdout-set will only output warning and error messages, whereas the other extreme 5 (dump) outputs lots of information intended for debugging. The verbosity level is set to 2 (normal).