fasta-holdout-set
[options] --p <primary sequences>
Split primary and control sequences into training and testing sets. Control sequences are generated by shuffling if not specified. Primary sequences will be centrally trimmed to a specified length if requested.
The name of a file containing the primary (positive) sequences in
FASTA format. The file must contain
at least two valid sequences or fasta-holdout-set
will reject it.
fasta-holdout-set
writes its output to files in a directory named
fasta-holdout-set_out
, which it creates if necessary. You can change the
output directory using the --o or --oc options.
The directory will contain:
train_pos.fa
- the primary training set
train_neg.fa
- the control training set
test_pos.fa
- the primary testing (hold-out) set
test_neg.fa
- the control testing (hold-out) set
Note: All options may be preceded by a single dash (-) instead of a double dash (--) if desired.
Option | Parameter | Description | Default Behavior | |
---|---|---|---|---|
Output | ||||
--n | control sequences | The name of a file containing control (negative) sequences in
FASTA format.
The control sequences must be in the same sequence alphabet as the primary sequences.
If the average length of the control sequences is longer than that of
the primary sequences, fasta-holdout-set trims the control sequences so that both
sets have the same average length.
|
If you do not provide control sequences, fasta-holdout-set creates them by shuffling a copy
of each primary sequence, using a m-order shuffle
(see next option). Shuffling also preserves the positions of non-core
(e.g., ambiguous) characters in each sequence to avoid artifacts.
| |
--order | m | If you do not provide control sequences, fasta-holdout-set will do an
an m-order shuffle of each primary sequence to
to create control sequences. This preserves the frequencies of words of
length m+1 in each shuffled sequence.
Unless you specify a background model file (see --bfile, below),
fasta-holdout-set will also estimate an m-order Markov background model
from the control sequences (or the primary sequences if you do not provide
control sequences). fasta-holdout-set uses the fasta-get-markov
program with a total pseudocount of 1 to create the Markov model.
m must be in the range [0,..,5].
| fasta-holdout-set uses m=2 (DNA and RNA),
and m=0 (Protein and Custom alphabets).
| |
--hofract | hofract | The fraction of the primary and control sequences that fasta-holdout-set will randomly select
and place in the hold-out set output files.
Note: If a value of 0 is specified, no hold-out set output files are created and the training set files will contain all the original sequences. Note: If a hold-out set would contain fewer than 5 sequences, fasta-holdout-set creates an empty output file.
|
fasta-holdout-set places 0.1 (10%) of the primary and control sequences
in the hold-out set output files.
|
|
--seed | seed | Random seed for shuffling and sampling the hold-out set sequences (see above). | fasta-holdout-set uses a random seed of 0. |
|
--ccut | size | Trim the primary sequences to their central region of size characters before creating the control sequences and before splitting the sequences into training and testing sets. A value of 0 indicates that the primary sequences should not be trimmed. Note: If you provide control sequences they will never be trimmed. | A value of 0 is used. | |
--verbosity | 1|2|3|4|5 | A number that regulates the verbosity level of the output
information messages. If set to 1 (quiet) then fasta-holdout-set will only
output warning and error messages, whereas the other extreme 5 (dump)
outputs lots of information intended for debugging.
|
The verbosity level is set to 2 (normal). |