parrot-predict¶
parrot-predict
is a command for making predictions using a trained PARROT network. The parrot-train
and parrot-optimize
commands both output a file with trained network weights, and this trained network can be used by parrot-predict
to make new predictions on unlabeled sequences. The prediction will be output as a text file saved to a specified location. Note that this command will only make predictions for non-redundant sequences in the provided file. Currently, users must input the hyperparameters (–num-layers and –hidden-size) they used to train their network originally, but in future versions of PARROT, parrot-predict
will be able to dynamically read in your saved network and automatically detect these hyperparameters.
Once PARROT is installed, the user can run parrot-predict
from the command line:
$ parrot-predict seq_file saved_network output_file <flags>
Where seq_file specifies a file containing a list of sequences. Each line of seq_file should have two whitespace-separated columns: a sequence ID and the amino acid sequence. Optionally, the file may also be formatted without the sequence IDs. Two example seq_file can be found in the /data folder. saved_network is the path to where the trained network is saved in memory. output_file is the path to where the predictions will be saved as a text file.
Required flags:
--datatype
/-d
: Describes how values are formatted in datafile. Should be ‘sequence’ if there is a single value per sequence, or ‘residues’ if there are values for every residue in each sequence. See the example datasets in the data folder for more information.--classes
/-c
: The number of classes for the machine learning task. If the task is regression, then specify ‘1’.
Optional flags:
--help
/-h
: Display a help message.--num-layers
/-nl
: Number of hidden layers in the network (default is 1). Must be a positive integer and must be identical to the number of layers used when the network was trained.--hidden-size
/-hs
: Size of hidden vectors within the network (default is 10). Must be a positive integer and must be identical to the hidden size used when the network was trained.--encode
: Include this flag to specify the numeric encoding scheme for each amino acid. Available options are ‘onehot’ (default), ‘biophysics’ or user-specified. If you wish to manually specify an encoding scheme, provide a path to a text file describing the amino acid to vector mapping. The encoding scheme used for sequence prediction must be identical to that used for network training.--exclude-seq-id
: Include this flag if the seq_file is formatted without sequence IDs as the first column in each row.--probabilistic-classification
: Include this flag to output class predictions as continuous values [0-1], based on the probability that the input sample belongs to each class. Currently only implemented for sequence classification. (NOTE: This is a new feature, let us know if you run into any issues!)--silent
: Flag which, if provided, ensures no output is generated to the terminal.--print-frequency
: Value that defines how often status updates should be printed (in number of sequences predicted). Default=1000
Output:
parrot-predict
will produce a single text file as output, as well as status updates to the console (if --silent
is not specified). This file will be formatted similarly to the original datafiles used for network training: each row contains a sequence ID (exluded if the flag --exclude-seq-id
is given), an amino acid sequence, and the prediction values for that sequence.