parrot-predict
parrot-predict is a command for making predictions using a trained PARROT network. The parrot-train and parrot-optimize commands both output a file with trained network weights, and this trained network can be used by parrot-predict to make new predictions on unlabeled sequences. The prediction will be output as a text file saved to a specified location. Note that this command will only make predictions for non-redundant sequences in the provided file. Currently, users must input the hyperparameters (–num-layers and –hidden-size) they used to train their network originally, but in future versions of PARROT, parrot-predict will be able to dynamically read in your saved network and automatically detect these hyperparameters.
Once PARROT is installed, the user can run parrot-predict from the command line:
$ parrot-predict seq_file saved_network output_file <flags>
Where seq_file specifies a file containing a list of sequences. Each line of seq_file should have two whitespace-separated columns: a sequence ID and the amino acid sequence. Optionally, the file may also be formatted without the sequence IDs. Two example seq_file can be found in the /data folder. saved_network is the path to where the trained network is saved in memory. output_file is the path to where the predictions will be saved as a text file.
Required flags:
--datatype/-d: Describes how values are formatted in datafile. Should be ‘sequence’ if there is a single value per sequence, or ‘residues’ if there are values for every residue in each sequence. See the example datasets in the data folder for more information.
--classes/-c: The number of classes for the machine learning task. If the task is regression, then specify ‘1’.
Optional flags:
--help/-h: Display a help message.
--num-layers/-nl: Number of hidden layers in the network (default is 1). Must be a positive integer and must be identical to the number of layers used when the network was trained.
--hidden-size/-hs: Size of hidden vectors within the network (default is 10). Must be a positive integer and must be identical to the hidden size used when the network was trained.
--encode: Include this flag to specify the numeric encoding scheme for each amino acid. Available options are ‘onehot’ (default), ‘biophysics’ or user-specified. If you wish to manually specify an encoding scheme, provide a path to a text file describing the amino acid to vector mapping. The encoding scheme used for sequence prediction must be identical to that used for network training.
--exclude-seq-id: Include this flag if the seq_file is formatted without sequence IDs as the first column in each row.
--probabilistic-classification: Include this flag to output class predictions as continuous values [0-1], based on the probability that the input sample belongs to each class. Currently only implemented for sequence classification. (NOTE: This is a new feature, let us know if you run into any issues!)
--silent: Flag which, if provided, ensures no output is generated to the terminal.
--print-frequency: Value that defines how often status updates should be printed (in number of sequences predicted). Default=1000
Output:
parrot-predict will produce a single text file as output, as well as status updates to the console (if --silent is not specified). This file will be formatted similarly to the original datafiles used for network training: each row contains a sequence ID (exluded if the flag --exclude-seq-id is given), an amino acid sequence, and the prediction values for that sequence.