Advanced Examples:

The core usage of PARROT is designed to be as simple as possible so that anyone, regardless of computational expertise, can train a network on their dataset with minimal investment. However, on top of this basic implementation, PARROT has a number of options intended to let experienced users tailor their networks to their particular needs and to facilitate more sophisticated computational workflows.

Advanced parrot-train options:

Automatic determination of number of training epochs with –stop:

This flag determines the stop condition for network training. Currently, there are two options implemented: either ‘iter’ or ‘auto’. In all of the previous examples we used the default behavior, ‘iter’, which means that the number we specify for the -e flag will be the number of iterations that we train the network. Alternatively, using ‘auto’ means that training will stop automatically once performance on the validation set has plateaued for -e epochs. Thus, with ‘auto’ it is recommended to use a smaller number of epochs (10-20) for -e so training does not extend for a significantly long period of time.

parrot-train data/seq_regress_dataset.tsv stop_example.pt --datatype sequence -c 1 -nl 2 -hs 5 -lr 0.001 -e 10 -b 32 -v --stop auto
PARROT with user-specified parameters
-------------------------------------
Train on:   cpu
Datatype:   sequence
ML Task:    regression
Learning rate:  0.001000
Number of layers:   2
Hidden vector size: 5
Batch size: 32

Validation set loss per epoch:

Epoch 0 Loss 0.1779
Epoch 1 Loss 0.1752
Epoch 2 Loss 0.1727
...
Epoch 98    Loss 0.0456
Epoch 99    Loss 0.0456
Epoch 100   Loss 0.0456
Epoch 101   Loss 0.0456
Epoch 102   Loss 0.0456
Epoch 103   Loss 0.0456
Epoch 104   Loss 0.0456
Epoch 105   Loss 0.0456
Epoch 106   Loss 0.0456
Epoch 107   Loss 0.0456
Epoch 108   Loss 0.0456
Epoch 109   Loss 0.0456
Epoch 110   Loss 0.0455
Epoch 111   Loss 0.0455
Epoch 112   Loss 0.0455

Training stops here because performance has stopped improving. Worth mentioning: in some cases such as this dataset, ‘auto’ can actually get stuck in a local minimum well before the network is fully trained. Be mindful of this when using ‘auto’ stop condition.

You might also notice that in this example, the validation loss is listed for every single epoch instead of every 5. This is simply because the verbose -v flag was provided.

Splitting data into train/validation/test sets:

--set-fractions: This flag allows the user to set the proportions of data that will be a part of the training set, validation set, and test set. By default, the split is 70:15:15. This flag takes three input arguments, between 0 and 1, that must sum to 1.

parrot-train data/seq_regress_dataset.tsv setfractions_network.pt --datatype sequence -c 1 -e 200 --set-fractions 0.5 0.45 0.05

Notice that the output predictions file from this command has fewer datapoints because of the reduced test set. Most likely, the accuracy will be a little worse then the default proportions because the training set is also smaller.

--split: In some cases, users might want precise control over over the training, validation and test set splits of their input data. This flag allows the user to manually specify which subset each sample in their dataset will be assigned. This flag requires an argument that is a path to a split_file, which specifically allocates sequences in datafile to the different datasets. An example split_file is provided in the /data folder for reference.

parrot-train data/seq_regress_dataset.tsv manualsplit_network.pt --datatype sequence -c 1 -e 200 --split data/split_file.tsv

This can especially be useful if you wish to perform k-fold cross-validation on your dataset, as you can prepare k different split_files that each specify a particular 1/kth of your dataset into the test set.

--save-splits: Sometimes, a random partition into training/val/test sets is acceptable, but it is helpful to know for replicability where each sample was assigned. For example, if you are comparing multiple types of machine learning networks, it is best practice to use the same training set for each network. Including this flag causes an additional text file (suffix: “_split_file.txt”) to be saved to the output directory. This file is formatted in the same way as a split_file for using with the --split flag.

Amino acid -> vector encoding:

“Encoding” in the context of PARROT refers to the process of converting a sequence of amino acids into computer-readable numeric vectors. By default, PARROT utilizes one-hot encoding, which represents each amino acid as a vector with 19 zeros and a single 1, where the position of the 1 determines its identity. However, users can change how amino acids are encoded using the --encode flag.

In addition to one-hot encoding, encoding using biophysical scales (vector of properties like charge, hydrophobicity, molecular weight, etc.) is also hard-coded into PARROT. Machine learning using biophysical encoding and can be carried out by providing ‘biophysics’ after this flag.

parrot-train data/seq_regress_dataset.tsv biophysics_network.pt -d sequence -c 1 -nl 2 -hs 10 -e 200 --encode biophysics

More powerfully, PARROT also allows the user to manually specify their own encoding scheme, if they desire. An example encoding file can be found in the /data folder. In this case, provide the path to this encoding file following the flag.

parrot-train data/seq_regress_dataset.tsv userencode_network.pt -d sequence -c 1 -nl 2 -hs 10 -e 200 --encode data/encoding_example.txt

With the --encode flag and a user-provided file, PARROT is even flexible enough to work on nucleotide sequences! To illustrate this, we’ve included the file “nucleotide_encoding.txt” which can be passed in via this flag to one-hot encode nucleotide sequences. We’ve also included an example sequence regression dataset (melting temperature prediction) with nucleotide sequences: “nucleotide_dataset.tsv”.

parrot-train data/nucleotide_dataset.txt nucleotide_network.pt -d sequence -c 1 -nl 2 -hs 10 -e 200 --encode data/nucleotide_encoding.txt

Probabilistic classification with –probabilistic-classification:

The standard behavior of “classification” tasks in PARROT is to make predictions of discrete class labels. In reality though, this sort of behavior does not provide any information on the certainty behind these prediction. For example, in a two class problem (classes 0 and 1), if sequence A is deemed to be class 0 with 98% confidence, and sequence B is deemed class 0 with 51% confidence, both of these sequences will appear in the output prediction file as class 0. In some instances, it is useful to provide users a measure of confidence for each of the class predictions that PARROT makes. This can be accomplished with the --probabilistic-classification flag.

Using this flag is easy and can be used with parrot-train, parrot-optimize and parrot-predict. For the first two commands, this flag changes how predictions on the test set are output in the “_predictions.tsv” file and changes the figures and performance stats that are output (if specified). For the predict command, it changes how the predictions are outputed. If this flag is combined with --include-figs, it also changes the figure and metrics that are produced for evaluating performance on the test set (see parrot-train documentation page for more details). Conveniently, this flag can be used in parrot-predict even if it was not specified during training. As an example, here is the same sequence 3-class classification network making predictions with and without the --probabilistic-classification flag (default layers and hidden vector size):

parrot-predict data/seqfile.txt prob_example.pt discrete.txt -d sequence -c 3

Output:

a1 EADDGLYWQQN 2
b2 RRLKHEEDSTSTSTSTSTQ 0
c3 YYYGGAFAFAGRM 2
d4 GGIL 2
e5 GREPCCMLLYILILAAAQRDESSSSST 2
f6 PGDEADLGHRSLVWADD 2
parrot-predict data/seqfile.txt prob_example.pt probabilistic.txt -d sequence -c 3 --probabilistic-classification

Output:

a1 EADDGLYWQQN 0.0527 0.1081 0.8392
b2 RRLKHEEDSTSTSTSTSTQ 0.9819 0.0034 0.0148
c3 YYYGGAFAFAGRM 0.0742 0.0098 0.916
d4 GGIL 0.1509 0.0596 0.7894
e5 GREPCCMLLYILILAAAQRDESSSSST 0.0465 0.0118 0.9418
f6 PGDEADLGHRSLVWADD 0.0645 0.2576 0.678

The three numbers following each sequence represent the probability that the sequence belongs to each of the three classes. Notice the numbers in each row sum to 1.

Currently, probabilistic classification is only implemented for sequence classification problems. The same principles would work for residue classification, however, we have not thought of a convenient way of representing the information in the output files (each sequence has num_classes x seq_len values).

Hyperparameter tuning with parrot-optimize:

parrot-optimize will train a network like parrot-train, however this command does not require the user to specify hyperparameters. Instead, it relies upon Bayesian Optimization to automatically select hyperparameters. Although Bayesian Optimization is much more efficient than grid search optimization, it still requires many iterations to converge upon the best hyperparameters. Additionally, this command relies upon 5-fold cross validation for each set of hyperparameters to achieve an accurate estimate of network performance. All together, this means that parrot-optimize can take over 100x longer to run than parrot-train. It is strongly recommended to only run this command on a machine with a GPU.

Nonetheless, usage for parrot-optimize is remarkably similar to parrot-train, since many of the flags are identical. As an example, let’s run the command on a residue regression dataset:

parrot-optimize data/res_class_dataset.tsv optimize_example.pt -d residues -c 3 -e 200 --max-iter 20 -b 32 --verbose

Notice how we do not need to specify number of layers, hidden vector size, or learning rate as these are the parameters we are optimizing. Perhaps the most important consideration is the number of epochs. Running the optimization procedure with a large number of epochs is more likely to identify the best performing hyperparameters, however more epochs also means significantly longer run time. IMPORTANT: I only used 20 iterations and 150 epochs here to speed up the example but it is HIGHLY recommended to use at least the default iterations for normal usage. It is recommended to play around with your data using parrot-train with a few different parameters and visualizing the training and validation loss per epoch in order to pick the optimal number of epochs for training. Ideally, you should set the number of epochs to be around the point where validation accuracy tends to plateau during training.

Let’s break down what is output to console during the optimization procedure:

PARROT with hyperparameter optimization
---------------------------------------
Train on:   cuda
Datatype:   residues
ML Task:    classification
Batch size: 32
Number of epochs:   200
Number of optimization iterations:  20


Initial search results:
lr  nl  hs  output
0.00100  1  20  11.6680
0.00100  2  20  11.2927
0.00100  3  20  11.0651
0.00100  4  20  10.9217
0.00100  5  20  11.2689
0.01000  2  20  10.7816
0.00050  2  20  11.6328
0.00010  2  20  13.6755
0.00001  2  20  32.7119
0.00100  2   5  11.2988
0.00100  2  15  11.1669
0.00100  2  35  11.2267
0.00100  2  50  11.0833
Noise estimate: 0.7594081234203327

The first chunk of text details the network performance (average of 5 data folds) during the initial stage of hyperparameter optimization. This stage is used to gather an estimate of the noise (standard deviation across cross-val folds) for future optimization. The hyperparameters used in the initial search stage are hard-coded into the optimization procedure.

Primary optimization:
--------------------

Learning rate   |   n_layers   |   hidden vector size |  avg CV loss
======================================================================
  0.010000  |      3       |         20           |    10.593
  0.005001  |      3       |         19           |    10.820
  0.010000  |      4       |         21           |    10.715
  0.005513  |      3       |         21           |    10.852
  0.000744  |      4       |         21           |    11.113
  0.004678  |      5       |         22           |    10.847
  0.008415  |      4       |         22           |    10.550
  0.000954  |      4       |         23           |    11.024
  0.010000  |      3       |         23           |    10.597
  0.010000  |      4       |         24           |    10.559
  0.002181  |      3       |         24           |    10.757
  0.000709  |      4       |         25           |    11.065
  0.001744  |      5       |         24           |    11.281
  0.010000  |      3       |         25           |    10.707
  0.010000  |      2       |         22           |    10.869
  0.010000  |      2       |         24           |    10.758
  0.000822  |      2       |         25           |    11.275
  0.000859  |      2       |         23           |    11.100
  0.010000  |      5       |         26           |    10.817
  0.010000  |      4       |         30           |    10.774

The optimal hyperparameters are:
lr = 0.00841
nl = 4
hs = 22

This long block of text is the main process of optimization. The algorithm automatically selects the learning rate, number of layers and hidden vector size for each iteration. Finally, after the algorithm runs for 20 iterations (default: 50 iterations), the optimal hyperparameters are determined. These hyperparameters are also saved to a text file called ‘optimal_hyperparams.txt’ in the output directory. You might notice that the optimization procedure doesn’t appear to sample the entire hyperparameter space, but this is due to the fact that we specified to use fewer iterations than normally recommended.

Training with optimal hyperparams:
Epoch 0 Loss 31.7953
Epoch 1 Loss 30.4627
Epoch 2 Loss 22.8318
Epoch 3 Loss 26.4293
Epoch 4 Loss 17.9814
Epoch 5 Loss 15.7970
Epoch 6 Loss 15.0506
Epoch 7 Loss 13.6761
Epoch 8 Loss 13.8338
Epoch 9 Loss 14.3309
Epoch 10    Loss 13.1378
...
Epoch 396   Loss 40.0893
Epoch 397   Loss 40.9645
Epoch 398   Loss 41.5348
Epoch 399   Loss 41.8932

Test Loss: 11.1555

Lastly, a network is trained on all the training data using the optimal hyperparameters and tested on the held-out test set. The output produced is analogous to parrot-train.

Integrating trained PARROT networks into Python workflows:

We added the option for users to create a predictor object in Python using their trained PARROT network. This option is built-in to the file “py_predictor.py” that is installed with PARROT. Importing PARROT within Python is simple:

>>> from parrot import py_predictor as ppp

To use a saved network, you need to create a Predictor() object. Initializing this object only requires the path to the saved network weights and specification of whether this network is for sequence or residue prediction.

>>> my_predictor = ppp.Predictor('/path/to/network.pt', dtype='sequence')

Now we’re ready to make predictions! Once a network is loaded, the time to make predictions is negligible, so your predictor can be applied to as many sequences as you want. Just feed in amino acid sequences to the predict() function one at a time and predicted values will be output.

>>> value = my_predictor.predict('MYTESTAMINACIDSEQ')

Currently, this Python usage is only implemented for networks that were created using standard, one-hot amino acid encoding. In the future, we may add the option to feed in a particular encoding file so that all trained networks can be used in this manner. If this is a feature you’d be interested in, let us know and we can prioritize adding it!