====================
 Advanced Examples:
====================

The core usage of PARROT is designed to be as simple as possible so that anyone, regardless of computational expertise, can train a network on their dataset with minimal investment. However, on top of this basic implementation, PARROT has a number of options intended to let experienced users tailor their networks to their particular needs and to facilitate more sophisticated computational workflows.

Advanced ``parrot-train`` options:
----------------------------------

**Automatic determination of number of training epochs with --stop:**

This flag determines the stop condition for network training. Currently, there are two options implemented: either 'iter' or 'auto'. In all of the previous examples we used the default behavior, 'iter', which means that the number we specify for the ``-e`` flag will be the number of iterations that we train the network. Alternatively, using 'auto' means that training will stop automatically once performance on the validation set has plateaued for ``-e`` epochs. Thus, with 'auto' it is recommended to use a smaller number of epochs (10-20) for ``-e`` so training does not extend for a significantly long period of time.

.. code-block:: bash

    parrot-train data/seq_regress_dataset.tsv stop_example.pt --datatype sequence -c 1 -nl 2 -hs 5 -lr 0.001 -e 10 -b 32 -v --stop auto

.. code-block:: bash

    PARROT with user-specified parameters
    -------------------------------------
    Train on:   cpu
    Datatype:   sequence
    ML Task:    regression
    Learning rate:  0.001000
    Number of layers:   2
    Hidden vector size: 5
    Batch size: 32

    Validation set loss per epoch:

    Epoch 0 Loss 0.1779
    Epoch 1 Loss 0.1752
    Epoch 2 Loss 0.1727
    ...
    Epoch 98    Loss 0.0456
    Epoch 99    Loss 0.0456
    Epoch 100   Loss 0.0456
    Epoch 101   Loss 0.0456
    Epoch 102   Loss 0.0456
    Epoch 103   Loss 0.0456
    Epoch 104   Loss 0.0456
    Epoch 105   Loss 0.0456
    Epoch 106   Loss 0.0456
    Epoch 107   Loss 0.0456
    Epoch 108   Loss 0.0456
    Epoch 109   Loss 0.0456
    Epoch 110   Loss 0.0455
    Epoch 111   Loss 0.0455
    Epoch 112   Loss 0.0455

Training stops here because performance has stopped improving. Worth mentioning: in some cases such as this dataset, 'auto' can actually get stuck in a local minimum well before the network is fully trained. Be mindful of this when using 'auto' stop condition.

You might also notice that in this example, the validation loss is listed for every single epoch instead of every 5. This is simply because the verbose ``-v`` flag was provided.

**Splitting data into train/validation/test sets:**

``--set-fractions``:
This flag allows the user to set the proportions of data that will be a part of the training set, validation set, and test set. By default, the split is 70:15:15. This flag takes three input arguments, between 0 and 1, that must sum to 1.

.. code-block:: bash

    parrot-train data/seq_regress_dataset.tsv setfractions_network.pt --datatype sequence -c 1 -e 200 --set-fractions 0.5 0.45 0.05

Notice that the output predictions file from this command has fewer datapoints because of the reduced test set. Most likely, the accuracy will be a little worse then the default proportions because the training set is also smaller.

``--split``:
In some cases, users might want precise control over over the training, validation and test set splits of their input data. This flag allows the user to manually specify which subset each sample in their dataset will be assigned. This flag requires an argument that is a path to a *split_file*, which specifically allocates sequences in `datafile` to the different datasets. An example *split_file* is provided in the **/data** folder for reference.

.. code-block:: bash

    parrot-train data/seq_regress_dataset.tsv manualsplit_network.pt --datatype sequence -c 1 -e 200 --split data/split_file.tsv 

This can especially be useful if you wish to perform k-fold cross-validation on your dataset, as you can prepare k different split_files that each specify a particular 1/kth of your dataset into the test set.

``--save-splits``:
Sometimes, a random partition into training/val/test sets is acceptable, but it is helpful to know for replicability where each sample was assigned. For example, if you are comparing multiple types of machine learning networks, it is best practice to use the same training set for each network. Including this flag causes an additional text file (suffix: "_split_file.txt") to be saved to the output directory. This file is formatted in the same way as a *split_file* for using with the ``--split`` flag.

**Amino acid -> vector encoding:**

"Encoding" in the context of PARROT refers to the process of converting a sequence of amino acids into computer-readable numeric vectors. By default, PARROT utilizes *one-hot* encoding, which represents each amino acid as a vector with 19 zeros and a single 1, where the position of the 1 determines its identity. However, users can change how amino acids are encoded using the ``--encode`` flag. 

In addition to one-hot encoding, encoding using biophysical scales (vector of properties like charge, hydrophobicity, molecular weight, etc.) is also hard-coded into PARROT. Machine learning using biophysical encoding and can be carried out by providing 'biophysics' after this flag.

.. code-block:: bash

    parrot-train data/seq_regress_dataset.tsv biophysics_network.pt -d sequence -c 1 -nl 2 -hs 10 -e 200 --encode biophysics

More powerfully, PARROT also allows the user to manually specify their own encoding scheme, if they desire. An example encoding file can be found in the **/data** folder. In this case, provide the path to this encoding file following the flag.

.. code-block:: bash

    parrot-train data/seq_regress_dataset.tsv userencode_network.pt -d sequence -c 1 -nl 2 -hs 10 -e 200 --encode data/encoding_example.txt

With the ``--encode`` flag and a user-provided file, PARROT is even flexible enough to work on nucleotide sequences! To illustrate this, we've included the file "nucleotide_encoding.txt" which can be passed in via this flag to one-hot encode nucleotide sequences. We've also included an example sequence regression dataset (melting temperature prediction) with nucleotide sequences: "nucleotide_dataset.tsv".

.. code-block:: bash

    parrot-train data/nucleotide_dataset.txt nucleotide_network.pt -d sequence -c 1 -nl 2 -hs 10 -e 200 --encode data/nucleotide_encoding.txt

**Probabilistic classification with --probabilistic-classification:**

The standard behavior of "classification" tasks in PARROT is to make predictions of discrete class labels. In reality though, this sort of behavior does not provide any information on the certainty behind these prediction. For example, in a two class problem (classes 0 and 1), if sequence A is deemed to be class 0 with 98% confidence, and sequence B is deemed class 0 with 51% confidence, both of these sequences will appear in the output prediction file as class 0. In some instances, it is useful to provide users a measure of confidence for each of the class predictions that PARROT makes. This can be accomplished with the ``--probabilistic-classification`` flag.

Using this flag is easy and can be used with ``parrot-train``, ``parrot-optimize`` and ``parrot-predict``. For the first two commands, this flag changes how predictions on the test set are output in the "_predictions.tsv" file and changes the figures and performance stats that are output (if specified). For the predict command, it changes how the predictions are outputed. If this flag is combined with ``--include-figs``, it also changes the figure and metrics that are produced for evaluating performance on the test set (see ``parrot-train`` documentation page for more details). Conveniently, this flag can be used in ``parrot-predict`` even if it was not specified during training. As an example, here is the same sequence 3-class classification network making predictions with and without the ``--probabilistic-classification`` flag (default layers and hidden vector size):

.. code-block:: bash

    parrot-predict data/seqfile.txt prob_example.pt discrete.txt -d sequence -c 3

Output:

.. code-block:: bash

    a1 EADDGLYWQQN 2
    b2 RRLKHEEDSTSTSTSTSTQ 0
    c3 YYYGGAFAFAGRM 2
    d4 GGIL 2
    e5 GREPCCMLLYILILAAAQRDESSSSST 2
    f6 PGDEADLGHRSLVWADD 2

.. code-block:: bash

    parrot-predict data/seqfile.txt prob_example.pt probabilistic.txt -d sequence -c 3 --probabilistic-classification

Output:

.. code-block:: bash

    a1 EADDGLYWQQN 0.0527 0.1081 0.8392
    b2 RRLKHEEDSTSTSTSTSTQ 0.9819 0.0034 0.0148
    c3 YYYGGAFAFAGRM 0.0742 0.0098 0.916
    d4 GGIL 0.1509 0.0596 0.7894
    e5 GREPCCMLLYILILAAAQRDESSSSST 0.0465 0.0118 0.9418
    f6 PGDEADLGHRSLVWADD 0.0645 0.2576 0.678

The three numbers following each sequence represent the probability that the sequence belongs to each of the three classes. Notice the numbers in each row sum to 1.

Currently, probabilistic classification is only implemented for *sequence classification* problems. The same principles would work for *residue classification*, however, we have not thought of a convenient way of representing the information in the output files (each sequence has num_classes x seq_len values).

Hyperparameter tuning with ``parrot-optimize``:
-----------------------------------------------

``parrot-optimize`` will train a network like ``parrot-train``, however this command does not require the user to specify hyperparameters. Instead, it relies upon Bayesian Optimization to automatically select hyperparameters. Although Bayesian Optimization is much more efficient than grid search optimization, it still requires many iterations to converge upon the best hyperparameters. Additionally, this command relies upon 5-fold cross validation for each set of hyperparameters to achieve an accurate estimate of network performance. All together, this means that ``parrot-optimize`` can take over 100x longer to run than ``parrot-train``. It is strongly recommended to only run this command on a machine with a GPU.

Nonetheless, usage for ``parrot-optimize`` is remarkably similar to ``parrot-train``, since many of the flags are identical. As an example, let's run the command on a residue regression dataset:

.. code-block:: bash

    parrot-optimize data/res_class_dataset.tsv optimize_example.pt -d residues -c 3 -e 200 --max-iter 20 -b 32 --verbose

Notice how we do not need to specify number of layers, hidden vector size, or learning rate as these are the parameters we are optimizing. Perhaps the most important consideration is the number of epochs. Running the optimization procedure with a large number of epochs is more likely to identify the best performing hyperparameters, however more epochs also means significantly longer run time. **IMPORTANT: I only used 20 iterations and 150 epochs here to speed up the example but it is HIGHLY recommended to use at least the default iterations for normal usage.** It is recommended to play around with your data using ``parrot-train`` with a few different parameters and visualizing the training and validation loss per epoch in order to pick the optimal number of epochs for training. Ideally, you should set the number of epochs to be around the point where validation accuracy tends to plateau during training.

Let's break down what is output to console during the optimization procedure:

.. code-block:: bash

    PARROT with hyperparameter optimization
    ---------------------------------------
    Train on:   cuda
    Datatype:   residues
    ML Task:    classification
    Batch size: 32
    Number of epochs:   200
    Number of optimization iterations:  20


    Initial search results:
    lr  nl  hs  output
    0.00100  1  20  11.6680
    0.00100  2  20  11.2927
    0.00100  3  20  11.0651
    0.00100  4  20  10.9217
    0.00100  5  20  11.2689
    0.01000  2  20  10.7816
    0.00050  2  20  11.6328
    0.00010  2  20  13.6755
    0.00001  2  20  32.7119
    0.00100  2   5  11.2988
    0.00100  2  15  11.1669
    0.00100  2  35  11.2267
    0.00100  2  50  11.0833
    Noise estimate: 0.7594081234203327


The first chunk of text details the network performance (average of 5 data folds) during the initial stage of hyperparameter optimization. This stage is used to gather an estimate of the noise (standard deviation across cross-val folds) for future optimization. The hyperparameters used in the initial search stage are hard-coded into the optimization procedure.

.. code-block:: bash

    Primary optimization:
    --------------------

    Learning rate   |   n_layers   |   hidden vector size |  avg CV loss  
    ======================================================================
      0.010000  |      3       |         20           |    10.593
      0.005001  |      3       |         19           |    10.820
      0.010000  |      4       |         21           |    10.715
      0.005513  |      3       |         21           |    10.852
      0.000744  |      4       |         21           |    11.113
      0.004678  |      5       |         22           |    10.847
      0.008415  |      4       |         22           |    10.550
      0.000954  |      4       |         23           |    11.024
      0.010000  |      3       |         23           |    10.597
      0.010000  |      4       |         24           |    10.559
      0.002181  |      3       |         24           |    10.757
      0.000709  |      4       |         25           |    11.065
      0.001744  |      5       |         24           |    11.281
      0.010000  |      3       |         25           |    10.707
      0.010000  |      2       |         22           |    10.869
      0.010000  |      2       |         24           |    10.758
      0.000822  |      2       |         25           |    11.275
      0.000859  |      2       |         23           |    11.100
      0.010000  |      5       |         26           |    10.817
      0.010000  |      4       |         30           |    10.774

    The optimal hyperparameters are:
    lr = 0.00841
    nl = 4
    hs = 22


This long block of text is the main process of optimization. The algorithm automatically selects the learning rate, number of layers and hidden vector size for each iteration. Finally, after the algorithm runs for 20 iterations (default: 50 iterations), the optimal hyperparameters are determined. These hyperparameters are also saved to a text file called 'optimal_hyperparams.txt' in the output directory. You might notice that the optimization procedure doesn't appear to sample the entire hyperparameter space, but this is due to the fact that we specified to use fewer iterations than normally recommended.

.. code-block:: bash

    Training with optimal hyperparams:
    Epoch 0 Loss 31.7953
    Epoch 1 Loss 30.4627
    Epoch 2 Loss 22.8318
    Epoch 3 Loss 26.4293
    Epoch 4 Loss 17.9814
    Epoch 5 Loss 15.7970
    Epoch 6 Loss 15.0506
    Epoch 7 Loss 13.6761
    Epoch 8 Loss 13.8338
    Epoch 9 Loss 14.3309
    Epoch 10    Loss 13.1378
    ...
    Epoch 396   Loss 40.0893
    Epoch 397   Loss 40.9645
    Epoch 398   Loss 41.5348
    Epoch 399   Loss 41.8932

    Test Loss: 11.1555


Lastly, a network is trained on all the training data using the optimal hyperparameters and tested on the held-out test set. The output produced is analogous to ``parrot-train``.


Integrating trained PARROT networks into Python workflows:
----------------------------------------------------------

We added the option for users to create a predictor object in Python using their trained PARROT network. This option is built-in to the file "py_predictor.py" that is installed with PARROT. Importing PARROT within Python is simple:

.. code-block:: python

    >>> from parrot import py_predictor as ppp

To use a saved network, you need to create a Predictor() object. Initializing this object only requires the path to the saved network weights and specification of whether this network is for sequence or residue prediction.

.. code-block:: python

    >>> my_predictor = ppp.Predictor('/path/to/network.pt', dtype='sequence')

Now we're ready to make predictions! Once a network is loaded, the time to make predictions is negligible, so your predictor can be applied to as many sequences as you want. Just feed in amino acid sequences to the predict() function one at a time and predicted values will be output.

.. code-block:: python

    >>> value = my_predictor.predict('MYTESTAMINACIDSEQ')

Currently, this Python usage is only implemented for networks that were created using standard, one-hot amino acid encoding. In the future, we may add the option to feed in a particular encoding file so that all trained networks can be used in this manner. If this is a feature you'd be interested in, let us know and we can prioritize adding it!