================= Basic Examples: ================= Below are a handful of examples outlining how to apply PARROT to various machine learning tasks. Provided in the PARROT distribution on GitHub is a **/data** folder which contains several example datasets (If you installed via pip you will need to download these files from GitHub). Among these, there are datasets with different data types (sequence-mapped values and residue-mapped values) and for different machine learning task (classification and regression). Read the README within this folder for more details on how to format PARROT datasets and on the particulars of these datasets. This folder also contains an example list of sequences for ``parrot-predict``, and the other files that are used in these examples. parrot-train ============ **Sequence classification:** In our first example, each of the 300 sequences in *seq_class_dataset.tsv* belongs to one of three classes: .. code-block:: bash Frag0 WKHNPKHLRP 0 Frag1 DLFQDEDDAEEEDFMDDIWDPDS 1 Frag2 YHFAFTHMPALISQTSKYHYYSASMRG 2 Frag3 CNRNRNHKLKKFKHKKMGVPRKKRKHWK 0 ... Frag296 PDPLAMEDEVESHMEWCNRTHNRKG 2 Frag297 IWKYTHRSKACMHPH 0 Frag298 EDDEDVDENEEDDEDEEDNEEDPIE 1 Frag299 GEPCWVPYDIAQSADRMFFDKAMR 2 Let's train a network with ``parrot-train`` that learns how to classify sequences into these three data classes. Details on running the ``parrot-train`` command can be be found on its specific documentation page. For starters, we won't worry about the network hyperparameters and we'll just use the default values. In the most basic use case, all we need to provide is the datafile, the location where we want to save the trained network, and some basic information about what kind of network we are training. Here, since we are predicting only one value per sequence, we will indicate that the datatype is "sequence". We also will indicate how many data classes there are, which is '3' in this case. .. code-block:: bash parrot-train data/seq_class_dataset.tsv seq_class_network.pt --datatype sequence --classes 3 Training has a stochastic component, so running this multiple times will yield slightly different results. The output to console should look something like: .. code-block:: bash ############################################# WARNING: Batch size is large relative to the number of samples in the training set. This may decrease training efficiency. ############################################# PARROT with user-specified parameters ------------------------------------- Validation set loss per epoch: Epoch 0 Loss 1.0972 Epoch 5 Loss 1.0831 Epoch 10 Loss 1.0446 Epoch 15 Loss 0.8525 Epoch 20 Loss 0.6279 Epoch 25 Loss 0.3620 Epoch 30 Loss 0.2419 Epoch 35 Loss 0.2805 Epoch 40 Loss 0.2273 Epoch 45 Loss 0.1774 Epoch 50 Loss 0.2103 Epoch 55 Loss 0.1682 Epoch 60 Loss 0.1524 Epoch 65 Loss 0.1488 Epoch 70 Loss 0.1498 Epoch 75 Loss 0.1464 Epoch 80 Loss 0.1448 Epoch 85 Loss 0.1564 Epoch 90 Loss 0.1593 Epoch 95 Loss 0.1581 Test Loss: 0.1507 First we notice that there is a message warning us that our batch size is too large. This isn't super problematic and we can ignore it for now. In future runs, we will decrease our batch size using the ``--batch`` flag (by default it's set to 32, which is pretty large relative to our dataset of only 300 sequences). Also note that you can explicitly hide warning messages with the ``--ignore-warnings`` flag. Turning to the actual training results, we can see that our validation set loss decreases for a while, then plateaus around 0.14-0.15. This is pretty typical, generally this loss will decrease up to a certain point, then start to increase as the network begins to overfit on the training data. Don't worry about this overfitting, since the final network that PARROT returns will be from the iteration that produced the lowest validation set loss. If you look in the current directory, you should also see three files: our trained network "seq_class_network.pt", a predictions file "seq_class_network_predictions.tsv", and a performance stats summary file "seq_class_network_performace_stats.txt". The network file can be used to make predictions on new sequences with ``parrot-predict`` but is not readable by eye. The second file is a bit more interesting to look at: .. code-block:: bash Frag1_TRUE DLFQDEDDAEEEDFMDDIWDPDS 1 Frag1_PRED DLFQDEDDAEEEDFMDDIWDPDS 1 Frag20_TRUE SWQIHMPQWQCKHDMIQWLGDDAQ 2 Frag20_PRED SWQIHMPQWQCKHDMIQWLGDDAQ 2 Frag21_TRUE HQPKRKHHHYQHARHHHHKRVH 0 Frag21_PRED HQPKRKHHHYQHARHHHHKRVH 0 ... Frag273_TRUE LLHRHRFQRSTKRHLLK 0 Frag273_PRED LLHRHRFQRSTKRHLLK 0 Frag286_TRUE DDEDEDYWNEWEETEEIQESE 1 Frag286_PRED DDEDEDYWNEWEETEEIQESE 1 Frag299_TRUE GEPCWVPYDIAQSADRMFFDKAMR 2 Frag299_PRED GEPCWVPYDIAQSADRMFFDKAMR 2 **NOTE: Your file will have the same general format, but with different sequences.** These sequences are the ones that were randomly held out as a test set during the training of our network. After the network concluded training, the best-perfoming network (on the validation set) was applied to these test set sequences. **By analyzing this file, we can get an approximation of how well our network would perform on sequences it has not seen before.** This approximation may not hold in every case, but sometimes, it's the best we can do (see "Machine Learning Resources" for more info). In our case, it seems as if our network did a good job at predicting these test set sequences. The performance stats file is an extension of these test set predictions: .. code-block:: bash Matthews Correlation Coef : 1.000 F1 Score : 1.000 Accuracy : 1.000 This file quantifies performance on the test set using a variety of different metrics, which vary between classification and regression tasks. For classification, as shown here, this file reports on the accuracy, F1 score and MCC of our predictions. You can always prevent this file from being output by providing the ``--no-stats`` flag. See "Machine Learning Resources" (or Google!) for more information on how to interpret these metrics. ............................................................................... Let's demonstrate a few more features of PARROT by training another network. In this run, we'll decrease the ``--batch`` parameter to '8' to get rid of the warning. A smaller batch size will cause the network to update more often during training, which means that training will take longer overall, but the network will learn more each epoch. Additionally, we will also modify the training time with the ``--epochs`` flag. In the context of machine learning, an epoch is one "round" through the entire training set. By default, PARROT trains for 100 epochs, which means that a network will be exposed to every sequence in the training set 100 times. It's often necessary to increase this parameter to ensure that the network learns the data to its maximum potential. The remaining two flags we will add are ``--verbose`` and ``--include-figs``. "Verbose" simply causes the output to terminal to be more descriptive, printing the training results after every epoch instead of every 5. As the name suggests, "include-figs" will cause PNG images to be output into the same directory that we are saving the network. .. code-block:: bash parrot-train data/seq_class_dataset.tsv seq_class_network.pt --datatype sequence --classes 3 --batch 8 --epochs 200 --include-figs --verbose Let's look at the figures that we generated: "seq_class_network_train_val_loss.png" and "seq_class_network_seq_CM.png" .. image:: ../images/seq_class_network_train_val_loss.png :width: 400 .. image:: ../images/seq_class_network_seq_CM.png :width: 400 The first is a plot of the performance achieved by the network on the training and validation sets over the course of training. The validation loss here is the same as what is being output to terminal. This particular plot looks a little funny, but that's due to the fact that this classification task is not very difficult, so our network learns what it needs too by around epoch 20 and the rest of the time is just overfitting and noise. The second figure is provides some insight on how well our network will generalize onto unseen data. After training completes, PARROT networks are applied to a test set of randomly held-out sequences. For a classification task, PARROT displays a confusion matrix detailing the true vs predicted classes for each sequence in this test set. As you can see, our network is perfect (also confirmed by our performance stats file)! **Sequence regression:** Training a PARROT network on a regression task is very similar to classification in terms of syntax. For this example we will use *seq_regress_dataset.tsv*: .. code-block:: bash Frag0 EHCWTYIFQMYRIDQTQRVKRGEKPIIYLEPMAR 3.8235294117647056 Frag1 SDAWVMKFLWDKCGDHFIQYQKPANRWEWVD 3.870967741935484 Frag2 IYPEQSPDNAWAW 3.076923076923077 ... Frag296 VWIMYFIA 8.75 Frag297 WICEWRVP 5.0 Frag298 YMYWTDDWEA 5.0 Frag299 PCHSWSMEGILCNHMH 3.125 The key difference between regression datasets and classification datasets is that each value is a continuous number rather than an integer class label. In terms of command-line syntax, the only difference in the ``parrot-train`` command for this regression case (other than the datafile path) is the ``--classes`` argument. Since we are doing regression, we will put '1' here. For the purposes of demonstration, we will also modify a few of the network hyperparameters in this run. Instead of the default network architecture with one hidden layer (``-nl 1``) and a hidden vector size of 10 (``-hs 10``), we will train a network with 2 layers and a vector size of 20. These two hyperparameters, along with learning rate (``-lr``), are the main ways to tune PARROT networks. .. code-block:: bash parrot-train data/seq_regress_dataset.tsv seq_regress_network.pt --datatype sequence --classes 1 -nl 2 -hs 20 -b 8 --epochs 200 --include-figs You might notice that this network seems to train a bit slower than the previous example. This is because our network has an additional layer. Increasing the ``-nl`` hyperparameter increases training time, but creates a more complex network that may be better at discerning patterns from data. Like before, this command outputs a network file, a prediction file, a performance stats file, a training results PNG and a test set performance PNG into the current directory. In this case, the performance image is a scatterplot that compares the true values of the test set sequences to what was predicted by the PARROT network. .. image:: ../images/seq_regress_network_seq_scatterplot.png :width: 400 The performance stats file provides the Pearson and Spearman correlations for this true vs predicted value scatterplot: .. code-block:: bash Pearson R : 0.958 Spearman R : 0.963 Not bad! **Residue classification:** Now let's try a task where the objective is to classify each residue in a sequence. Unlike before where every sequence had one class label, in *res_class_dataset.tsv* there are labels for every residue in each sequence. .. code-block:: bash Frag0 DEDGTEDDMATTK 1 1 1 1 1 1 1 1 1 1 1 1 1 Frag1 CGSAPSRFVKTCDPDEEDEDDEDE 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 Frag2 EWYEDDKPFPCPERVPHHKKGHRGGWRAKKNWKV 1 1 1 1 1 1 1 0 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... Frag297 HHWHRWDYERHKNCPIAGRIRR 0 0 0 0 0 0 0 1 1 1 0 0 0 0 2 2 2 2 0 0 0 0 Frag298 CEDEEEDEDHHQGPHHRT 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 Frag299 DPATGETHHDEDIEDSVEEDEDDDQDS 1 1 2 2 2 2 2 2 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Despite this major difference, the ``parrot-train`` command is similar to the above examples. The only difference will be the value we input after the ``--datatype`` flag. Before we put 'sequence', and here we will put 'residues'. Just for demonstration, we will also decrease our learning rate (``-lr``) by an order of magnitude for training this network. .. code-block:: bash parrot-train data/res_class_dataset.tsv res_class_network.pt --datatype residues --classes 3 -lr 0.0001 -e 200 -b 8 --include-figs This produces more files to the output directory. If we look at the performance stats file, we can see this network is not perfectly accurate. .. code-block:: bash Matthews Correlation Coef : 0.621 F1 Score : 0.744 Accuracy : 0.748 In this case, the confusion matrix is for every single residue in all of the sequences in the test set. Looking at the confusion matrix can shed some light on which classes our network has trouble with. .. image:: ../images/res_class_network_res_CM.png :width: 400 Evidently class '2' is the tricky one in this example problem. **Residue regression:** The final kind of machine learning task that PARROT can handle is regression on every residue in a sequence. For this command ``--datatype`` should be set to 'residues' and ``--classes`` should be '1'. Notice that for convenience, we can use ``-d`` and ``-c`` for these flags. For this network, we'll use all of the default hyperparameters and train for 300 epochs. .. code-block:: bash parrot-train data/res_regress_dataset.tsv res_regress_network.pt -d residues -c 1 -e 300 -b 8 --include-figs The output from this command is analogous to the other examples. Like the sequence regression task, specifying ``--include-figs`` with a residue regression task will produce a scatter plot that shows the network's performance on the test set. .. image:: ../images/res_regress_network_res_scatterplot.png :width: 400 Here, each point represents a single residue in the test set. Each combination of marker shape and color in this scatterplot belongs to a single sequence, which may provide some insight on whether the network systematically mis-predicts all sequences, or if there are only a few specific sequences that are outliers. parrot-predict ============== You can use a trained network from ``parrot-optimize`` or ``parrot-train`` to predict the values of new, unseen sequences. An example file is provided in **/data** folder: .. code-block:: bash a1 EADDGLYWQQN b2 RRLKHEEDSTSTSTSTSTQ c3 YYYGGAFAFAGRM d4 GGIL e5 GREPCCMLLYILILAAAQRDESSSSST f6 PGDEADLGHRSLVWADD To run ``parrot-predict``, we need to provide the path to this sequence file, the path to our trained network file, the location where we want to output our predictions to, and information on network type and architecture. The most important thing to keep in mind when using ``parrot-predict`` is that your ``-nl`` and ``-hs`` hyperparameters (and encoding scheme) must exactly match those used for network training, or else you will get an error. Let's run our trained sequence regression network on this sequence file. Note the ``-nl`` and ``-hs`` flags are same as we used above. .. code-block:: bash parrot-predict data/seqfile.txt seq_regress_network.pt seq_regress_newPredictions.txt --datatype sequence --classes 1 -nl 2 -hs 20 We can see these predictions in "seq_regress_newPredictions.txt": .. code-block:: bash a1 EADDGLYWQQN 2.8656542 b2 RRLKHEEDSTSTSTSTSTQ 0.7592569 c3 YYYGGAFAFAGRM 4.2728763 d4 GGIL 3.238177 e5 GREPCCMLLYILILAAAQRDESSSSST 3.377026 f6 PGDEADLGHRSLVWADD 2.486051 Remember: results will vary since networks train with stochasticity. Now let's make predictions on the same sequences with our residue classification network. We don't need to provide hyperparameters here because we used the default values above. .. code-block:: bash parrot-predict data/seqfile.txt res_class_network.pt res_class_newPredictions.txt --datatype residues --classes 3 .. code-block:: bash a1 EADDGLYWQQN 1 1 1 1 1 1 1 1 1 1 2 b2 RRLKHEEDSTSTSTSTSTQ 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 c3 YYYGGAFAFAGRM 2 2 2 2 2 2 2 2 2 2 2 0 0 d4 GGIL 2 2 2 2 e5 GREPCCMLLYILILAAAQRDESSSSST 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 f6 PGDEADLGHRSLVWADD 1 1 1 1 1 1 2 2 0 0 2 2 2 2 2 1 1