Module Documentation

brnn_architecture.py

The underlying architecture of the bidirectional LSTM network used in PARROT

Question/comments/concerns? Raise an issue on github: https://github.com/idptools/parrot

Licensed under the MIT license.

class parrot.brnn_architecture.BRNN_MtM(input_size, hidden_size, num_layers, num_classes, device)[source]

A PyTorch many-to-many bidirectional recurrent neural network

A class containing the PyTorch implementation of a BRNN. The network consists of repeating LSTM units in the hidden layers that propogate sequence information in both the foward and reverse directions. A final fully connected layer aggregates the deepest hidden layers of both directions and produces the outputs.

“Many-to-many” refers to the fact that the network will produce outputs corresponding to every item of the input sequence. For example, an input sequence of length 10 will produce 10 sequential outputs.

Variables:
  • device (str) – String describing where the network is physically stored on the computer. Should be either ‘cpu’ or ‘cuda’ (GPU).
  • hidden_size (int) – Size of hidden vectors in the network
  • num_layers (int) – Number of hidden layers (for each direction) in the network
  • num_classes (int) – Number of classes for the machine learning task. If it is a regression problem, num_classes should be 1. If it is a classification problem, it should be the number of classes.
  • lstm (PyTorch LSTM object) – The bidirectional LSTM layer(s) of the recurrent neural network.
  • fc (PyTorch Linear object) – The fully connected linear layer of the recurrent neural network. Across the length of the input sequence, this layer aggregates the output of the LSTM nodes from the deepest forward layer and deepest reverse layer and returns the output for that residue in the sequence.
forward(x)[source]

Propogate input sequences through the network to produce outputs

Parameters:x (3-dimensional PyTorch IntTensor) – Input sequence to the network. Should be in the format: [batch_dim X sequence_length X input_size]
Returns:Output after propogating the sequences through the network. Will be in the format: [batch_dim X sequence_length X num_classes]
Return type:3-dimensional PyTorch FloatTensor
class parrot.brnn_architecture.BRNN_MtO(input_size, hidden_size, num_layers, num_classes, device)[source]

A PyTorch many-to-one bidirectional recurrent neural network

A class containing the PyTorch implementation of a BRNN. The network consists of repeating LSTM units in the hidden layers that propogate sequence information in both the foward and reverse directions. A final fully connected layer aggregates the deepest hidden layers of both directions and produces the output.

“Many-to-one” refers to the fact that the network will produce a single output for an entire input sequence. For example, an input sequence of length 10 will produce only one output.

Variables:
  • device (str) – String describing where the network is physically stored on the computer. Should be either ‘cpu’ or ‘cuda’ (GPU).
  • hidden_size (int) – Size of hidden vectors in the network
  • num_layers (int) – Number of hidden layers (for each direction) in the network
  • num_classes (int) – Number of classes for the machine learning task. If it is a regression problem, num_classes should be 1. If it is a classification problem, it should be the number of classes.
  • lstm (PyTorch LSTM object) – The bidirectional LSTM layer(s) of the recurrent neural network.
  • fc (PyTorch Linear object) – The fully connected linear layer of the recurrent neural network. Across the length of the input sequence, this layer aggregates the output of the LSTM nodes from the deepest forward layer and deepest reverse layer and returns the output for that residue in the sequence.
forward(x)[source]

Propogate input sequences through the network to produce outputs

Parameters:x (3-dimensional PyTorch IntTensor) – Input sequence to the network. Should be in the format: [batch_dim X sequence_length X input_size]
Returns:Output after propogating the sequences through the network. Will be in the format: [batch_dim X 1 X num_classes]
Return type:3-dimensional PyTorch FloatTensor

encode_sequence.py

File containing functions for encoding a string of amino acids into a numeric vector.

Question/comments/concerns? Raise an issue on github: https://github.com/idptools/parrot

Licensed under the MIT license.

class parrot.encode_sequence.UserEncoder(encode_file)[source]

User-specified amino acid-to-vector encoding scheme object

Variables:
  • encode_file (str) – A path to a file that describes the encoding scheme
  • encode_dict (dict) – A dictionary that maps each amino acid to a numeric vector
  • input_size (int) – The length of the encoding vector used for each amino acid
decode(seq_vectors)[source]

Converts a list of sequence vectors back to a list of protein sequences

Parameters:seq_vectors (list of numpy arrays) – A list containing sequence vectors
Returns:Strings of amino acid sequences
Return type:list
encode(seq)[source]

Convert an amino acid sequence into this encoding scheme

Parameters:seq (str) – An uppercase sequence of amino acids (single letter code)
Returns:a PyTorch tensor representing the encoded sequence
Return type:torch.FloatTensor
parrot.encode_sequence.biophysics(seq)[source]

Convert an amino acid sequence to a PyTorch tensor with biophysical encoding

Each amino acid is represented by a length 9 vector with each value representing a biophysical property. The nine encoded biophysical scales are Kyte-Doolittle hydrophobicity, charge, isoelectric point, molecular weight, aromaticity, h-bonding ability, side chain solvent accessible surface area, backbone SASA, and free energy of solvation. Inputing a sequence with a nono-canonical amino acid letter will cause the program to exit.

E.g. Glutamic acid (E) is: [-3.5, -1, 3.2, 147.1, 0, 1, 161.8, 68.1, -107.3]

Parameters:seq (str) – An uppercase sequence of amino acids (single letter code)
Returns:a PyTorch tensor representing the encoded sequence
Return type:torch.FloatTensor
parrot.encode_sequence.one_hot(seq)[source]

Convert an amino acid sequence to a PyTorch tensor of one-hot vectors

Each amino acid is represented by a length 20 vector with a single 1 and 19 0’s Inputing a sequence with a nono-canonical amino acid letter will cause the program to exit.

E.g. Glutamic acid (E) is encoded: [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Parameters:seq (str) – An uppercase sequence of amino acids (single letter code)
Returns:a PyTorch tensor representing the encoded sequence
Return type:torch.IntTensor
parrot.encode_sequence.parse_encode_file(file)[source]

Helper function to convert an encoding file into key:value dictionary

parrot.encode_sequence.rev_biophysics(seq_vectors)[source]

Decode a list of biophysically-encoded sequence vectors into amino acid sequences

Parameters:seq_vectors (list of numpy arrays) – A list containing sequence vectors
Returns:Strings of amino acid sequences
Return type:list
parrot.encode_sequence.rev_one_hot(seq_vectors)[source]

Decode a list of one-hot sequence vectors into amino acid sequences

Parameters:seq_vectors (list of numpy arrays) – A list containing sequence vectors
Returns:Strings of amino acid sequences
Return type:list

process_input_data.py

Module with functions for processing an input datafile into a PyTorch-compatible format.

Question/comments/concerns? Raise an issue on github: https://github.com/idptools/parrot

Licensed under the MIT license.

class parrot.process_input_data.SequenceDataset(data, subset=array([], dtype=float64), encoding_scheme='onehot', encoder=None)[source]

A PyTorch-compatible dataset containing sequences and values

Stores a collection of sequences as tensors along with their corresponding target values. This class is designed to be provided to PyTorch Dataloaders.

Variables:
  • data (list of lists) – Each inner list represents a single sequence in the dataset and should have the format: [seqID, sequence, value(s)]
  • encoding_scheme (str) – Description of how an amino acid sequence should be encoded as a numeric vector. Providing a string other than ‘onehot’, ‘biophysics’, or ‘user’ will produce unintended consequences.
  • encoder (UserEncoder object, optional) – If encoding_scheme is ‘user’, encoder should be a UserEncoder object that can convert amino acid sequences to numeric vectors. If encoding_scheme is not ‘user’, use None.
parrot.process_input_data.parse_file(tsvfile, datatype, problem_type, num_classes, excludeSeqID=False, ignoreWarnings=False)[source]

Parse a datafile containing sequences and values.

Each line of of the input tsv file contains a sequence of amino acids, a value (or values) corresponding to that sequence, and an optional sequence ID. This file will be parsed into a more convenient list of lists.

If excludeSeqID is False, then the format of each line in the file should be: <seqID> <sequence> <value(s)>

If excludeSeqID is True, then the format of each line in the file should be: <sequence> <value(s)>

value(s) will either be a single number if datatype is ‘sequence’ or a len(sequence) series of whitespace-separated numbers if it is ‘residues’.

If problem_type is ‘regression’, then each value can be any real number. But if it is ‘classification’ then each value should be an integer in the range [0-N] where N is the number of classes.

Parameters:
  • tsvfile (str) – Path to a whitespace-separated datafile
  • datatype (str) – Description of the format of the values in tsvfile. Providing a string other than ‘sequence’ or ‘residues’ will produce unintended behavior.
  • problem_type (str) – Description of the machine-learning task. Providing a string other than ‘regression’ or ‘classification’ will produce unintended behavior.
  • excludeSeqID (bool, optional) – Boolean indicating whether or not each line in tsvfile has a sequence ID (default is False)
  • ignoreWarnings (bool, optional) – If False, assess the structure and balance of the provided dataset with basic heuristics and display warnings for common issues.
Returns:

A list representing the entire tsvfile. Each inner list corresponds to a single line in the file and has the format [seqID, sequence, values].

Return type:

list of lists

parrot.process_input_data.read_split_file(split_file)[source]

Read in a split_file

Parameters:split_file (str) – Path to a whitespace-separated splitfile
Returns:
  • numpy int array – an array of the indices for the training set samples
  • numpy int array – an array of the indices for the validation set samples
  • numpy int array – an array of the indices for the testing set samples
parrot.process_input_data.read_tsv_raw(tsvfile, delimiter=None)[source]

Internal function for parsing a tsv file. Ignores empty lines and allows for comment lines (lines that start with a # symbol). Does not do any other sanity checking, however. :Parameters: * tsvfile (str) – Path to a whitespace-separated datafile

  • delimiter (str or None) – Allows you to define the string to split columns in the file. Default is any whitespace character. Default = None (any whitespace).
Returns:Returns a list of strings where each element in the list
Return type:list
parrot.process_input_data.res_class_collate(batch)[source]

Collates sequences and their values into a batch

Transforms a collection of tuples of sequence vectors and values into a single tuple by stacking along a newly-created batch dimension. This function is specifically designed for classification problems with residue-mapped data. To account for sequences with different lengths, all sequence vectors are zero- padded to the length of the longest sequence in the batch

Parameters:batch (list) – A list of tuples of the form (sequence_vector, target_value(s))
Returns:a tuple with concatenated names, sequence_vectors and target_values
Return type:tuple
parrot.process_input_data.res_regress_collate(batch)[source]

Collates sequences and their values into a batch

Transforms a collection of tuples of sequence vectors and values into a single tuple by stacking along a newly-created batch dimension. This function is specifically designed for regression problems with residue-mapped data. To account for sequences with different lengths, all sequence vectors are zero- padded to the length of the longest sequence in the batch

Parameters:batch (list) – A list of tuples of the form (sequence_vector, target_value(s))
Returns:a tuple with concatenated names, sequence_vectors and target_values
Return type:tuple
parrot.process_input_data.seq_class_collate(batch)[source]

Collates sequences and their values into a batch

Transforms a collection of tuples of sequence vectors and values into a single tuple by stacking along a newly-created batch dimension. This function is specifically designed for classification problems with sequence-mapped data.

Parameters:batch (list) – A list of tuples of the form (sequence_vector, target_value(s))
Returns:a tuple with concatenated names, sequence_vectors and target_values
Return type:tuple
parrot.process_input_data.seq_regress_collate(batch)[source]

Collates sequences and their values into a batch

Transforms a collection of tuples of sequence vectors and values into a single tuple by stacking along a newly-created batch dimension. This function is specifically designed for regression problems with sequence-mapped data.

Parameters:batch (list) – A list of tuples of the form (sequence_vector, target_value(s))
Returns:a tuple with concatenated names, sequence_vectors and target_value
Return type:tuple
parrot.process_input_data.split_data(data_file, datatype, problem_type, num_classes, excludeSeqID=False, split_file=None, encoding_scheme='onehot', encoder=None, percent_val=0.15, percent_test=0.15, ignoreWarnings=False, save_splits_output=None)[source]

Divide a datafile into training, validation, and test datasets

Takes in a datafile and specification of the data format and the machine learning problem, and returns PyTorch-compatible Dataset objects for the training, validation and test sets of the data. The user may optionally specify how the dataset should be split into these subsets, as well as how protein sequences should be encoded as numeric vectors.

Parameters:
  • data_file (str) – Path to the datafile containing sequences and corresponding values
  • datatype (str) – Format of the values within data_file. Should be ‘sequence’ if the data_file contains a single value per sequence, or ‘residues’ if it contains a value for each residue per sequence.
  • problem_type (str) – The machine learning task to be addressed. Should be either ‘regression’ or ‘classification’.
  • excludeSeqID (bool, optional) – Flag that indicates how data_file is formatted. If False (default), then each line in the file should begin with a column containing a sequence ID. If True, then the datafile will not have this ID column, and will begin with the protein sequence.
  • split_file (str, optional) – Path to a file containing information on how to divide the data into training, validation and test datasets. Default is None, which will cause the data to be divided randomly, with proportions based on percent_val and percent_test. If split_file is provided it must contain 3 lines in the file, corresponding to the training, validation and test sets. Each line should have whitespace-separated integer indices which correspond to lines in data_file.
  • encoding_scheme (str, optional) – The method to be used for encoding protein sequences as numeric vectors. Currently ‘onehot’ and ‘biophysics’ are implemented (default is ‘onehot’).
  • encoder (UserEncoder object, optional) – If encoding_scheme is ‘user’, encoder should be a UserEncoder object that can convert amino acid sequences to numeric vectors. If encoding_scheme is not ‘user’, use None.
  • percent_val (float, optional) – If split_file is not provided, the fraction of the data that should be randomly assigned to the validation set. Should be in the range [0-1] (default is 0.15).
  • percent_test (float, optional) – If split_file is not provided, the fraction of the data that should be randomly assigned to the test set. Should be in the range [0-1] (default is 0.15). The proportion of the training set will be calculated by the difference between 1 and the sum of percent_val and percent_train, so these should not sum to be greater than 1.
  • ignoreWarnings (bool, optional) – If False, assess the structure and balance of the provided dataset with basic heuristics and display warnings for common issues.
  • save_splits_output (str, optional) – Location where the train / val / test splits for this run should be saved
Returns:

  • SequenceDataset object – a dataset containing the training set sequences and values
  • SequenceDataset object – a dataset containing the validation set sequences and values
  • SequenceDataset object – a dataset containing the test set sequences and values

parrot.process_input_data.split_data_cv(data_file, datatype, problem_type, num_classes, excludeSeqID=False, split_file=None, encoding_scheme='onehot', encoder=None, percent_val=0.15, percent_test=0.15, n_folds=5, ignoreWarnings=False, save_splits_output=None)[source]

Divide a datafile into training, val, test and 5 cross-val datasets.

Takes in a datafile and specification of the data format and the machine learning problem, and returns PyTorch-compatible Dataset objects for the training, validation, test and cross-validation sets of the data. The user may optionally specify how the dataset should be split into these subsets, as well as how protein sequences should be encoded as numeric vectors.

Parameters:
  • data_file (str) – Path to the datafile containing sequences and corresponding values
  • datatype (str) – Format of the values within data_file. Should be ‘sequence’ if the data_file contains a single value per sequence, or ‘residues’ if it contains a value for each residue per sequence.
  • problem_type (str) – The machine learning task to be addressed. Should be either ‘regression’ or ‘classification’.
  • excludeSeqID (bool, optional) – Flag that indicates how data_file is formatted. If False (default), then each line in the file should begin with a column containing a sequence ID. If True, then the datafile will not have this ID column, and will begin with the protein sequence.
  • split_file (str, optional) – Path to a file containing information on how to divide the data into training, validation and test datasets. Default is None, which will cause the data to be divided randomly, with proportions based on percent_val and percent_test. If split_file is provided it must contain 3 lines in the file, corresponding to the training, validation and test sets. Each line should have whitespace-separated integer indices which correspond to lines in data_file.
  • encoding_scheme (str, optional) – The method to be used for encoding protein sequences as numeric vectors. Currently ‘onehot’ and ‘biophysics’ are implemented (default is ‘onehot’).
  • encoder (UserEncoder object, optional) – If encoding_scheme is ‘user’, encoder should be a UserEncoder object that can convert amino acid sequences to numeric vectors. If encoding_scheme is not ‘user’, use None.
  • percent_val (float, optional) – If split_file is not provided, the fraction of the data that should be randomly assigned to the validation set. Should be in the range [0-1] (default is 0.15).
  • percent_test (float, optional) – If split_file is not provided, the fraction of the data that should be randomly assigned to the test set. Should be in the range [0-1] (default is 0.15). The proportion of the training set will be calculated by the difference between 1 and the sum of percent_val and percent_train, so these should not sum to be greater than 1.
  • n_folds (int, optional) – Number of folds for cross-validation (default is 5).
  • ignoreWarnings (bool, optional) – If False, assess the structure and balance of the provided dataset with basic heuristics and display warnings for common issues.
  • save_splits_output (str, optional) – Location where the train / val / test splits for this run should be saved
Returns:

  • list of tuples of SequenceDataset objects – a list of tuples of length n_folds. Each tuple contains the training and validation datasets for one of the cross-val folds.
  • SequenceDataset object – a dataset containing the training set sequences and values
  • SequenceDataset object – a dataset containing the validation set sequences and values
  • SequenceDataset object – a dataset containing the test set sequences and values

parrot.process_input_data.vector_split(v, fraction)[source]

Split a vector randomly by a specified proportion

Randomly divide the values of a vector into two, non-overlapping smaller vectors. The proportions of the two vectors will be fraction and (1 - fraction).

Parameters:
  • v (numpy array) – The vector to divide
  • fraction (float) – Size proportion for the returned vectors. Should be in the range [0-1].
Returns:

  • numpy array – a subset of v of length fraction * len(v) (rounding up)
  • numpy array – a subset of v of length (1-fraction) * len(v).

train_network.py

Core training module of PARROT

Question/comments/concerns? Raise an issue on github: https://github.com/idptools/parrot

Licensed under the MIT license.

parrot.train_network.test_labeled_data(network, test_loader, datatype, problem_type, weights_file, num_classes, probabilistic_classification, include_figs, device, output_file_prefix='')[source]

Test a trained BRNN on labeled sequences

Using the saved weights of a trained network, run a set of sequences through the network and evaluate the performancd. Return the average loss per sequence and plot the results. Testing a network on previously-unseen data provides a useful estimate of how generalizeable the network’s performance is.

Parameters:
  • network (PyTorch network object) – A BRNN network with the desired architecture
  • test_loader (PyTorch DataLoader object) – A DataLoader containing the sequences and targets of the test set
  • datatype (str) – The format of values in the dataset. Should be ‘sequence’ for datasets with a single value (or class label) per sequence, or ‘residues’ for datasets with values (or class labels) for every residue in a sequence.
  • problem_type (str) – The machine learning task–should be either ‘regression’ or ‘classification’.
  • weights_file (str) – A path to the location of the best_performing network weights
  • num_classes (int) – Number of data classes. If regression task, put 1.
  • probabilistic_classification (bool) – Whether output should be binary labels, or “weights” of each label type. This field is only implemented for binary, sequence classification tasks.
  • include_figs (bool) – Whether or not matplotlib figures should be generated.
  • device (str) – Location of where testing will take place–should be either ‘cpu’ or ‘cuda’ (GPU). If available, training on GPU is typically much faster.
  • output_file_prefix (str) – Path and filename prefix to which the test set predictions and plots will be saved.
Returns:

  • float – The average loss across the entire test set
  • list of lists – Details of the output predictions for each of the sequences in the test set. Each inner list represents a sample in the test set, with the format: [sequence_vector, true_value, predicted_value, sequence_ID]

parrot.train_network.test_unlabeled_data(network, sequences, device, encoding_scheme='onehot', encoder=None, print_frequency=None)[source]

Test a trained BRNN on unlabeled sequences

Use a trained network to make predictions on previously-unseen data.

** Note: Unlike the previous functions, network here must have pre-loaded weights. **

Parameters:
  • network (PyTorch network object) – A BRNN network with the desired architecture and pre-loaded weights
  • sequences (list) – A list of amino acid sequences to test using the network
  • device (str) – Location of where testing will take place–should be either ‘cpu’ or ‘cuda’ (GPU). If available, training on GPU is typically much faster.
  • encoding_scheme (str, optional) – How amino acid sequences are to be encoded as numeric vectors. Currently, ‘onehot’,’biophysics’ and ‘user’ are the implemented options.
  • encoder (UserEncoder object, optional) – If encoding_scheme is ‘user’, encoder should be a UserEncoder object that can convert amino acid sequences to numeric vectors. If encoding_scheme is not ‘user’, use None.
  • print_frequency (int) – If provided defines at what sequence interval an update is printed. Default = None.
Returns:

A dictionary containing predictions mapped to sequences

Return type:

dict

parrot.train_network.train(network, train_loader, val_loader, datatype, problem_type, weights_file, stop_condition, device, learn_rate, n_epochs, verbose=False, silent=False)[source]

Train a BRNN and save the best performing network weights

Train the network on a training set, and every epoch evaluate its performance on a validation set. Save the network weights that acheive the best performance on the validation set.

User must specify the machine learning tast (problem_type) and the format of the data (datatype). Additionally, this function requires the learning rate hyperparameter and the number of epochs of training. The other hyperparameters, number of hidden layers and hidden vector size, are implictly included on the the provided network.

The user may specify if they want to train the network for a set number of epochs or until an automatic stopping condition is reached with the argument stop_condition. Depending on the stopping condition used, the n_epochs argument will have a different role.

Parameters:
  • network (PyTorch network object) – A BRNN network with the desired architecture
  • train_loader (PyTorch DataLoader object) – A DataLoader containing the sequences and targets of the training set
  • val_loader (PyTorch DataLoader object) – A DataLoader containing the sequences and targets of the validation set
  • datatype (str) – The format of values in the dataset. Should be ‘sequence’ for datasets with a single value (or class label) per sequence, or ‘residues’ for datasets with values (or class labels) for every residue in a sequence.
  • problem_type (str) – The machine learning task–should be either ‘regression’ or ‘classification’.
  • weights_file (str) – A path to the location where the best_performing network weights will be saved
  • stop_condition (str) – Determines when to conclude network training. If ‘iter’, then the network will train for n_epochs epochs, then stop. If ‘auto’ then the network will train for at least n_epochs epochs, then begin assessing whether performance has sufficiently stagnated. If the performance plateaus for n_epochs consecutive epochs, then training will stop.
  • device (str) – Location of where training will take place–should be either ‘cpu’ or ‘cuda’ (GPU). If available, training on GPU is typically much faster.
  • learn_rate (float) – Initial learning rate of network training. The training process is controlled by the Adam optimization algorithm, so this learning rate will tend to decrease as training progresses.
  • n_epochs (int) – Number of epochs to train for, or required to have stagnated performance for, depending on stop_condition.
  • verbose (bool, optional) – If true, causes training updates to be written every epoch, rather than every 5 epochs.
  • silent (bool, optional) – If true, causes not training updates to be written to standard out.
Returns:

  • list – A list of the average training set losses achieved at each epoch
  • list – A list of the average validation set losses achieved at each epoch

bayesian_optimization.py

This file contains code for conducting Bayesian optimization.

Question/comments/concerns? Raise an issue on github: https://github.com/idptools/parrot

Licensed under the MIT license.

class parrot.bayesian_optimization.BayesianOptimizer(cv_dataloaders, input_size, n_epochs, n_classes, dtype, weights_file, max_iterations, device, silent)[source]

A class for conducting Bayesian Optimization on a PyTorch RNN

Sets up and runs GPy Bayesian Optimization in order to choose the best- performing hyperparameters for a RNN for a given machine learning task. Iteratively change learning rate, hidden vector size, and the number of layers in the network, then train and validating using 5-fold cross validation.

Variables:
  • cv_dataloaders (list of tuples of PyTorch DataLoader objects) – For each of the cross-val folds, a tuple containing a training set DataLoader and a validation set DataLoader.
  • input_size (int) – Length of the amino acid encoding vectors
  • n_epochs (int) – Number of epochs to train for each iteration of the algorithm
  • n_classes (int) – Number of classes
  • n_folds (int) – Number of cross-validation folds
  • problem_type (str) – ‘classification’ or ‘regression’
  • dtype (str) – ‘sequence’ or ‘residues’
  • weights_file (str) – Path to which the network weights will be saved during training
  • device (str) – ‘cpu’ or ‘cuda’ depending on system hardware
  • max_iterations (int) – Maximum number of iterations to perform the optimization procedure
  • silent (bool) – If true, do not print updates to console
  • bds (list of dicts) – GPy-compatible bounds for each of the hyperparameters to be optimized
compute_cv_loss(hyperparameters)[source]

Compute the average cross-val loss for a given set of hyperparameters

Given N sets of hyperparameters, determine the average cross-validation loss for BRNNs trained with these parameters.

Parameters:hyperparameters (numpy float array) – Each row corresponds to a set of hyperparameters, in the order: [log_learining_rate, n_layers, hidden_size]
Returns:a Nx1 numpy array of the average cross-val loss per set of input hyperparameters
Return type:numpy float array
eval_cv_brnns(lr, nl, hs)[source]

Train and test a network with given parameters across all cross-val folds

Parameters:
  • lr (float) – Learning rate of the network
  • nl (int) – Number of hidden layers (for each direction) in the network
  • hs (int) – Size of hidden vectors in the network
Returns:

the best validation loss from each fold of cross validation

Return type:

numpy float array

Calculate loss and estimate noise for an initial set of hyperparameters

Parameters:x (numpy array) – Array containing initial hyperparameters to test
Returns:
  • numpy array – Array containing the average losses of the input hyperparameters
  • float – The standard deviation of loss across cross-val folds for the input hyperparameters; an estimation of the training noise
optimize()[source]

Set up and run Bayesian Optimization on the BRNN using GPy

Returns:the best hyperparameters are chosen by Bayesian Optimization. Returned in the order: [lr, nl, hs]
Return type:list

brnn_plot.py

Plot training results for regression and classification tasks on both sequence-mapped and residue-mapped data.

Question/comments/concerns? Raise an issue on github: https://github.com/idptools/parrot

Licensed under the MIT license.

parrot.brnn_plot.confusion_matrix(true_classes, predicted_classes, num_classes, output_file_prefix='')[source]

Create a confusion matrix for a sequence classification problem

Figure is saved to file at “<output_file_prefix>_seq_CM.png”.

Parameters:
  • true_classes (list of PyTorch IntTensors) – A list where each item is a [1 x 1] tensor with the true class label of a particular sequence
  • predicted_classes (list of PyTorch FloatTensors) – A list where each item is a [1 x num_classes] tensor prediction of the class label for a particular sequence
  • num_classes (int) – Number of distinct data classes
  • output_file_prefix (str, optional) – File to which the plot will be saved as “<output_file_prefix>_seq_CM.png”
parrot.brnn_plot.output_predictions_to_file(sequence_data, excludeSeqID, encoding_scheme, probabilistic_class, encoder=None, output_file_prefix='')[source]

Output sequences, their true values, and their predicted values to a file

Used on the output of the test_unlabeled_data() function in the train_network module in order to detail the performance of the trained network on the test set. Produces the file “test_set_predictions.tsv” in output_dir. Each pair of lines in this tsvfile corresponds to a particular test set sequence, with the first containing the true data values, and the second line having the predicted data values.

Parameters:
  • sequence_data (list of lists) – Details of the output predictions for each of the sequences in the test set. Each inner list represents a sample in the test set, with the format: [sequence_vector, true_value, predicted_value, sequence_ID]
  • excludeSeqID (bool) – Boolean indicating whether or not each line in tsvfile has a sequence ID (default is False)
  • encoding_scheme (str) – Description of how an amino acid sequence should be encoded as a numeric vector. Providing a string other than ‘onehot’, ‘biophysics’, or ‘user’ will produce unintended consequences.
  • probabilistic_class (bool) – Flag indicating if probabilistic classification was specified by the user. If True, instead of class labels, predictions will be output as probabilities of each class.
  • encoder (UserEncoder object, optional) – If encoding_scheme is ‘user’, encoder should be a UserEncoder object that can convert amino acid sequences to numeric vectors. If encoding_scheme is not ‘user’, use None.
  • output_file_prefix (str) – Path and filename prefix to which the test set predictions will be saved. Final file path is “<output_file_prefix>_predictions.tsv”
parrot.brnn_plot.plot_precision_recall_curve(true_classes, predicted_class_probs, num_classes, output_file_prefix='')[source]

Create an PR curve for a sequence classification problem

Figure is saved to file at “<output_file_prefix>_PR_curve.png”.

Parameters:
  • true_classes (list of PyTorch IntTensors) – A list where each item is a [1 x 1] tensor with the true class label of a particular sequence
  • predicted_class_probs (list of PyTorch FloatTensors) – A list where each item is a [1 x num_classes] tensor of the probabilities of assignment to each class
  • num_classes (int) – Number of distinct data classes
  • output_file_prefix (str, optional) – File to which the plot will be saved as “<output_file_prefix>_PR_curve.png”
parrot.brnn_plot.plot_roc_curve(true_classes, predicted_class_probs, num_classes, output_file_prefix='')[source]

Create an ROC curve for a sequence classification problem

Figure is saved to file at “<output_file_prefix>_ROC_curve.png”.

Parameters:
  • true_classes (list of PyTorch IntTensors) – A list where each item is a [1 x 1] tensor with the true class label of a particular sequence
  • predicted_class_probs (list of PyTorch FloatTensors) – A list where each item is a [1 x num_classes] tensor of the probabilities of assignment to each class
  • num_classes (int) – Number of distinct data classes
  • output_file_prefix (str, optional) – File to which the plot will be saved as “<output_file_prefix>_ROC_curve.png”
parrot.brnn_plot.res_confusion_matrix(true_classes, predicted_classes, num_classes, output_file_prefix='')[source]

Create a confusion matrix for a residue classification problem

Figure is saved to file at “<output_file_prefix>_res_CM.png”.

Parameters:
  • true_classes (list of PyTorch IntTensors) – A list where each item is a [1 x len(sequence)] tensor with the true class label of the residues in a particular sequence
  • predicted_classes (list of PyTorch FloatTensors) – A list where each item is a [1 x num_classes x len(sequence)] tensor with predictions of the class label for each residue in a particular sequence
  • num_classes (int) – Number of distinct data classes
  • output_file_prefix (str, optional) – File to which the plot will be saved as “<output_file_prefix>_res_CM.png”
parrot.brnn_plot.residue_regression_scatterplot(true, predicted, output_file_prefix='')[source]

Create a scatterplot for a residue-mapped values regression problem

Each sequence is plotted with a unique marker-color combination, up to 70 different sequences.

Figure is saved to file at “<output_file_prefix>_res_scatterplot.png”.

Parameters:
  • true (list of PyTorch FloatTensors) – A list where each item is a [1 x len(sequence)] tensor with the true regression values of each residue in a sequence
  • predicted (list of PyTorch FloatTensors) – A list where each item is a [1 x len(sequence)] tensor with the regression predictions for each residue in a sequence
  • output_file_prefix (str, optional) – File to which the plot will be saved as “<output_file_prefix>_res_scatterplot.png”
parrot.brnn_plot.sequence_regression_scatterplot(true, predicted, output_file_prefix='')[source]

Create a scatterplot for a sequence-mapped values regression problem

Figure is saved to file at “<output_file_prefix>_seq_scatterplot.png”.

Parameters:
  • true (list of PyTorch FloatTensors) – A list where each item is a [1 x 1] tensor with the true regression value of a particular sequence
  • predicted (list of PyTorch FloatTensors) – A list where each item is a [1 x 1] tensor with the regression prediction for a particular sequence
  • output_file_prefix (str, optional) – File to which the plot will be saved as “<output_file_prefix>_seq_scatterplot.png”
parrot.brnn_plot.training_loss(train_loss, val_loss, output_file_prefix='')[source]

Plot training and validation loss per epoch

Figure is saved to file at “<output_file_prefix>_train_val_loss.png”.

Parameters:
  • train_loss (list) – training loss across each epoch
  • val_loss (list) – validation loss across each epoch
  • output_file_prefix (str, optional) – File to which the plot will be saved as “<output_file_prefix>_train_val_loss.png”
parrot.brnn_plot.write_performance_metrics(sequence_data, dtype, problem_type, prob_class, output_file_prefix='')[source]

Writes a short text file describing performance on a variety of metrics

Writes different output depending on whether a classification or regression task is specified. Also produces unique output if in probabilistic classification mode. File is saved to “<output_file_prefix>_performance_stats.txt”.

Parameters:
  • sequence_data (list of lists) – Details of the output predictions for each of the sequences in the test set. Each inner list represents a sample in the test set, with the format: [sequence_vector, true_value, predicted_value, sequence_ID]
  • dtype (str) – The format of values in the dataset. Should be ‘sequence’ for datasets with a single value (or class label) per sequence, or ‘residues’ for datasets with values (or class labels) for every residue in a sequence.
  • problem_type (str) – The machine learning task–should be either ‘regression’ or ‘classification’.
  • prob_class (bool) – Flag indicating if probabilistic classification was specified by the user.
  • output_file_prefix (str) – Path and filename prefix to which the test set predictions will be saved. Final file path is “<output_file_prefix>_performance_stats.txt”

py_predictor.py

Python module for integrating a trained network directly into a Python workflow.

Question/comments/concerns? Raise an issue on github: https://github.com/idptools/parrot

Licensed under the MIT license.

class parrot.py_predictor.Predictor(saved_weights, dtype)[source]

Class that for integrating a trained PARROT network into a Python workflow

Usage: >>> from parrot import py_predictor >>> my_predictor = py_predictor.Predictor(</path/to/saved_network.pt>,

dtype={“sequence” or “residues”})
>>> value = my_predictor.predict(AA_sequence)

*** NOTE: Assumes all sequences are composed of canonical amino acids and

that all networks were implemented using one-hot encoding.

***

Variables:
  • dtype (str) – Data format that the network was trained for. Either “sequence” or “residues”.
  • num_layers (int) – Number of hidden layers in the trained network.
  • hidden_vector_size (int) – Size of hidden vectoer in the trained network.
  • n_classes (int) – Number of data classes that the network was trained for. If 1, then network is designed for regression task. If >1, then classification task with n_classes.
  • task (str) – Designates if network is designed for “classification” or “regression”.
  • network (PyTorch object) – Initialized PARROT network with loaded weights.
predict(seq)[source]

Use the network to predict values for a single sequence of valid amino acids

Parameters:seq (str) – Valid amino acid sequence
Returns:Returns a 1D np.ndarray the length of the sequence where each position is the prediction at that position.
Return type:np.ndarray