tcrpmhcdataset.dataset

The purpose of this python3 script is to implement the TCRpMHCdataset class.

class TCRpMHCdataset:

Main class for the TCRpMHCDataset package. This class is designed to take tabular paired data as input and return a cohesive dataset that is designed to capture the many to many nature of TCR and pMHC cross-reactivity. Accepts either TCR -> multiple pMHC mapping or pMHC -> multiple TCR mapping. Parses data from a table into lists of TCR and pMHC objects which can be indexed and called during training/eval. The dataset can also be split into stratified train/test sets.

Args:

  • source (str): The source of the dataset. Either 'tcr' or 'pmhc'.
  • target (str): The target of the dataset. Either 'tcr' or 'pmhc'.
  • use_mhc (bool): Whether to use the MHC sequence or the pMHC sequence.
  • use_pseudo (bool): Whether to use the pseudo MHC sequence or the full MHC sequence.
  • use_cdr3 (bool): Whether to use the CDR3 sequence or the full TCR sequence.

Attributes:

  • source (str): The source of the dataset. Either 'tcr' or 'pmhc'.
  • target (str): The target of the dataset. Either 'tcr' or 'pmhc'.
  • tcrs (list): The list of TCRs in the dataset.
  • pMHCs (list): The list of pMHCs in the dataset.
  • use_mhc (bool): Whether to use the MHC sequence or the pMHC sequence.
  • use_pseudo (bool): Whether to use the pseudo MHC sequence or the full MHC sequence.
  • use_cdr3 (bool): Whether to use the CDR3 sequence or the full TCR sequence.
  • use_both_chains (bool): Whether to use both alpha and beta chains of the TCR.

Implements:

  • __len__ (self): Return the number of TCR:pMHC pairs in the dataset. Can be accessed using len(dataset).
  • __repr__ (self): Return a string representation of the dataset. Can be accessed using repr(dataset).
  • __str__ (self): Return a more user friendly string representation of the dataset. Can be accessed using str(dataset).
  • __getitem__ (self, idx): Return a tuple of (source object, target object) for the given index. Is either (TCR, PMHC) or (PMHC, TCR) depending on the source and target attributes. Can be accessed using dataset[idx].
  • get_srclist (self): Return the list of source objects.
  • get_trglist (self): Return the list of target objects.
  • load_data_from_file (self, path_to_csv): Load the data from a csv file
  • load_data_from_df (self, df): Load the data from a dataframe
  • split (self, test_size=0.2, balance_on_allele=True, split_on=None, random_seed=42): Split the dataset into train, and test data.
  • to_dict (self, stringify_input=False, stringify_output=True): Return a de-dpuplicated dictionary representation of the parallel dataset.
  • to_df (self): Return a dataframe representation of the dataset instance.
  • to_csv (self, path_to_csv): Write the dataset to a csv file.
TCRpMHCdataset( source, target, use_mhc=False, use_pseudo=True, use_cdr3=True, use_both_chains=False)
source
target
tcrs
pMHCs
use_mhc
use_pseudo
use_cdr3
use_both_chains
def get_srclist(self):

Return the list of source objects.

def get_trglist(self):

Return the list of target objects.

def load_data_from_file(self, path_to_csv, verbose=False):

Load the data from a csv file with the following required columns:

1. 'CDR3b'
2. 'TRBV'
3. 'TRBJ'
4. 'Epitope'
5. 'Allele'
6. 'Reference'

Args:

  • path_to_csv (str): The path to the csv file with the following columns:
    1. 'CDR3a': The CDR3a sequence in capital single letter Amino Acid Code format (str, optional)
  * 2. 'CDR3b': The CDR3b sequence in capital single letter Amino Acid Code format (str, required)
    3. 'TRAV': The TRAV gene in IMGT format (str, optional)
  * 4. 'TRBV': The TRBV gene in IMGT format (str, required)
    5. 'TRAJ': The TRAJ gene in IMGT format (str, optional)
  * 6. 'TRBJ': The TRBJ gene in IMGT format (str, required)
    7. 'TRAD': The TRAD gene in IMGT format (str, optional)
    8. 'TRBD': The TRBD gene in IMGT format (str, optional)
    9. 'TRA_stitched': The full TRA sequence in capital single letter Amino Acid Code format (str, optional)
    10. 'TRB_stitched': The full TRB sequence in capital single letter Amino Acid Code format (str, optional [can be imputed])
  * 11. 'Epitope': The peptide sequence in capital single letter Amino Acid Code format (str, required)
  * 12. 'Allele': The HLA allele in IMGT format (str, required)
    13. 'Pseudo': The pseudo MHC sequence in capital single letter Amino Acid Code format (str, optional [can be imputed])
    14. 'MHC': The full MHC sequence in capital single letter Amino Acid Code format (str, optional [can be imputed])
  * 15. 'Reference': The reference for the data point (str, required)

Raises:

  • FileNotFoundError: If the file is not found.
  • Warnings if specific instances were unable to be loaded.

Returns:

  • None: This function does not return anything.
def load_data_from_df(self, df, verbose=False):

Load the data from a dataframe with the following required columns:

1. 'CDR3a'
2. 'CDR3b'
3. 'TRAV'
4. 'TRBV'
5. 'TRAJ'
6. 'TRBJ'
7. 'TRAD'
8. 'TRBD'
9. 'TRA_stitched'
10. 'TRB_stitched'
11. 'Epitope'
12. 'Allele'
13. 'Pseudo'
14. 'MHC'
15. 'Reference'

Returns:

  • None: This function does not return anything.
def split( self, test_size=0.2, balance_on_allele=True, split_on=None, random_seed=42):

Split the dataset into train, and test data. The split is stratified by allele so that the allele distributions of the train and test sets are approximately equal. The train_test_split also contains functionality of ensuring that epitopes and/or TCRs are held out from the training set to assess the generalization capacity of the model.

Args:

  • test_size (float): The proportion of the dataset to include in the test set.
  • balance_on_allele (bool): Whether to balance the train/test split on allele.
  • split_on (list): The column(s) to split on, ensures combinations of instances from these columns occur in both train and test.
    • ['Epitope'] ensures that no epitope is shared between train and test.
    • ['Epitope', 'Allele'] ensures that no epitope::allele combination is shared between train and test.
    • ['CDR3a', 'CDR3b'] ensures that no CDR3a::CDR3b combination is shared between train and test.
  • random_seed (int): The random seed to use for the train/test split.

Returns:

  • train_dataset (TCRpMHCdataset): The training dataset.
  • test_dataset (TCRpMHCdataset): The testing dataset.
def to_dict(self, stringify_input=False, stringify_output=True):

Return a de-dpuplicated dictionary representation of the parallel dataset. Keys are the source objects or their string representations, with string representations having more condensing of the data (by merging collisions).

Args:

  • stringify_input (bool): Whether to convert the input to a string representation using __str__.
  • stringify_output (bool): Whether to convert the output to a string representation using __str__.

Raises:

  • None

Returns:

  • data_dict (dict): The dictionary representation of the dataset with k,v pairs of some combination of source, target and the repr function [{TCR: pMHC} repr(TCR):repr(pMHC)].
def to_df(self):

Return a dataframe representation of the dataset instance.

Args:

  • None

Returns:

  • df (pd.DataFrame): The dataframe representation of the dataset where each row is a unique TCR:pMHC pair and the reference is the
    concatenation of the list of references for that pair. Multiple references are separated by a semicolon.
def to_csv(self, path_to_csv):

Write the dataset to a csv file.

Args:

  • path_to_csv (str): The path to the csv file to write to.

Returns:

  • None: This function does not return anything.