tcrpmhcdataset.dataset
The purpose of this python3 script is to implement the TCRpMHCdataset class.
Main class for the TCRpMHCDataset package. This class is designed to take tabular paired data as input and return a cohesive dataset that is designed to capture the many to many nature of TCR and pMHC cross-reactivity. Accepts either TCR -> multiple pMHC mapping or pMHC -> multiple TCR mapping. Parses data from a table into lists of TCR and pMHC objects which can be indexed and called during training/eval. The dataset can also be split into stratified train/test sets.
Args:
- source (str): The source of the dataset. Either 'tcr' or 'pmhc'.
- target (str): The target of the dataset. Either 'tcr' or 'pmhc'.
- use_mhc (bool): Whether to use the MHC sequence or the pMHC sequence.
- use_pseudo (bool): Whether to use the pseudo MHC sequence or the full MHC sequence.
- use_cdr3 (bool): Whether to use the CDR3 sequence or the full TCR sequence.
Attributes:
- source (str): The source of the dataset. Either 'tcr' or 'pmhc'.
- target (str): The target of the dataset. Either 'tcr' or 'pmhc'.
- tcrs (list): The list of TCRs in the dataset.
- pMHCs (list): The list of pMHCs in the dataset.
- use_mhc (bool): Whether to use the MHC sequence or the pMHC sequence.
- use_pseudo (bool): Whether to use the pseudo MHC sequence or the full MHC sequence.
- use_cdr3 (bool): Whether to use the CDR3 sequence or the full TCR sequence.
- use_both_chains (bool): Whether to use both alpha and beta chains of the TCR.
Implements:
- __len__ (self): Return the number of TCR:pMHC pairs in the dataset. Can be accessed using len(dataset).
- __repr__ (self): Return a string representation of the dataset. Can be accessed using repr(dataset).
- __str__ (self): Return a more user friendly string representation of the dataset. Can be accessed using str(dataset).
- __getitem__ (self, idx): Return a tuple of (source object, target object) for the given index. Is either (TCR, PMHC) or (PMHC, TCR) depending on the source and target attributes. Can be accessed using dataset[idx].
- get_srclist (self): Return the list of source objects.
- get_trglist (self): Return the list of target objects.
- load_data_from_file (self, path_to_csv): Load the data from a csv file
- load_data_from_df (self, df): Load the data from a dataframe
- split (self, test_size=0.2, balance_on_allele=True, split_on=None, random_seed=42): Split the dataset into train, and test data.
- to_dict (self, stringify_input=False, stringify_output=True): Return a de-dpuplicated dictionary representation of the parallel dataset.
- to_df (self): Return a dataframe representation of the dataset instance.
- to_csv (self, path_to_csv): Write the dataset to a csv file.
Load the data from a csv file with the following required columns:
1. 'CDR3b'
2. 'TRBV'
3. 'TRBJ'
4. 'Epitope'
5. 'Allele'
6. 'Reference'
Args:
- path_to_csv (str): The path to the csv file with the following columns:
1. 'CDR3a': The CDR3a sequence in capital single letter Amino Acid Code format (str, optional)
* 2. 'CDR3b': The CDR3b sequence in capital single letter Amino Acid Code format (str, required)
3. 'TRAV': The TRAV gene in IMGT format (str, optional)
* 4. 'TRBV': The TRBV gene in IMGT format (str, required)
5. 'TRAJ': The TRAJ gene in IMGT format (str, optional)
* 6. 'TRBJ': The TRBJ gene in IMGT format (str, required)
7. 'TRAD': The TRAD gene in IMGT format (str, optional)
8. 'TRBD': The TRBD gene in IMGT format (str, optional)
9. 'TRA_stitched': The full TRA sequence in capital single letter Amino Acid Code format (str, optional)
10. 'TRB_stitched': The full TRB sequence in capital single letter Amino Acid Code format (str, optional [can be imputed])
* 11. 'Epitope': The peptide sequence in capital single letter Amino Acid Code format (str, required)
* 12. 'Allele': The HLA allele in IMGT format (str, required)
13. 'Pseudo': The pseudo MHC sequence in capital single letter Amino Acid Code format (str, optional [can be imputed])
14. 'MHC': The full MHC sequence in capital single letter Amino Acid Code format (str, optional [can be imputed])
* 15. 'Reference': The reference for the data point (str, required)
Raises:
- FileNotFoundError: If the file is not found.
- Warnings if specific instances were unable to be loaded.
Returns:
- None: This function does not return anything.
Load the data from a dataframe with the following required columns:
1. 'CDR3a'
2. 'CDR3b'
3. 'TRAV'
4. 'TRBV'
5. 'TRAJ'
6. 'TRBJ'
7. 'TRAD'
8. 'TRBD'
9. 'TRA_stitched'
10. 'TRB_stitched'
11. 'Epitope'
12. 'Allele'
13. 'Pseudo'
14. 'MHC'
15. 'Reference'
Returns:
- None: This function does not return anything.
Split the dataset into train, and test data. The split is stratified by allele so that the allele distributions of the train and test sets are approximately equal. The train_test_split also contains functionality of ensuring that epitopes and/or TCRs are held out from the training set to assess the generalization capacity of the model.
Args:
- test_size (float): The proportion of the dataset to include in the test set.
- balance_on_allele (bool): Whether to balance the train/test split on allele.
- split_on (list): The column(s) to split on, ensures combinations of instances from these columns occur in both train and test.
- ['Epitope'] ensures that no epitope is shared between train and test.
- ['Epitope', 'Allele'] ensures that no epitope::allele combination is shared between train and test.
- ['CDR3a', 'CDR3b'] ensures that no CDR3a::CDR3b combination is shared between train and test.
- random_seed (int): The random seed to use for the train/test split.
Returns:
- train_dataset (TCRpMHCdataset): The training dataset.
- test_dataset (TCRpMHCdataset): The testing dataset.
Return a de-dpuplicated dictionary representation of the parallel dataset. Keys are the source objects or their string representations, with string representations having more condensing of the data (by merging collisions).
Args:
- stringify_input (bool): Whether to convert the input to a string representation using __str__.
- stringify_output (bool): Whether to convert the output to a string representation using __str__.
Raises:
- None
Returns:
- data_dict (dict): The dictionary representation of the dataset with k,v pairs of some combination of source, target and the repr function [{TCR: pMHC} repr(TCR):repr(pMHC)].
Return a dataframe representation of the dataset instance.
Args:
- None
Returns:
- df (pd.DataFrame): The dataframe representation of the dataset where each row is a unique TCR:pMHC pair and the reference is the
concatenation of the list of references for that pair. Multiple references are separated by a semicolon.