Assembly API¶
Module for full-length TCR sequence assembly.
Overview¶
The assembly module builds full-length TCR sequences from CDR3 and V/J gene information. It supports:
- Leader sequences: Per-chain configuration - extract from contig FASTAs or use standard signal peptides
- Constant regions: Fetch from Ensembl (TRAC, TRBC1, TRBC2) or use built-in sequences
- 2A linkers: Join alpha and beta chains with self-cleaving peptides (T2A, P2A, E2A, F2A)
- Single-chain constructs: Generate β-linker-α format for expression
Leader Sequence Options¶
Each chain (alpha and beta) can have its own leader configuration:
| Option | Description |
|---|---|
None |
No leader sequence |
"from_contig" |
Extract native leader from CellRanger FASTA |
"CD8A", "CD28", etc. |
Use a standard signal peptide |
Default configuration: CD28 on alpha chain, CD8A on beta chain (distinct leaders for identification).
Available Leader Sequences¶
| Leader | Source | Species | Sequence |
|---|---|---|---|
| CD8A | CD8A signal peptide (UniProt P01732) | Human | MALPVTALLLPLALLLHAARP |
| CD28 | CD28 signal peptide (UniProt P10747) | Human | MLRLLLALNLFPSIQVTG |
| IgK | IgGκ light chain signal peptide | Mouse | METDTLLLWVLLLWVPGSTG |
| TRAC | TCR alpha constant signal peptide | Human | MAGTWLLLLLALGCPALPTG |
| TRBC | TCR beta constant signal peptide | Human | MGTSLLCWMALCLLGADHADG |
Available Linkers¶
| Linker | Source | Sequence |
|---|---|---|
| T2A | Thosea asigna virus | EGRGSLLTCGDVEENPGP |
| P2A | Porcine teschovirus-1 | GSGATNFSLLKQAGDVEENPGP |
| E2A | Equine rhinitis A virus | QCTNYALLKLAGDVESNPGP |
| F2A | Foot-and-mouth disease virus | VKQTLNFDLLKLAGDVESNPGP |
Usage Examples¶
from tcrsift import assemble_full_sequences
# Default: CD28 on alpha, CD8A on beta
assembled = assemble_full_sequences(clonotypes)
# Custom leaders
assembled = assemble_full_sequences(
clonotypes,
alpha_leader="CD8A",
beta_leader="CD28",
)
# Leader only on beta chain (first in 2A construct)
assembled = assemble_full_sequences(
clonotypes,
alpha_leader=None,
beta_leader="CD8A",
)
# Extract native leaders from contigs
assembled = assemble_full_sequences(
clonotypes,
contigs_dir="/path/to/contigs",
alpha_leader="from_contig",
beta_leader="from_contig",
)
# No leaders at all
assembled = assemble_full_sequences(
clonotypes,
alpha_leader=None,
beta_leader=None,
)
API Reference¶
assemble ¶
Full-length TCR sequence assembly for TCRsift.
Builds complete TCR sequences including leader peptides and constant regions.
CODON_TABLE
module-attribute
¶
CODON_TABLE = {'ATA': 'I', 'ATC': 'I', 'ATT': 'I', 'ATG': 'M', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AAC': 'N', 'AAT': 'N', 'AAA': 'K', 'AAG': 'K', 'AGC': 'S', 'AGT': 'S', 'AGA': 'R', 'AGG': 'R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CAC': 'H', 'CAT': 'H', 'CAA': 'Q', 'CAG': 'Q', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GAC': 'D', 'GAT': 'D', 'GAA': 'E', 'GAG': 'E', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TTC': 'F', 'TTT': 'F', 'TTA': 'L', 'TTG': 'L', 'TAC': 'Y', 'TAT': 'Y', 'TAA': '*', 'TAG': '*', 'TGC': 'C', 'TGT': 'C', 'TGA': '*', 'TGG': 'W'}
LINKERS
module-attribute
¶
LINKERS = {'T2A': {'dna': 'GAGGGCAGAGGAAGTCTGCTAACATGCGGTGACGTCGAGGAGAATCCTGGCCCG', 'aa': 'EGRGSLLTCGDVEENPGP', 'source': 'Thosea asigna virus'}, 'P2A': {'dna': 'GGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACCCTGGACCT', 'aa': 'GSGATNFSLLKQAGDVEENPGP', 'source': 'Porcine teschovirus-1'}, 'E2A': {'dna': 'CAGTGTACTAATTATGCTCTCTTGAAATTGGCTGGAGATGTTGAGAGCAACCCAGGTCCC', 'aa': 'QCTNYALLKLAGDVESNPGP', 'source': 'Equine rhinitis A virus'}, 'F2A': {'dna': 'GTGAAACAGACTTTGAATTTTGACCTTCTCAAGTTGGCGGGAGACGTGGAGTCCAACCCAGGGCCC', 'aa': 'VKQTLNFDLLKLAGDVESNPGP', 'source': 'Foot-and-mouth disease virus'}}
DEFAULT_LEADERS
module-attribute
¶
DEFAULT_LEADERS = {'CD8A': {'aa': 'MALPVTALLLPLALLLHAARP', 'dna': 'ATGGCCCTGCCTGTGACAGCCCTGCTGCTGCCTCTGGCTCTGCTGCTGCATGCCGCTAGACCC', 'source': 'Human CD8A signal peptide (UniProt P01732)', 'species': 'human'}, 'CD28': {'aa': 'MLRLLLALNLFPSIQVTG', 'dna': 'ATGCTCCGCCTGCTGCTGGCCCTGAACCTGTTCCCCAGCATCCAGGTGACCGGC', 'source': 'Human CD28 signal peptide (UniProt P10747)', 'species': 'human'}, 'IgK': {'aa': 'METDTLLLWVLLLWVPGSTG', 'dna': 'ATGGAGACAGACACACTCCTGCTATGGGTACTGCTGCTCTGGGTTCCAGGTTCCACTGGT', 'source': 'Murine IgGκ light chain signal peptide', 'species': 'mouse', 'note': 'Widely used for high secretion efficiency in mammalian expression'}, 'TRAC': {'aa': 'MAGTWLLLLLALGCPALPTG', 'dna': 'ATGGCTGGCACCTGGCTGCTGCTGCTGCTGGCCCTGGGATGCCCAGCACTGCCCACAGGC', 'source': 'Human TRAC native signal peptide', 'species': 'human'}, 'TRBC': {'aa': 'MGTSLLCWMALCLLGADHADG', 'dna': 'ATGGGCACCAGCCTGCTGTGCTGGATGGCCCTGTGCCTGCTGGGAGCAGACCACGCCGATGGC', 'source': 'Human TRBC native signal peptide', 'species': 'human'}}
assemble_full_sequences ¶
assemble_full_sequences(clonotypes: DataFrame, contigs_dir: str | Path | None = None, alpha_leader: str | None = 'CD28', beta_leader: str | None = 'CD8A', include_constant: bool = True, constant_source: str = 'ensembl', linker: str = 'T2A', verbose: bool = True, show_progress: bool = True) -> pd.DataFrame
Assemble full-length TCR sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
clonotypes
|
DataFrame
|
Clonotype DataFrame with VDJ sequences (from fwr1/cdr1/fwr2/cdr2/fwr3/cdr3/fwr4) |
required |
contigs_dir
|
str or Path
|
Directory with CellRanger contig FASTA files. Required if alpha_leader or beta_leader is set to "from_contig". |
None
|
alpha_leader
|
str or None
|
Leader sequence for alpha chain. Options: - None: No leader sequence - "from_contig": Extract native leader from contig FASTA (requires contigs_dir) - Key from DEFAULT_LEADERS: "CD8A", "CD28", "IgK", "TRAC", "TRBC" Default is "CD28" to provide distinct sequences from beta chain. |
'CD28'
|
beta_leader
|
str or None
|
Leader sequence for beta chain. Same options as alpha_leader. Default is "CD8A" to provide distinct sequences from alpha chain. |
'CD8A'
|
include_constant
|
bool
|
Include constant region sequences (fetched from Ensembl or data) |
True
|
constant_source
|
str
|
Source for constant regions: "ensembl" or "from-data" |
'ensembl'
|
linker
|
str
|
Linker sequence for single-chain constructs: "T2A", "P2A", "E2A", "F2A" |
'T2A'
|
verbose
|
bool
|
Print progress information |
True
|
show_progress
|
bool
|
Show progress bar |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Clonotypes with full sequences added |
Examples:
>>> # Default: CD28 on alpha, CD8A on beta (distinct leaders)
>>> assembled = assemble_full_sequences(clonotypes)
>>> # No leader sequences
>>> assembled = assemble_full_sequences(clonotypes, alpha_leader=None, beta_leader=None)
>>> # Leader only on beta chain (first in 2A construct)
>>> assembled = assemble_full_sequences(clonotypes, alpha_leader=None, beta_leader="CD8A")
>>> # Extract native leaders from contig FASTAs
>>> assembled = assemble_full_sequences(
... clonotypes,
... contigs_dir="/path/to/contigs",
... alpha_leader="from_contig",
... beta_leader="from_contig",
... )
Source code in tcrsift/assemble.py
357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 | |
translate_dna ¶
Translate DNA sequence to amino acids.
Returns:
| Type | Description |
|---|---|
tuple
|
(amino_acid_sequence, ragged_3p_nucleotides) |
Source code in tcrsift/assemble.py
find_longest_orf ¶
Find and translate the longest open reading frame.
Returns:
| Type | Description |
|---|---|
tuple
|
(amino_acid_sequence, start_offset, ragged_3p_nucleotides) |
Source code in tcrsift/assemble.py
parse_fasta ¶
Parse a FASTA file.
Returns:
| Type | Description |
|---|---|
dict
|
Mapping from sequence ID to sequence |
Source code in tcrsift/assemble.py
load_contigs ¶
Load contig sequences from CellRanger output directories.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
contig_dir
|
str or Path
|
Directory containing sample subdirectories with FASTA files |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Nested dict: sample -> contig_id -> sequence |
Source code in tcrsift/assemble.py
get_constant_region_sequences ¶
Get human TCR constant region sequences from Ensembl.
Returns:
| Type | Description |
|---|---|
dict
|
Gene name to coding sequence |
Source code in tcrsift/assemble.py
validate_sequences ¶
Validate assembled sequences.
Returns:
| Type | Description |
|---|---|
list
|
List of warning messages |
Source code in tcrsift/assemble.py
export_fasta ¶
Export sequences to FASTA format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame with sequences |
required |
output_path
|
str or Path
|
Output file path |
required |
sequence_col
|
str
|
Column containing sequences to export |
'single_chain_aa'
|