Skip to content

Assembly API

Module for full-length TCR sequence assembly.

Overview

The assembly module builds full-length TCR sequences from CDR3 and V/J gene information. It supports:

  • Leader sequences: Per-chain configuration - extract from contig FASTAs or use standard signal peptides
  • Constant regions: Fetch from Ensembl (TRAC, TRBC1, TRBC2) or use built-in sequences
  • 2A linkers: Join alpha and beta chains with self-cleaving peptides (T2A, P2A, E2A, F2A)
  • Single-chain constructs: Generate β-linker-α format for expression

Leader Sequence Options

Each chain (alpha and beta) can have its own leader configuration:

Option Description
None No leader sequence
"from_contig" Extract native leader from CellRanger FASTA
"CD8A", "CD28", etc. Use a standard signal peptide

Default configuration: CD28 on alpha chain, CD8A on beta chain (distinct leaders for identification).

Available Leader Sequences

Leader Source Species Sequence
CD8A CD8A signal peptide (UniProt P01732) Human MALPVTALLLPLALLLHAARP
CD28 CD28 signal peptide (UniProt P10747) Human MLRLLLALNLFPSIQVTG
IgK IgGκ light chain signal peptide Mouse METDTLLLWVLLLWVPGSTG
TRAC TCR alpha constant signal peptide Human MAGTWLLLLLALGCPALPTG
TRBC TCR beta constant signal peptide Human MGTSLLCWMALCLLGADHADG

Available Linkers

Linker Source Sequence
T2A Thosea asigna virus EGRGSLLTCGDVEENPGP
P2A Porcine teschovirus-1 GSGATNFSLLKQAGDVEENPGP
E2A Equine rhinitis A virus QCTNYALLKLAGDVESNPGP
F2A Foot-and-mouth disease virus VKQTLNFDLLKLAGDVESNPGP

Usage Examples

from tcrsift import assemble_full_sequences

# Default: CD28 on alpha, CD8A on beta
assembled = assemble_full_sequences(clonotypes)

# Custom leaders
assembled = assemble_full_sequences(
    clonotypes,
    alpha_leader="CD8A",
    beta_leader="CD28",
)

# Leader only on beta chain (first in 2A construct)
assembled = assemble_full_sequences(
    clonotypes,
    alpha_leader=None,
    beta_leader="CD8A",
)

# Extract native leaders from contigs
assembled = assemble_full_sequences(
    clonotypes,
    contigs_dir="/path/to/contigs",
    alpha_leader="from_contig",
    beta_leader="from_contig",
)

# No leaders at all
assembled = assemble_full_sequences(
    clonotypes,
    alpha_leader=None,
    beta_leader=None,
)

API Reference

assemble

Full-length TCR sequence assembly for TCRsift.

Builds complete TCR sequences including leader peptides and constant regions.

CODON_TABLE module-attribute

CODON_TABLE = {'ATA': 'I', 'ATC': 'I', 'ATT': 'I', 'ATG': 'M', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AAC': 'N', 'AAT': 'N', 'AAA': 'K', 'AAG': 'K', 'AGC': 'S', 'AGT': 'S', 'AGA': 'R', 'AGG': 'R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CAC': 'H', 'CAT': 'H', 'CAA': 'Q', 'CAG': 'Q', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GAC': 'D', 'GAT': 'D', 'GAA': 'E', 'GAG': 'E', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TTC': 'F', 'TTT': 'F', 'TTA': 'L', 'TTG': 'L', 'TAC': 'Y', 'TAT': 'Y', 'TAA': '*', 'TAG': '*', 'TGC': 'C', 'TGT': 'C', 'TGA': '*', 'TGG': 'W'}

LINKERS module-attribute

LINKERS = {'T2A': {'dna': 'GAGGGCAGAGGAAGTCTGCTAACATGCGGTGACGTCGAGGAGAATCCTGGCCCG', 'aa': 'EGRGSLLTCGDVEENPGP', 'source': 'Thosea asigna virus'}, 'P2A': {'dna': 'GGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACCCTGGACCT', 'aa': 'GSGATNFSLLKQAGDVEENPGP', 'source': 'Porcine teschovirus-1'}, 'E2A': {'dna': 'CAGTGTACTAATTATGCTCTCTTGAAATTGGCTGGAGATGTTGAGAGCAACCCAGGTCCC', 'aa': 'QCTNYALLKLAGDVESNPGP', 'source': 'Equine rhinitis A virus'}, 'F2A': {'dna': 'GTGAAACAGACTTTGAATTTTGACCTTCTCAAGTTGGCGGGAGACGTGGAGTCCAACCCAGGGCCC', 'aa': 'VKQTLNFDLLKLAGDVESNPGP', 'source': 'Foot-and-mouth disease virus'}}

DEFAULT_LEADERS module-attribute

DEFAULT_LEADERS = {'CD8A': {'aa': 'MALPVTALLLPLALLLHAARP', 'dna': 'ATGGCCCTGCCTGTGACAGCCCTGCTGCTGCCTCTGGCTCTGCTGCTGCATGCCGCTAGACCC', 'source': 'Human CD8A signal peptide (UniProt P01732)', 'species': 'human'}, 'CD28': {'aa': 'MLRLLLALNLFPSIQVTG', 'dna': 'ATGCTCCGCCTGCTGCTGGCCCTGAACCTGTTCCCCAGCATCCAGGTGACCGGC', 'source': 'Human CD28 signal peptide (UniProt P10747)', 'species': 'human'}, 'IgK': {'aa': 'METDTLLLWVLLLWVPGSTG', 'dna': 'ATGGAGACAGACACACTCCTGCTATGGGTACTGCTGCTCTGGGTTCCAGGTTCCACTGGT', 'source': 'Murine IgGκ light chain signal peptide', 'species': 'mouse', 'note': 'Widely used for high secretion efficiency in mammalian expression'}, 'TRAC': {'aa': 'MAGTWLLLLLALGCPALPTG', 'dna': 'ATGGCTGGCACCTGGCTGCTGCTGCTGCTGGCCCTGGGATGCCCAGCACTGCCCACAGGC', 'source': 'Human TRAC native signal peptide', 'species': 'human'}, 'TRBC': {'aa': 'MGTSLLCWMALCLLGADHADG', 'dna': 'ATGGGCACCAGCCTGCTGTGCTGGATGGCCCTGTGCCTGCTGGGAGCAGACCACGCCGATGGC', 'source': 'Human TRBC native signal peptide', 'species': 'human'}}

assemble_full_sequences

assemble_full_sequences(clonotypes: DataFrame, contigs_dir: str | Path | None = None, alpha_leader: str | None = 'CD28', beta_leader: str | None = 'CD8A', include_constant: bool = True, constant_source: str = 'ensembl', linker: str = 'T2A', verbose: bool = True, show_progress: bool = True) -> pd.DataFrame

Assemble full-length TCR sequences.

Parameters:

Name Type Description Default
clonotypes DataFrame

Clonotype DataFrame with VDJ sequences (from fwr1/cdr1/fwr2/cdr2/fwr3/cdr3/fwr4)

required
contigs_dir str or Path

Directory with CellRanger contig FASTA files. Required if alpha_leader or beta_leader is set to "from_contig".

None
alpha_leader str or None

Leader sequence for alpha chain. Options: - None: No leader sequence - "from_contig": Extract native leader from contig FASTA (requires contigs_dir) - Key from DEFAULT_LEADERS: "CD8A", "CD28", "IgK", "TRAC", "TRBC" Default is "CD28" to provide distinct sequences from beta chain.

'CD28'
beta_leader str or None

Leader sequence for beta chain. Same options as alpha_leader. Default is "CD8A" to provide distinct sequences from alpha chain.

'CD8A'
include_constant bool

Include constant region sequences (fetched from Ensembl or data)

True
constant_source str

Source for constant regions: "ensembl" or "from-data"

'ensembl'
linker str

Linker sequence for single-chain constructs: "T2A", "P2A", "E2A", "F2A"

'T2A'
verbose bool

Print progress information

True
show_progress bool

Show progress bar

True

Returns:

Type Description
DataFrame

Clonotypes with full sequences added

Examples:

>>> # Default: CD28 on alpha, CD8A on beta (distinct leaders)
>>> assembled = assemble_full_sequences(clonotypes)
>>> # No leader sequences
>>> assembled = assemble_full_sequences(clonotypes, alpha_leader=None, beta_leader=None)
>>> # Leader only on beta chain (first in 2A construct)
>>> assembled = assemble_full_sequences(clonotypes, alpha_leader=None, beta_leader="CD8A")
>>> # Extract native leaders from contig FASTAs
>>> assembled = assemble_full_sequences(
...     clonotypes,
...     contigs_dir="/path/to/contigs",
...     alpha_leader="from_contig",
...     beta_leader="from_contig",
... )
Source code in tcrsift/assemble.py
def assemble_full_sequences(
    clonotypes: pd.DataFrame,
    contigs_dir: str | Path | None = None,
    alpha_leader: str | None = "CD28",
    beta_leader: str | None = "CD8A",
    include_constant: bool = True,
    constant_source: str = "ensembl",
    linker: str = "T2A",
    verbose: bool = True,
    show_progress: bool = True,
) -> pd.DataFrame:
    """
    Assemble full-length TCR sequences.

    Parameters
    ----------
    clonotypes : pd.DataFrame
        Clonotype DataFrame with VDJ sequences (from fwr1/cdr1/fwr2/cdr2/fwr3/cdr3/fwr4)
    contigs_dir : str or Path, optional
        Directory with CellRanger contig FASTA files. Required if alpha_leader or
        beta_leader is set to "from_contig".
    alpha_leader : str or None
        Leader sequence for alpha chain. Options:
        - None: No leader sequence
        - "from_contig": Extract native leader from contig FASTA (requires contigs_dir)
        - Key from DEFAULT_LEADERS: "CD8A", "CD28", "IgK", "TRAC", "TRBC"
        Default is "CD28" to provide distinct sequences from beta chain.
    beta_leader : str or None
        Leader sequence for beta chain. Same options as alpha_leader.
        Default is "CD8A" to provide distinct sequences from alpha chain.
    include_constant : bool
        Include constant region sequences (fetched from Ensembl or data)
    constant_source : str
        Source for constant regions: "ensembl" or "from-data"
    linker : str
        Linker sequence for single-chain constructs: "T2A", "P2A", "E2A", "F2A"
    verbose : bool
        Print progress information
    show_progress : bool
        Show progress bar

    Returns
    -------
    pd.DataFrame
        Clonotypes with full sequences added

    Examples
    --------
    >>> # Default: CD28 on alpha, CD8A on beta (distinct leaders)
    >>> assembled = assemble_full_sequences(clonotypes)

    >>> # No leader sequences
    >>> assembled = assemble_full_sequences(clonotypes, alpha_leader=None, beta_leader=None)

    >>> # Leader only on beta chain (first in 2A construct)
    >>> assembled = assemble_full_sequences(clonotypes, alpha_leader=None, beta_leader="CD8A")

    >>> # Extract native leaders from contig FASTAs
    >>> assembled = assemble_full_sequences(
    ...     clonotypes,
    ...     contigs_dir="/path/to/contigs",
    ...     alpha_leader="from_contig",
    ...     beta_leader="from_contig",
    ... )
    """
    # Validate inputs
    clonotypes = validate_clonotype_df(clonotypes, for_assembly=True)

    valid_constant_sources = ["ensembl", "from-data"]
    if constant_source not in valid_constant_sources:
        raise TCRsiftValidationError(
            f"Invalid constant_source: '{constant_source}'",
            hint=f"Valid options are: {valid_constant_sources}",
        )

    # Validate and resolve leader options for each chain
    leader_config = {}
    for chain, leader_param in [("alpha", alpha_leader), ("beta", beta_leader)]:
        if leader_param is None:
            leader_config[chain] = None
        elif leader_param.lower() == "from_contig":
            if not contigs_dir:
                raise TCRsiftValidationError(
                    f"{chain}_leader='from_contig' requires contigs_dir to be specified",
                    hint="Provide contigs_dir with CellRanger FASTA files, or use a default leader like 'CD8A'",
                )
            leader_config[chain] = "from_contig"
        elif leader_param.upper() in DEFAULT_LEADERS:
            leader_config[chain] = DEFAULT_LEADERS[leader_param.upper()]
        else:
            raise TCRsiftValidationError(
                f"Unknown {chain}_leader: '{leader_param}'",
                hint=f"Valid options are: None, 'from_contig', or one of {list(DEFAULT_LEADERS.keys())}",
            )

    if verbose:
        alpha_desc = _describe_leader(alpha_leader, leader_config["alpha"])
        beta_desc = _describe_leader(beta_leader, leader_config["beta"])
        logger.info(f"Assembling full sequences for {len(clonotypes):,} clonotypes")
        logger.info(f"  Alpha leader: {alpha_desc}")
        logger.info(f"  Beta leader: {beta_desc}")
        logger.info(f"  Constant regions: {include_constant} (source: {constant_source})")
        logger.info(f"  Linker: {linker}")

    df = clonotypes.copy()

    # Load constant regions if needed
    constant_seqs = {}
    if include_constant and constant_source == "ensembl":
        if verbose:
            logger.info("  Loading constant regions from Ensembl...")
        constant_seqs = get_constant_region_sequences()
        if not constant_seqs:
            logger.warning(
                "  Could not load constant regions from Ensembl, will use sequences from data"
            )
        elif verbose:
            logger.info(f"    Loaded {len(constant_seqs)} constant region sequences")

    # Warn if from-data constants requested but not present
    if include_constant and constant_source == "from-data":
        constant_cols = [
            "alpha_constant_aa",
            "alpha_constant_nt",
            "beta_constant_aa",
            "beta_constant_nt",
        ]
        if not any(col in df.columns for col in constant_cols):
            logger.warning(
                "  constant_source='from-data' but no constant region columns found in input. "
                "Constants will be omitted."
            )

    # Load contigs if needed for leader extraction
    sample_contigs = {}
    needs_contigs = (
        leader_config["alpha"] == "from_contig" or leader_config["beta"] == "from_contig"
    )
    if contigs_dir and needs_contigs:
        contigs_dir = validate_directory_exists(Path(contigs_dir), "contigs directory")
        if verbose:
            logger.info(f"  Loading contigs from {contigs_dir}...")
        sample_contigs = load_contigs(contigs_dir)
        if verbose:
            total_contigs = sum(len(c) for c in sample_contigs.values())
            logger.info(f"    Loaded {total_contigs:,} contigs from {len(sample_contigs)} samples")

    # Process each clonotype
    if verbose:
        logger.info("  Assembling sequences...")

    assembly_results = []

    # Create iterator with optional progress bar
    row_iter = df.iterrows()
    if show_progress:
        row_iter = tqdm(
            list(df.iterrows()),
            desc="Assembling sequences",
            unit="clone",
        )

    for idx, row in row_iter:
        result = _assemble_clone(
            row,
            sample_contigs,
            constant_seqs,
            leader_config,
            include_constant,
            constant_source,
        )
        assembly_results.append(result)

    # Add assembly columns to dataframe
    result_df = pd.DataFrame(assembly_results)
    for col in result_df.columns:
        df[col] = result_df[col].values

    # Add single-chain construct if requested
    if linker and "full_beta_aa" in df.columns and "full_alpha_aa" in df.columns:
        if verbose:
            logger.info(f"  Creating single-chain constructs with {linker} linker...")
        df = _add_single_chain(df, linker)

    # Summary
    if verbose:
        n_with_alpha = df["full_alpha_aa"].notna().sum() if "full_alpha_aa" in df.columns else 0
        n_with_beta = df["full_beta_aa"].notna().sum() if "full_beta_aa" in df.columns else 0
        n_single_chain = (
            df["single_chain_aa"].notna().sum() if "single_chain_aa" in df.columns else 0
        )
        logger.info("  Assembly complete:")
        logger.info(f"    With full alpha: {n_with_alpha:,}")
        logger.info(f"    With full beta: {n_with_beta:,}")
        logger.info(f"    Single-chain constructs: {n_single_chain:,}")

    return df

translate_dna

translate_dna(dna_seq: str) -> tuple[str, str]

Translate DNA sequence to amino acids.

Returns:

Type Description
tuple

(amino_acid_sequence, ragged_3p_nucleotides)

Source code in tcrsift/assemble.py
def translate_dna(dna_seq: str) -> tuple[str, str]:
    """
    Translate DNA sequence to amino acids.

    Returns
    -------
    tuple
        (amino_acid_sequence, ragged_3p_nucleotides)
    """
    seq_len = len(dna_seq)
    seq_len_trimmed = (seq_len // 3) * 3

    if seq_len != seq_len_trimmed:
        ragged_nt = dna_seq[seq_len_trimmed:]
        dna_seq = dna_seq[:seq_len_trimmed]
    else:
        ragged_nt = ""

    aa_seq = "".join([CODON_TABLE.get(dna_seq[i : i + 3], "X") for i in range(0, len(dna_seq), 3)])

    # Stop at first stop codon
    if "*" in aa_seq:
        ragged_nt = ""
        aa_seq = aa_seq[: aa_seq.index("*")]

    return aa_seq, ragged_nt

find_longest_orf

find_longest_orf(dna_seq: str) -> tuple[str, int, str]

Find and translate the longest open reading frame.

Returns:

Type Description
tuple

(amino_acid_sequence, start_offset, ragged_3p_nucleotides)

Source code in tcrsift/assemble.py
def find_longest_orf(dna_seq: str) -> tuple[str, int, str]:
    """
    Find and translate the longest open reading frame.

    Returns
    -------
    tuple
        (amino_acid_sequence, start_offset, ragged_3p_nucleotides)
    """
    start_positions = [i for i in range(len(dna_seq)) if dna_seq[i : i + 3] == "ATG"]

    longest_aa = ""
    longest_offset = 0
    longest_ragged = ""

    for start in start_positions:
        subseq = dna_seq[start:]
        aa, ragged = translate_dna(subseq)
        if len(aa) > len(longest_aa):
            longest_aa = aa
            longest_offset = start
            longest_ragged = ragged

    return longest_aa, longest_offset, longest_ragged

parse_fasta

parse_fasta(path: str | Path) -> dict[str, str]

Parse a FASTA file.

Returns:

Type Description
dict

Mapping from sequence ID to sequence

Source code in tcrsift/assemble.py
def parse_fasta(path: str | Path) -> dict[str, str]:
    """
    Parse a FASTA file.

    Returns
    -------
    dict
        Mapping from sequence ID to sequence
    """
    path = Path(path)
    results = {}
    curr_id = None
    lines = []

    with open(path) as f:
        for line in f:
            line = line.strip()
            if line.startswith(">"):
                if curr_id and lines:
                    results[curr_id] = "".join(lines)
                    lines = []
                curr_id = line[1:].split()[0]  # Take first word after >
            else:
                lines.append(line)

        # Don't forget last entry
        if curr_id and lines:
            results[curr_id] = "".join(lines)

    return results

load_contigs

load_contigs(contig_dir: str | Path) -> dict[str, dict[str, str]]

Load contig sequences from CellRanger output directories.

Parameters:

Name Type Description Default
contig_dir str or Path

Directory containing sample subdirectories with FASTA files

required

Returns:

Type Description
dict

Nested dict: sample -> contig_id -> sequence

Source code in tcrsift/assemble.py
def load_contigs(contig_dir: str | Path) -> dict[str, dict[str, str]]:
    """
    Load contig sequences from CellRanger output directories.

    Parameters
    ----------
    contig_dir : str or Path
        Directory containing sample subdirectories with FASTA files

    Returns
    -------
    dict
        Nested dict: sample -> contig_id -> sequence
    """
    contig_dir = Path(contig_dir)
    sample_contigs = {}

    # Look for FASTA files in subdirectories
    for fasta_path in contig_dir.rglob("*contig*.fasta"):
        sample_name = fasta_path.parent.name
        if sample_name not in sample_contigs:
            sample_contigs[sample_name] = {}
        sample_contigs[sample_name].update(parse_fasta(fasta_path))

    # Also check direct files
    for fasta_path in contig_dir.glob("*.fasta"):
        sample_name = fasta_path.stem.split("_")[0]
        if sample_name not in sample_contigs:
            sample_contigs[sample_name] = {}
        sample_contigs[sample_name].update(parse_fasta(fasta_path))

    logger.info(f"Loaded contigs from {len(sample_contigs)} samples")
    return sample_contigs

get_constant_region_sequences

get_constant_region_sequences() -> dict[str, str]

Get human TCR constant region sequences from Ensembl.

Returns:

Type Description
dict

Gene name to coding sequence

Source code in tcrsift/assemble.py
def get_constant_region_sequences() -> dict[str, str]:
    """
    Get human TCR constant region sequences from Ensembl.

    Returns
    -------
    dict
        Gene name to coding sequence
    """
    try:
        from pyensembl import ensembl_grch38

        def find_stop_codon(seq, offset=0):
            for i in range(offset, len(seq), 3):
                codon = seq[i : i + 3]
                if codon in {"TAA", "TAG", "TGA"}:
                    return i
            return None

        constants = {}

        # TRAC
        trac = ensembl_grch38.genes_by_name("TRAC")[0]
        trac_seq = trac.transcripts[0].sequence
        stop_idx = find_stop_codon(trac_seq, offset=2)
        if stop_idx:
            constants["TRAC"] = trac_seq[: stop_idx + 3]

        # TRBC1 and TRBC2
        for name in ["TRBC1", "TRBC2"]:
            gene = ensembl_grch38.genes_by_name(name)[0]
            seq = gene.transcripts[0].sequence
            stop_idx = find_stop_codon(seq, offset=2)
            if stop_idx:
                constants[name] = seq[: stop_idx + 3]

        return constants

    except ImportError:
        logger.warning("pyensembl not available, constant regions will not be included")
        return {}
    except Exception as e:
        logger.warning(f"Could not load constant regions from Ensembl: {e}")
        return {}

validate_sequences

validate_sequences(df: DataFrame) -> list[str]

Validate assembled sequences.

Returns:

Type Description
list

List of warning messages

Source code in tcrsift/assemble.py
def validate_sequences(df: pd.DataFrame) -> list[str]:
    """
    Validate assembled sequences.

    Returns
    -------
    list
        List of warning messages
    """
    warnings = []

    # Check sequence lengths
    for chain in ["alpha", "beta"]:
        col = f"full_{chain}_aa"
        if col not in df.columns:
            continue

        for idx, row in df.iterrows():
            seq = row.get(col, "")
            if not seq:
                continue

            if len(seq) < 200:
                warnings.append(f"Clone {idx}: {chain} chain too short ({len(seq)} aa)")
            if len(seq) > 450:
                warnings.append(f"Clone {idx}: {chain} chain too long ({len(seq)} aa)")

            # Check CDR3 is present
            cdr3_col = f"CDR3_{chain}"
            if cdr3_col in row:
                cdr3 = row[cdr3_col]
                if cdr3 and cdr3 not in seq:
                    warnings.append(f"Clone {idx}: CDR3_{chain} not found in full sequence")

    # Check constant region endings
    for idx, row in df.iterrows():
        for chain in ["alpha", "beta"]:
            c_gene = row.get(f"{chain}_c_gene", "")
            full_seq = row.get(f"full_{chain}_aa", "")

            if c_gene and full_seq and c_gene in CONSTANT_REGION_ENDINGS:
                expected_end = CONSTANT_REGION_ENDINGS[c_gene]
                if not full_seq.endswith(expected_end):
                    warnings.append(
                        f"Clone {idx}: {chain} constant region doesn't end with expected "
                        f"sequence for {c_gene}"
                    )

    return warnings

export_fasta

export_fasta(df: DataFrame, output_path: str | Path, sequence_col: str = 'single_chain_aa')

Export sequences to FASTA format.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with sequences

required
output_path str or Path

Output file path

required
sequence_col str

Column containing sequences to export

'single_chain_aa'
Source code in tcrsift/assemble.py
def export_fasta(df: pd.DataFrame, output_path: str | Path, sequence_col: str = "single_chain_aa"):
    """
    Export sequences to FASTA format.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame with sequences
    output_path : str or Path
        Output file path
    sequence_col : str
        Column containing sequences to export
    """
    with open(output_path, "w") as f:
        for idx, row in df.iterrows():
            seq = row.get(sequence_col, "")
            if not seq:
                continue

            # Build header
            cdr3ab = row.get("CDR3ab", idx)
            cdr3a = row.get("CDR3_alpha", "")
            cdr3b = row.get("CDR3_beta", "")

            header = f">{cdr3ab} CDR3a={cdr3a} CDR3b={cdr3b}"
            f.write(f"{header}\n{seq}\n")

    logger.info(f"Exported {len(df)} sequences to {output_path}")