Quick Start¶

This guide walks you through a typical TCRsift workflow.

Step 1: Prepare Your Sample Sheet¶

Create a YAML file describing your samples:

samples.yaml

samples:
  - sample: "Patient1_CMV"
    vdj_dir: "/data/patient1/vdj"
    gex_dir: "/data/patient1/gex"
    antigen_type: "short_peptide"
    antigen_description: "CMV pp65"
    source: "culture"

  - sample: "Patient1_TIL"
    vdj_dir: "/data/patient1_til/vdj"
    source: "til"

Step 2: Run the Pipeline¶

Command Line (Recommended)¶

Run the complete pipeline with a single command:

tcrsift run \
    --sample-sheet samples.yaml \
    --output-dir results/ \
    --vdjdb /path/to/vdjdb

This will:

Load all samples
Phenotype cells as CD4+ or CD8+
Aggregate clonotypes
Apply tiered filtering
Annotate with VDJdb
Generate a summary report

Python API¶

import tcrsift

# Load sample sheet
sample_sheet = tcrsift.load_sample_sheet("samples.yaml")

# Load all samples into AnnData
adata = tcrsift.load_samples(sample_sheet)

# Phenotype cells
adata = tcrsift.phenotype_cells(adata, cd4_cd8_ratio=3.0)

# Aggregate clonotypes
clonotypes = tcrsift.aggregate_clonotypes(adata, group_by="CDR3ab")

# Filter clonotypes (default: CD8+ with threshold method)
filtered = tcrsift.filter_clonotypes(
    clonotypes,
    method="threshold",
    tcell_type="cd8",
)

# Annotate with public databases
annotated = tcrsift.annotate_clonotypes(
    filtered,
    vdjdb_path="/path/to/vdjdb",
    exclude_viral=True,
)

# Save results
annotated.to_csv("results/annotated_clonotypes.csv", index=False)

Step 3: Explore Results¶

Output Files¶

The pipeline creates:

results/
├── data/
│   ├── loaded.h5ad           # Raw loaded data
│   ├── phenotyped.h5ad       # With CD4/CD8 classification
│   ├── clonotypes.csv        # All clonotypes
│   ├── filtered_tier1.csv    # Highest confidence clones
│   ├── filtered_tier2.csv
│   ├── filtered_tier3.csv
│   ├── filtered_tier4.csv
│   ├── filtered_tier5.csv
│   ├── annotated.csv         # With database annotations (if provided)
│   ├── til_matched.csv       # TIL matching results (if TIL samples provided)
│   └── full_sequences.csv    # Assembled sequences (if assembly enabled)
├── plots/
│   ├── qc.pdf
│   ├── phenotype.pdf
│   ├── clonotypes.pdf
│   └── tcrsift_report.pdf    # Summary report (if enabled)
└── config.yaml               # Resolved config used for the run

Key Columns¶

The output CSV files contain:

Column	Description
`CDR3ab`	Unique identifier (CDR3_alpha_CDR3_beta)
`CDR3_alpha`	Alpha chain CDR3 sequence
`CDR3_beta`	Beta chain CDR3 sequence
`cell_count`	Number of cells
`max_frequency`	Maximum frequency
`tier`	Quality tier (1 = best)
`Tcell_type_consensus`	CD4+ or CD8+
`db_match`	Matched in public database
`is_viral`	Known viral specificity

Next Steps¶

Sample Sheet Format - Detailed sample sheet options
Pipeline Overview - Understanding each step
Filtering Strategies - Customizing filters