Recipe B · 01 · ~30 min · intermediate · flagship

Benchmark an aligner against truth.

GenAIRR's signature pitch is absolute ground truth — you know what every V/D/J call should be, regardless of what the aligner reports. This is the recipe that cashes that pitch. Simulate 50k sequences, run an aligner against them, compute V-call accuracy stratified by mutation level, and plot the curve where the aligner starts to drift.

01 Simulate with truth expose_provenance=True
02 Export blinded FASTA sequences only · no truth fields
03 Score join · accuracy · curve
PART 01

Why ground truth makes this benchmark different.

Real-world aligner benchmarks usually compare two tools to each other and call the agreement signal. That's better than nothing, but it tells you about agreement, not correctness. With GenAIRR's truth_v_call field you know the answer the moment the sequence is born — every disagreement between truth and the aligner's report is, by construction, an aligner mistake.

What the truth gives you
  • truth_v_call · truth_d_call · truth_j_callthe alleles sampled at recombination — fixed, by construction
  • truth_n_mutationsstratify accuracy by mutation load to find the drift threshold
  • corrupted_5prime · _3primestratify by how much sequence the aligner is missing
  • junction_aacross-check CDR3 reconstruction independently of V/D/J calls
What you're testing for
  • V-call accuracydoes the aligner pick the right V allele?
  • Drift under SHMat what mutation rate does accuracy fall below threshold?
  • Robustness to corruptiondoes 5′ trim eat the aligner's information?
  • Family vs allele resolutiondoes it get the family right even when the exact allele is wrong?
PART 02

Step 1 — simulate the panel.

50k sequences is a healthy panel size: large enough to slice by mutation level with stable sample counts per bucket, small enough to fit in memory for a quick analysis. Vary mutation load across the panel so you have rows to stratify by later.

A standard benchmark panel
import GenAIRR as ga

result = (
    ga.Experiment.on("human_igh")
       .recombine()
       # wide SHM range — gives you the curve, not just one point
       .mutate(model="s5f", count=(0, 40))
       # realistic lab effects — what the aligner actually sees
       .corrupt_5prime_loss(length=(5, 60))
       .corrupt_3prime_loss(length=(0, 30))
       .corrupt_ns(prob=0.001)
       .run_records(
            n=50_000,
            seed=42,
            respect=ga.productive(),
            expose_provenance=True,    # <-- truth_* columns
       )
)

result.to_csv("panel_truth.tsv", sep="\t")
PART 03

Step 2 — export a blinded FASTA.

Aligners take raw sequence and a reference. Hand them a FASTA with just sequence_id and the nucleotide string — no truth fields, no V/D/J hints. Keep the full TSV as your private answer key.

Blinded FASTA
# strips every truth_* column — only sequence + sequence_id ship
result.to_fasta("panel.fasta")

# run your aligner of choice — IgBLAST, MiXCR, your tool —
# producing an AIRR TSV with v_call / d_call / j_call
$ igblastn \
    -germline_db_V human_V \
    -germline_db_D human_D \
    -germline_db_J human_J \
    -query panel.fasta \
    -outfmt "19" > panel.aligner.tsv
PART 04

Step 3 — score.

Join the aligner's report with your truth TSV on sequence_id. Three accuracy tiers, each measuring something different.

Join + accuracy
import pandas as pd

truth   = pd.read_csv("panel_truth.tsv",    sep="\t")
called  = pd.read_csv("panel.aligner.tsv", sep="\t")
df = truth.merge(called, on="sequence_id", suffixes=("_t", "_a"))

# exact allele (IGHV3-23*01 == IGHV3-23*01)
df["v_allele_ok"] = df["truth_v_call"] == df["v_call_a"]

# family-only (IGHV3-23 == IGHV3-23, ignore the *N suffix)
df["v_family_ok"] = (
    df["truth_v_call"].str.split("*").str[0] ==
    df["v_call_a"].str.split("*").str[0]
)

print("V allele accuracy:", df["v_allele_ok"].mean())
print("V family accuracy:", df["v_family_ok"].mean())
PART 05

Step 4 — plot the drift curve.

The single most useful chart from this recipe: aligner accuracy as a function of mutation load. Real samples have a range of SHM rates; the curve tells you at which range your aligner is reliable.

Stratify by mutation bucket
import numpy as np
import matplotlib.pyplot as plt

df["mut_bucket"] = pd.cut(
    df["truth_n_mutations"],
    bins=[0, 5, 10, 15, 20, 25, 30, 40],
)

curve = df.groupby("mut_bucket")["v_allele_ok"].mean()
fam   = df.groupby("mut_bucket")["v_family_ok"].mean()

fig, ax = plt.subplots()
ax.plot(curve.index.astype(str), curve.values, label="allele")
ax.plot(fam.index.astype(str),   fam.values,   label="family")
ax.set_xlabel("SHM load (mutations)")
ax.set_ylabel("V accuracy")
ax.legend()
PART 06

Reading the curve.

Three patterns to watch for. None of them are bugs — they're real properties of allele callers under SHM stress.

Family stays high, allele drops

The aligner identifies the family correctly (IGHV3-23) but picks the wrong allele (*04 instead of *01). Common above ~10 SHM events. Family-level downstream analyses are still trustworthy; allele-level repertoire diversity reports are not.

Both drop at high SHM

Above ~25 SHM events the sequence diverges enough that even the family call slips. This is the working ceiling of any allele caller; the aligner isn't broken, the information just isn't recoverable.

5′ corruption flattens the curve early

Stratify also by corrupted_5prime. The V family signal lives in the 5′ FWR1/CDR1 region; if you trim it, allele resolution collapses well before SHM would otherwise cost you.

Related recipes

Where to next.

B · 02 · Compare two SHM models →

Apply the same workflow with the mutation model varied — does your aligner cope with non-S5F SHM patterns?

Concept · The AIRR Record →

All ~70 fields, including the full set of truth_* columns you can score against.