Recipe B · 01 · ~30 min · intermediate · flagship

Benchmark an aligner against truth.

GenAIRR's signature pitch is absolute ground truth — you know what every V/D/J call should be, regardless of what the aligner reports. This is the recipe that cashes that pitch. Simulate 50k sequences, run an aligner against them, compute V-call accuracy stratified by mutation level, and plot the curve where the aligner starts to drift.

01 Simulate with truth expose_provenance=True

02 Export blinded FASTA sequences only · no truth fields

03 Score join · accuracy · curve

PART 01

Why ground truth makes this benchmark different.

Real-world aligner benchmarks usually compare two tools to each other and call the agreement signal. That's better than nothing, but it tells you about agreement, not correctness. With GenAIRR's truth_v_call field you know the answer the moment the sequence is born — every disagreement between truth and the aligner's report is, by construction, an aligner mistake.

What the truth gives you

truth_v_call · truth_d_call · truth_j_callthe alleles sampled at recombination — fixed, by construction
truth_n_mutationsstratify accuracy by mutation load to find the drift threshold
corrupted_5prime · _3primestratify by how much sequence the aligner is missing
junction_aacross-check CDR3 reconstruction independently of V/D/J calls

What you're testing for

V-call accuracydoes the aligner pick the right V allele?
Drift under SHMat what mutation rate does accuracy fall below threshold?
Robustness to corruptiondoes 5′ trim eat the aligner's information?
Family vs allele resolutiondoes it get the family right even when the exact allele is wrong?

PART 02

Step 1 — simulate the panel.

50k sequences is a healthy panel size: large enough to slice by mutation level with stable sample counts per bucket, small enough to fit in memory for a quick analysis. Vary mutation load across the panel so you have rows to stratify by later.

A standard benchmark panel

import GenAIRR as ga

result = (
    ga.Experiment.on("human_igh")
       .recombine()
       # wide SHM range — gives you the curve, not just one point
       .mutate(model="s5f", count=(0, 40))
       # realistic lab effects — what the aligner actually sees
       .corrupt_5prime_loss(length=(5, 60))
       .corrupt_3prime_loss(length=(0, 30))
       .corrupt_ns(prob=0.001)
       .run_records(
            n=50_000,
            seed=42,
            respect=ga.productive(),
            expose_provenance=True,    # <-- truth_* columns
       )
)

result.to_csv("panel_truth.tsv", sep="\t")

PART 03

Step 2 — export a blinded FASTA.

Aligners take raw sequence and a reference. Hand them a FASTA with just sequence_id and the nucleotide string — no truth fields, no V/D/J hints. Keep the full TSV as your private answer key.

Blinded FASTA

# strips every truth_* column — only sequence + sequence_id ship
result.to_fasta("panel.fasta")

# run your aligner of choice — IgBLAST, MiXCR, your tool —
# producing an AIRR TSV with v_call / d_call / j_call
$ igblastn \
    -germline_db_V human_V \
    -germline_db_D human_D \
    -germline_db_J human_J \
    -query panel.fasta \
    -outfmt "19" > panel.aligner.tsv

PART 04

Step 3 — score.

Join the aligner's report with your truth TSV on sequence_id. Three accuracy tiers, each measuring something different.

Join + accuracy

import pandas as pd

truth   = pd.read_csv("panel_truth.tsv",    sep="\t")
called  = pd.read_csv("panel.aligner.tsv", sep="\t")
df = truth.merge(called, on="sequence_id", suffixes=("_t", "_a"))

# exact allele (IGHV3-23*01 == IGHV3-23*01)
df["v_allele_ok"] = df["truth_v_call"] == df["v_call_a"]

# family-only (IGHV3-23 == IGHV3-23, ignore the *N suffix)
df["v_family_ok"] = (
    df["truth_v_call"].str.split("*").str[0] ==
    df["v_call_a"].str.split("*").str[0]
)

print("V allele accuracy:", df["v_allele_ok"].mean())
print("V family accuracy:", df["v_family_ok"].mean())

PART 05

Step 4 — plot the drift curve.

The single most useful chart from this recipe: aligner accuracy as a function of mutation load. Real samples have a range of SHM rates; the curve tells you at which range your aligner is reliable.

Stratify by mutation bucket

import numpy as np
import matplotlib.pyplot as plt

df["mut_bucket"] = pd.cut(
    df["truth_n_mutations"],
    bins=[0, 5, 10, 15, 20, 25, 30, 40],
)

curve = df.groupby("mut_bucket")["v_allele_ok"].mean()
fam   = df.groupby("mut_bucket")["v_family_ok"].mean()

fig, ax = plt.subplots()
ax.plot(curve.index.astype(str), curve.values, label="allele")
ax.plot(fam.index.astype(str),   fam.values,   label="family")
ax.set_xlabel("SHM load (mutations)")
ax.set_ylabel("V accuracy")
ax.legend()

PART 06

Reading the curve.

Three patterns to watch for. None of them are bugs — they're real properties of allele callers under SHM stress.

Family stays high, allele drops

The aligner identifies the family correctly (IGHV3-23) but picks the wrong allele (*04 instead of *01). Common above ~10 SHM events. Family-level downstream analyses are still trustworthy; allele-level repertoire diversity reports are not.

Both drop at high SHM

Above ~25 SHM events the sequence diverges enough that even the family call slips. This is the working ceiling of any allele caller; the aligner isn't broken, the information just isn't recoverable.

5′ corruption flattens the curve early

Stratify also by corrupted_5prime. The V family signal lives in the 5′ FWR1/CDR1 region; if you trim it, allele resolution collapses well before SHM would otherwise cost you.

Related recipes

Where to next.

B · 02 · Compare two SHM models →

Apply the same workflow with the mutation model varied — does your aligner cope with non-S5F SHM patterns?

Concept · The AIRR Record →

All ~70 fields, including the full set of truth_* columns you can score against.