GenAIRR's signature pitch is absolute ground truth — you know what every V/D/J call should be, regardless of what the aligner reports. This is the recipe that cashes that pitch. Simulate 50k sequences, run an aligner against them, compute V-call accuracy stratified by mutation level, and plot the curve where the aligner starts to drift.
Real-world aligner benchmarks usually compare two tools to each other and call the
agreement signal. That's better than nothing, but it tells you about agreement,
not correctness. With GenAIRR's truth_v_call field you know the
answer the moment the sequence is born — every disagreement between truth and the
aligner's report is, by construction, an aligner mistake.
50k sequences is a healthy panel size: large enough to slice by mutation level with stable sample counts per bucket, small enough to fit in memory for a quick analysis. Vary mutation load across the panel so you have rows to stratify by later.
import GenAIRR as ga
result = (
ga.Experiment.on("human_igh")
.recombine()
# wide SHM range — gives you the curve, not just one point
.mutate(model="s5f", count=(0, 40))
# realistic lab effects — what the aligner actually sees
.corrupt_5prime_loss(length=(5, 60))
.corrupt_3prime_loss(length=(0, 30))
.corrupt_ns(prob=0.001)
.run_records(
n=50_000,
seed=42,
respect=ga.productive(),
expose_provenance=True, # <-- truth_* columns
)
)
result.to_csv("panel_truth.tsv", sep="\t")
Aligners take raw sequence and a reference. Hand them a FASTA with just sequence_id
and the nucleotide string — no truth fields, no V/D/J hints. Keep the full TSV as your
private answer key.
# strips every truth_* column — only sequence + sequence_id ship
result.to_fasta("panel.fasta")
# run your aligner of choice — IgBLAST, MiXCR, your tool —
# producing an AIRR TSV with v_call / d_call / j_call
$ igblastn \
-germline_db_V human_V \
-germline_db_D human_D \
-germline_db_J human_J \
-query panel.fasta \
-outfmt "19" > panel.aligner.tsv
Join the aligner's report with your truth TSV on sequence_id. Three accuracy
tiers, each measuring something different.
import pandas as pd
truth = pd.read_csv("panel_truth.tsv", sep="\t")
called = pd.read_csv("panel.aligner.tsv", sep="\t")
df = truth.merge(called, on="sequence_id", suffixes=("_t", "_a"))
# exact allele (IGHV3-23*01 == IGHV3-23*01)
df["v_allele_ok"] = df["truth_v_call"] == df["v_call_a"]
# family-only (IGHV3-23 == IGHV3-23, ignore the *N suffix)
df["v_family_ok"] = (
df["truth_v_call"].str.split("*").str[0] ==
df["v_call_a"].str.split("*").str[0]
)
print("V allele accuracy:", df["v_allele_ok"].mean())
print("V family accuracy:", df["v_family_ok"].mean())
The single most useful chart from this recipe: aligner accuracy as a function of mutation load. Real samples have a range of SHM rates; the curve tells you at which range your aligner is reliable.
import numpy as np
import matplotlib.pyplot as plt
df["mut_bucket"] = pd.cut(
df["truth_n_mutations"],
bins=[0, 5, 10, 15, 20, 25, 30, 40],
)
curve = df.groupby("mut_bucket")["v_allele_ok"].mean()
fam = df.groupby("mut_bucket")["v_family_ok"].mean()
fig, ax = plt.subplots()
ax.plot(curve.index.astype(str), curve.values, label="allele")
ax.plot(fam.index.astype(str), fam.values, label="family")
ax.set_xlabel("SHM load (mutations)")
ax.set_ylabel("V accuracy")
ax.legend()
Three patterns to watch for. None of them are bugs — they're real properties of allele callers under SHM stress.
The aligner identifies the family correctly (IGHV3-23) but picks the wrong allele (*04 instead of *01). Common above ~10 SHM events. Family-level downstream analyses are still trustworthy; allele-level repertoire diversity reports are not.
Above ~25 SHM events the sequence diverges enough that even the family call slips. This is the working ceiling of any allele caller; the aligner isn't broken, the information just isn't recoverable.
Stratify also by corrupted_5prime. The V family signal lives in the 5′
FWR1/CDR1 region; if you trim it, allele resolution collapses well before SHM would
otherwise cost you.