Recipe B · 02 · ~15 min · intermediate

Compare two SHM models.

Hold the recombination seed fixed, vary only the mutation pass, and you have a clean A/B isolation of how your SHM choice shapes the downstream identity and motif distributions. Useful when picking between S5F variants, calibrating a custom model, or stress-testing an aligner against non-canonical mutation patterns.

01 Fix recombination shared seed · shared V·D·J pool
02 Vary the mutate pass S5F · uniform · custom
03 Diff the marginals v_identity · motif counts
PART 01

Why isolating the mutation pass matters.

Naively comparing two simulations end-to-end mixes confounds: different recombination draws, different NP regions, different trim lengths. Any of those can swing downstream metrics. To attribute a difference to the mutation model, you have to keep everything upstream identical and only swap the model.

Three models worth comparing
  • S5F (hh_s5f)canonical for human heavy chains — motif-aware, AID-shaped
  • Uniformflat per-base rate — useful as a null model for "did the motif matter?"
  • Customyour own table — see A · 04
What to compare on
  • v_identitysummary statistic — how much SHM ended up in the V region
  • motif counts at mutated positionsdoes the chosen model produce the expected hotspot distribution?
  • CDR vs FWR ratioare mutations concentrated where biology expects?
PART 02

Recipe — shared seed, swapped model.

Two runs at the same seed produce bit-identical recombination — same V allele, same trims, same NP1/NP2. Only the bases that the mutation pass touches differ.

Two panels, matched recombination
import GenAIRR as ga

def panel(model: str):
    return (
        ga.Experiment.on("human_igh")
          .recombine()
          .mutate(model=model, count=(10, 20))
          .run_records(
              n=5_000,
              seed=42,                # <-- identical across runs
              expose_provenance=True,
          )
    )

a = panel("s5f").to_dataframe()
b = panel("uniform").to_dataframe()

# sanity-check: every row pairs by sequence_id with the same truth_v_call
assert (a["truth_v_call"] == b["truth_v_call"]).all()
PART 03

Side-by-side density plot.

v_identity is the single most informative axis — every record produces one value in [0, 1], and the distribution shape is sensitive to whether the model concentrates mutations in motifs or scatters them.

Two-density plot
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots()
sns.kdeplot(data=a["v_identity"], label="S5F",     ax=ax)
sns.kdeplot(data=b["v_identity"], label="uniform", ax=ax)
ax.set_xlabel("V identity")
ax.legend()
PART 04

Motif-level diagnostic.

A density plot can hide the structural difference. Bucket every mutated position by its 5-mer motif (centred on the mutated base) and the gap between the two models becomes obvious: S5F over-represents AID hotspots like WRC · GYW.

Counting motifs across mutated positions
from collections import Counter

def motif_counts(df, k=5):
    counts = Counter()
    for _, row in df.iterrows():
        germ = row["germline_alignment"]
        for p in row["mutation_pos"]:
            lo, hi = p - k // 2, p + k // 2 + 1
            if lo >= 0 and hi <= len(germ):
                counts[germ[lo:hi]] += 1
    return counts

ma = motif_counts(a)
mb = motif_counts(b)
# diff the top 20 motifs — S5F should over-represent WRC, GYW
Related recipes

Where to next.

A · 04 · Wire a custom SHM model →

Build your own model and feed it into this comparison harness.

B · 01 · Benchmark an aligner against truth →

Re-run the aligner benchmark under each mutation model — does aligner accuracy vary with SHM shape?