Hold the recombination seed fixed, vary only the mutation pass, and you have a clean A/B isolation of how your SHM choice shapes the downstream identity and motif distributions. Useful when picking between S5F variants, calibrating a custom model, or stress-testing an aligner against non-canonical mutation patterns.
Naively comparing two simulations end-to-end mixes confounds: different recombination draws, different NP regions, different trim lengths. Any of those can swing downstream metrics. To attribute a difference to the mutation model, you have to keep everything upstream identical and only swap the model.
Two runs at the same seed produce bit-identical recombination — same V allele, same trims, same NP1/NP2. Only the bases that the mutation pass touches differ.
import GenAIRR as ga
def panel(model: str):
return (
ga.Experiment.on("human_igh")
.recombine()
.mutate(model=model, count=(10, 20))
.run_records(
n=5_000,
seed=42, # <-- identical across runs
expose_provenance=True,
)
)
a = panel("s5f").to_dataframe()
b = panel("uniform").to_dataframe()
# sanity-check: every row pairs by sequence_id with the same truth_v_call
assert (a["truth_v_call"] == b["truth_v_call"]).all()
v_identity is the single most informative axis — every record produces one
value in [0, 1], and the distribution shape is sensitive to whether the model concentrates
mutations in motifs or scatters them.
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots()
sns.kdeplot(data=a["v_identity"], label="S5F", ax=ax)
sns.kdeplot(data=b["v_identity"], label="uniform", ax=ax)
ax.set_xlabel("V identity")
ax.legend()
A density plot can hide the structural difference. Bucket every mutated position by its
5-mer motif (centred on the mutated base) and the gap between the two models becomes
obvious: S5F over-represents AID hotspots like WRC · GYW.
from collections import Counter
def motif_counts(df, k=5):
counts = Counter()
for _, row in df.iterrows():
germ = row["germline_alignment"]
for p in row["mutation_pos"]:
lo, hi = p - k // 2, p + k // 2 + 1
if lo >= 0 and hi <= len(germ):
counts[germ[lo:hi]] += 1
return counts
ma = motif_counts(a)
mb = motif_counts(b)
# diff the top 20 motifs — S5F should over-represent WRC, GYW