S5F: mutation isn't uniform
If your simulator scatters point mutations uniformly across a sequence, it's giving you a fairy tale. Real SHM is concentrated in CDRs and shaped by 5-mer context.
The intuition
AID — the enzyme that drives somatic hypermutation — has hot spots. WRC motifs (where W is A/T and R is A/G) are mutated orders of magnitude more often than the surrounding bases. CDRs are enriched for hot spots; FWRs are not. The result is a heatmap, not a sprinkle.
GenAIRR uses the S5F model — a 5-mer context-aware mutability and substitution table empirically derived from human B-cell repertoires. Each position's mutation probability is a function of its two upstream and two downstream neighbors. The hotter the 5-mer, the more often it mutates.
Mutation density across the V region
Below: 1,000 simulated heavy chains, mutated at a 5% rate, summarized as positional density. Darker green = more mutations land here. Amber = hot-spot regions where the rate exceeds the average.
CDR1 / CDR2 / CDR3 (the antigen-contact loops) are visibly enriched. FWRs (the structural scaffold) are colder — selection pressure rejects mutations that disrupt folding.
The corresponding code
S5F is the default mutation model. The count= argument controls how many mutations land on each sequence — a fixed integer, a (min, max) uniform range, or an arbitrary weighted distribution. The s5f_model= argument picks the empirical kernel; four ship with GenAIRR.
import GenAIRR as ga
result = (
ga.Experiment.on("human_igh")
.recombine()
.mutate(
model="s5f",
count=(5, 25),
# kernels: "hh_s5f" (heavy, default), "hh_s5f_60",
# "hh_s5f_opposite", "hkl_s5f" (kappa/lambda)
s5f_model="hh_s5f",
)
.run_records(n=1000, seed=42, expose_provenance=True)
)
# Each record exposes mutation summaries at field level
rec = result.records[0]
rec["n_mutations"] # e.g. 10
rec["mutation_rate"] # e.g. 0.0258
rec["v_identity"] # e.g. 0.9763 (S5F concentrates here)
Run the experiment above and inspect mean v_identity, d_identity, j_identity across the 1,000 records. You'll see the three numbers aren't equal — D drops the most (short segments are hit hardest by any mutation that lands), V drops next, and J holds the highest identity. Each region has its own SHM signature. Try swapping s5f_model="hkl_s5f" with a light-chain config to see the kernel matter.
Run two experiments at the same seed and fixed count=(15, 15): one with model="s5f", one with model="uniform". Compare mean v_identity against mean d_identity in each. With uniform they should sit near each other; with S5F there's a real gap — V is hotter than D for the same mutation budget. Why?