Learn Lesson 03
Lesson 03 · ~3 min

S5F: mutation isn't uniform

If your simulator scatters point mutations uniformly across a sequence, it's giving you a fairy tale. Real SHM is concentrated in CDRs and shaped by 5-mer context.

03 / 05

The intuition

AID — the enzyme that drives somatic hypermutation — has hot spots. WRC motifs (where W is A/T and R is A/G) are mutated orders of magnitude more often than the surrounding bases. CDRs are enriched for hot spots; FWRs are not. The result is a heatmap, not a sprinkle.

GenAIRR uses the S5F model — a 5-mer context-aware mutability and substitution table empirically derived from human B-cell repertoires. Each position's mutation probability is a function of its two upstream and two downstream neighbors. The hotter the 5-mer, the more often it mutates.

Mutation density across the V region

Below: 1,000 simulated heavy chains, mutated at a 5% rate, summarized as positional density. Darker green = more mutations land here. Amber = hot-spot regions where the rate exceeds the average.

CDR1 / CDR2 / CDR3 (the antigen-contact loops) are visibly enriched. FWRs (the structural scaffold) are colder — selection pressure rejects mutations that disrupt folding.

The corresponding code

S5F is the default mutation model. The count= argument controls how many mutations land on each sequence — a fixed integer, a (min, max) uniform range, or an arbitrary weighted distribution. The s5f_model= argument picks the empirical kernel; four ship with GenAIRR.

lesson_3.py
import GenAIRR as ga

result = (
    ga.Experiment.on("human_igh")
       .recombine()
       .mutate(
           model="s5f",
           count=(5, 25),
           # kernels: "hh_s5f" (heavy, default), "hh_s5f_60",
           #          "hh_s5f_opposite", "hkl_s5f" (kappa/lambda)
           s5f_model="hh_s5f",
       )
       .run_records(n=1000, seed=42, expose_provenance=True)
)

# Each record exposes mutation summaries at field level
rec = result.records[0]
rec["n_mutations"]    # e.g. 10
rec["mutation_rate"]   # e.g. 0.0258
rec["v_identity"]      # e.g. 0.9763 (S5F concentrates here)
Try it now

Run the experiment above and inspect mean v_identity, d_identity, j_identity across the 1,000 records. You'll see the three numbers aren't equal — D drops the most (short segments are hit hardest by any mutation that lands), V drops next, and J holds the highest identity. Each region has its own SHM signature. Try swapping s5f_model="hkl_s5f" with a light-chain config to see the kernel matter.

Exercise

Run two experiments at the same seed and fixed count=(15, 15): one with model="s5f", one with model="uniform". Compare mean v_identity against mean d_identity in each. With uniform they should sit near each other; with S5F there's a real gap — V is hotter than D for the same mutation budget. Why?