Out of the box GenAIRR ships two mutation models: S5F (motif-aware, Yaari et al. 2013) and Uniform (rate-only, useful as a null). Both are good. Neither is your model — if you have a per-base substitution distribution from your own data, you can drop it in as a new mutation pass while leaving the rest of the pipeline intact.
A mutation model has to answer two questions per substitution event: where in the sequence does the mutation land, and which base does it become. The difference between S5F, uniform, and your-own-model is purely how those probabilities are conditioned.
Each position is independently a target with probability proportional to a fixed rate. Substitution is uniform over the 3 alternative bases. Useful as a null model.
Target probability conditioned on a small window — typically the 5-mer centred on the position. Substitution table is also conditioned on the motif. Captures AID hotspots.
Probability conditioned on the position relative to the gene start (e.g. CDR/FWR aware). Most flexible; needs more parameters and a calibration dataset.
A motif-aware model is two tables and a sampler. The targeting table maps each motif to a probability of being targeted; the substitution table maps each motif (and target base) to a distribution over the 4 nucleotides.
from GenAIRR._engine import Simulation, Rng
import json
class MotifMutationPass:
def __init__(self, model_path: str, count: tuple):
with open(model_path) as f:
m = json.load(f)
self.target_w = m["targeting"] # 5-mer → weight
self.sub_dist = m["substitution"] # 5-mer → {A,C,G,T weights}
self.count_lo, self.count_hi = count
def name(self): return "mutate.custom_motif"
def effects(self): return frozenset({"point_mutation"})
def apply(self, sim, rng, contracts):
n_subs = rng.randint(self.count_lo, self.count_hi)
builder = sim.to_builder()
for _ in range(n_subs):
pos = self._draw_target(sim, rng)
new_b = self._draw_sub(sim, pos, rng)
builder.set_base(pos, new_b)
builder.record_choice(
f"mutate.custom_motif.pos[{pos}]",
{"pos": pos, "alt": new_b},
)
return builder.build()
The _draw_target and _draw_sub helpers are where your model's
physics lives. Implement them however you like — weighted reservoir sampling, alias method,
or a precomputed inverse-CDF lookup. The engine only cares that they return valid positions
and bases.
Drop the custom pass into a manually-assembled PassPlan in the same slot the
built-in mutate pass would occupy. The DSL stays available for the surrounding phases.
from GenAIRR._engine import PassPlan, CompiledSimulator
import GenAIRR as ga
refdata = ga.dataconfig_to_refdata(ga.HUMAN_IGH_OGRDB)
plan = PassPlan.recombine(refdata)
# swap in your motif model for the canonical S5F pass
plan.push_python_pass(
MotifMutationPass("./my_5mer_model.json", count=(5, 25))
)
# layer the standard corruption phase after
plan.push_corrupt_5prime_loss([(5, 30, 1.0)])
sim = CompiledSimulator(plan)
out = sim.run(n=1000, seed=42)
A custom model is only useful if it produces the distributions it advertises. Two orthogonal checks: per-base mutation rate, and motif-bias on the targeting distribution.
import numpy as np
rates = np.array([
sim.truth_n_mutations / sim.sequence_length
for o in out
for sim in [o.final_simulation()]
])
print(rates.mean(), rates.std())
from collections import Counter
counts = Counter()
for o in out:
sim = o.final_simulation()
for pos in sim.mutation_positions():
motif = sim.germline_window(pos, 5)
counts[motif] += 1
# compare with your model's targeting table